`lipidoz.ml`

This subpackage contains utilities for performing double bond identification using machine learning.

Module Reference

`lipidoz.ml.data`

lipidoz.ml.data.load_preml_data(preml_file)

loads a pre-ml dataset (produced by lipidoz.workflows.collect_pre_ml_dataset()) from specified file

Parameters:

preml_filestr: path to pre-ml dataset file

Returns:

preml_datadict(...): pre-ml dataset

lipidoz.ml.data.load_ml_targets(ml_target_file, rt_corr_func=None)

loads list of annotated lipids from .csv formatted target list with columns:

lipid – lipid name in standard format (str)

adduct – MS adduct (str)

retention time – target retention time (float)

true_dbidx – known double bond index (int)

true_dbopos – known double bond position(s), separated by - if multiple at this index, gets unpacked into a list(int) with all double bond positions at that index

A list of lipid targets is returned, each defined by the information above, but grouped by lipid/adduct/retention time. Each lipid target contains:

lipid (str)

adduct (str)

retention time (rounded to 2 decimal places, float)

annotations (dict(int:list(int))), a dict mapping db indices to lists of db positions

Parameters:

ml_target_filestr: path to target list .csv
rt_corr_funcfunc, optional: apply rt correction using provided function

Returns:

targetsdict(...): dict mapping (lipid, adduct, retention time) to annotations (known db indices and positions)

lipidoz.ml.data.split_true_and_false_preml_data(preml_data, targets)

Splits a pre-ml dataset into true/false annotated examples based on a list of target lipids with annotated double bond positions and indices. The annotated double bond position and indices determine which entries are put into the true split for a given lipid target, the rest of the entries for that target are put into the false split.

Parameters:

preml_datadict(...): a pre-ml dataset produced by lipidoz.workflows.collect_pre_ml_dataset()
targetsdict(...): dict mapping (lipid, adduct, retention time) to annotations (known db indices and positions), produced by lipidoz.ml.data.load_ml_targets()

Returns:

true_preml_datadict(str:dict(...))
false_preml_datadict(str:dict(...)): new pre-ml datasets with entries in ‘targets’ split by annotation (T/F)

lipidoz.ml.data.preml_to_ml_data(preml_data, rt_sampling_augment=False, normalize_intensity=True, debug_flag=None, debug_cb=None)

Takes pre-ml dataset and performs standard binning and grouping of precursor/fragment RTMZ profiles into arrays

Parameters:

preml_datadict(...): pre-ml dataset (produced by lipidoz.workflows.collect_pre_ml_dataset())
rt_sampling_augmentbool, default=False: re-sample RT dimension from RTMZ data multiple times in order to augment training examples (~10x)
normalize_intensitybool, default=True: normalize the intensities in each 2D RTMZ array so that they are in the range 0->1
debug_flagstr, optional: specifies how to dispatch the message and/or plot, None to do nothing
debug_cbfunc, optional: callback function that takes the debugging message as an argument, can be None if debug_flag is not set to ‘textcb’

Returns:

ml_datanumpy.ndarray: array of binned data for ML with shape: (targets, 3, 24, 400)

lipidoz.ml.data.get_dataloaders_for_ml(true_data, false_data, val_size=0.2, batch_size=128, shuffle=True, random_state=420)

Splits true/false data into training/validation sets, returns torch.utils.data.DataLoader instances for each along with corresponding dataset sizes. Includes the torchvision.transforms.ToTensor transform in datasets

Parameters:

true_datanumpy.ndarray
false_datanumpy.ndarray: arrays of true, false binned data for ML with shapes (N, 3, 24, 400), where N is the number of training examples in each set
val_sizefloat, default=0.2: proportion of dataset to split into validation set
batch_sizeint, default=128: batch_size parameter for dataloaders, given the proportion of True/False samples in the complete trianing data (~7.5%), a batch size of 128 should contain around 10 True examples on average
shufflebool, default=True: shuffle parameter for dataloaders
random_stateint, default=420: pRNG seed for deterministic splitting results

Returns:

dataloadersdict(str:torch.utils.data.DataLoader): ‘train’ and ‘validate’ Dataloaders (torch.utils.data.DataLoader)
dataset_sizesdict(str:int): ‘train’ and ‘validate’ dataset sizes

`lipidoz.ml.models`

ML Model Base Class Methods

lipidoz.ml.models._Model.load(self, state_dict_path)

Loads parameters for self.model

self.model has instance of untrained model, is updated with parameters loaded from state dict then gets sent to self.device

Parameters:

state_dict_pathstr: path to state dict with pre-trained parameters for this model

lipidoz.ml.models._Model.save(self, state_dict_path)

Saves parameters for self.model to file as state dict

Parameters:

state_dict_pathstr: path to state dict to save parameters for this model

lipidoz.ml.models._Model.train(self, dataloaders, dataset_sizes, criterion=None, optimizer=None, scheduler=None, epochs=32, debug=False, xent_f_t_weights=[0.9, 0.1])

Trains a model with specified dataset and training parameters

Parameters:

dataloadersdict(str:torch.utils.data.DataLoader): ‘train’ and ‘validate’ Dataloaders (torch.utils.data.DataLoader)
dataset_sizesdict(str:int): ‘train’ and ‘validate’ dataset sizes
criteriontorch.nn.?, optional: loss function for training, if not provided defaults to torch.nn.CrossEntropyLoss with weight of True examples set to 10% and False to 90% to reflect the imbalance in the training data
optimizertorch.optim.Optimizer, optional: model optimizer, if not provided defaults to torch.optim.Adam with learning rate of 0.001
schedulertorch.optim.lr_schedulerStepLR, optional: learning rate scheduler, if not provided defaults to decaying learning rate by 0.1 every 8 epochs
epochsint, default=32: number of epochs to train over
debugbool, default=False: print debugging info
xent_f_t_weightslist(float), default=[0.9, 0.1]: if using the default cross-entropy loss, set weights for [F, T] classes. By defalt this ratio is 0.9 F to 0.1 T to reflect the approximate imbalance in training examples, but the ratio can be tuned to achieve desired prediction characteristics

lipidoz.ml.models._Model.predict(self, X)

Predict class labels for input examples

Parameters:

Xnumpy.ndarray: array of input data for ML with shape (N, 24, 400, 3), where N is the number of examples in the set. Shape (N, 3, 24, 400) (as in pre-ml data) is also ok, it automatically gets transposed to the proper shape

Returns:

ynumpy.ndarray: array of predictions, 0 for False 1 for True, with shape (N,) where N is the number of examples in the set

lipidoz.ml.models._Model.predict_proba(self, X)

Predicts class probabilities for input examples

Parameters:

Xnumpy.ndarray: array of input data for ML with shape (N, 24, 400, 3), where N is the number of examples in the set. Shape (N, 3, 24, 400) (as in pre-ml data) is also ok, it automatically gets transposed to the proper shape

Returns:

ynumpy.ndarray: array of label (T/F) probabilities, with shape (N, 2) where N is the number of examples in the set

ResNet18 model

class lipidoz.ml.models.resnet18.ResNet18

Model based on pre-trained ResNet18

Methods

`load`(state_dict_path)	Loads parameters for `self.model`
`predict`(X)	Predict class labels for input examples
`predict_proba`(X)	Predicts class probabilities for input examples
`save`(state_dict_path)	Saves parameters for `self.model` to file as state dict
`train`(dataloaders, dataset_sizes[, ...])	Trains a model with specified dataset and training parameters

lipidoz.ml.models.resnet18.ResNet18.__init__(self): Inits a new instance of RESNET18 model

`lipidoz.ml.view`

lipidoz.ml.view.plot_preml_example(pre_rtmz, ald_rtmz, crg_rtmz, rt, mzs, rt_tol=0.2, mz_range=(1.5, 2.5), figname=None, rgb=True)

Produces a plot of a pre-machine learning example which consists of raw RTMZ array data (arrays of retention time, m/z, and intensities) for a precursor and aldehyde/criegee Oz fragments. Default rt_tol and mz_range correspond to the normal values used for extracting and binning ML data.

Parameters:

pre_rtmztuple(numpy.ndarray(float))
ald_rtmztuple(numpy.ndarray(float))
crg_rtmztuple(numpy.ndarray(float)): raw RTMZ data arrays for precursor and aldehyde/criegee Oz fragments
rtfloat: central retention time value to display (same for precursor and Oz fragments)
mzstuple(float): monoisotopic masses for precursor, aldehyde, and criegee
rt_tolfloat, default=0.2: tolerance of rt values (relative to rt) to display
mz_rangetuple(float), default=(1.5, 2.5): lower, upper range of m/z values to display (relative to monoisotopic mass), same for precursor and Oz fragments
fignamestr, optional: if provided, save the image to the specified file
rgbbool, default=True: if True, use Reds, Greens, Blues color scales for each panel, otherwise use viridis for all

lipidoz.ml.view.plot_ml_example(pre_binned, ald_binned, crg_binned, figname=None, rgb=True)

Produces a plot of a machine learning example which consists of binned RTMZ data for a precursor and aldehyde/criegee Oz fragments

Parameters:

pre_binnednumpy.ndarray(float)
ald_binnednumpy.nparray(float)
crg_binnednumpy.ndarray(float): binned RTMZ data for precursor and aldehyde/criegee Oz fragments
fignamestr, optional: if provided, save the image to the specified file
rgbbool, default=True: if True, use Reds, Greens, Blues color scales for each panel, otherwise use viridis for all

lipidoz.ml

Module Reference

lipidoz.ml.data

lipidoz.ml.models

ML Model Base Class Methods

ResNet18 model

lipidoz.ml.view

`lipidoz.ml`

`lipidoz.ml.data`

`lipidoz.ml.models`

`lipidoz.ml.view`