lipidoz.ml

This subpackage contains utilities for performing double bond identification using machine learning.

Module Reference

lipidoz.ml.data

lipidoz.ml.data.load_preml_data(preml_file)

loads a pre-ml dataset (produced by lipidoz.workflows.collect_pre_ml_dataset()) from specified file

Parameters:
preml_filestr

path to pre-ml dataset file

Returns:
preml_datadict(...)

pre-ml dataset

lipidoz.ml.data.load_ml_targets(ml_target_file, rt_corr_func=None)

loads list of annotated lipids from .csv formatted target list with columns:

  • lipid – lipid name in standard format (str)

  • adduct – MS adduct (str)

  • retention time – target retention time (float)

  • true_dbidx – known double bond index (int)

  • true_dbopos – known double bond position(s), separated by - if multiple at this index, gets unpacked into a list(int) with all double bond positions at that index

A list of lipid targets is returned, each defined by the information above, but grouped by lipid/adduct/retention time. Each lipid target contains:

  • lipid (str)

  • adduct (str)

  • retention time (rounded to 2 decimal places, float)

  • annotations (dict(int:list(int))), a dict mapping db indices to lists of db positions

Parameters:
ml_target_filestr

path to target list .csv

rt_corr_funcfunc, optional

apply rt correction using provided function

Returns:
targetsdict(...)

dict mapping (lipid, adduct, retention time) to annotations (known db indices and positions)

lipidoz.ml.data.split_true_and_false_preml_data(preml_data, targets)

Splits a pre-ml dataset into true/false annotated examples based on a list of target lipids with annotated double bond positions and indices. The annotated double bond position and indices determine which entries are put into the true split for a given lipid target, the rest of the entries for that target are put into the false split.

Parameters:
preml_datadict(...)

a pre-ml dataset produced by lipidoz.workflows.collect_pre_ml_dataset()

targetsdict(...)

dict mapping (lipid, adduct, retention time) to annotations (known db indices and positions), produced by lipidoz.ml.data.load_ml_targets()

Returns:
true_preml_datadict(str:dict(...))
false_preml_datadict(str:dict(...))

new pre-ml datasets with entries in ‘targets’ split by annotation (T/F)

lipidoz.ml.data.preml_to_ml_data(preml_data, rt_sampling_augment=False, normalize_intensity=True, debug_flag=None, debug_cb=None)

Takes pre-ml dataset and performs standard binning and grouping of precursor/fragment RTMZ profiles into arrays

Parameters:
preml_datadict(...)

pre-ml dataset (produced by lipidoz.workflows.collect_pre_ml_dataset())

rt_sampling_augmentbool, default=False

re-sample RT dimension from RTMZ data multiple times in order to augment training examples (~10x)

normalize_intensitybool, default=True

normalize the intensities in each 2D RTMZ array so that they are in the range 0->1

debug_flagstr, optional

specifies how to dispatch the message and/or plot, None to do nothing

debug_cbfunc, optional

callback function that takes the debugging message as an argument, can be None if debug_flag is not set to ‘textcb’

Returns:
ml_datanumpy.ndarray

array of binned data for ML with shape: (targets, 3, 24, 400)

lipidoz.ml.data.get_dataloaders_for_ml(true_data, false_data, val_size=0.2, batch_size=128, shuffle=True, random_state=420)

Splits true/false data into training/validation sets, returns torch.utils.data.DataLoader instances for each along with corresponding dataset sizes. Includes the torchvision.transforms.ToTensor transform in datasets

Parameters:
true_datanumpy.ndarray
false_datanumpy.ndarray

arrays of true, false binned data for ML with shapes (N, 3, 24, 400), where N is the number of training examples in each set

val_sizefloat, default=0.2

proportion of dataset to split into validation set

batch_sizeint, default=128

batch_size parameter for dataloaders, given the proportion of True/False samples in the complete trianing data (~7.5%), a batch size of 128 should contain around 10 True examples on average

shufflebool, default=True

shuffle parameter for dataloaders

random_stateint, default=420

pRNG seed for deterministic splitting results

Returns:
dataloadersdict(str:torch.utils.data.DataLoader)

‘train’ and ‘validate’ Dataloaders (torch.utils.data.DataLoader)

dataset_sizesdict(str:int)

‘train’ and ‘validate’ dataset sizes

lipidoz.ml.models

ML Model Base Class Methods

lipidoz.ml.models._Model.load(self, state_dict_path)

Loads parameters for self.model

self.model has instance of untrained model, is updated with parameters loaded from state dict then gets sent to self.device

Parameters:
state_dict_pathstr

path to state dict with pre-trained parameters for this model

lipidoz.ml.models._Model.save(self, state_dict_path)

Saves parameters for self.model to file as state dict

Parameters:
state_dict_pathstr

path to state dict to save parameters for this model

lipidoz.ml.models._Model.train(self, dataloaders, dataset_sizes, criterion=None, optimizer=None, scheduler=None, epochs=32, debug=False, xent_f_t_weights=[0.9, 0.1])

Trains a model with specified dataset and training parameters

Parameters:
dataloadersdict(str:torch.utils.data.DataLoader)

‘train’ and ‘validate’ Dataloaders (torch.utils.data.DataLoader)

dataset_sizesdict(str:int)

‘train’ and ‘validate’ dataset sizes

criteriontorch.nn.?, optional

loss function for training, if not provided defaults to torch.nn.CrossEntropyLoss with weight of True examples set to 10% and False to 90% to reflect the imbalance in the training data

optimizertorch.optim.Optimizer, optional

model optimizer, if not provided defaults to torch.optim.Adam with learning rate of 0.001

schedulertorch.optim.lr_schedulerStepLR, optional

learning rate scheduler, if not provided defaults to decaying learning rate by 0.1 every 8 epochs

epochsint, default=32

number of epochs to train over

debugbool, default=False

print debugging info

xent_f_t_weightslist(float), default=[0.9, 0.1]

if using the default cross-entropy loss, set weights for [F, T] classes. By defalt this ratio is 0.9 F to 0.1 T to reflect the approximate imbalance in training examples, but the ratio can be tuned to achieve desired prediction characteristics

lipidoz.ml.models._Model.predict(self, X)

Predict class labels for input examples

Parameters:
Xnumpy.ndarray

array of input data for ML with shape (N, 24, 400, 3), where N is the number of examples in the set. Shape (N, 3, 24, 400) (as in pre-ml data) is also ok, it automatically gets transposed to the proper shape

Returns:
ynumpy.ndarray

array of predictions, 0 for False 1 for True, with shape (N,) where N is the number of examples in the set

lipidoz.ml.models._Model.predict_proba(self, X)

Predicts class probabilities for input examples

Parameters:
Xnumpy.ndarray

array of input data for ML with shape (N, 24, 400, 3), where N is the number of examples in the set. Shape (N, 3, 24, 400) (as in pre-ml data) is also ok, it automatically gets transposed to the proper shape

Returns:
ynumpy.ndarray

array of label (T/F) probabilities, with shape (N, 2) where N is the number of examples in the set

ResNet18 model

class lipidoz.ml.models.resnet18.ResNet18

Model based on pre-trained ResNet18

Methods

load(state_dict_path)

Loads parameters for self.model

predict(X)

Predict class labels for input examples

predict_proba(X)

Predicts class probabilities for input examples

save(state_dict_path)

Saves parameters for self.model to file as state dict

train(dataloaders, dataset_sizes[, ...])

Trains a model with specified dataset and training parameters

lipidoz.ml.models.resnet18.ResNet18.__init__(self)

Inits a new instance of RESNET18 model

lipidoz.ml.view

lipidoz.ml.view.plot_preml_example(pre_rtmz, ald_rtmz, crg_rtmz, rt, mzs, rt_tol=0.2, mz_range=(1.5, 2.5), figname=None, rgb=True)

Produces a plot of a pre-machine learning example which consists of raw RTMZ array data (arrays of retention time, m/z, and intensities) for a precursor and aldehyde/criegee Oz fragments. Default rt_tol and mz_range correspond to the normal values used for extracting and binning ML data.

Parameters:
pre_rtmztuple(numpy.ndarray(float))
ald_rtmztuple(numpy.ndarray(float))
crg_rtmztuple(numpy.ndarray(float))

raw RTMZ data arrays for precursor and aldehyde/criegee Oz fragments

rtfloat

central retention time value to display (same for precursor and Oz fragments)

mzstuple(float)

monoisotopic masses for precursor, aldehyde, and criegee

rt_tolfloat, default=0.2

tolerance of rt values (relative to rt) to display

mz_rangetuple(float), default=(1.5, 2.5)

lower, upper range of m/z values to display (relative to monoisotopic mass), same for precursor and Oz fragments

fignamestr, optional

if provided, save the image to the specified file

rgbbool, default=True

if True, use Reds, Greens, Blues color scales for each panel, otherwise use viridis for all

lipidoz.ml.view.plot_ml_example(pre_binned, ald_binned, crg_binned, figname=None, rgb=True)

Produces a plot of a machine learning example which consists of binned RTMZ data for a precursor and aldehyde/criegee Oz fragments

Parameters:
pre_binnednumpy.ndarray(float)
ald_binnednumpy.nparray(float)
crg_binnednumpy.ndarray(float)

binned RTMZ data for precursor and aldehyde/criegee Oz fragments

fignamestr, optional

if provided, save the image to the specified file

rgbbool, default=True

if True, use Reds, Greens, Blues color scales for each panel, otherwise use viridis for all