lipidoz.ml
This subpackage contains utilities for performing double bond identification using machine learning.
Module Reference
lipidoz.ml.data
- lipidoz.ml.data.load_preml_data(preml_file)
loads a pre-ml dataset (produced by
lipidoz.workflows.collect_pre_ml_dataset()) from specified file- Parameters:
- preml_file
str path to pre-ml dataset file
- preml_file
- Returns:
- preml_data
dict(...) pre-ml dataset
- preml_data
- lipidoz.ml.data.load_ml_targets(ml_target_file, rt_corr_func=None)
loads list of annotated lipids from .csv formatted target list with columns:
lipid – lipid name in standard format (
str)adduct – MS adduct (
str)retention time – target retention time (
float)true_dbidx – known double bond index (
int)true_dbopos – known double bond position(s), separated by - if multiple at this index, gets unpacked into a
list(int)with all double bond positions at that index
A list of lipid targets is returned, each defined by the information above, but grouped by lipid/adduct/retention time. Each lipid target contains:
lipid (
str)adduct (
str)retention time (rounded to 2 decimal places,
float)annotations (
dict(int:list(int))), a dict mapping db indices to lists of db positions
- Parameters:
- ml_target_file
str path to target list .csv
- rt_corr_func
func, optional apply rt correction using provided function
- ml_target_file
- Returns:
- targets
dict(...) dict mapping (lipid, adduct, retention time) to annotations (known db indices and positions)
- targets
- lipidoz.ml.data.split_true_and_false_preml_data(preml_data, targets)
Splits a pre-ml dataset into true/false annotated examples based on a list of target lipids with annotated double bond positions and indices. The annotated double bond position and indices determine which entries are put into the true split for a given lipid target, the rest of the entries for that target are put into the false split.
- Parameters:
- preml_data
dict(...) a pre-ml dataset produced by
lipidoz.workflows.collect_pre_ml_dataset()- targets
dict(...) dict mapping (lipid, adduct, retention time) to annotations (known db indices and positions), produced by
lipidoz.ml.data.load_ml_targets()
- preml_data
- Returns:
- true_preml_data
dict(str:dict(...)) - false_preml_data
dict(str:dict(...)) new pre-ml datasets with entries in ‘targets’ split by annotation (T/F)
- true_preml_data
- lipidoz.ml.data.preml_to_ml_data(preml_data, rt_sampling_augment=False, normalize_intensity=True, debug_flag=None, debug_cb=None)
Takes pre-ml dataset and performs standard binning and grouping of precursor/fragment RTMZ profiles into arrays
- Parameters:
- preml_data
dict(...) pre-ml dataset (produced by
lipidoz.workflows.collect_pre_ml_dataset())- rt_sampling_augment
bool, default=False re-sample RT dimension from RTMZ data multiple times in order to augment training examples (~10x)
- normalize_intensity
bool, default=True normalize the intensities in each 2D RTMZ array so that they are in the range 0->1
- debug_flag
str, optional specifies how to dispatch the message and/or plot, None to do nothing
- debug_cb
func, optional callback function that takes the debugging message as an argument, can be None if debug_flag is not set to ‘textcb’
- preml_data
- Returns:
- ml_data
numpy.ndarray array of binned data for ML with shape: (targets, 3, 24, 400)
- ml_data
- lipidoz.ml.data.get_dataloaders_for_ml(true_data, false_data, val_size=0.2, batch_size=128, shuffle=True, random_state=420)
Splits true/false data into training/validation sets, returns
torch.utils.data.DataLoaderinstances for each along with corresponding dataset sizes. Includes thetorchvision.transforms.ToTensortransform in datasets- Parameters:
- true_data
numpy.ndarray - false_data
numpy.ndarray arrays of true, false binned data for ML with shapes (N, 3, 24, 400), where N is the number of training examples in each set
- val_size
float, default=0.2 proportion of dataset to split into validation set
- batch_size
int, default=128 batch_sizeparameter for dataloaders, given the proportion of True/False samples in the complete trianing data (~7.5%), a batch size of 128 should contain around 10 True examples on average- shuffle
bool, default=True shuffleparameter for dataloaders- random_state
int, default=420 pRNG seed for deterministic splitting results
- true_data
- Returns:
- dataloaders
dict(str:torch.utils.data.DataLoader) ‘train’ and ‘validate’ Dataloaders (
torch.utils.data.DataLoader)- dataset_sizes
dict(str:int) ‘train’ and ‘validate’ dataset sizes
- dataloaders
lipidoz.ml.models
ML Model Base Class Methods
- lipidoz.ml.models._Model.load(self, state_dict_path)
Loads parameters for
self.modelself.modelhas instance of untrained model, is updated with parameters loaded from state dict then gets sent toself.device- Parameters:
- state_dict_path
str path to state dict with pre-trained parameters for this model
- state_dict_path
- lipidoz.ml.models._Model.save(self, state_dict_path)
Saves parameters for
self.modelto file as state dict- Parameters:
- state_dict_path
str path to state dict to save parameters for this model
- state_dict_path
- lipidoz.ml.models._Model.train(self, dataloaders, dataset_sizes, criterion=None, optimizer=None, scheduler=None, epochs=32, debug=False, xent_f_t_weights=[0.9, 0.1])
Trains a model with specified dataset and training parameters
- Parameters:
- dataloaders
dict(str:torch.utils.data.DataLoader) ‘train’ and ‘validate’ Dataloaders (
torch.utils.data.DataLoader)- dataset_sizes
dict(str:int) ‘train’ and ‘validate’ dataset sizes
- criterion
torch.nn.?, optional loss function for training, if not provided defaults to
torch.nn.CrossEntropyLosswith weight of True examples set to 10% and False to 90% to reflect the imbalance in the training data- optimizer
torch.optim.Optimizer, optional model optimizer, if not provided defaults to
torch.optim.Adamwith learning rate of 0.001- scheduler
torch.optim.lr_schedulerStepLR, optional learning rate scheduler, if not provided defaults to decaying learning rate by 0.1 every 8 epochs
- epochs
int, default=32 number of epochs to train over
- debug
bool, default=False print debugging info
- xent_f_t_weights
list(float), default=[0.9, 0.1] if using the default cross-entropy loss, set weights for [F, T] classes. By defalt this ratio is 0.9 F to 0.1 T to reflect the approximate imbalance in training examples, but the ratio can be tuned to achieve desired prediction characteristics
- dataloaders
- lipidoz.ml.models._Model.predict(self, X)
Predict class labels for input examples
- Parameters:
- X
numpy.ndarray array of input data for ML with shape (N, 24, 400, 3), where N is the number of examples in the set. Shape (N, 3, 24, 400) (as in pre-ml data) is also ok, it automatically gets transposed to the proper shape
- X
- Returns:
- y
numpy.ndarray array of predictions, 0 for False 1 for True, with shape (N,) where N is the number of examples in the set
- y
- lipidoz.ml.models._Model.predict_proba(self, X)
Predicts class probabilities for input examples
- Parameters:
- X
numpy.ndarray array of input data for ML with shape (N, 24, 400, 3), where N is the number of examples in the set. Shape (N, 3, 24, 400) (as in pre-ml data) is also ok, it automatically gets transposed to the proper shape
- X
- Returns:
- y
numpy.ndarray array of label (T/F) probabilities, with shape (N, 2) where N is the number of examples in the set
- y
ResNet18 model
- class lipidoz.ml.models.resnet18.ResNet18
Model based on pre-trained ResNet18
Methods
load(state_dict_path)Loads parameters for
self.modelpredict(X)Predict class labels for input examples
predict_proba(X)Predicts class probabilities for input examples
save(state_dict_path)Saves parameters for
self.modelto file as state dicttrain(dataloaders, dataset_sizes[, ...])Trains a model with specified dataset and training parameters
- lipidoz.ml.models.resnet18.ResNet18.__init__(self)
Inits a new instance of RESNET18 model
lipidoz.ml.view
- lipidoz.ml.view.plot_preml_example(pre_rtmz, ald_rtmz, crg_rtmz, rt, mzs, rt_tol=0.2, mz_range=(1.5, 2.5), figname=None, rgb=True)
Produces a plot of a pre-machine learning example which consists of raw RTMZ array data (arrays of retention time, m/z, and intensities) for a precursor and aldehyde/criegee Oz fragments. Default rt_tol and mz_range correspond to the normal values used for extracting and binning ML data.
- Parameters:
- pre_rtmz
tuple(numpy.ndarray(float)) - ald_rtmz
tuple(numpy.ndarray(float)) - crg_rtmz
tuple(numpy.ndarray(float)) raw RTMZ data arrays for precursor and aldehyde/criegee Oz fragments
- rt
float central retention time value to display (same for precursor and Oz fragments)
- mzs
tuple(float) monoisotopic masses for precursor, aldehyde, and criegee
- rt_tol
float, default=0.2 tolerance of rt values (relative to rt) to display
- mz_range
tuple(float), default=(1.5, 2.5) lower, upper range of m/z values to display (relative to monoisotopic mass), same for precursor and Oz fragments
- figname
str, optional if provided, save the image to the specified file
- rgb
bool, default=True if True, use Reds, Greens, Blues color scales for each panel, otherwise use viridis for all
- pre_rtmz
- lipidoz.ml.view.plot_ml_example(pre_binned, ald_binned, crg_binned, figname=None, rgb=True)
Produces a plot of a machine learning example which consists of binned RTMZ data for a precursor and aldehyde/criegee Oz fragments
- Parameters:
- pre_binned
numpy.ndarray(float) - ald_binned
numpy.nparray(float) - crg_binned
numpy.ndarray(float) binned RTMZ data for precursor and aldehyde/criegee Oz fragments
- figname
str, optional if provided, save the image to the specified file
- rgb
bool, default=True if True, use Reds, Greens, Blues color scales for each panel, otherwise use viridis for all
- pre_binned