lipidoz.workflows
This module defines the functional components for standard high-level OzID data processing workflows. The functions fall broadly into two categories: those related to isotope distribution analysis and those related to the machine learning-based double bond determination.
Isotope Scoring Target List Format
The isotope scoring workflow expects a target list in .csv format with 3 columns: lipid name, MS adduct, and target retention time. A single header row from the .csv file is always ignored. Lines starting with # are treated as comments and ignored.
lipid,adduct,retention_time
PE(17:0_18:1),[M-H]-,23.70
PE(17:0_20:3),[M-H]-,22.99
PE(17:0_22:4),[M-H]-,23.46
#CE(18:1),[M-H]-,12.34 <- this line is commented out so it will be ignored
PG(17:0_18:1),[M-H]-,23.70
PG(17:0_20:3),[M-H]-,22.99
PG(17:0_22:4),[M-H]-,23.46
Note
Target list format for lipidoz.workflows.run_isotope_scoring_workflow_infusion() is the same,
but excluding the retention time column, and target list format for lipidoz.workflows.run_isotope_scoring_workflow_targeted()
is likewise the same except for the inclusion of additional columns for targeted DB indices and positions.
See examples below.
lipid,adduct
PE(17:0_18:1),[M-H]-
PE(17:0_20:3),[M-H]-
PE(17:0_22:4),[M-H]-
#CE(18:1),[M-H]- <- this line is commented out so it will be ignored
PG(17:0_18:1),[M-H]-
PG(17:0_20:3),[M-H]-
PG(17:0_22:4),[M-H]-
lipid,adduct,retention_time,db_idx,db_pos
PE(17:0_18:1),[M-H]-,23.70,1,9
PE(17:0_20:3),[M-H]-,22.99,1/2/3,6/9/12
PE(17:0_22:4),[M-H]-,23.46,1/2/3/4,3/6/9/12
#CE(18:1),[M-H]-,12.34,1,9 <- this line is commented out so it will be ignored
# note that multiple target DB indices/positions can be included in one line
# and they are separated by /
PG(17:0_18:1),[M-H]-,23.70,1,9
PG(17:0_20:3),[M-H]-,22.99,1/2/3,6/9/12
PG(17:0_22:4),[M-H]-,23.46,1/2/3/4,3/6/9/12
Structure of LipidOz Results
LipidOz now has multiple workflows for analyzing OzID data in different ways (e.g. isotope distribution analysis, machine-learning, hybrid approach), each of which produces its own set of results in the form of extracted/processed data and metadata. The sections below detail the structure of those individual results sets. In order to easily organize the different results, an overarching datastructure, termed lipidoz_results is defined which is simply a dictionary with sections for storing the results from each of the different individual workflows. The structure of the lipidoz_results is as follows:
lipidoz_results dictionarylipidoz_results = {
# normal/infusion/targeted variants all get packed into this one
'isotope_scoring_results': {...isotope_scoring_results...},
'preml_data': {...preml_data...},
'ml_data': np.array(...),
# when DL inference is run, put the predictions
# and probabilities into arrays
# and store the name of the parameters file used
# to run the inference
'ml_pred_lbls': np.array(...),
'ml_pred_probs': np.array(...),
'ml_params_file': 'resnet18_SPLA-ULSP-BTLE_params.pt'
}
Structure of run_isotope_scoring_workflow Results
The run_isotope_scoring_workflow function returns a dictionary containing information from double bond
determination analyses performed for a set of lipid species defined in a target list. The results are organized
into two top-level sections: 'metadata' and 'targets'. The 'metadata' section contains metadata
about the analysis including information like input files and tolerances used for data extraction. The ``
‘targets’`` section contains the analysis results organized in a heirarchical fashion, first by lipid, then by
MS adduct, finally by target retention time. The results for individual lipid species (defined by a combination of
lipid and MS adduct) are stored underneath these sub-sections.
Note
See Structure of score_db_pos_isotope_dist_polyunsat Results for details regarding the organization of the result sections for individual lipid species.
Note
Results from lipidoz.workflows.run_isotope_scoring_workflow_targeted() are the same as for
lipidoz.workflows.run_isotope_scoring_workflow(), except the metadata “workflow” entry will
be set to “isotope_scoring_targeted”
run_isotope_scoring_workflow results dictionaryisotope_scoring_results = {
'metadata': {
'workflow': 'isotope_scoring',
'lipidoz_version': 0.4.20,
'oz_data_file': 'data/ozid_data_file.mza',
'target_list_file': 'a_target_list.csv',
'rt_tol': 0.25,
'rt_peak_win': 1.5,
'mz_tol': 0.05,
'd_label': None,
'd_label_in_nl': None,
},
'targets': {
'PC(16:1_16:0)': {
'[M+H]+': {
'21.05min': {
'precursor': {
'target_mz': 789.0123,
'target_rt': 23.45,
'xic_peak_rt': 23.45,
'xic_peak_ht': 1e5,
'xic_peak_fwhm': 0.15,
'mz_ppm': 10.1,
'abun_percent': 5.5,
'mz_cos_dist': 0.15,
'isotope_dist_img': ...,
'xic_fit_img': ...,
'saturation_corrected': False
},
'fragments': {
1: {
9: {
'aldehyde': {
'target_mz': 234.5678,
'target_rt': 23.45,
'xic_peak_rt': 23.45,
'xic_peak_ht': 1e4,
'xic_peak_fwhm': 0.25,
'mz_ppm': 10.1,
'abun_percent': 5.5,
'mz_cos_dist': 0.15,
'rt_cos_dist': 0.25,
'isotope_dist_img': ...,
'xic_fit_img': ...,
'saturation_corrected': False,
},
# if the fragment was not found the section is set to None
'criegee': None
},
# more db positions ...
},
# more db indices ...
}
},
# more retention times ...
},
# more adducts ...
},
# more targets ...
},
}
Structure of run_isotope_scoring_workflow_infusion Results
The results from the infusion variant of the isotope scoring workflow are very similar to those from the normal version, except any component having to do with retention time is omitted.
run_isotope_scoring_workflow_infusion results dictionaryisotope_scoring_results = {
'metadata': {
'workflow': 'isotope_scoring_infusion',
'lipidoz_version': 0.4.20,
'oz_data_file': 'data/infusion_ozid_data_file.mza',
'target_list_file': 'a_target_list.csv',
'mz_tol': 0.05,
'd_label': None,
'd_label_in_nl': None,
},
'targets': {
'PC(16:1_16:0)': {
'[M+H]+': {
'infusion': { # instead of a retention time the label here is just "infusion"
'precursor': {
'target_mz': 789.0123,
'mz_ppm': 10.1,
'abun_percent': 5.5,
'mz_cos_dist': 0.15,
'isotope_dist_img': ...,
},
'fragments': {
1: {
9: {
'aldehyde': {
'target_mz': 234.5678,
'mz_ppm': 10.1,
'abun_percent': 5.5,
'mz_cos_dist': 0.15,
'isotope_dist_img': ...,
},
# if the fragment was not found the section is set to None
'criegee': None
},
# more db positions ...
},
# more db indices ...
}
}
},
# more adducts ...
},
# more targets ...
},
}
Structure of collect_preml_dataset dataset
The lipidoz.workflows.collect_preml_dataset() function returns a dictionary containing minimally
processed RTMZ data for a set of
lipid species defined in a target list. The dataset contains extracted data for lipid precursor and aldehyde/criegee
OzID fragments for different double bond locations. The dataset is organized
into two top-level sections: 'metadata' and 'targets'. The 'metadata' section contains metadata
about the analysis including information like input files and tolerances used for data extraction.
The 'targets' section contains the data for individual lipid species, defined by the lipid, MS adduct, target
retention time, double bond index, and double bond position.
collect_preml_data datasetpre_ml_dataset = {
'metadata': {
'workflow': 'pre_ml',
'lipidoz_version': '0.4.20',
'oz_data_file': '../../_data/Ultimate-Splash_NEG_O3_Run-1.mza',
'target_list_file': 'test_target_list.csv',
'rt_tol': 0.2,
'd_label': 5,
'd_label_in_nl': False,
},
'targets': {
'PE(18:1_17:0)|[M+H]+|23.70min|1|1': { # <lipid>|<adduct>|<target_rt>|<db_idx>|<db_pos>
'pre_data': # <raw RTMZ arrays for precursor>
'ald_data': # <raw RTMZ arrays for aldehyde OzID fragment>
'crg_data': # <raw RTMZ arrays for criegee OzID fragment>
'pre_mz': mz, # precursor m/z
'ald_mz': ald_mz, # aldehyde OzID fragment m/z
'crg_mz': crg_mz, # criegee OzID fragment m/z
'rt': 23.70, # target retention time
},
# ... data for other targets omitted
},
}
Module Reference
Isotope Distribution Analysis
- lipidoz.workflows.run_isotope_scoring_workflow(oz_data_file, target_list_file, rt_tol, rt_peak_win, mz_tol, d_label=None, d_label_in_nl=None, progress_cb=None, info_cb=None, early_stop_event=None, debug_flag=None, debug_cb=None, rt_correction_func=None, ignore_preferred_ionization=True, mza_version='new')
workflow for performing isotope scoring for the determination of db positions. inputs are the data file and target list file, output is a dictionary containing metadata about the analysis and the analysis results for all of the lipids in the target list. The target list should have columns containing the following information (in order, 1 header row is skipped):
lipid name in standard abbreviated format, with FA composition fully specified, e.g., PC(18:1_16:0) or TG(16:0/18:1/20:2)
MS adduct, e.g., [M+H]+ or [M-2H]2-
target retention time
- Parameters:
- oz_data_file
str filename and path for OzID data (.mza format)
- target_list_file
str filename and path for target list (.csv format)
- rt_tol
float retention time tolerance (for MS1 data extraction)
- rt_peak_win
float size of retention time window to extract for fitting retention time peak
- mz_tol
float m/z tolerance for extracting XICs
- d_label
int, optional number of deuteriums in deuterium-labeled standards (i.e. SPLASH and Ultimate SPLASH mixes)
- d_label_in_nl
bool, optional if deuterium labels are present, indicates whether they are included in the neutral loss during OzID fragmentation, this is False if the deuteriums are on the lipid head group and True if they are at the end of a FA tail (meaning that the aldehyde and criegee fragment formulas must be adjusted to account for loss of the label during fragmentataion)
- progress_cb
function, optional option for a callback function that gets called every time an individual lipid species has been processed, this callback function should take as arguments (in order):
lipid name (
str)adduct (
str)current position in target list (
int)total lipids in target list(
int)
- info_cb
function, optional optional callback function that gets called at several intermediate steps and gives information about data processing details. Callback function takes a single argument which is a
strinfo message- early_stop_event
threading.Event, optional When the workflow is running in its own thread and this event gets set, processing is stopped gracefully
- debug_flag
str, optional specifies how to dispatch the message and/or plot, None to do nothing
- debug_cb
func, optional callback function that takes the debugging message as an argument, can be None if debug_flag is not set to ‘textcb’
- ignore_preferred_ionization
bool, default=False whether to ignore cases where a lipid/adduct combination violates the lipid class’ preferred ionization state
- rt_correction_func
function, optional provide a function that takes an uncorrected retention time as an argument then returns the corrected retention time
- mza_version
str, default=’new’ temporary measure for indicating whether the the scan indexing needs to account for partitioned scan data (‘new’) or not (‘old’). Again, this is only temporary as at some point the mza version will be encoded as metadata into the file and this accommodation can be made automatically.
- oz_data_file
- Returns:
- isotope_scoring_results
dict(...) results dictionary with metadata and scoring information
- isotope_scoring_results
- lipidoz.workflows.run_isotope_scoring_workflow_targeted(oz_data_file, target_list_file, rt_tol, rt_peak_win, mz_tol, d_label=None, d_label_in_nl=None, progress_cb=None, info_cb=None, early_stop_event=None, debug_flag=None, debug_cb=None, rt_correction_func=None, ignore_preferred_ionization=True, mza_version='new')
workflow for performing isotope scoring for the determination of db positions.
! TARGETED VARIANT !
inputs are the data file and target list file, output is a dictionary containing metadata about the analysis and the analysis results for all of the lipids in the target list. The target list should have columns containing the following information (in order, 1 header row is skipped):
lipid name in standard abbreviated format, with FA composition fully specified, e.g., PC(18:1_16:0) or TG(16:0/18:1/20:2)
MS adduct, e.g., [M+H]+ or [M-2H]2-
target retention time
target double bond indices separated by “/” (e.g. “1/1/1/2”)
target double bond positions separated by “/” (e.g. “6/7/9/9”)
- Parameters:
- oz_data_file
str filename and path for OzID data (.mza format)
- target_list_file
str filename and path for target list (.csv format)
- rt_tol
float retention time tolerance (for MS1 data extraction)
- rt_peak_win
float size of retention time window to extract for fitting retention time peak
- mz_tol
float m/z tolerance for extracting XICs
- d_label
int, optional number of deuteriums in deuterium-labeled standards (i.e. SPLASH and Ultimate SPLASH mixes)
- d_label_in_nl
bool, optional if deuterium labels are present, indicates whether they are included in the neutral loss during OzID fragmentation, this is False if the deuteriums are on the lipid head group and True if they are at the end of a FA tail (meaning that the aldehyde and criegee fragment formulas must be adjusted to account for loss of the label during fragmentataion)
- progress_cb
function, optional option for a callback function that gets called every time an individual lipid species has been processed, this callback function should take as arguments (in order):
lipid name (
str)adduct (
str)current position in target list (
int)total lipids in target list(
int)
- info_cb
function, optional optional callback function that gets called at several intermediate steps and gives information about data processing details. Callback function takes a single argument which is a
strinfo message- early_stop_event
threading.Event, optional When the workflow is running in its own thread and this event gets set, processing is stopped gracefully
- debug_flag
str, optional specifies how to dispatch the message and/or plot, None to do nothing
- debug_cb
func, optional callback function that takes the debugging message as an argument, can be None if debug_flag is not set to ‘textcb’
- ignore_preferred_ionization
bool, default=False whether to ignore cases where a lipid/adduct combination violates the lipid class’ preferred ionization state
- rt_correction_func
function, optional provide a function that takes an uncorrected retention time as an argument then returns the corrected retention time
- mza_version
str, default=’new’ temporary measure for indicating whether the the scan indexing needs to account for partitioned scan data (‘new’) or not (‘old’). Again, this is only temporary as at some point the mza version will be encoded as metadata into the file and this accommodation can be made automatically.
- oz_data_file
- Returns:
- isotope_scoring_results
dict(...) results dictionary with metadata and scoring information
- isotope_scoring_results
- lipidoz.workflows.run_isotope_scoring_workflow_infusion(oz_data_file, target_list_file, mz_tol, d_label=None, d_label_in_nl=None, progress_cb=None, early_stop_event=None, debug_flag=None, debug_cb=None, ignore_preferred_ionization=False, mza_version='new')
workflow for performing isotope scoring for the determination of db positions from infusion data inputs are the data file and target list file, output is a dictionary containing metadata about the analysis and the analysis results for all of the lipids in the target list. The target list should have columns containing the following information (in order, 1 header row is skipped):
lipid name in standard abbreviated format, with FA composition fully specified, e.g., PC(18:1_16:0) or TG(16:0/18:1/20:2)
MS adduct, e.g., [M+H]+ or [M-2H]2-
- Parameters:
- oz_data_file
str filename and path for OzID data (.mza format)
- target_list_file
str filename and path for target list (.csv format)
- mz_tol
float m/z tolerance for extracting XICs
- d_label
int, optional number of deuteriums in deuterium-labeled standards (i.e. SPLASH and Ultimate SPLASH mixes)
- d_label_in_nl
bool, optional if deuterium labels are present, indicates whether they are included in the neutral loss during OzID fragmentation, this is False if the deuteriums are on the lipid head group and True if they are at the end of a FA tail (meaning that the aldehyde and criegee fragment formulas must be adjusted to account for loss of the label during fragmentataion)
- progress_cb
function, optional option for a callback function that gets called every time an individual lipid species has been processed, this callback function should take as arguments (in order):
lipid name (
str)adduct (
str)current position in target list (
int)total lipids in target list(
int)
- early_stop_event
threading.Event, optional When the workflow is running in its own thread and this event gets set, processing is stopped gracefully
- debug_flag
str, optional specifies how to dispatch the message and/or plot, None to do nothing
- debug_cb
func, optional callback function that takes the debugging message as an argument, can be None if debug_flag is not set to ‘textcb’
- ignore_preferred_ionization
bool, default=False whether to ignore cases where a lipid/adduct combination violates the lipid class’ preferred ionization state
- mza_version
str, default=’new’ temporary measure for indicating whether the the scan indexing needs to account for partitioned scan data (‘new’) or not (‘old’). Again, this is only temporary as at some point the mza version will be encoded as metadata into the file and this accommodation can be made automatically.
- oz_data_file
- Returns:
- isotope_scoring_results
dict(...) results dictionary with metadata and scoring information
- isotope_scoring_results
- lipidoz.workflows.save_isotope_scoring_results(isotope_scoring_results, results_file_name)
save the results of the isotope scoring workflow (complete with metadata) to file in pickle format
- Parameters:
- isotope_scoring_results
dict(...) results dictionary with metadata and scoring information
- results_file_name
str filename and path to save the results file under, should have .loz file ending (maintains compatibility with
lipidoz_gui)
- isotope_scoring_results
- lipidoz.workflows.write_isotope_scoring_report_xlsx(isotope_scoring_results, xlsx_file)
writes results of the isotope scoring workflow to an excel spreadsheet
- Parameters:
- isotope_scoring_results
dict(...) results dictionary from isotope scoring workflow
- xlsx_file
str filename to save report under
- isotope_scoring_results
Machine Learning
- lipidoz.workflows.collect_preml_dataset(oz_data_file, target_list_file, rt_tol, d_label=None, d_label_in_nl=None, debug_flag=None, debug_cb=None, ignore_preferred_ionization=False, rt_correction_func=None, mza_version='new')
collects a dataset which can be used in training ML models. The dataset is a dictionary with metadata and minimally processed RTMZ data. The RTMZ data is extracted in a window with the following bounds:
target RT +/- rt_tol – this should be set wide enough to accomodate the chromatographic peak
target m/z (M isotope) - 0.5, target m/z (M isotope) + 2.5 – this covers the M, M+1, M+2 isotopes
- Parameters:
- oz_data_file
str filename and path for OzID data (.mza format)
- target_list_file
str filename and path for target list (.csv format)
- rt_tol
float retention time tolerance, defines data extraction window
- d_label
int, optional number of deuteriums in deuterium-labeled standards (i.e. SPLASH and Ultimate SPLASH mixes)
- d_label_in_nl
bool, optional if deuterium labels are present, indicates whether they are included in the neutral loss during OzID fragmentation, this is False if the deuteriums are on the lipid head group and True if they are at the end of a FA tail (meaning that the aldehyde and criegee fragment formulas must be adjusted to account for loss of the label during fragmentataion)
- debug_flag
str, optional specifies how to dispatch the message and/or plot, None to do nothing
- debug_cb
func, optional callback function that takes the debugging message as an argument, can be None if debug_flag is not set to ‘textcb’
- ignore_preferred_ionization
bool, default=False whether to ignore cases where a lipid/adduct combination violates the lipid class’ preferred ionization state
- rt_correction_func
function, optional provide a function that takes an uncorrected retention time as an argument then returns the corrected retention time
- mza_version
str, default=’new’ temporary measure for indicating whether the the scan indexing needs to account for partitioned scan data (‘new’) or not (‘old’). Again, this is only temporary as at some point the mza version will be encoded as metadata into the file and this accommodation can be made automatically.
- oz_data_file
- Returns:
- pre_ml_dataset
dict(...) dataset used for assembling ML training data
- pre_ml_dataset
Note
See Structure of collect_preml_dataset dataset for details regarding the organization of the pre-ml dataset.
- lipidoz.workflows.convert_multi_preml_datasets_labeled(preml_files, ml_target_files, rt_sampling_augment=True, normalize_intensity=True, rt_corr_funcs=None, debug_flag=None, debug_cb=None)
iterates through pairs of pre-ml dataset files and ml target lists, splits the pre-ml datasets into True/False examples, then converts to binned ml datasets. True/False examples from all pairs are combined and returned as two arrays of ML data
- Parameters:
- preml_files
list(str) paths to pre-ml dataset files
- ml_target_files
list(str) paths to target list .csv files
- rt_sampling_augment
bool, default=True re-sample RT dimension from RTMZ data multiple times in order to augment training examples (~10x)
- normalize_intensity
bool, default=True normalize the intensities in each 2D RTMZ array so that they are in the range 0->1
- rt_corr_funcs
list(function), optional list of RT correction functions, one per ml_target_file, applies retention time corrections
- debug_flag
str, optional specifies how to dispatch the message and/or plot, None to do nothing
- debug_cb
func, optional callback function that takes the debugging message as an argument, can be None if debug_flag is not set to ‘textcb’
- preml_files
- Returns:
- true_ml_data
numpy.ndarray - false_ml_data
numpy.ndarray arrays of binned data for ML split by annotation from all pairs with shapes: (N, 3, 24, 400), where N is the number of True or False examples in the array
- true_ml_data
- lipidoz.workflows.convert_multi_preml_datasets_unlabeled(preml_files, normalize_intensity=True, debug_flag=None, debug_cb=None)
iterates through pre-ml dataset files converts to binned ml datasets (for unlabeled data)
- Parameters:
- preml_files
list(str) paths to pre-ml dataset files
- normalize_intensity
bool, default=True normalize the intensities in each 2D RTMZ array so that they are in the range 0->1
- debug_flag
str, optional specifies how to dispatch the message and/or plot, None to do nothing
- debug_cb
func, optional callback function that takes the debugging message as an argument, can be None if debug_flag is not set to ‘textcb’
- preml_files
- Returns:
- ml_data
numpy.ndarray array of binned data for ML with shape: (N, 3, 24, 400), where N is the number of training examples
- ml_data
Hybrid Workflow
Note
The hybrid workflow uses ML inference to prioritize targets for full analysis using the targeted variant of the isotope distribution analysis. See Structure of LipidOz Results for details on how the data from these different steps is organized in the lipidoz_results dictionary that is returned by this function.
- lipidoz.workflows.hybrid_deep_learning_and_isotope_scoring(oz_data_file, target_list_file, rt_tol, rt_peak_win, mz_tol, dl_params_file, d_label=None, d_label_in_nl=None, debug_flag=None, debug_cb=None)
A hybrid workflow that incorporates deep learning inference as a prefilter then performs targeted isotope scoring workflow on predicted True double bond positions
- Parameters:
- oz_data_file
str filename and path for OzID data (.mza format)
- target_list_file
str filename and path for target list (.csv format)
- rt_tol
float retention time tolerance, defines data extraction window
- rt_peak_win
float size of retention time window to extract for fitting retention time peak
- mz_tol
float m/z tolerance for extracting XICs
- dl_params_file
str pre-trained DL model parameters file
- d_label
int, optional number of deuteriums in deuterium-labeled standards (i.e. SPLASH and Ultimate SPLASH mixes)
- d_label_in_nl
bool, optional if deuterium labels are present, indicates whether they are included in the neutral loss during OzID fragmentation, this is False if the deuteriums are on the lipid head group and True if they are at the end of a FA tail (meaning that the aldehyde and criegee fragment formulas must be adjusted to account for loss of the label during fragmentataion)
- debug_flag
str, optional specifies how to dispatch the message and/or plot, None to do nothing
- debug_cb
func, optional callback function that takes the debugging message as an argument, can be None if debug_flag is not set to ‘textcb’
- oz_data_file
- Returns:
- lipidoz_results
dict(...) results from DL prefiltering and targeted isotope scoring analysis
- lipidoz_results