`lipidoz.workflows`

This module defines the functional components for standard high-level OzID data processing workflows. The functions fall broadly into two categories: those related to isotope distribution analysis and those related to the machine learning-based double bond determination.

Isotope Scoring Target List Format

The isotope scoring workflow expects a target list in .csv format with 3 columns: lipid name, MS adduct, and target retention time. A single header row from the .csv file is always ignored. Lines starting with # are treated as comments and ignored.

Example target list for isotope scoring

lipid,adduct,retention_time
PE(17:0_18:1),[M-H]-,23.70
PE(17:0_20:3),[M-H]-,22.99
PE(17:0_22:4),[M-H]-,23.46
#CE(18:1),[M-H]-,12.34  <- this line is commented out so it will be ignored
PG(17:0_18:1),[M-H]-,23.70
PG(17:0_20:3),[M-H]-,22.99
PG(17:0_22:4),[M-H]-,23.46

Note

Target list format for lipidoz.workflows.run_isotope_scoring_workflow_infusion() is the same, but excluding the retention time column, and target list format for lipidoz.workflows.run_isotope_scoring_workflow_targeted() is likewise the same except for the inclusion of additional columns for targeted DB indices and positions. See examples below.

Example target list for isotope scoring (infusion)

lipid,adduct
PE(17:0_18:1),[M-H]-
PE(17:0_20:3),[M-H]-
PE(17:0_22:4),[M-H]-
#CE(18:1),[M-H]-  <- this line is commented out so it will be ignored
PG(17:0_18:1),[M-H]-
PG(17:0_20:3),[M-H]-
PG(17:0_22:4),[M-H]-

Example target list for isotope scoring (targeted)

lipid,adduct,retention_time,db_idx,db_pos
PE(17:0_18:1),[M-H]-,23.70,1,9
PE(17:0_20:3),[M-H]-,22.99,1/2/3,6/9/12
PE(17:0_22:4),[M-H]-,23.46,1/2/3/4,3/6/9/12
#CE(18:1),[M-H]-,12.34,1,9  <- this line is commented out so it will be ignored
# note that multiple target DB indices/positions can be included in one line
# and they are separated by /
PG(17:0_18:1),[M-H]-,23.70,1,9
PG(17:0_20:3),[M-H]-,22.99,1/2/3,6/9/12
PG(17:0_22:4),[M-H]-,23.46,1/2/3/4,3/6/9/12

Structure of LipidOz Results

LipidOz now has multiple workflows for analyzing OzID data in different ways (e.g. isotope distribution analysis, machine-learning, hybrid approach), each of which produces its own set of results in the form of extracted/processed data and metadata. The sections below detail the structure of those individual results sets. In order to easily organize the different results, an overarching datastructure, termed lipidoz_results is defined which is simply a dictionary with sections for storing the results from each of the different individual workflows. The structure of the lipidoz_results is as follows:

Layout of lipidoz_results dictionary

lipidoz_results = {
    # normal/infusion/targeted variants all get packed into this one
    'isotope_scoring_results': {...isotope_scoring_results...},
    'preml_data': {...preml_data...},
    'ml_data': np.array(...),
    # when DL inference is run, put the predictions
    # and probabilities into arrays
    # and store the name of the parameters file used
    # to run the inference
    'ml_pred_lbls': np.array(...),
    'ml_pred_probs': np.array(...),
    'ml_params_file': 'resnet18_SPLA-ULSP-BTLE_params.pt'
}

Structure of `run_isotope_scoring_workflow` Results

The run_isotope_scoring_workflow function returns a dictionary containing information from double bond determination analyses performed for a set of lipid species defined in a target list. The results are organized into two top-level sections: 'metadata' and 'targets'. The 'metadata' section contains metadata about the analysis including information like input files and tolerances used for data extraction. The `` ‘targets’`` section contains the analysis results organized in a heirarchical fashion, first by lipid, then by MS adduct, finally by target retention time. The results for individual lipid species (defined by a combination of lipid and MS adduct) are stored underneath these sub-sections.

Note

See Structure of score_db_pos_isotope_dist_polyunsat Results for details regarding the organization of the result sections for individual lipid species.

Note

Results from lipidoz.workflows.run_isotope_scoring_workflow_targeted() are the same as for lipidoz.workflows.run_isotope_scoring_workflow(), except the metadata “workflow” entry will be set to “isotope_scoring_targeted”

Example run_isotope_scoring_workflow results dictionary

isotope_scoring_results = {
    'metadata': {
        'workflow': 'isotope_scoring',
        'lipidoz_version': 0.4.20,
        'oz_data_file': 'data/ozid_data_file.mza',
        'target_list_file': 'a_target_list.csv',
        'rt_tol': 0.25,
        'rt_peak_win': 1.5,
        'mz_tol': 0.05,
        'd_label': None,
        'd_label_in_nl': None,
    },
    'targets': {
        'PC(16:1_16:0)': {
            '[M+H]+': {
                '21.05min': {
                    'precursor': {
                        'target_mz': 789.0123,
                        'target_rt': 23.45,
                        'xic_peak_rt': 23.45,
                        'xic_peak_ht': 1e5,
                        'xic_peak_fwhm': 0.15,
                        'mz_ppm': 10.1,
                        'abun_percent': 5.5,
                        'mz_cos_dist': 0.15,
                        'isotope_dist_img': ...,
                        'xic_fit_img': ...,
                        'saturation_corrected': False
                    },
                    'fragments': {
                        1: {
                            9: {
                                'aldehyde': {
                                    'target_mz': 234.5678,
                                    'target_rt': 23.45,
                                    'xic_peak_rt': 23.45,
                                    'xic_peak_ht': 1e4,
                                    'xic_peak_fwhm': 0.25,
                                    'mz_ppm': 10.1,
                                    'abun_percent': 5.5,
                                    'mz_cos_dist': 0.15,
                                    'rt_cos_dist': 0.25,
                                    'isotope_dist_img': ...,
                                    'xic_fit_img': ...,
                                    'saturation_corrected': False,
                                },
                                # if the fragment was not found the section is set to None
                                'criegee': None
                            },
                            # more db positions ...
                        },
                        # more db indices ...
                    }
                },
                # more retention times ...
            },
            # more adducts ...
        },
        # more targets ...
    },
}

Structure of `run_isotope_scoring_workflow_infusion` Results

The results from the infusion variant of the isotope scoring workflow are very similar to those from the normal version, except any component having to do with retention time is omitted.

Example run_isotope_scoring_workflow_infusion results dictionary

isotope_scoring_results = {
    'metadata': {
        'workflow': 'isotope_scoring_infusion',
        'lipidoz_version': 0.4.20,
        'oz_data_file': 'data/infusion_ozid_data_file.mza',
        'target_list_file': 'a_target_list.csv',
        'mz_tol': 0.05,
        'd_label': None,
        'd_label_in_nl': None,
    },
    'targets': {
        'PC(16:1_16:0)': {
            '[M+H]+': {
                'infusion': {  # instead of a retention time the label here is just "infusion"
                    'precursor': {
                        'target_mz': 789.0123,
                        'mz_ppm': 10.1,
                        'abun_percent': 5.5,
                        'mz_cos_dist': 0.15,
                        'isotope_dist_img': ...,
                    },
                    'fragments': {
                        1: {
                            9: {
                                'aldehyde': {
                                    'target_mz': 234.5678,
                                    'mz_ppm': 10.1,
                                    'abun_percent': 5.5,
                                    'mz_cos_dist': 0.15,
                                    'isotope_dist_img': ...,
                                },
                                # if the fragment was not found the section is set to None
                                'criegee': None
                            },
                            # more db positions ...
                        },
                        # more db indices ...
                    }
                }
            },
            # more adducts ...
        },
        # more targets ...
    },
}

Structure of `collect_preml_dataset` dataset

The lipidoz.workflows.collect_preml_dataset() function returns a dictionary containing minimally processed RTMZ data for a set of lipid species defined in a target list. The dataset contains extracted data for lipid precursor and aldehyde/criegee OzID fragments for different double bond locations. The dataset is organized into two top-level sections: 'metadata' and 'targets'. The 'metadata' section contains metadata about the analysis including information like input files and tolerances used for data extraction. The 'targets' section contains the data for individual lipid species, defined by the lipid, MS adduct, target retention time, double bond index, and double bond position.

Example collect_preml_data dataset

pre_ml_dataset = {
    'metadata': {
        'workflow': 'pre_ml',
        'lipidoz_version': '0.4.20',
        'oz_data_file': '../../_data/Ultimate-Splash_NEG_O3_Run-1.mza',
        'target_list_file': 'test_target_list.csv',
        'rt_tol': 0.2,
        'd_label': 5,
        'd_label_in_nl': False,
    },
    'targets': {
        'PE(18:1_17:0)|[M+H]+|23.70min|1|1': {  # <lipid>|<adduct>|<target_rt>|<db_idx>|<db_pos>
            'pre_data': # <raw RTMZ arrays for precursor>
            'ald_data': # <raw RTMZ arrays for aldehyde OzID fragment>
            'crg_data': # <raw RTMZ arrays for criegee OzID fragment>
            'pre_mz': mz,  # precursor m/z
            'ald_mz': ald_mz,  # aldehyde OzID fragment m/z
            'crg_mz': crg_mz,  # criegee OzID fragment m/z
            'rt': 23.70,  # target retention time
        },
        # ... data for other targets omitted
    },
}

Module Reference

Isotope Distribution Analysis

lipidoz.workflows.run_isotope_scoring_workflow(oz_data_file, target_list_file, rt_tol, rt_peak_win, mz_tol, d_label=None, d_label_in_nl=None, progress_cb=None, info_cb=None, early_stop_event=None, debug_flag=None, debug_cb=None, rt_correction_func=None, ignore_preferred_ionization=True, mza_version='new')

workflow for performing isotope scoring for the determination of db positions. inputs are the data file and target list file, output is a dictionary containing metadata about the analysis and the analysis results for all of the lipids in the target list. The target list should have columns containing the following information (in order, 1 header row is skipped):

lipid name in standard abbreviated format, with FA composition fully specified, e.g., PC(18:1_16:0) or TG(16:0/18:1/20:2)
MS adduct, e.g., [M+H]+ or [M-2H]2-
target retention time

Parameters:

oz_data_filestr

filename and path for OzID data (.mza format)

target_list_filestr

filename and path for target list (.csv format)

rt_tolfloat

retention time tolerance (for MS1 data extraction)

rt_peak_winfloat

size of retention time window to extract for fitting retention time peak

mz_tolfloat

m/z tolerance for extracting XICs

d_labelint, optional

number of deuteriums in deuterium-labeled standards (i.e. SPLASH and Ultimate SPLASH mixes)

d_label_in_nlbool, optional

if deuterium labels are present, indicates whether they are included in the neutral loss during OzID fragmentation, this is False if the deuteriums are on the lipid head group and True if they are at the end of a FA tail (meaning that the aldehyde and criegee fragment formulas must be adjusted to account for loss of the label during fragmentataion)

progress_cbfunction, optional

option for a callback function that gets called every time an individual lipid species has been processed, this callback function should take as arguments (in order):

lipid name (str)
adduct (str)
current position in target list (int)
total lipids in target list(int)

info_cbfunction, optional

optional callback function that gets called at several intermediate steps and gives information about data processing details. Callback function takes a single argument which is a str info message

early_stop_eventthreading.Event, optional

When the workflow is running in its own thread and this event gets set, processing is stopped gracefully

debug_flagstr, optional

specifies how to dispatch the message and/or plot, None to do nothing

debug_cbfunc, optional

callback function that takes the debugging message as an argument, can be None if debug_flag is not set to ‘textcb’

ignore_preferred_ionizationbool, default=False

whether to ignore cases where a lipid/adduct combination violates the lipid class’ preferred ionization state

rt_correction_funcfunction, optional

provide a function that takes an uncorrected retention time as an argument then returns the corrected retention time

mza_versionstr, default=’new’

temporary measure for indicating whether the the scan indexing needs to account for partitioned scan data (‘new’) or not (‘old’). Again, this is only temporary as at some point the mza version will be encoded as metadata into the file and this accommodation can be made automatically.

Returns:

isotope_scoring_resultsdict(...): results dictionary with metadata and scoring information

lipidoz.workflows.run_isotope_scoring_workflow_targeted(oz_data_file, target_list_file, rt_tol, rt_peak_win, mz_tol, d_label=None, d_label_in_nl=None, progress_cb=None, info_cb=None, early_stop_event=None, debug_flag=None, debug_cb=None, rt_correction_func=None, ignore_preferred_ionization=True, mza_version='new')

workflow for performing isotope scoring for the determination of db positions.

! TARGETED VARIANT !

inputs are the data file and target list file, output is a dictionary containing metadata about the analysis and the analysis results for all of the lipids in the target list. The target list should have columns containing the following information (in order, 1 header row is skipped):

lipid name in standard abbreviated format, with FA composition fully specified, e.g., PC(18:1_16:0) or TG(16:0/18:1/20:2)
MS adduct, e.g., [M+H]+ or [M-2H]2-
target retention time
target double bond indices separated by “/” (e.g. “1/1/1/2”)
target double bond positions separated by “/” (e.g. “6/7/9/9”)

Parameters:

oz_data_filestr

filename and path for OzID data (.mza format)

target_list_filestr

filename and path for target list (.csv format)

rt_tolfloat

retention time tolerance (for MS1 data extraction)

rt_peak_winfloat

size of retention time window to extract for fitting retention time peak

mz_tolfloat

m/z tolerance for extracting XICs

d_labelint, optional

number of deuteriums in deuterium-labeled standards (i.e. SPLASH and Ultimate SPLASH mixes)

d_label_in_nlbool, optional

if deuterium labels are present, indicates whether they are included in the neutral loss during OzID fragmentation, this is False if the deuteriums are on the lipid head group and True if they are at the end of a FA tail (meaning that the aldehyde and criegee fragment formulas must be adjusted to account for loss of the label during fragmentataion)

progress_cbfunction, optional

option for a callback function that gets called every time an individual lipid species has been processed, this callback function should take as arguments (in order):

lipid name (str)
adduct (str)
current position in target list (int)
total lipids in target list(int)

info_cbfunction, optional

optional callback function that gets called at several intermediate steps and gives information about data processing details. Callback function takes a single argument which is a str info message

early_stop_eventthreading.Event, optional

When the workflow is running in its own thread and this event gets set, processing is stopped gracefully

debug_flagstr, optional

specifies how to dispatch the message and/or plot, None to do nothing

debug_cbfunc, optional

callback function that takes the debugging message as an argument, can be None if debug_flag is not set to ‘textcb’

ignore_preferred_ionizationbool, default=False

whether to ignore cases where a lipid/adduct combination violates the lipid class’ preferred ionization state

rt_correction_funcfunction, optional

provide a function that takes an uncorrected retention time as an argument then returns the corrected retention time

mza_versionstr, default=’new’

temporary measure for indicating whether the the scan indexing needs to account for partitioned scan data (‘new’) or not (‘old’). Again, this is only temporary as at some point the mza version will be encoded as metadata into the file and this accommodation can be made automatically.

Returns:

isotope_scoring_resultsdict(...): results dictionary with metadata and scoring information

lipidoz.workflows.run_isotope_scoring_workflow_infusion(oz_data_file, target_list_file, mz_tol, d_label=None, d_label_in_nl=None, progress_cb=None, early_stop_event=None, debug_flag=None, debug_cb=None, ignore_preferred_ionization=False, mza_version='new')

workflow for performing isotope scoring for the determination of db positions from infusion data inputs are the data file and target list file, output is a dictionary containing metadata about the analysis and the analysis results for all of the lipids in the target list. The target list should have columns containing the following information (in order, 1 header row is skipped):

lipid name in standard abbreviated format, with FA composition fully specified, e.g., PC(18:1_16:0) or TG(16:0/18:1/20:2)
MS adduct, e.g., [M+H]+ or [M-2H]2-

Parameters:

oz_data_filestr

filename and path for OzID data (.mza format)

target_list_filestr

filename and path for target list (.csv format)

mz_tolfloat

m/z tolerance for extracting XICs

d_labelint, optional

number of deuteriums in deuterium-labeled standards (i.e. SPLASH and Ultimate SPLASH mixes)

d_label_in_nlbool, optional

if deuterium labels are present, indicates whether they are included in the neutral loss during OzID fragmentation, this is False if the deuteriums are on the lipid head group and True if they are at the end of a FA tail (meaning that the aldehyde and criegee fragment formulas must be adjusted to account for loss of the label during fragmentataion)

progress_cbfunction, optional

option for a callback function that gets called every time an individual lipid species has been processed, this callback function should take as arguments (in order):

lipid name (str)
adduct (str)
current position in target list (int)
total lipids in target list(int)

early_stop_eventthreading.Event, optional

When the workflow is running in its own thread and this event gets set, processing is stopped gracefully

debug_flagstr, optional

specifies how to dispatch the message and/or plot, None to do nothing

debug_cbfunc, optional

callback function that takes the debugging message as an argument, can be None if debug_flag is not set to ‘textcb’

ignore_preferred_ionizationbool, default=False

whether to ignore cases where a lipid/adduct combination violates the lipid class’ preferred ionization state

mza_versionstr, default=’new’

temporary measure for indicating whether the the scan indexing needs to account for partitioned scan data (‘new’) or not (‘old’). Again, this is only temporary as at some point the mza version will be encoded as metadata into the file and this accommodation can be made automatically.

Returns:

isotope_scoring_resultsdict(...): results dictionary with metadata and scoring information

lipidoz.workflows.save_isotope_scoring_results(isotope_scoring_results, results_file_name)

save the results of the isotope scoring workflow (complete with metadata) to file in pickle format

Parameters:

isotope_scoring_resultsdict(...): results dictionary with metadata and scoring information
results_file_namestr: filename and path to save the results file under, should have .loz file ending (maintains compatibility with lipidoz_gui)

lipidoz.workflows.write_isotope_scoring_report_xlsx(isotope_scoring_results, xlsx_file)

writes results of the isotope scoring workflow to an excel spreadsheet

Parameters:

isotope_scoring_resultsdict(...): results dictionary from isotope scoring workflow
xlsx_filestr: filename to save report under

Machine Learning

lipidoz.workflows.collect_preml_dataset(oz_data_file, target_list_file, rt_tol, d_label=None, d_label_in_nl=None, debug_flag=None, debug_cb=None, ignore_preferred_ionization=False, rt_correction_func=None, mza_version='new')

collects a dataset which can be used in training ML models. The dataset is a dictionary with metadata and minimally processed RTMZ data. The RTMZ data is extracted in a window with the following bounds:

target RT +/- rt_tol – this should be set wide enough to accomodate the chromatographic peak
target m/z (M isotope) - 0.5, target m/z (M isotope) + 2.5 – this covers the M, M+1, M+2 isotopes

Parameters:

oz_data_filestr: filename and path for OzID data (.mza format)
target_list_filestr: filename and path for target list (.csv format)
rt_tolfloat: retention time tolerance, defines data extraction window
d_labelint, optional: number of deuteriums in deuterium-labeled standards (i.e. SPLASH and Ultimate SPLASH mixes)
d_label_in_nlbool, optional: if deuterium labels are present, indicates whether they are included in the neutral loss during OzID fragmentation, this is False if the deuteriums are on the lipid head group and True if they are at the end of a FA tail (meaning that the aldehyde and criegee fragment formulas must be adjusted to account for loss of the label during fragmentataion)
debug_flagstr, optional: specifies how to dispatch the message and/or plot, None to do nothing
debug_cbfunc, optional: callback function that takes the debugging message as an argument, can be None if debug_flag is not set to ‘textcb’
ignore_preferred_ionizationbool, default=False: whether to ignore cases where a lipid/adduct combination violates the lipid class’ preferred ionization state
rt_correction_funcfunction, optional: provide a function that takes an uncorrected retention time as an argument then returns the corrected retention time
mza_versionstr, default=’new’: temporary measure for indicating whether the the scan indexing needs to account for partitioned scan data (‘new’) or not (‘old’). Again, this is only temporary as at some point the mza version will be encoded as metadata into the file and this accommodation can be made automatically.

Returns:

pre_ml_datasetdict(...): dataset used for assembling ML training data

Note

See Structure of collect_preml_dataset dataset for details regarding the organization of the pre-ml dataset.

lipidoz.workflows.convert_multi_preml_datasets_labeled(preml_files, ml_target_files, rt_sampling_augment=True, normalize_intensity=True, rt_corr_funcs=None, debug_flag=None, debug_cb=None)

iterates through pairs of pre-ml dataset files and ml target lists, splits the pre-ml datasets into True/False examples, then converts to binned ml datasets. True/False examples from all pairs are combined and returned as two arrays of ML data

Parameters:

preml_fileslist(str): paths to pre-ml dataset files
ml_target_fileslist(str): paths to target list .csv files
rt_sampling_augmentbool, default=True: re-sample RT dimension from RTMZ data multiple times in order to augment training examples (~10x)
normalize_intensitybool, default=True: normalize the intensities in each 2D RTMZ array so that they are in the range 0->1
rt_corr_funcslist(function), optional: list of RT correction functions, one per ml_target_file, applies retention time corrections
debug_flagstr, optional: specifies how to dispatch the message and/or plot, None to do nothing
debug_cbfunc, optional: callback function that takes the debugging message as an argument, can be None if debug_flag is not set to ‘textcb’

Returns:

true_ml_datanumpy.ndarray
false_ml_datanumpy.ndarray: arrays of binned data for ML split by annotation from all pairs with shapes: (N, 3, 24, 400), where N is the number of True or False examples in the array

lipidoz.workflows.convert_multi_preml_datasets_unlabeled(preml_files, normalize_intensity=True, debug_flag=None, debug_cb=None)

iterates through pre-ml dataset files converts to binned ml datasets (for unlabeled data)

Parameters:

preml_fileslist(str): paths to pre-ml dataset files
normalize_intensitybool, default=True: normalize the intensities in each 2D RTMZ array so that they are in the range 0->1
debug_flagstr, optional: specifies how to dispatch the message and/or plot, None to do nothing
debug_cbfunc, optional: callback function that takes the debugging message as an argument, can be None if debug_flag is not set to ‘textcb’

Returns:

ml_datanumpy.ndarray: array of binned data for ML with shape: (N, 3, 24, 400), where N is the number of training examples

Hybrid Workflow

Note

The hybrid workflow uses ML inference to prioritize targets for full analysis using the targeted variant of the isotope distribution analysis. See Structure of LipidOz Results for details on how the data from these different steps is organized in the lipidoz_results dictionary that is returned by this function.

lipidoz.workflows.hybrid_deep_learning_and_isotope_scoring(oz_data_file, target_list_file, rt_tol, rt_peak_win, mz_tol, dl_params_file, d_label=None, d_label_in_nl=None, debug_flag=None, debug_cb=None)

A hybrid workflow that incorporates deep learning inference as a prefilter then performs targeted isotope scoring workflow on predicted True double bond positions

Parameters:

oz_data_filestr: filename and path for OzID data (.mza format)
target_list_filestr: filename and path for target list (.csv format)
rt_tolfloat: retention time tolerance, defines data extraction window
rt_peak_winfloat: size of retention time window to extract for fitting retention time peak
mz_tolfloat: m/z tolerance for extracting XICs
dl_params_filestr: pre-trained DL model parameters file
d_labelint, optional: number of deuteriums in deuterium-labeled standards (i.e. SPLASH and Ultimate SPLASH mixes)
d_label_in_nlbool, optional: if deuterium labels are present, indicates whether they are included in the neutral loss during OzID fragmentation, this is False if the deuteriums are on the lipid head group and True if they are at the end of a FA tail (meaning that the aldehyde and criegee fragment formulas must be adjusted to account for loss of the label during fragmentataion)
debug_flagstr, optional: specifies how to dispatch the message and/or plot, None to do nothing
debug_cbfunc, optional: callback function that takes the debugging message as an argument, can be None if debug_flag is not set to ‘textcb’

Returns:

lipidoz_resultsdict(...): results from DL prefiltering and targeted isotope scoring analysis

lipidoz.workflows

Isotope Scoring Target List Format

Structure of LipidOz Results

Structure of run_isotope_scoring_workflow Results

Structure of run_isotope_scoring_workflow_infusion Results

Structure of collect_preml_dataset dataset

Module Reference

Isotope Distribution Analysis

Machine Learning

Hybrid Workflow

`lipidoz.workflows`

Structure of `run_isotope_scoring_workflow` Results

Structure of `run_isotope_scoring_workflow_infusion` Results

Structure of `collect_preml_dataset` dataset