lipidoz.workflows

This module defines the functional components for standard high-level OzID data processing workflows. The functions fall broadly into two categories: those related to isotope distribution analysis and those related to the machine learning-based double bond determination.

Isotope Scoring Target List Format

The isotope scoring workflow expects a target list in .csv format with 3 columns: lipid name, MS adduct, and target retention time. A single header row from the .csv file is always ignored. Lines starting with # are treated as comments and ignored.

Example target list for isotope scoring
lipid,adduct,retention_time
PE(17:0_18:1),[M-H]-,23.70
PE(17:0_20:3),[M-H]-,22.99
PE(17:0_22:4),[M-H]-,23.46
#CE(18:1),[M-H]-,12.34  <- this line is commented out so it will be ignored
PG(17:0_18:1),[M-H]-,23.70
PG(17:0_20:3),[M-H]-,22.99
PG(17:0_22:4),[M-H]-,23.46

Note

Target list format for lipidoz.workflows.run_isotope_scoring_workflow_infusion() is the same, but excluding the retention time column, and target list format for lipidoz.workflows.run_isotope_scoring_workflow_targeted() is likewise the same except for the inclusion of additional columns for targeted DB indices and positions. See examples below.

Example target list for isotope scoring (infusion)
lipid,adduct
PE(17:0_18:1),[M-H]-
PE(17:0_20:3),[M-H]-
PE(17:0_22:4),[M-H]-
#CE(18:1),[M-H]-  <- this line is commented out so it will be ignored
PG(17:0_18:1),[M-H]-
PG(17:0_20:3),[M-H]-
PG(17:0_22:4),[M-H]-
Example target list for isotope scoring (targeted)
lipid,adduct,retention_time,db_idx,db_pos
PE(17:0_18:1),[M-H]-,23.70,1,9
PE(17:0_20:3),[M-H]-,22.99,1/2/3,6/9/12
PE(17:0_22:4),[M-H]-,23.46,1/2/3/4,3/6/9/12
#CE(18:1),[M-H]-,12.34,1,9  <- this line is commented out so it will be ignored
# note that multiple target DB indices/positions can be included in one line
# and they are separated by /
PG(17:0_18:1),[M-H]-,23.70,1,9
PG(17:0_20:3),[M-H]-,22.99,1/2/3,6/9/12
PG(17:0_22:4),[M-H]-,23.46,1/2/3/4,3/6/9/12

Structure of LipidOz Results

LipidOz now has multiple workflows for analyzing OzID data in different ways (e.g. isotope distribution analysis, machine-learning, hybrid approach), each of which produces its own set of results in the form of extracted/processed data and metadata. The sections below detail the structure of those individual results sets. In order to easily organize the different results, an overarching datastructure, termed lipidoz_results is defined which is simply a dictionary with sections for storing the results from each of the different individual workflows. The structure of the lipidoz_results is as follows:

Layout of lipidoz_results dictionary
lipidoz_results = {
    # normal/infusion/targeted variants all get packed into this one
    'isotope_scoring_results': {...isotope_scoring_results...},
    'preml_data': {...preml_data...},
    'ml_data': np.array(...),
    # when DL inference is run, put the predictions
    # and probabilities into arrays
    # and store the name of the parameters file used
    # to run the inference
    'ml_pred_lbls': np.array(...),
    'ml_pred_probs': np.array(...),
    'ml_params_file': 'resnet18_SPLA-ULSP-BTLE_params.pt'
}

Structure of run_isotope_scoring_workflow Results

The run_isotope_scoring_workflow function returns a dictionary containing information from double bond determination analyses performed for a set of lipid species defined in a target list. The results are organized into two top-level sections: 'metadata' and 'targets'. The 'metadata' section contains metadata about the analysis including information like input files and tolerances used for data extraction. The `` ‘targets’`` section contains the analysis results organized in a heirarchical fashion, first by lipid, then by MS adduct, finally by target retention time. The results for individual lipid species (defined by a combination of lipid and MS adduct) are stored underneath these sub-sections.

Note

See Structure of score_db_pos_isotope_dist_polyunsat Results for details regarding the organization of the result sections for individual lipid species.

Note

Results from lipidoz.workflows.run_isotope_scoring_workflow_targeted() are the same as for lipidoz.workflows.run_isotope_scoring_workflow(), except the metadata “workflow” entry will be set to “isotope_scoring_targeted”

Example run_isotope_scoring_workflow results dictionary
isotope_scoring_results = {
    'metadata': {
        'workflow': 'isotope_scoring',
        'lipidoz_version': 0.4.20,
        'oz_data_file': 'data/ozid_data_file.mza',
        'target_list_file': 'a_target_list.csv',
        'rt_tol': 0.25,
        'rt_peak_win': 1.5,
        'mz_tol': 0.05,
        'd_label': None,
        'd_label_in_nl': None,
    },
    'targets': {
        'PC(16:1_16:0)': {
            '[M+H]+': {
                '21.05min': {
                    'precursor': {
                        'target_mz': 789.0123,
                        'target_rt': 23.45,
                        'xic_peak_rt': 23.45,
                        'xic_peak_ht': 1e5,
                        'xic_peak_fwhm': 0.15,
                        'mz_ppm': 10.1,
                        'abun_percent': 5.5,
                        'mz_cos_dist': 0.15,
                        'isotope_dist_img': ...,
                        'xic_fit_img': ...,
                        'saturation_corrected': False
                    },
                    'fragments': {
                        1: {
                            9: {
                                'aldehyde': {
                                    'target_mz': 234.5678,
                                    'target_rt': 23.45,
                                    'xic_peak_rt': 23.45,
                                    'xic_peak_ht': 1e4,
                                    'xic_peak_fwhm': 0.25,
                                    'mz_ppm': 10.1,
                                    'abun_percent': 5.5,
                                    'mz_cos_dist': 0.15,
                                    'rt_cos_dist': 0.25,
                                    'isotope_dist_img': ...,
                                    'xic_fit_img': ...,
                                    'saturation_corrected': False,
                                },
                                # if the fragment was not found the section is set to None
                                'criegee': None
                            },
                            # more db positions ...
                        },
                        # more db indices ...
                    }
                },
                # more retention times ...
            },
            # more adducts ...
        },
        # more targets ...
    },
}

Structure of run_isotope_scoring_workflow_infusion Results

The results from the infusion variant of the isotope scoring workflow are very similar to those from the normal version, except any component having to do with retention time is omitted.

Example run_isotope_scoring_workflow_infusion results dictionary
isotope_scoring_results = {
    'metadata': {
        'workflow': 'isotope_scoring_infusion',
        'lipidoz_version': 0.4.20,
        'oz_data_file': 'data/infusion_ozid_data_file.mza',
        'target_list_file': 'a_target_list.csv',
        'mz_tol': 0.05,
        'd_label': None,
        'd_label_in_nl': None,
    },
    'targets': {
        'PC(16:1_16:0)': {
            '[M+H]+': {
                'infusion': {  # instead of a retention time the label here is just "infusion"
                    'precursor': {
                        'target_mz': 789.0123,
                        'mz_ppm': 10.1,
                        'abun_percent': 5.5,
                        'mz_cos_dist': 0.15,
                        'isotope_dist_img': ...,
                    },
                    'fragments': {
                        1: {
                            9: {
                                'aldehyde': {
                                    'target_mz': 234.5678,
                                    'mz_ppm': 10.1,
                                    'abun_percent': 5.5,
                                    'mz_cos_dist': 0.15,
                                    'isotope_dist_img': ...,
                                },
                                # if the fragment was not found the section is set to None
                                'criegee': None
                            },
                            # more db positions ...
                        },
                        # more db indices ...
                    }
                }
            },
            # more adducts ...
        },
        # more targets ...
    },
}

Structure of collect_preml_dataset dataset

The lipidoz.workflows.collect_preml_dataset() function returns a dictionary containing minimally processed RTMZ data for a set of lipid species defined in a target list. The dataset contains extracted data for lipid precursor and aldehyde/criegee OzID fragments for different double bond locations. The dataset is organized into two top-level sections: 'metadata' and 'targets'. The 'metadata' section contains metadata about the analysis including information like input files and tolerances used for data extraction. The 'targets' section contains the data for individual lipid species, defined by the lipid, MS adduct, target retention time, double bond index, and double bond position.

Example collect_preml_data dataset
pre_ml_dataset = {
    'metadata': {
        'workflow': 'pre_ml',
        'lipidoz_version': '0.4.20',
        'oz_data_file': '../../_data/Ultimate-Splash_NEG_O3_Run-1.mza',
        'target_list_file': 'test_target_list.csv',
        'rt_tol': 0.2,
        'd_label': 5,
        'd_label_in_nl': False,
    },
    'targets': {
        'PE(18:1_17:0)|[M+H]+|23.70min|1|1': {  # <lipid>|<adduct>|<target_rt>|<db_idx>|<db_pos>
            'pre_data': # <raw RTMZ arrays for precursor>
            'ald_data': # <raw RTMZ arrays for aldehyde OzID fragment>
            'crg_data': # <raw RTMZ arrays for criegee OzID fragment>
            'pre_mz': mz,  # precursor m/z
            'ald_mz': ald_mz,  # aldehyde OzID fragment m/z
            'crg_mz': crg_mz,  # criegee OzID fragment m/z
            'rt': 23.70,  # target retention time
        },
        # ... data for other targets omitted
    },
}

Module Reference

Isotope Distribution Analysis

lipidoz.workflows.run_isotope_scoring_workflow(oz_data_file, target_list_file, rt_tol, rt_peak_win, mz_tol, d_label=None, d_label_in_nl=None, progress_cb=None, info_cb=None, early_stop_event=None, debug_flag=None, debug_cb=None, rt_correction_func=None, ignore_preferred_ionization=True, mza_version='new')

workflow for performing isotope scoring for the determination of db positions. inputs are the data file and target list file, output is a dictionary containing metadata about the analysis and the analysis results for all of the lipids in the target list. The target list should have columns containing the following information (in order, 1 header row is skipped):

  • lipid name in standard abbreviated format, with FA composition fully specified, e.g., PC(18:1_16:0) or TG(16:0/18:1/20:2)

  • MS adduct, e.g., [M+H]+ or [M-2H]2-

  • target retention time

Parameters:
oz_data_filestr

filename and path for OzID data (.mza format)

target_list_filestr

filename and path for target list (.csv format)

rt_tolfloat

retention time tolerance (for MS1 data extraction)

rt_peak_winfloat

size of retention time window to extract for fitting retention time peak

mz_tolfloat

m/z tolerance for extracting XICs

d_labelint, optional

number of deuteriums in deuterium-labeled standards (i.e. SPLASH and Ultimate SPLASH mixes)

d_label_in_nlbool, optional

if deuterium labels are present, indicates whether they are included in the neutral loss during OzID fragmentation, this is False if the deuteriums are on the lipid head group and True if they are at the end of a FA tail (meaning that the aldehyde and criegee fragment formulas must be adjusted to account for loss of the label during fragmentataion)

progress_cbfunction, optional

option for a callback function that gets called every time an individual lipid species has been processed, this callback function should take as arguments (in order):

  • lipid name (str)

  • adduct (str)

  • current position in target list (int)

  • total lipids in target list(int)

info_cbfunction, optional

optional callback function that gets called at several intermediate steps and gives information about data processing details. Callback function takes a single argument which is a str info message

early_stop_eventthreading.Event, optional

When the workflow is running in its own thread and this event gets set, processing is stopped gracefully

debug_flagstr, optional

specifies how to dispatch the message and/or plot, None to do nothing

debug_cbfunc, optional

callback function that takes the debugging message as an argument, can be None if debug_flag is not set to ‘textcb’

ignore_preferred_ionizationbool, default=False

whether to ignore cases where a lipid/adduct combination violates the lipid class’ preferred ionization state

rt_correction_funcfunction, optional

provide a function that takes an uncorrected retention time as an argument then returns the corrected retention time

mza_versionstr, default=’new’

temporary measure for indicating whether the the scan indexing needs to account for partitioned scan data (‘new’) or not (‘old’). Again, this is only temporary as at some point the mza version will be encoded as metadata into the file and this accommodation can be made automatically.

Returns:
isotope_scoring_resultsdict(...)

results dictionary with metadata and scoring information

lipidoz.workflows.run_isotope_scoring_workflow_targeted(oz_data_file, target_list_file, rt_tol, rt_peak_win, mz_tol, d_label=None, d_label_in_nl=None, progress_cb=None, info_cb=None, early_stop_event=None, debug_flag=None, debug_cb=None, rt_correction_func=None, ignore_preferred_ionization=True, mza_version='new')

workflow for performing isotope scoring for the determination of db positions.

! TARGETED VARIANT !

inputs are the data file and target list file, output is a dictionary containing metadata about the analysis and the analysis results for all of the lipids in the target list. The target list should have columns containing the following information (in order, 1 header row is skipped):

  • lipid name in standard abbreviated format, with FA composition fully specified, e.g., PC(18:1_16:0) or TG(16:0/18:1/20:2)

  • MS adduct, e.g., [M+H]+ or [M-2H]2-

  • target retention time

  • target double bond indices separated by “/” (e.g. “1/1/1/2”)

  • target double bond positions separated by “/” (e.g. “6/7/9/9”)

Parameters:
oz_data_filestr

filename and path for OzID data (.mza format)

target_list_filestr

filename and path for target list (.csv format)

rt_tolfloat

retention time tolerance (for MS1 data extraction)

rt_peak_winfloat

size of retention time window to extract for fitting retention time peak

mz_tolfloat

m/z tolerance for extracting XICs

d_labelint, optional

number of deuteriums in deuterium-labeled standards (i.e. SPLASH and Ultimate SPLASH mixes)

d_label_in_nlbool, optional

if deuterium labels are present, indicates whether they are included in the neutral loss during OzID fragmentation, this is False if the deuteriums are on the lipid head group and True if they are at the end of a FA tail (meaning that the aldehyde and criegee fragment formulas must be adjusted to account for loss of the label during fragmentataion)

progress_cbfunction, optional

option for a callback function that gets called every time an individual lipid species has been processed, this callback function should take as arguments (in order):

  • lipid name (str)

  • adduct (str)

  • current position in target list (int)

  • total lipids in target list(int)

info_cbfunction, optional

optional callback function that gets called at several intermediate steps and gives information about data processing details. Callback function takes a single argument which is a str info message

early_stop_eventthreading.Event, optional

When the workflow is running in its own thread and this event gets set, processing is stopped gracefully

debug_flagstr, optional

specifies how to dispatch the message and/or plot, None to do nothing

debug_cbfunc, optional

callback function that takes the debugging message as an argument, can be None if debug_flag is not set to ‘textcb’

ignore_preferred_ionizationbool, default=False

whether to ignore cases where a lipid/adduct combination violates the lipid class’ preferred ionization state

rt_correction_funcfunction, optional

provide a function that takes an uncorrected retention time as an argument then returns the corrected retention time

mza_versionstr, default=’new’

temporary measure for indicating whether the the scan indexing needs to account for partitioned scan data (‘new’) or not (‘old’). Again, this is only temporary as at some point the mza version will be encoded as metadata into the file and this accommodation can be made automatically.

Returns:
isotope_scoring_resultsdict(...)

results dictionary with metadata and scoring information

lipidoz.workflows.run_isotope_scoring_workflow_infusion(oz_data_file, target_list_file, mz_tol, d_label=None, d_label_in_nl=None, progress_cb=None, early_stop_event=None, debug_flag=None, debug_cb=None, ignore_preferred_ionization=False, mza_version='new')

workflow for performing isotope scoring for the determination of db positions from infusion data inputs are the data file and target list file, output is a dictionary containing metadata about the analysis and the analysis results for all of the lipids in the target list. The target list should have columns containing the following information (in order, 1 header row is skipped):

  • lipid name in standard abbreviated format, with FA composition fully specified, e.g., PC(18:1_16:0) or TG(16:0/18:1/20:2)

  • MS adduct, e.g., [M+H]+ or [M-2H]2-

Parameters:
oz_data_filestr

filename and path for OzID data (.mza format)

target_list_filestr

filename and path for target list (.csv format)

mz_tolfloat

m/z tolerance for extracting XICs

d_labelint, optional

number of deuteriums in deuterium-labeled standards (i.e. SPLASH and Ultimate SPLASH mixes)

d_label_in_nlbool, optional

if deuterium labels are present, indicates whether they are included in the neutral loss during OzID fragmentation, this is False if the deuteriums are on the lipid head group and True if they are at the end of a FA tail (meaning that the aldehyde and criegee fragment formulas must be adjusted to account for loss of the label during fragmentataion)

progress_cbfunction, optional

option for a callback function that gets called every time an individual lipid species has been processed, this callback function should take as arguments (in order):

  • lipid name (str)

  • adduct (str)

  • current position in target list (int)

  • total lipids in target list(int)

early_stop_eventthreading.Event, optional

When the workflow is running in its own thread and this event gets set, processing is stopped gracefully

debug_flagstr, optional

specifies how to dispatch the message and/or plot, None to do nothing

debug_cbfunc, optional

callback function that takes the debugging message as an argument, can be None if debug_flag is not set to ‘textcb’

ignore_preferred_ionizationbool, default=False

whether to ignore cases where a lipid/adduct combination violates the lipid class’ preferred ionization state

mza_versionstr, default=’new’

temporary measure for indicating whether the the scan indexing needs to account for partitioned scan data (‘new’) or not (‘old’). Again, this is only temporary as at some point the mza version will be encoded as metadata into the file and this accommodation can be made automatically.

Returns:
isotope_scoring_resultsdict(...)

results dictionary with metadata and scoring information

lipidoz.workflows.save_isotope_scoring_results(isotope_scoring_results, results_file_name)

save the results of the isotope scoring workflow (complete with metadata) to file in pickle format

Parameters:
isotope_scoring_resultsdict(...)

results dictionary with metadata and scoring information

results_file_namestr

filename and path to save the results file under, should have .loz file ending (maintains compatibility with lipidoz_gui)

lipidoz.workflows.write_isotope_scoring_report_xlsx(isotope_scoring_results, xlsx_file)

writes results of the isotope scoring workflow to an excel spreadsheet

Parameters:
isotope_scoring_resultsdict(...)

results dictionary from isotope scoring workflow

xlsx_filestr

filename to save report under

Machine Learning

lipidoz.workflows.collect_preml_dataset(oz_data_file, target_list_file, rt_tol, d_label=None, d_label_in_nl=None, debug_flag=None, debug_cb=None, ignore_preferred_ionization=False, rt_correction_func=None, mza_version='new')

collects a dataset which can be used in training ML models. The dataset is a dictionary with metadata and minimally processed RTMZ data. The RTMZ data is extracted in a window with the following bounds:

  • target RT +/- rt_tol – this should be set wide enough to accomodate the chromatographic peak

  • target m/z (M isotope) - 0.5, target m/z (M isotope) + 2.5 – this covers the M, M+1, M+2 isotopes

Parameters:
oz_data_filestr

filename and path for OzID data (.mza format)

target_list_filestr

filename and path for target list (.csv format)

rt_tolfloat

retention time tolerance, defines data extraction window

d_labelint, optional

number of deuteriums in deuterium-labeled standards (i.e. SPLASH and Ultimate SPLASH mixes)

d_label_in_nlbool, optional

if deuterium labels are present, indicates whether they are included in the neutral loss during OzID fragmentation, this is False if the deuteriums are on the lipid head group and True if they are at the end of a FA tail (meaning that the aldehyde and criegee fragment formulas must be adjusted to account for loss of the label during fragmentataion)

debug_flagstr, optional

specifies how to dispatch the message and/or plot, None to do nothing

debug_cbfunc, optional

callback function that takes the debugging message as an argument, can be None if debug_flag is not set to ‘textcb’

ignore_preferred_ionizationbool, default=False

whether to ignore cases where a lipid/adduct combination violates the lipid class’ preferred ionization state

rt_correction_funcfunction, optional

provide a function that takes an uncorrected retention time as an argument then returns the corrected retention time

mza_versionstr, default=’new’

temporary measure for indicating whether the the scan indexing needs to account for partitioned scan data (‘new’) or not (‘old’). Again, this is only temporary as at some point the mza version will be encoded as metadata into the file and this accommodation can be made automatically.

Returns:
pre_ml_datasetdict(...)

dataset used for assembling ML training data

Note

See Structure of collect_preml_dataset dataset for details regarding the organization of the pre-ml dataset.

lipidoz.workflows.convert_multi_preml_datasets_labeled(preml_files, ml_target_files, rt_sampling_augment=True, normalize_intensity=True, rt_corr_funcs=None, debug_flag=None, debug_cb=None)

iterates through pairs of pre-ml dataset files and ml target lists, splits the pre-ml datasets into True/False examples, then converts to binned ml datasets. True/False examples from all pairs are combined and returned as two arrays of ML data

Parameters:
preml_fileslist(str)

paths to pre-ml dataset files

ml_target_fileslist(str)

paths to target list .csv files

rt_sampling_augmentbool, default=True

re-sample RT dimension from RTMZ data multiple times in order to augment training examples (~10x)

normalize_intensitybool, default=True

normalize the intensities in each 2D RTMZ array so that they are in the range 0->1

rt_corr_funcslist(function), optional

list of RT correction functions, one per ml_target_file, applies retention time corrections

debug_flagstr, optional

specifies how to dispatch the message and/or plot, None to do nothing

debug_cbfunc, optional

callback function that takes the debugging message as an argument, can be None if debug_flag is not set to ‘textcb’

Returns:
true_ml_datanumpy.ndarray
false_ml_datanumpy.ndarray

arrays of binned data for ML split by annotation from all pairs with shapes: (N, 3, 24, 400), where N is the number of True or False examples in the array

lipidoz.workflows.convert_multi_preml_datasets_unlabeled(preml_files, normalize_intensity=True, debug_flag=None, debug_cb=None)

iterates through pre-ml dataset files converts to binned ml datasets (for unlabeled data)

Parameters:
preml_fileslist(str)

paths to pre-ml dataset files

normalize_intensitybool, default=True

normalize the intensities in each 2D RTMZ array so that they are in the range 0->1

debug_flagstr, optional

specifies how to dispatch the message and/or plot, None to do nothing

debug_cbfunc, optional

callback function that takes the debugging message as an argument, can be None if debug_flag is not set to ‘textcb’

Returns:
ml_datanumpy.ndarray

array of binned data for ML with shape: (N, 3, 24, 400), where N is the number of training examples

Hybrid Workflow

Note

The hybrid workflow uses ML inference to prioritize targets for full analysis using the targeted variant of the isotope distribution analysis. See Structure of LipidOz Results for details on how the data from these different steps is organized in the lipidoz_results dictionary that is returned by this function.

lipidoz.workflows.hybrid_deep_learning_and_isotope_scoring(oz_data_file, target_list_file, rt_tol, rt_peak_win, mz_tol, dl_params_file, d_label=None, d_label_in_nl=None, debug_flag=None, debug_cb=None)

A hybrid workflow that incorporates deep learning inference as a prefilter then performs targeted isotope scoring workflow on predicted True double bond positions

Parameters:
oz_data_filestr

filename and path for OzID data (.mza format)

target_list_filestr

filename and path for target list (.csv format)

rt_tolfloat

retention time tolerance, defines data extraction window

rt_peak_winfloat

size of retention time window to extract for fitting retention time peak

mz_tolfloat

m/z tolerance for extracting XICs

dl_params_filestr

pre-trained DL model parameters file

d_labelint, optional

number of deuteriums in deuterium-labeled standards (i.e. SPLASH and Ultimate SPLASH mixes)

d_label_in_nlbool, optional

if deuterium labels are present, indicates whether they are included in the neutral loss during OzID fragmentation, this is False if the deuteriums are on the lipid head group and True if they are at the end of a FA tail (meaning that the aldehyde and criegee fragment formulas must be adjusted to account for loss of the label during fragmentataion)

debug_flagstr, optional

specifies how to dispatch the message and/or plot, None to do nothing

debug_cbfunc, optional

callback function that takes the debugging message as an argument, can be None if debug_flag is not set to ‘textcb’

Returns:
lipidoz_resultsdict(...)

results from DL prefiltering and targeted isotope scoring analysis