Identifying Metabolites using a Statistical Modelling Approach

This tutorial demonstrates how to use the MetAssign program in order to produce probabilistic identifications of peaks and also of each entry in a given list of compounds. A description of the final output files is also provided.

Data files

Three replicates of a standard mixture of 104 metabolites were run on an Orbitrap Exactive mass spectrometer coupled to a HPLC chromatography system and will be used in this tutorial. The files can be downloaded here. This archive includes the data files, the identity databases and any intermediate files produced from the processing pipeline. Extracting the zip file will produce a folder called tutorial. Changing to this directory and running R will enable the rest of the tutorial to be followed.

Alternatively, to access this data set you can use following commands in R; we assume that you have a read/write access to your "D:" drive (on machine running Microsoft Windows). If not, you will have to alter the file paths accordingly.

	setwd ("D:/")
	unzip ("")
	file.remove ("")
	setwd ("tutorial")

More detail on general data processing pipe-line is given on the data processing page.


The inputs to MetAssign consist of a PeakML file with multiple replicates (produced e.g. from the Combine program) and a list of identification databases that will be used to annotate each of the peak sets in the PeakML file with identities. The outputs consist of the annotated PeakML file and a tab-seperated-value (TSV) file, where each row corresponds to a database metabolite and each column corresponds to a property of that metabolite (e.g. the probability that metabolite is present). For this demonstration, there is a single input PeakML file (combined_filtered_gapfilled.peakml) and two identification databases — std1_POS.filtered.xml which contains the metabolites in the samples and std1_POS_1000.xml which contains a list of 1000 chemical compounds that are very close to the mass of the metabolites in the sample. The purpose of these files are to show how MetAssign can distinguish compounds that are almost identical in mass.


Once the PeakML file is generated, the mzmatch.ipeak.sort.MetAssign command can be used to identify the peaks and the metabolites present in the sample. The command usage from R is:


After identification using MetAssign, the annotations on the various peaks can be accessed using the ConvertToText program, which in this tutorial produces the identified_peaks.tsv file.

The format of the identified_peaks.tsv file is Tab-Seperated Values. By looking at peak id 21 of this file we see
	id      mass    RT      rep1    rep2    rep3    filteredIdentification

	21	170.08105210187662      942.3519897460938       9363905.0       5595895.5       1699790.25
		StdMix1_7, 4-Aminobenzoate, M+CH3OH+H, [12C]8[1H]12[14N]1[16O]3, 0.36500;
		StdMix1_98, 3-Hydroxyphenylacetate, M+NH4, [12C]8[1H]12[14N]1[16O]3, 0.35000;
		StdMix1_19, Pyridoxine, M+H, [12C]8[1H]12[14N]1[16O]3, 0.28500

Note that this output has been folded to fit on this page. The first field of the line is the numeric identifier of the peak, in this case, 21. The next field is the mass (170.08105210187662) and then the retention time (942.3519897460938). The next three fields are the intensities of the peaks in each of the three replicates (9363905.0, 5595895.5 and 1699790.25). The final field (filteredIdentification) is a string, the components of which are seperated by '; '. Each of these components is the proportion of draws the Monte-Carlo sampler spent at that particular assignment. So in this case, peak 21 was assigned to Pyridoxine 28.5% of the time. The adduct in this case was M+H adduct and the particular isotopic peaks was 12C81H1214N116O3. Note that the sub-fields in a component are seperated by ', '.

MetAssign also produces probabilistic identifications of the metabolites present in the sample. For this example, these are contained in the file identified_metabolites.tsv. An example line from the file for component 14 of the sample is:

	compoundId      compoundName    p.1     p.2     p.3     p.4     p.5     p.combined

	StdMix1_14      L-Tryptophan    1.0     1.0     1.0     1.0     0.26    0.852
The identifier in this case is StdMix1_14, which is for L-Tryptophan. The various numbers following this are for various probabilisitic classifiers produced by the MetAssign algorithm; each of classifiers from p.1 to p.5 is the probability that many peaks were produced by the given metabolite. The p.combined classifier is a compromise between each of the other classifiers and can be used as a probability that the metabolite is present in the sample. So for the purposes of this example, the probability that L-Tryptophan is present is 85.2%.
Design based on the SWT pages.