Professional Documents
Culture Documents
Web-based inference of biological patterns, functions and pathways from metabolomic data using MetaboAnalyst
Jianguo Xia1 & David S Wishart1 3
1 3
Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada. 2Department of Biological Sciences, University of Alberta, Edmonton, Alberta, Canada. National Research Council, National Institute for Nanotechnology, Edmonton, Alberta, Canada. Correspondence should be addressed to D.S.W. (david.wishart@ualberta.ca).
Metaboanalyst is an integrated web-based platform for comprehensive analysis of quantitative metabolomic data. It is designed to be used by biologists (with little or no background in statistics) to perform a variety of complex metabolomic data analysis tasks. these include data processing, data normalization, statistical analysis and high-level functional interpretation. this protocol provides a step-wise description on how to format and upload data to Metaboanalyst, how to process and normalize data, how to identify significant features and patterns through univariate and multivariate statistical methods and, finally, how to use metabolite set enrichment analysis and metabolic pathway analysis to help elucidate possible biological mechanisms. the complete protocol can be executed in ~45 min.
IntroDuctIon
Metabolomics is primarily concerned with comprehensive analysis of all small-molecule compounds that can be found in biological samples, such as cells, tissues or biofluids1. Because of its utility in identifying biomarkers of disease and in measuring biochemical phenotypes, the field of metabolomics has grown rapidly in recent years. This growth has also been aided by advances in analytical technologies, such as high-resolution nuclear magnetic resonance (NMR) spectroscopy, mass spectrometry (MS) and various compound separation techniques2,3. As with other omics technologies, bioinformatics has a key role in facilitating the storage, dissemination and interpretation of metabolomic data. In particular, bioinformaticians have developed a number of comprehensive spectral, compound and biofluid databases46, as well as a variety of software tools for data processing, compound identification and compound quantification712. With these basic bioinformatics tools now in place, the focus in metabolomic software development has gradually shifted away from basic compound identification and more toward functional interpretation and pathway analysis (i.e., systems biology). There are two general approaches to performing a metabolomics study: chemometric approaches and quantitative approaches13. Chemometric approaches (also known as nontargeted or untargeted methods) use raw, unannotated peak lists, binned spectral data or aligned spectral profiles in combination with multivariate statistics to identify spectral features that are statistically different between two (or more) different sample populations. Those features (peaks, retention times, masses, chemical shifts) identified as being significant may or may not be identified in subsequent analysis steps. Because chemometric methods do not make compound identification a priority, a major challenge with this approach is the subsequent identification step and the handling and elimination of false positives or spectral noise. In contrast, quantitative metabolomics (also known as targeted profiling) requires compound identification and quantification before any further analysis. Multivariate statistical methods are then applied to the resulting concentration data to identify metabolites that are statistically different between two (or more) different sample populations. In quantitative metabolomics, compound identification and quantification are usually achieved by comparing the MS or NMR spectra of the biological samples of interest with a set of chemical standards or a reference spectral library. Obviously, a key limitation to quantitative metabolomics is the accurate identification and quantification of compounds, especially in complex mixtures. Although still in use today, chemometric approaches were more widespread when compound identification was hampered by the lack of comprehensive spectral databases and appropriate compound identification/quantification software. However, as many metabolomics researchers learned, without a list of named compounds, it is extremely difficult to identify the affected pathways, to infer a mechanism of action or to develop any kind of biological understanding. It is also very difficult to patent an unknown peak or an unnamed spectral feature. With the availability of several comprehensive metabolomic databases and improved spectral analysis tools46,14,15, compound identification has become much easier, and now quantitative metabolomics is becoming much more widely used in the metabolomics community1619. In response to this trend toward quantitative metabolomics, as well as the growing community shift toward using open-access, web-based tools in many omics applications, we have developed a web-based software tool called MetaboAnalyst20. MetaboAnalyst was specifically designed to address a wide variety of common metabolomic research and educational needs, including conventional biomarker identification, the extraction of diagnostic or prognostic metabolite patterns, general metabolite annotation, putative pathway identification, functional or biological interpretation of metabolomic data, general data exploration, online class instruction for multivariate statistics, general data visualization, the creation of plots/figures for publications and presentations, MS and/or NMR data normalization and large-scale error-checking of MS and NMR metabolomic data. Although MetaboAnalyst is certainly capable of being used for standard chemometric applications, it is mainly designed to support quantitative metabolomics. MetaboAnalyst is particularly unique among metabolomic analysis tools, in that it provides comprehensive support for multiple data
nature protocols | VOL.6 NO.6 | 2011 | 743
types (NMR, gas chromatography-MS (GC-MS) and liquid chromatography-MS (LC-MS) data), multiple data processing procedures, a wide range of statistical and machine learning methods, and tools for high-level functional interpretation. MetaboAnalyst also provides a user-friendly interface that guides non-experts through the data analysis process. In addition, it offers intuitive visualization tools and generates a detailed analysis report at the end of each session. Since its release in 2009, MetaboAnalyst has been heavily used by researchers in the metabolomics community. Currently, the server is being accessed by an average of ~50 unique users per day. This has necessitated multiple server upgrades and the development of a very extensive set of frequently asked questions and tutorials. On the basis of user feedback, MetaboAnalyst has also undergone several updates to improve its support for binary (two-group) analysis and to extend its support for multiple-group analysis. One of the most recent enhancements has been the incorporation of metabolite set enrichment analysis (MSEA)21 and metabolic pathway analysis22 into MetaboAnalyst to assist in the high-level functional interpretation of quantitative metabolomic data. These additions should make MetaboAnalyst a true one-stop shop for metabolomic data analysis. Comparison with other available tools for metabolomic data analysis Perhaps the most widely used tool in metabolomics data analysis today is SIMCA-P + (Umetrics). SIMCA-P + is a commercial desktop application with a nicely designed graphical user interface that supports a wide variety of data transformations and multivariate statistical analyses, including principal component analysis (PCA), partial least squaresdiscriminant analysis (PLS-DA) and orthogonal projection into latent structure (see Box 1 for glossary). SAS (Statistical Analysis System from the SAS Institute) is another
744 | VOL.6 NO.6 | 2011 | nature protocols
stand-alone commercial software package that is also commonly used in many metabolomics studies. Similar to SIMCA-P + , SAS supports a wide range of data transformations as well as sophisticated univariate and multivariate analyses. Unlike SIMCA-P + , SAS lacks a graphical interface and is generally accessed through application programming interfaces. Generally speaking, the normalization, clustering, multivariate statistics and many of the graphs generated by means of MetaboAnalyst (and the accompanying protocols) could be generated using SIMCA-P + and/or SAS. However, neither SIMCA-P + nor SAS support metabolomic-specific data processing (for NMR and/or MS data), nor do they offer highlevel functional interpretation through automated metabolite annotation, MSEA or metabolic pathway analysis. Furthermore, MetaboAnalyst is a freely available, web-based application with extensive graphical output and an easy-to-use graphical user interface. This makes it somewhat more accessible, easier to learn and far easier to use than either SIMCA-P + or SAS. To the best of our best knowledge, there are only two other freely available web-based metabolomic data processing toolsMeltDB23 and the metaP-Server24. However, neither would be able to perform most of the data processing or interpretive steps described in this protocol. MeltDB was primarily built for MS-based metabolomics data storage, administration, analysis and annotation, whereas metaP-Server was designed to support exploratory metabolomic data analysis using mainly univariate summary statistics. A detailed feature comparison for these five tools is given in Table 1. Limitations of the protocol and software Because of space restrictions, the protocols/procedures outlined in this paper will not be able to illustrate all of the functional capabilities that can be found in MetaboAnalyst. In particular, the clustering, classification and machine learning tools for data processing will not be discussed here. Similarly, some of the metabolite annotation
protocol
taBle 1 | Comparison of different metabolomics data analysis/interpretation programs. tool Software type License Data input Graphical interface Normalization Univariate analysis Multivariate analysis Clustering 2011 Nature America, Inc. All rights reserved. Classification Enrichment analysis Pathway analysis Pathway visualization Integration with other omics data Peak annotation ++ Metaboanalyst Web-based Free Data table, NMR, MS, GC-MS data, compound/peak lists +++ +++ +++ +++ +++ ++ ++ +++ ++ + +++ + MeltDB Web-based Free (registry required) Raw mass spectral files ++ + ++ ++ ++ metap-server Web-based Free Data table ++ + +++ + +++ sIMca-p Stand-alone Commercial Data table +++ ++ sas Stand-alone Commercial Data table +/ ++ +++ +++ ++ ++
The level of support for a particular feature is rated by the number of + , with + + + as the highest.
functions will not be presented either. Although it is important to software packages rather than through a web-based application. note some of the limitations of this particular protocol, it is also Indeed, many freely available tools have been developed for MS important to note that the software itself also has some shortcom- spectral processing, including MetAlign11, MZmine25, Met-IDEA26, ings. In particular, MetaboAnalyst has relatively limited metabolite MSFACTS27, Tagfinder28 or XCMS8, to name just a few. By avoiding annotation capabilities, it does not support or integrate other kinds this data transfer bottleneck and by limiting its preferred input forof omics data and it has limited capabilities for processing and visu- mat to partially processed data, such as peak lists or concentrations, alizing raw MS spectral files. This limitation with MS spectral files MetaboAnalyst is able to offer much more efficient data analysis and is primarily the result of hardware restrictions, both with respect to visualization services to a much wider user base. the MetaboAnalyst server and with respect to the speed of Internet data transfers (bandwidth). Raw MS spectral files are often too large Analysis overview (greater than hundreds of Mb) to be routinely or rapidly uploaded The procedure described here provides a step-by-step protocol for to a remote server. Furthermore, spectral processing (including peak using MetaboAnalyst to fully analyze quantitative metabolomic picking, alignment and annotation) is a computationally intensive exercise that usuConcentration tables Statistical analysis ally requires multiple iterations and careUnivariate analysis Data processing and normalization ful manual inspection to achieve optimal Fold changes t-Tests Data pre-processing Other inputs: results. Consequently, we believe that these Data integrity check (Step 3) Volcano plots Peak lists Peak detection/alignment Missing value imputation tasks are better handled by locally installed Spectral bins ANOVA (Step 8A) Retention time correction
MS spectra Noise filtering Compound name Standardization (Step 7) Outlier removal
Figure 1 | Flowchart for MetaboAnalyst. MetaboAnalyst is composed of three main functional modules responsible for data processing, statistical analysis and high-level functional interpretation. Different data inputs are first processed to produce appropriate data matrices. A wide array of univariate and multivariate statistical analyses can then be performed on these data matrices. If compound identities are known, users can perform enrichment analysis or pathway analysis after compound name standardization. Corresponding PROCEDURE step numbers are indicated in the figure.
Multivariate analysis PCA (Steps 1114) PLS-DA (Steps 1519) Clustering Hierarchical cluster SOM K-means Classification Random forests SVM
High-level functional interpretation Pathway analysis (Steps 2131) 15 organisms, 1,173 pathways Pathway enrichment analysis Pathway topology analysis Interactive visualization Enrichment analysis (Steps 3235) 6,295 metabolite sets in 7 categories Over-representation analysis Single sample profiling Quantitaive enrichment analysis
protocol
Figure 2 | Data upload view. This screenshot shows MetaboAnalysts available data analysis modules, with the Statistical Analysis module being selected for data upload. Clicking the tab labeled Enrichment Analysis or Pathway Analysis will allow users to upload data for the corresponding data analysis. The navigation tree is located on the left with the current step (Upload) highlighted.
data. It begins with a general overview of the program, followed by a detailed description on how to format and upload data, how to cleanse the data, how to normalize it and how to identify significant features or generate lists of important metabolites. It concludes with a description on how to perform MSEA and how to perform metabolic pathway analysis. Although the protocol is specific to MetaboAnalyst, many of the early stage statistical steps can be readily adapted to other statistical analysis packages (such as SIMCA-P + and SAS). As noted earlier, not all of MetaboAnalysts options or data analysis paths can be discussed in detail. However, the protocol described here should be applicable to many common data analysis scenarios in metabolomics. MetaboAnalyst consists of three main modules: (i) a data processing module; (ii) a statistics module; and (iii) a high-level functional interpretation module. The data processing module is responsible for data input, data processing and data normalization. The statistics module supports a number of statistical (univariate, multivariate) and machine learning methods for feature selection, clustering and classification. The high-level functional interpretation module includes enrichment analysis and pathway analysis. The enrichment analysis offers MSEA using several comprehensive metaboliteset libraries. The pathway analysis offers pathway enrichment analysis and pathway topology analysis through a Google Maps style interactive pathway visualization system. As illustrated in Figure 1, the data processing module is the entryway to access the other two modules. The statistics module, which is perhaps the
most important module in MetaboAnalyst, is designed for generalpurpose metabolomic data analysis and can be used to analyze a number of different data types, including compound concentration data, peak lists or binned spectral data (i.e., both targeted and non-targeted data). For high-level functional interpretation, only quantitative metabolomic data (i.e., compound concentration data or a list of metabolite names) can be accepted. It is important to note that MetaboAnalysts high-level functional analysis is organism specific as dictated by MetaboAnalysts underlying knowledgebase. For enrichment analysis, the collection of ~6,300 metabolite sets was compiled primarily from human studies. Therefore, users need to provide their own custom metabolite sets if they wish to perform enrichment analysis for other organisms. MetaboAnalysts pathway analysis currently supports 15 model organisms with ~1,200 precompiled Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways. Before using this option, users need to decide whether these predefined libraries are applicable to their organism(s) under study. To perform high-level functional analysis, one critical step is to match compound names between users data and MetaboAnalysts
knowledgebase. As there are currently no universally accepted set of metabolite names or IDs, we have implemented an automated compound disambiguator to convert various compound IDs and synonyms to Human Metabolome Database (HMDB) compound names for MSEA and to KEGG compound names for pathway analysis. In some cases, there will be redundancies and conflicts due to different naming schema adopted by different databases. Those compounds with name conflicts will be highlighted for subsequent manual inspection. We recommend that users try the recently released Chemical Translation Service29 (http://cts. fiehnlab.ucdavis.edu) to clarify these ambiguities before performing any kind of high-level analysis. MetaboAnalyst uses a navigation tree to guide users through its different analysis procedures (Fig. 2). All the available functions
are represented as tree nodes and these nodes are organized into different branches or functional categories. Users may click the corresponding nodes to navigate among different MetaboAnalyst functions. Depending on the context, some tree nodes may be disabled when the required preliminary steps have not been performed by the user. The current node is always highlighted during the analysis, as shown in Figure 2. This protocol is organized into five sections: (i) data formatting, uploading and processing; (ii) identifying important features using univariate analysis; (iii) multivariate statistical analysis; (iv) MSEA; and (v) metabolic pathway analysis. Two compound concentration data sets are provided to demonstrate these procedures. The first data set contains metabolite concentrations of 39 bovine rumen samples measured by proton NMR. The rumen
protocol
Figure 3 | Data normalization view. The graph summarizes the distribution of input data values before and after normalization. The box plots on the top show the concentration distributions of individual compounds, whereas the bottom plots show the overall concentration distribution based on kernel density estimation.
Before normalization Fumarate Glucose Endotoxin Xanthine Valine Valerate Uracil Tyrosine Succinate Ribose Propionate Proline PAG Phenylacetate Nicotinate NDMA Methylamine Mathanol Maltose Lysine Leucine Lactate Isovalerate Isoleucine Isobutyrate Hypoxanthine Histidine Glycine Glycerol Glutamate Formate Ferulate Ethanol Dimethylamine Choline Caffeine Cadaverine Butyrate Benzoate Aspartate Alanine Acetoacetate Acetate 3-PP 3-HP 3-HB 1,3-D Fumarate Glucose Endotoxin Xanthine Valine Valerate Uracil Tyrosine Succinate Ribose Propionate Proline PAG Phenylacetate Nicotinate NDMA Methylamine Methanol Maltose Lysine Leucine Lactate Isovalerate Isoleucine Isobutyrate Hypoxanthine Histidine Glycine Glycerol Glutamate Formate Ferulate Ethanol Dimethylamine Choline Caffeine Cadaverine Butyrate Benzoate Aspartate Alanine Acetoacetate Acetate 3-PP 3-HP 3-HB 1,3-D After normalization
samples were collected from dairy cows fed with different proportions of barley grain. The samples are labeled in four groups0, 15, 30 and 45indicating different percentages of barley in the diet. The second data set contains metabolite concentrations of 77 urine samples from cancer patients, also measured by proton NMR. The samples are divided into two groupscontrol or cachexic (significant muscle loss).
2011 Nature America, Inc. All rights reserved.
Data formatting, processing and normalization This section describes how to upload various data types into MetaboAnalyst, followed by explanations on how to perform data cleansing and normalization. The basic 8e+05 0.0020 idea is to transform any uploaded data into 6e+05 0.0015 a matrix, with samples in rows and features 4e+05 0.0010 in columns. Three basic data formats are 2e+05 0.0005 supported by MetaboAnalyst (Box 2). The 0 0e+00 most common type is a data table contain2e06 0 10,000 20,000 30,000 40,000 1e06 0e+00 1e06 ing compound concentrations, peak intenNormalized concentration Concentration sities or spectral bins. These kinds of data can be easily viewed and edited using any spreadsheet program. type corresponds to raw MS spectra saved in open exchange formats, The second data type consists of multiple peak lists, as picked from such as netCDF or mzXML. More detailed information regarding multiple spectra (NMR, MS or GC-MS). These kinds of data can be data input formats, including example data sets, are available on the obtained from most spectral processing programs. The third data MetaboAnalyst website under Data Formats and FAQs links.
Density
Compounds
Depending on the type of uploaded data, different preprocessing procedures may be used to convert the raw data into an appropriate data matrix. Compound concentration data (measured by NMR, GC-MS or LC-MS) are usually of high quality and do not normally need a preprocessing step. Binned spectral data usually contain a great deal of baseline noise and often require baseline filtering. For NMR and MS peak lists, MetaboAnalyst will first align the peaks across all samples. For GC/LC-MS spectra, MetaboAnalyst performs peak detection, peak alignment and retention time correction sequentially using the XCMS package8. MetaboAnalyst also supports some limited peak annotation/identification from raw peak lists. This annotation function can be accessed by going to Other Utilities tab and clicking on the NMR/MS peak search tool bar. This particular function uses the HMDB peak search tools to score and identify peaks. MetaboAnlalysts MS peak search also identifies common adducts. After the data has been converted into a data matrix, a data integrity check is performed to ensure that the data are valid and suitable for subsequent analysis. This data integrity check includes checking for data values, sample size, group labels and other data features. Box 3 describes some of the approaches available in MetaboAnalyst for dealing with missing values and outliers.
It is often necessary to normalize metabolomic data before starting any kind of statistical analysis for several reasons. First, normalization can reduce systematic bias or technical variation. Second, metabolite concentrations or peak intensities usually span several orders of magnitude (sub-micromolar to millimolar). Consequently, the variance from the more abundant metabolites will tend to dominate the variance-covariance matrix and obscure small but potentially significant signals. This can lead to misidentification of significant changes or a failure to identify significant changes, particularly with conventional multivariate statistical approaches. In addition, many statistical methods assume that data values follow a Gaussian distribution. Therefore, it is important to perform appropriate data transformations to make the data look like a bell curve. MetaboAnalyst provides many useful methods for data normalization (Box 4). The effect of these normalization procedures on users data can be visualized with a diagnostic plot (Fig. 3). Significant feature identification using univariate methods This section provides the detailed steps on how to identify features of interest using classical univariate statistical methods, such as the Students t-test, analysis of variance (ANOVA) or correlation analysis. It also describes how to use a method developed for
nature protocols | VOL.6 NO.6 | 2011 | 749
protocol
a
0 15 30 45
1.0
Accuracy R2 Q2
0.4
Figure 4 | Multivariate analysis using PLS-DA. (a) PLS-DA 3D score plot. (b) Bar plots showing the three performance measures (prediction accuracy, R2 and Q2) using different number of components. The red * indicates the best values of the currently selected measures (Q2). (c) The result of permutation tests summarized by a histogram. (d) The top 15 compounds ranked by VIP scores.
0.2
under the Peak search node located near the bottom of the navigation tree.
1 2 3 4 Number of components 5
Component 1 (15.9%)
c
150
100 Frequency
50
1.0
1.2
analyzing high-dimensional data, namely, significance analysis of microarrays (SAM)30. Metabolomic data sets are intrinsically high dimensional, with the number of features (peaks, metabolites) ranging from a few dozen to hundreds or even thousands. They represent snapa shots of global biochemical profiles of individual organisms. Most of these features are expected to be within normal physiological variations, and only a few may be significantly associated with the conditions or phenotypes of interest. The identification of those key features is the first step toward finding useful biomarkers or explaining the underlying biological process. Depending on the specific questions being asked or the information already known, MetaboAnalyst offers a number of different strategies to perform feature identification and assessment (Box 5). MetaboAnalyst also supports feature (or peak) annotation after significant features (peaks or bins) have been identified. This utility can be accessed
Figure 5 | Results from metabolite set enrichment analysis. (a) The result table summarizing the matched metabolite sets ranked by their P values. (b) The detailed view of a matched metabolite set (accessed by clicking the corresponding bar icon on the last table column). 750 | VOL.6 NO.6 | 2011 | nature protocols
Multivariate data analysis Multivariate statistics involves the simultaneous observation and analysis of more than two statistical variables. Because metabolomic data usually consist of dozens of features (compounds, peaks), many of which change as a function of time, phenotype or experimental conditions, multivariate data analysis is ideal for analyzing metabolomic data. Multivariate analysis includes a number of techniques, such as multivariate ANOVA, multivariate regression analysis, PCA, factor analysis and discriminant analysis. MetaboAnalyst supports two widely used multivariate methodsPCA and PLS1.4 1.6 1.8 2.0 2.2 DA. These two methods are very useful for VIP scores exploratory data analysis through dimensional reduction and data visualization (Box 6). MetaboAnalyst is also able to generate a variety of colorful, two- or three-dimensional graphs, such as score plots, loading plots and other kinds of
Compounds
diagnostic plots (Fig. 4). This section describes the detailed steps needed to perform PCA and PLS-DA using the example data sets and how to interpret the results. Metabolite set enrichment analysis This section describes the detailed steps to perform MSEA. MSEA is the metabolomic counterpart of the gene set enrichment analysis (GSEA)31, which has been widely used in gene expression data analysis. The key idea behind GSEA is to investigate the enrichment of predefined groups of functionally related genes (or gene sets) instead of individual genes. This approach has been shown to be good at identifying significant as well as subtle but coordinated expression changes among a group of related genes. As groups of genes are usually associated with biological functions or biological pathways, GSEA also greatly facilitates higher-level functional interpretation. MSEA has been implemented in MetaboAnalyst, using the same concepts underlying GSEA (Fig. 5). Similar to GSEA, there are
two essential components for MSEA(i) the algorithms for enrichment analysis and (ii) the comprehensive libraries of functionally related metabolite sets. Box 7 provides more details about these two components. Metabolic pathway analysis This section describes the basic steps to perform metabolic pathway analysis and visualization of the results. Pathway analysis has proven to be an invaluable tool in understanding complex relationships among genes and proteins in genomics and proteomics studies3235. Most pathway analysis tools focus on visually displaying and highlighting matched genes, proteins or metabolites and do not support more quantitative or statistical analysis. To address this issue, we have integrated two pathway analysis approachespathway enrichment analysis and pathway topology analysis. The results can be visualized intuitively using a Google Mapsstyle visualization system (Fig. 6). Box 8 provides additional details on the main features offered by MetaboAnalysts pathway analysis utilities.
MaterIals
EQUIPMENT SETUP A PC with an Internet connection Browser requirements: MetaboAnalyst has been tested on all modern web browsers that are JavaScript enabled, including Mozilla Firefox 3.0 + , Safari 4.0 + , Chrome 5.0 + (Google), Opera 10.0 + and Internet Explorer 8.0 (Microsoft). Data files: MetaboAnalyst has a number of example data sets for format illustration purposes as well as for testing purposes. Users can directly select a testing data set in MetaboAnalysts data upload page without
actually downloading it. For this protocol, we will download a concentration data set and then re-upload it to better illustrate how local or user-generated data files may be handled. First, go to the MetaboAnalyst home page and then click the Data Formats link on the left menu bar. In the Data Formats page, under the Comma Separated Value (CSV) format, click and download the first concentration fileCompound concentration data setcow, four groups and save it as cow_diet.csv. The second concentration file to be retrieved is Compound concentration data sethuman, two groups. Save this file as human_cachexia.csv.
protocol
Figure 6 | Metabolic pathway analysis and visualization. (a) The metabolome view showing all metabolic pathways arranged according to the scores from enrichment analysis (y axis) and from topology analysis (x axis). (b) The pathway view showing the corresponding metabolic pathway after clicking any node in the metabolome view. The matched metabolites are highlighted according to their P values. Users can zoom or drag the pathway map to view a subset of the compounds. (c) The compound view showing the concentration distribution of the corresponding metabolite after clicking any matched compound node. The P value and the node importance are indicated below.
proceDure Data upload, processing and normalization tIMInG 510 min 1| Starting up: Go to the MetaboAnalyst home page and click the click here to start link to enter the data upload page. crItIcal step As most browsers support multiple tabs, do not access MetaboAnalyst from more than one tab during an analysis. Opening up multiple connections to MetaboAnalyst within the same browser will cause problems as a result of having the session data overwritten. ? trouBlesHootInG 2| Data upload: Depending on the type of analysis that a user wishes to perform, they can upload their data using any of the three available tab optionsStatistical Analysis, Enrichment Analysis or Pathway Analysis (Fig. 2). Here we show how to upload data from the Statistical Analysis tab, which is selected by default (data upload instructions for Enrichment Analysis are provided in Steps 2124, and data upload directions for Pathway Analysis are given in Step 32). In the Upload your data section, users can upload either a comma-separated value (CSV) file or a compressed (ZIP) file (see Box 2 for more details). For the example we use here, choose Concentrations as the data type and Samples in rows (unpaired) as the data format. Click the Browse button to locate the cow_diet.csv file and click the Submit button. crItIcal step Users must specify the correct data type and data format that match their data. Failure to do so will result in MetaboAnalyst launching the wrong data processing procedure. crItIcal step Users can also easily perform paired analysis in MetaboAnalyst. For any kind of paired data comparison, there must be an even (2n) number of samples. For CSV formatted data, the pairwise information must be given by the class labels as integer values between 1 and n/2 and between 1 and n/2. Samples with class labels having the same absolute integer values are considered to be pairs (i.e., 18 is paired with + 18). For ZIP formatted data, users need to upload a separate text file (.txt) to give the pair information. Each pair is specified as two sample names (without a suffix) separated by a colon with one pair per row. ? trouBlesHootInG
752 | VOL.6 NO.6 | 2011 | nature protocols
3| Data integrity checking: If the data has been uploaded successfully, a data integrity check is performed. After this check is completed, MetaboAnalyst will provide a summary of the data characteristics. Two common issues that often arise with metabolomic data are missing values and outliers (see Box 3 for more details). To handle missing values, users can click the Missing value imputation button to use a variety of options to either exclude or replace these values. Outlier identification and removal is an iterative process and is usually performed in combination with preliminary data exploratory analysis. See Step 28 for an example. For this particular data set, we accept the data as is and so we will click the Skip button to go to the normalization step. 4| Data normalization: There are two normalization proceduresrow-wise normalization and column-wise normalization. The characteristics of the different normalization procedures are discussed in Box 4. In the data normalization page, choose normalization by a reference sample and then select the first sample name 0-1-1 for row-wise normalization. crItIcal step The choice for a reference sample is generally the sample in the control group with the fewest missing values. Alternatively, users can choose to use a pseudo-reference sample created by averaging all samples in the control group. For high-quality data in which samples in the same groups are very homogenous, the effects of either procedure should be very similar. 5| Select auto-scaling for column-wise normalization. 6| After the normalization steps have been completed, click next to view a graphic summary of the normalization effects on the data (Fig. 3). 7| Compound name standardization (optional): This step is only applicable for compound concentration data. Click the Name check node under the Processing branch. The results of the name conversion process will be shown as a table. Compounds without an exact match in MetaboAnalysts name library will be highlighted in either yellow (approximate match found) or red (no match found). Users should manually examine the compounds with approximate matches and choose the correct one. Otherwise, the first match in the candidate name list will be used. Click the Submit button to finish the name checking. Note that after this step, all three major nodes on the navigation treeStatistics, Enrichment and Pathway should be enabled. Note that if the data are uploaded under the Enrichment Analysis or Pathway Analysis tab, the compound name mapping will be performed by default. The data are now processed, normalized and ready for a variety of downstream analysis procedures. Identification of significant features with univariate methods tIMInG ~10 min 8| Identification of significantly different features: MetaboAnalyst directly supports significant feature (metabolite) identification using several methods including t-tests, ANOVA, volcano plots, SAM and others. Use option A for ANOVA-based feature selection or option B for SAM-based selection.
nature protocols | VOL.6 NO.6 | 2011 | 753
protocol
(a) anoVa-based feature selection (i) As the data in cow_diet.csv contains four groups, one can use ANOVA methods to select important features. Click the ANOVA node on the navigation tree to enter the One-way ANOVA and post hoc analysis page. (ii) Significant features are identified with the default P value threshold of 0.05. As the ANOVA F-test only indicates that more than two groups differ, the post hoc analysis further tests the ones that differ from each other. MetaboAnalyst offers two commonly used methodsFishers least significant difference (LSD) and Tukeys honestly significant difference (HSD). Tukeys HSD is generally more conservative than Fishers LSD. (iii) Click the view details link to see a data table from the ANOVA and post hoc tests using Fishers LSD (the default). Users can click any compound name to view a box plots summary of its concentrations in different groups. (B) saM-based feature selection (i) SAM is designed to control the false positives when running multiple tests on high-dimensional data. To use the SAM method, click the SAM node on the MetaboAnalyst navigation tree. (ii) The default view is the Step 1 tab, which contains two plots to help users select a suitable delta value. The left plot shows the false discovery rate (FDR) change with different delta values and the right plot shows the number of significant compounds identified given different delta values. For example, using the default delta value 0.6 will identify ~25 compounds with an FDR ~0.3; using a delta value of 1.0 will identify ~20 significant compounds with the FDR less than 0.1. Enter 1.0 as the new delta value and click Submit. (iii) The Step 2 tab shows a typical SAM plot with the delta value equaling 1.0. Click the View details link to see the SAM results table. A total of 21 compounds were identified above the chosen threshold. Note that the top ten compounds are almost exactly the same as those identified using ANOVA. 9| Identification of other features with patterns of interest: This step allows users to investigate trends or patterns in metabolite concentration changes. Click the Correlations node on the navigation tree to enter the Correlation Analysis page. There are two types of correlation analysis that can be performed in MetaboAnalystcorrelation with a defined pattern (option A) or correlation with a specific feature (option B). (a) correlation with a defined pattern (i) Here we will attempt to identify those metabolites that increase concentrations with the percentage of grain in the diet. Choose a predefined pattern 1234 from the select a predefined pattern drop-down list, which corresponds to a linear concentration increase in groups 0, 15, 30 and 45. Alternatively, users can specify their own patterns in the define your own pattern text field. (ii) Click the Submit button beside the drop-down list used in the previous step. The result is shown in Top 25 compounds correlated with the pattern 1234 a Figure 7a. The light blue bars show those metabolites Endotoxin showing a negative correlation and the light pink bars Alanine show those with a positive correlation with the given Methylamine pattern of change. Glucose (iii) Click the View details link to see a table of all the Uracil NDMA compounds listed as well as their correlation coefValine ficients. Clicking any compound name will generate Dimethylamine a graphic summary of its concentration distribution Glycerol within each group (Fig. 7b). Xanthine (B) correlation with a specific feature Ethanol Isoleucine (i) On the basis of the above analysis and a review 1,3-D of the literature, we know that elevated levels of b Benzoate endotoxin are important for initiating certain inflam2 Ribose matory responses. We are interested in identifying Histidine 1 other metabolites with patterns of change similar to Formate
Succinate Acetoacetate Isovalerate 3-HP Acetate Isobutyrate Aspartate 3-PP 1.0 0.5 0 Correlation coefficients 0.5 1.0 2 0 15 30 45 1 0
Figure 7 | Correlation analysis to identify compounds with a specific pattern. (a) Correlation plot showing the compounds that are significantly associated with a given pattern 1234 (a linear concentration increase under different conditions). The compounds are represented as horizontal bars, with colors in light pink indicating positive correlations and that in light blue indicating negative correlations. Users can click the view details link to see a detailed table. (b) Box plots summarizing the concentration distributions of a selected compound. 754 | VOL.6 NO.6 | 2011 | nature protocols
protocol
endotoxin. We will use the default Pearson r as the distance measure and then select Endotoxin from the Select a feature drop-down list. (ii) Click the Submit button. The resulting image shows a number of other features that are either positively or negatively correlated to endotoxin levels. The details can be obtained by following the view details link. 10| Report generation and result download: Click the Download node on the navigation tree. MetaboAnalyst will generate a detailed analysis report based on the steps that the user has previously executed. The report contains a brief description of each method used, followed by the graphical and textual results based on the last parameter set. The normalized data, as well as any graphs generated during the analysis, are also available for download. ? trouBlesHootInG Multivariate data analysis tIMInG ~10 min 11| Data exploration and visualization with PCA: PCA summarizes data into a few components that explains most of the data variance. The main characteristics of PCA are discussed in Box 6. Click the PCA node on the navigation tree to enter the PCA page. This page shows six main output panels from MetaboAnalysts PCA analysis. The default view is a pair-wise score plot from the top five PCs, with the diagonal panels showing the explained variance.
2011 Nature America, Inc. All rights reserved.
12| Click the 2D score plot tab to see a detailed scores plot using PC1 and PC2. The samples are labeled and colored according to their group memberships. In this view, users should look first for outliers; if there are obvious outliers, use the DataEditor under the Processing navigation tree to exclude outliers. Outlier removal should be carried out with considerable care and outliers should be removed only if there is some clear justification (sample stability problems, sample collection issues, instrument problems, typographical errors and so on) Next, users should investigate sample dispersion; if the data points in the score plot are not well dispersed or show a high degree of skewing, this may be due to insufficient normalization. Click the Normalization node under the Processing branch to choose a different normalization procedure. In particular, autoscaling or range scaling can be very effective for correcting severely skewed data. 13| In our case, no obvious outliers or skewed distribution can be detected. Furthermore, some modest separation or clustering is noticed among different groups. There are also some clusters that appear to overlap with each other. Users can click the 3D score plot to see whether a better separation can be identified with an extra dimension or an extra principal component. 14| Identification of influential or important features: If good separation patterns are seen in a scores plot, users should go to the Loading plot as well as the Biplot views to identify those features that are most responsible for the separation. The loading plot can be viewed either as a scatter plot or a bar plot, as specified by the user. In this particular case, as there are no clear separations, it is very difficult to identify the features that are important. We will use a supervised methodPLS-DAfor this purpose. 15| Data exploration and visualization with PLS-DA: PLS-DA can perform both classification and feature selection. The main characteristics of PLS-DA are discussed in Box 6. Click the PLS-DA node on the navigation tree to start this analysis. The default view is a pairwise summary of the score plots of the top five components. 16| Click the 2D Score plot for a detailed view of the separation patterns. A much better separation is obtained with PLS-DA compared with the PCA result obtained in Step 10. The 3D Score plot shows an almost perfect separation with the first three components (Fig. 4a). 17| Choosing the optimal number of components: MetaboAnalyst calculates R2 and Q2, which are two common performance measures in assessing PLS-DA models. R2 corresponds to the sum of squares captured by the model, whereas Q2 is the crossvalidated R2. MetaboAnalyst also calculates prediction accuracies through cross-validation. Click the Cross Validation tab to start the process. Users can choose 10-fold cross validation or Leave-one-out cross validation (LOOCV). In this case, we will choose LOOCV and click the Submit button. The result indicates that using the top two components gives the best performance based on Q2 measures (Fig. 4b). Click the view details link to get a detailed table of the calculated values. ? trouBlesHootInG 18| Result validation: As noted earlier, PLS-DA tends to overfit the data and this can often lead to false separations or incorrect classification. As a result, PLS-DA models need to be validated to see whether the separation is statistically significant or is due to random noise. This can be carried out using permutation tests. In each permutation, a PLS-DA model is built between the data (X) and the permuted class labels (Y) using the optimal number of components determined in the previous step. MetaboAnalyst provides two kinds of performance measures. The first is the separation distance, which is defined as
nature protocols | VOL.6 NO.6 | 2011 | 755
protocol
the ratio of the between-group sum of squares and the within-group sum of squares (B/W ratio), as suggested by Bijlsma et al.36. The second is the prediction accuracy. This is the default approach used by MetaboAnalyst. Click the Permutation button to view the results. The resulting histogram summarizes the distribution of the permutation test scores, with the red arrow indicating the performance based on the original labels. The further the arrow is to the right of the distribution, the more significant the separation between the two groups. Figure 4c shows a typical permutation result based on separation distance. As seen in this figure, the original class assignment is very significant and not part of the distribution that we obtained using the permuted data. A P value < 0.0005 is reported on the basis of 2,000 permutations. ? trouBlesHootInG 19| Identification of important features: Click the Var. Importance tab to see a list of important features identified based on the variable importance in projection (VIP) score (Fig. 4d). For multiple group analysis, the VIP score is calculated for each component. The overall VIP score shown in the figure is the average across all the selected components. Users can also use the coefficient-based importance measure by clicking on the corresponding radio button and then pressing the Submit button. For multiple-group discriminant analysis, the same number of predictors will be built with one for each group. The overall coefficient-based importance is the average of feature coefficients in all predictors. Click the View details link to see the individual VIP scores in each selected component or the coefficients in each group predictor if the coefficient-based importance is used. ? trouBlesHootInG
2011 Nature America, Inc. All rights reserved.
20| Report generation and result download: Click the Download node to download all the data, tables and figures produced from this particular analysis. ? trouBlesHootInG Metabolite set enrichment analysis tIMInG 510 min 21| In the Upload page, click the Enrichment Analysis tab. 22| There are three drop-down panels for three different types of enrichment analysis (see Box 7 for more details). Each method accepts a different data type: a list of compound names entered in a single-column format for over-representation analysis; a list of compound concentrations entered as two-column table for single-sample profiling (SSP); and a concentration table (CSV) with samples in rows and metabolites in columns for quantitative enrichment analysis (QEA). The phenotype information must be placed in the second column and can be binary, multiclass or continuous. Click the third drop-down pane A concentration table (quantitative enrichment analysis). 23| In the open page, click Browse to locate the human_cachexia.csv data file. 24| Ensure that the selected compound label type is compound names and the phenotype label is Discrete (Classification), and then click Submit. ? trouBlesHootInG 25| Compound name conversion: The purpose of this step is to compare and convert the compound names to common compound names used in the HMDB. The compound identities can be specified by common names or major database IDs (i.e., KEGG, PubChem, HMDB, MetLin, BiGG and so on). MetaboAnalysts compound name/ID conversion is based on a name-mapping table from the HMDB. Each HMDB compound ID is associated with a common name, a set of synonyms and compound IDs used in other major metabolomic databases. Any naming inconsistency is flagged and displayed to users for manual inspection and correction (see Step 7 for more details). crItIcal step Users must label compounds with either common compound names or common database IDs. Abbreviated names usually cannot be recognized. Unmatched or unidentified compounds will be excluded from downstream analyses. 26| Concentration comparison (optional): This step is only applicable when the uploaded data is a list of compound concentrations used for SSP. The basic idea behind SSP is to compare the measured concentration values of each compound with its normal reference ranges in the corresponding biofluid. For common human biofluids, such as blood, urine or cerebrospinal fluid, normal concentration ranges are known for many metabolites. In clinical metabolomic studies, it is often desirable to know whether certain metabolite concentrations in a given sample are higher or lower than their normal ranges. This procedure is designed to provide this kind of analysis. Click Conc. check to start concentration comparison. By default, only compounds with concentrations above or below all the known or reported normal ranges will be selected for further investigation. Users should manually select or deselect compounds to over-ride this default selection by inspecting the concentration comparison plots, as well as the original reports, by clicking the image icon in the Details column.
756 | VOL.6 NO.6 | 2011 | nature protocols
protocol
27| Data normalization (optional): This step is only applicable when the uploaded data is a concentration table. In this case, we select Normalization by a reference sample, and then choose create a pooled average sample from the control group. Choose Autoscaling for column-wise normalization. See Box 4 for more details. 28| Data visualization and outlier detection (optional): The purpose of this step is to check whether the data values are relatively homogenous and for outlier detection. Click the PCA node to open the PCA page. On the 2D score plot, a clear outlier PIF_115 is noticeable as it is far away from all other data points. This particular outlier is due to sample deterioration/ contamination. Follow the route Processing DataEditor and select PIF_115 under the Sample Editor tab, click Remove and then click Finish to go back to the normalization page. Perform the data normalization as done in Step 27. Recheck the PCA score plot. This time, no obvious outlier should be detected. Follow Enrichment Set param. to specify the parameters for enrichment analysis. 29| Set parameters for enrichment analysis: In this step, users must specify a metabolite set library (or upload a custom metabolite set library) to start the analysis (see Box 7 for details). Users can also indicate whether a filter should be applied to exclude metabolite sets containing very few compounds. In this case, we use the default Pathway-associated metabolite sets and click the Next button to view the result.
2011 Nature America, Inc. All rights reserved.
30| View the MSEA results: The MSEA result is presented, both graphically and in a detailed table (Fig. 5a). The horizontal bar graph summarizes the most significant metabolite sets identified during the analysis. The bars are colored on the basis of their P values and the bar length is based on the fold enrichment calculated as the actual matched number / expected number of matches (for over-representation analysis) and calculated statistic / expected statistic (for QEA). The Bonferroni corrected P value and FDR are also provided. Users can click the image icon in the Details column of each matched metabolite set to view all its constituent metabolites with matched ones highlighted in red (Fig. 5b), as well as SMPDB pathway images37 (when available). 31| Report generation and result download: Click the Download node to download the analysis report, images and the processed data. ? trouBlesHootInG Metabolic pathway analysis tIMInG ~10 min 32| Data upload and processing: In the Upload page, click the Pathway Analysis tab to get started with the human_ cachexia.csv data. Users can either enter a list of compound names or a concentration table. The data upload and processing steps are similar to those involved in the enrichment analysis. Please see Steps 2125 for more details. ? trouBlesHootInG 33| Set parameters for pathway analysis: Three parameters must be specified for pathway analysis. These include the pathway library, the algorithm for pathway enrichment analysis and the algorithm for topology analysis (see Box 8 for more details). Users can also supply a reference metabolome to correct for any potential bias in the enrichment analysis. The reference metabolome is specified as a list of KEGG compound IDs. In this case, we select the Homo sapiens library and use the default Global Test and Relative Betweenness Centrality for pathway enrichment analysis and pathway topology analysis, respectively. 34| Result visualization: The results from the pathway analysis are presented in two partsa graphical output in the top section and a table containing all the numerical results at the bottom. Users can intuitively explore the results by pointing and clicking on various graphic elements. There are three types of views (Fig. 6). The left panel is the metabolome view, which displays all the matched pathways as circles (Fig. 6a). The color and size of each circle is based on P values and pathway impact values, respectively. Pointing the mouse over different nodes will show the corresponding pathway names. Clicking the nodes of interest will launch the corresponding pathway view on the right panel (Fig. 6b). Users can zoom or drag to focus on a particular section of the pathway. Clicking on any matched compound node (with highlighted background) will show the corresponding compound view, which contains a detailed summary of the compound concentrations, importance measure, as well as the P value (Fig. 6c). 35| Report generation and result download: Click the Download node to get the complete analysis report as well as the processed data and images produced during the analysis. ? trouBlesHootInG
protocol
? trouBlesHootInG Troubleshooting advice can be found in table 2.
taBle 2 | Troubleshooting table. steps 1 problem The content of the home page does not show up possible reason JavaScript is disabled in your browser possible solution For Mozilla Firefox 3.0 + , go to Tools Options Content, then select the checkbox beside Enable JavaScript. For Internet Explorer 8.0, go to Tools Internet options Security, then select Internet from the Zone icons. Click the Custom level button. From the list of available options, make sure the Disable radio button is not selected under Active scripting item. For Safari 4.0 + , go to Edit Preferences Security, then select the checkbox beside Enable JavaScript. Please check the documentation for other browsers on how to enable JavaScript Make sure sample or feature (peak/compound) names are unique and consist of a combination of English letters, underscores or numbers for naming purposes; the names should contain no space or other special characters; make sure there are at least three samples per group; make sure the selected data format matches your data; for Microsoft Excel users, choose CSV (Macintosh) to generate a .csv file; for WinZip (v12.0) users, choose the Legacy compression (Zip 2.0 compatible) for compression These procedures require a minimum of five samples per group Set appropriate parameter values to make sure the resulting images are generated; make sure there are a minimum of five samples per group for PLS-DA analysis
2, 24 and 32
Non-unique or unusual names; small sample size; wrong data formats; unrecognized zip format
No image is generated
No PDF report is generated Some of the expected data were not generated
tIMInG The duration required to perform the steps described in the protocol depend on the data set size as well as the number of active users connected to the web server. For the test data sets used for these protocols, most results should be returned in a few seconds after a user has selected the appropriate parameters. The most time-consuming computational step is probably the permutation test used by PLS-DA (1520 s for 1,000 permutations). The most time-consuming non-computational test is typically the data visualization or data inspection step. Data upload, processing and normalization (Steps 27) should take about 510 min; feature selection using univariate analysis (Steps 810) usually takes around 35 min; and multivariate analysis (Steps 1120) takes ~10 min. For high-level functional analysis, MSEA (Steps 2131) should take 510 min, whereas metabolic pathway analysis (Steps 3235) should take ~10 min. Once the data has been uploaded, a modestly experienced user should be able to execute the complete protocol in 3040 min. antIcIpateD results
Graphical output
The graphical outputs produced during the analysis procedures are given in Figures 17. Some of the algorithms of the MetaboAnalyst use time-dependent random number generators to calculate certain statistical values and the results may vary slightly among runs.
Data processing results
The data integrity check for the data in cow_diet.csv will detect four groups with a total of 51 zero values and no missing values. The data integrity check for human_cachexia will yield two groups with no zero or missing values.
Feature selection using univariate methods
In MetaboAnalysts ANOVA analysis of the cow_diet.csv data, the top five compounds identified with the default threshold should be endotoxin, 3-PP, glucose, isobutyrate and methylamine. The top five compounds identified using the SAM method
758 | VOL.6 NO.6 | 2011 | nature protocols
protocol
will be the same. In correlation analysis using the predefined 1234 pattern, endotoxin and alanine are the top two compounds that will be positively correlated with this pattern, whereas 3-PP and aspartate are the top two compounds that will be negatively correlated with this pattern. The same compounds should be identified as being correlated/anticorrelated with endotoxin, using Pearson r. The top five compounds identified in SAM will be the same as those identified using the ANOVA test.
Multivariate data analysis
The score plot from the PCA analysis of the cow_diet.csv data should not show a clear separation, with groups 1 and 2 overlapping substantially and group 3 slightly overlapping with groups 2 and 4. A much better group separation will be achieved through PLS-DA. Using PLS-DA, the five most important compounds identified by VIP will be endotoxin, 3-PP, alanine, methylamine and glucose. The best PLS-DA model will use just top two components based on the Q2 score estimated from LOOCV (0.814). The P value based on 2,000 permutations should yield a value of P < 5e 04, which is very significant.
Metabolite set enrichment analysis
All compound names from the human_cachexia.csv data set should be found to have an exact match during the name conversion step. The PCA score plot should not show a clear separation, although it should show PIF_115 as being a clear outlier. In the enrichment analysis using the pathway-based metabolite sets, the top five metabolic pathways that appear to be associated with cachexia will be pyrimidine metabolism, beta-alanine metabolism, ketone body metabolism, purine metabolism and glutamate metabolism.
Metabolic pathway analysis
The top five pathways from the human_cachexia.csv data set that should be identified by pathway enrichment analysis alone are pyrimidine metabolism, pantothenate and CoA biosynthesis, beta-alanine metabolism, synthesis and degradation of ketone bodies and propanoate metabolism. Note that three of these pathways are similar to those previously identified by MSEA. The top three pathways identified by topology analysis alone should be glycine, serine and threonine metabolism; pyruvate metabolism; and taurine and hypotaurine metabolism. Overall, three pathwayspantothenate and CoA biosynthesis; citrate cycle (TCA cycle); and alanine, aspartate and glutamate metabolismappear to be perturbed as a consequence of cachexia, as these will be located in the diagonal area of the plot with relatively good scores from both analyses.
acknowleDGMents We thank the Canadian Institutes for Health Research (CIHR) and the Alberta Ingenuity Fund (AIF; now part of Alberta Innovates Technology Futures) for financial support. autHor contrIButIons J.X. and D.S.W. prepared and tested the protocol and wrote the article. coMpetInG FInancIal Interests The authors declare no competing financial interests. Published online at http://www.natureprotocols.com/. Reprints and permissions information is available online at http://npg.nature. com/reprintsandpermissions/. 1. 2. 3. 4. 5. 6. 7. 8. Fiehn, O. Metabolomicsthe link between genotypes and phenotypes. Plant. Mol. Biol. 48, 155171 (2002). Wishart, D.S. Quantitative metabolomics using NMR. Trends Analyt. Chem. 27, 228237 (2008). Dunn, W.B. & Ellis, D.I. Metabolomics: current analytical platforms and methodologies. Trends Analyt. Chem. 24, 285294 (2005). Wishart, D.S. et al. HMDB: the human metabolome database. Nucleic Acids Res. 35, D521D526 (2007). Lundberg, P. et al. MDLThe Magnetic Resonance Metabolomics Database http://mdl.imv.liu.se (European Society for Magnetic Resonance in Medicine and Biology, ESMRMB, 2005). Smith, C.A. et al. METLINa metabolite mass spectral database. Ther. Drug Monit. 27, 747751 (2005). Weljie, A.M., Newton, J., Mercier, P., Carlson, E. & Slupsky, C.M. Targeted profiling: quantitative analysis of 1H NMR metabolomics data. Anal. Chem. 78, 44304442 (2006). Smith, C.A., Want, E.J., OMaille, G., Abagyan, R. & Siuzdak, G. XCMS: processing mass spectrometry data for metabolite profiling using nonlinear
9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19.
peak alignment, matching, and identification. Anal. Chem. 78, 779787 (2006). Zhao, Q., Stoyanova, R., Du, S., Sajda, P. & Brown, T.R. HiResa tool for comprehensive assessment and interpretation of metabolomic data. Bioinformatics 22, 25622564 (2006). Xia, J., Bjorndahl, T.C., Tang, P. & Wishart, D.S. MetaboMinersemiautomated identification of metabolites from 2D NMR spectra of complex biofluids. BMC Bioinformatics 9, 507 (2008). Lommen, A. MetAlign: interface-driven, versatile metabolomics tool for hyphenated full-scan mass spectrometry data preprocessing. Anal. Chem. 81, 30793086 (2009). Katajamaa, M., Miettinen, J. & Oresic, M. MZmine: toolbox for processing and visualization of mass spectrometry based molecular profile data. Bioinformatics 22, 634636 (2006). Wishart, D.S. Current Progress in computational metabolomics. Brief. Bioinform. 8, 279293 (2007). Cui, Q. et al. Metabolite identification via the Madison Metabolomics Consortium Database. Nat. Biotechnol. 26, 162164 (2008). Wishart, D.S. et al. HMDB: a knowledgebase for the human metabolome. Nucleic Acids Res. 37, D603D610 (2009). Henderson, J.P. et al. Quantitative metabolomics reveals an epigenetic blueprint for iron acquisition in uropathogenic Escherichia coli. PLoS Pathog. 5, e1000305 (2009). Altmaier, E. et al. Variation in the human lipidome associated with coffee consumption as revealed by quantitative targeted metabolomics. Mol. Nutr. Food Res. 53, 13571365 (2009). Ewald, J.C., Heux, S. & Zamboni, N. High-throughput quantitative metabolomics: workflow for cultivation, quenching, and analysis of yeast in a multiwell format. Anal. Chem. 81, 36233629 (2009). Zulak, K.G., Weljie, A.M., Vogel, H.J. & Facchini, P.J. Quantitative 1H NMR metabolomics reveals extensive metabolic reprogramming of primary and secondary metabolism in elicitor-treated opium poppy cell cultures. BMC Plant Biol. 8, 5 (2008).
protocol
20. Xia, J., Psychogios, N., Young, N. & Wishart, D.S. MetaboAnalyst: a web server for metabolomic data analysis and interpretation. Nucleic Acids Res. 37, W652W660 (2009). 21. Xia, J. & Wishart, D.S. MSEA: A web-based tool to identify biologically meaningful patterns in quantitative metabolomics data. Nucleic Acids Res. 38, W71W77 (2010). 22. Xia, J. & Wishart, D.S. MetPA: a web-based metabolomics tool for pathway analysis and visualization. Bioinformatics 26, 23422344 (2010). 23. Neuweger, H. et al. MeltDB: a software platform for the analysis and integration of metabolomics experiment data. Bioinformatics 24, 27262732 (2008). 24. Kastenmuller, G., Romisch-Margl, W., Wagele, B., Altmaier, E. & Suhre, K. metaP-server: a web-based metabolomics data analysis tool. J. Biomed. Biotechnol. 2011, (2010). 25. Pluskal, T., Castillo, S., Villar-Briones, A. & Oresic, M. MZmine 2: modular framework for processing, visualizing, and analyzing mass spectrometrybased molecular profile data. BMC Bioinformatics 11, 395 (2010). 26. Broeckling, C.D., Reddy, I.R., Duran, A.L., Zhao, X. & Sumner, L.W. METIDEA: data extraction tool for mass spectrometry-based metabolomics. Anal. Chem. 78, 43344341 (2006). 27. Duran, A.L., Yang, J., Wang, L.J. & Sumner, L.W. Metabolomics spectral formatting, alignment and conversion tools (MSFACTs). Bioinformatics 19, 22832293 (2003). 28. Luedemann, A., Strassburg, K., Erban, A. & Kopka, J. TagFinder for the quantitative analysis of gas chromatographymass spectrometry (GC-MS)-based metabolite profiling experiments. Bioinformatics 24, 732737 (2008). 29. Wohlgemuth, G., Haldiya, P.K., Willighagen, E., Kind, T. & Fiehn, O. The Chemical Translation Servicea web-based tool to improve standardization of metabolomic reports. Bioinformatics 26, 26472648 (2010). 30. Tusher, V.G., Tibshirani, R. & Chu, G. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl Acad. Sci. USA 98, 511621 (2001). 31. Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA 102, 1554515550 (2005). 32. Salomonis, N. et al. GenMAPP 2: new features and resources for pathway analysis. BMC Bioinformatics 8, 217 (2007). 33. Goffard, N., Frickey, T. & Weiller, G. PathExpress update: the enzyme neighbourhood method of associating gene-expression data with metabolic pathways. Nucleic Acids Res. 37, W335W339 (2009). 34. Hu, Z. et al. VisANT 3.5: multi-scale network visualization, analysis and inference based on the gene ontology. Nucleic Acids Res. 37, W115W121 (2009). 35. Goffard, N. & Weiller, G. PathExpress: a web-based tool to identify relevant pathways in gene expression data. Nucleic Acids Res. 35, W176 W181 (2007). 36. Bijlsma, S. et al. Large-scale human metabolomics studies: a strategy for data (pre-) processing and validation. Anal. Chem. 78, 567574 (2006). 37. Frolkis, A. et al. SMPDB: the small molecule pathway database. Nucleic Acids Res. 38, D480D487 (2010). 38. Efron, B., Tibshirani, R., Storey, J.D. & Tusher, V. Empirical Bayes analysis of a microarray experiment. J. Am. Stat. Assoc. 96, 11511160 (2001). 39. Trygg, J. & Wold, S. Orthogonal projections to latent structures (O-PLS). J. Chemom. 16, 119128 (2002). 40. Wang, T. et al. Automics: an integrated platform for NMR-based metabonomics spectral processing and data analysis. BMC Bioinformatics 10, 83 (2009). 41. Stacklies, W., Redestig, H., Scholz, M., Walther, D. & Selbig, J. pcaMethodsa bioconductor package providing PCA methods for incomplete data. Bioinformatics 23, 11641167 (2007). 42. Dieterle, F., Ross, A., Schlotterbeck, G. & Senn, H. Probabilistic quotient normalization as robust method to account for dilution of complex biological mixtures. Application in 1H NMR metabonomics. Anal. Chem. 78, 42814290 (2006). 43. van den Berg, R.A., Hoefsloot, H.C., Westerhuis, J.A., Smilde, A.K. & van der Werf, M.J. Centering, scaling, and transformations: improving the biological information content of metabolomics data. BMC Genomics 7, 142 (2006). 44. Pavlidis, P. Using ANOVA for gene selection from microarray studies of the nervous system. Methods 31, 282289 (2003). 45. Breiman, L. Random forests. Mach. Learn. 45, 532 (2001). 46. Westerhuis, C.A. et al. Assessment of PLSDA cross validation. Metabolomics 4, 8189 (2007). 47. Goeman, J.J., van de Geer, S.A., de Kort, F. & van Houwelingen, H.C. A global test for groups of genes: testing association with a clinical outcome. Bioinformatics 20, 9399 (2004). 48. Hummel, M., Meister, R. & Mansmann, U. GlobalANCOVA: exploration and assessment of gene group effects. Bioinformatics 24, 7885 (2008). 49. Aittokallio, T. & Schwikowski, B. Graph-based methods for analysing networks in cell biology. Brief Bioinform. 7, 243255 (2006).
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.