Preloader

A proteomics sample metadata representation for multiomics integration and big data analysis

The amount of proteomics data in public repositories is growing at an unprecedented rate1,2. ProteomeXchange (PX) is a consortium of proteomics resources, including the PRIDE database2, PASSEL and PeptideAtlas3, MassIVE4, jPOST5,6, iProX7, and Panorama Public8. As of July 2021, over 27,000 datasets have been submitted to PX data repositories. PX datasets cover the whole spectrum of protein mass spectrometry (MS) analytical methods and experimental designs, which enable biologists and clinicians to study different aspects of the proteome. In parallel to the generalization of data deposition, reuse of public datasets is becoming increasingly popular. However, thus far, data reuse has largely been limited to benchmarking studies and applications related to peptide and protein identification, with resources such as PeptideAtlas3 and GPMDB9 systematically reanalyzing data from PX10. Recently, new efforts like ProteomicsDB11, MassiVE.Quant4, and Expression Atlas12 have started to include reanalyzed quantitative public datasets to present baseline and differential protein expression. However, the scalability and broad reuse of public quantitative experiments have been limited by the lack of sample metadata annotation, which unambiguously associates the samples included in each dataset with the corresponding data files13,14.

Since 2012, PX resources have been capturing a general dataset description, including the dataset title, description, instrument, protein modifications included in the search, and submitters/principal investigators, among other data1. The files included in each dataset are, on one hand, the output of the corresponding instrument (e.g., RAW files), and on the other hand, the processed results, which can be represented, e.g., in standard file formats such as mzIdentML15 or mzTab16. Currently, all PX partners mandate two types of information for each dataset: a general dataset description and the files containing the different required data types. Unfortunately, the experimental design and sample-related information are frequently missing in the datasets or are stored in ad hoc ways and/or formats1. Information about the biological samples such as the analyzed organ, tissue, disease, or cell line, and the links between the samples and the corresponding data files are often lacking.

Sample-related metadata and their relationship with the data files are well captured in two widespread file formats called ISA-TAB17 and MAGE-TAB (MicroArray Gene Expression Tabular)18, which are used in metabolomics and transcriptomics, respectively. As of May 2021, ArrayExpress has stored over 74,000 functional genomics datasets in the MAGE-TAB format18,19. In both formats, a tab-delimited file is used to annotate the sample metadata and link the metadata to the corresponding data files. While MAGE-TAB was originally designed for microarray experiments, it has been successfully adapted to high-throughput RNA-sequencing and single-cell RNA-Seq experiments20.

Here we introduce an extension and implementation of the MAGE-TAB format for proteomics (MAGE-TAB-Proteomics). The format has been developed in collaboration with the Proteomics Standards Initiative (PSI), the organization in charge of developing open-standard formats in the field21. We have also developed general guidelines about what information needs to be encoded in MAGE-TAB to improve the reproducibility and enable the reanalysis of proteomics datasets. In addition, we have crowdsourced the annotation of over 200 existing public datasets according to these guidelines, covering different analytical methods and experimental designs. Finally, we have developed an ecosystem of tools to validate MAGE-TAB-Proteomics files and integrate the metadata in the PRIDE database, the most popular PX resource. The full specification document describing all aspects of MAGE-TAB-Proteomics version 1.0, the current implementations, as well as application examples, is available at the PSI website (https://psidev.info/magetab).

Repurposing MAGE-TAB for proteomics

MAGE-TAB encodes the sample metadata annotations and the information linking the metadata to the corresponding data files in two different files: the Investigation Description Format (IDF) and the Sample and Data Relationship Format (SDRF). In the following, we describe how we adapted these formats to the specific needs of proteomics.

Providing study-description information in IDF

The IDF file contains information describing the study, including, e.g., authors/submitters, protocols, and publications (Supplementary Note 1). The IDF format contains a series of key/value pairs, where each key represents a different property. For example, “Experiment Description” should be followed by a free-text description of the experiment (which would be the value). Most of the fields can contain more than one value, so that multiple values (e.g., multiple-analysis software tools) can be defined in a single IDF file. Since 2012, PX dataset descriptions are provided using PX XML (http://proteomecentral.proteomexchange.org/schemas/proteomeXchange-1.4.0.html), an XML file format that captures equivalent information to the ones included in IDF, making both files easily exchangeable (Supplementary Note 1). Therefore, we developed the IDF component of MAGE-TAB based on the existing PX XML format.

Linking samples to data files with SDRF

SDRF is a tab-delimited file that describes the samples and allows their mapping to the data files1. As shown in Fig. 1a, SDRF includes the annotation of (i) biological sample metadata; (ii) the relationships between samples and data files; (iii) (technical) metadata of RAW data files; and (iv) the variables under study (called factor values). Each row in an SDRF file corresponds to one relationship between a sample and a data file (an MS RAW file or a channel included in a given RAW file in the case of labeling-based proteomics). Each column corresponds to an attribute/property of the sample or the file, and the value in each cell is the specific value of the property (Fig. 1a).

Fig. 1: SDRF-Proteomics representation for a label-free-based experiment without fractionation.
figure1

a Experimental design, including two biological replicates and two technical replicates per biological replicate. The biological and technical replicates are defined by the variable under study (e.g., phenotype). b The SDRF tab-delimited file, including the three main sections highlighted: sample metadata, data file properties, and the variables under study (factor values).

All the properties in the SDRF must be encoded as ontology terms, whereas the values of the properties can be encoded as ontology terms, numerical values, or free text. To facilitate the annotation, validation, and processing of SDRF files, a list of ontologies has been defined that can be used for encoding each property. For example, most of the sample properties are included in the Experimental Factor Ontology22 (EFO—https://www.ebi.ac.uk/efo/), while most of the data-file properties are included in PSI-MS-controlled vocabulary (https://www.ebi.ac.uk/ols/ontologies/ms) and the PRIDE ontology (https://www.ebi.ac.uk/ols/ontologies/pride).

Each sample in an SDRF file has a unique identifier (source name), and every sample property is encoded using the prefix characteristics (e.g., characteristics [organism part]). Each data file also has a unique identifier (assay name), and every file property has the prefix comment (e.g., comment[instrument], comment[fraction identifier]). Finally, the variables under study must be specified with the prefix factor value (e.g., factor value[tissue]). The MAGE-TAB-Proteomics specification defines the minimum information that should be provided for every sample and data file (https://github.com/bigbio/proteomics-metadata-standard/raw/master/psi-document/HUPO-PSI-MAGETAB-Proteomics_latest.docx). For all proteomics experiments, the following properties must be provided: organism, organism part, and biological replicate accession. For every data file, the following properties are required: fraction identifier, technical replicate accession, label (in the case of labeling methods), and data-file name. Biological and technical replicates should be explicitly included using the terms characteristics[biological replicate] and comment[technical replicate], respectively (Fig. 1a). The biological replicate field is considered a property of the samples, whereas the technical replicate is considered a property of the data files.

A second category of fields includes information that is not mandatory but recommended. Each PX repository can define which of the recommended fields must be provided in their resource, depending on the experiment types. The current PX templates request the submitters to provide the following properties for every data file: instrument model, cleavage agent, fragment-mass tolerance, precursor-mass tolerance, and mass modifications (e.g., post-translational modifications (PTMs) and artifactual modifications considered in the analysis). Most of the values of these properties can be encoded as a combination of multiple key/value pairs (e.g., methionine oxidation can be specified as AC = UNIMOD:35;NT = Oxidation;MT = Variable;TA = M). We believe that this represents the minimum set of information that is necessary for practical terms for understanding and re-using MS-based proteomics datasets. Furthermore, other properties such as labeling can be provided. Importantly, the set of mandatory properties can be readily expanded in case the PX community decides to extend its metadata requirements in the future. For now, we have defined additional templates (Supplementary Table 1), which form a set of recommended properties required per experiment type. Submitters can use the template corresponding to their experiment to streamline annotation. For example, the cell line (characteristics[cell line]) is a recommended metadata item for cell-line experiments; for every human dataset, the disease under study should be provided in the field characteristics[disease] and the control samples should be labeled with the value “normal”.

Multiplexing and fractionation

Transcriptomics datasets typically show a one-to-one relationship between each sample and data file. While this is also the case for some proteomics data, two popular experimental designs in proteomics—sample multiplexing and fractionation—follow different patterns (Fig. 2).

Fig. 2: SDRF-Proteomics file for an experiment combining TMT labeling and sample fractionation.
figure2

a TMT experimental design with three samples and three fractions. b SDRF representation for a TMT experiment with three samples and three fractions resulting in nine rows where samples are repeated for each fraction and data-file information is repeated for each labeling channel, which is encoded using the property comment[label].

In multiplexed quantitative experiments (e.g., based on tandem mass tag (TMT) labeling), multiple samples can be related to the same data file (Fig. 2a). In these cases, the data-file properties should be repeated for each sample including all the properties (e.g., instrument). The different samples included in the same data file can be encoded using the relevant property labels (e.g., comment[label] = TMT128N).

When fractionation is used, the sample information should be repeated for each data file. A property called fraction identifier is used to make clear which fractions correspond to a given data file (e.g., comment[fraction identifier] = 1).

These features make MAGE-TAB-Proteomics highly flexible and applicable to complex experimental designs involving both fractionation and multiplexing (Fig. 2b). While the duplication of information can be perceived as redundant, it enables a streamlined data reanalysis because each row/line of the SDRF can be processed individually. In addition, it facilitates meta-analysis with simple operations such as merging different SDRFs coming from different datasets or splitting an SDRF for a given dataset by a specific property of the data file or the sample.

Source link