Standardisation: the most difficult flower to grow. |
The PSI (Proteomics Standard Initiative) 2014
Meeting was held this year in Frankfurt (13-17 of April) and I can say I’m now
part of this history. First, I will try to describe with a couple of sentences (for
sure I will fai) the incredible venue, the Schloss Reinhartshausen Kempinski. When I
saw for the first time the hotel, first thing came to my mind was those films from
the 50s. Everything was elegant, classic, sophisticated - from the decoration to
a small latch. The food was incredible and the service is first class from the
moment you set foot on the front step and throughout the whole stay.
Standardization is the process of developing
and implementing technical standards. Standardization can help to maximize
compatibility, interoperability, safety, repeatability, or quality. It can also
facilitate commoditization of formerly custom processes. In bioinformatics, the standardization of file formats, vocabulary, and
resources is a job that all of us appreciate but for several reasons nobody
wants to do. First of all, standardization in bioinformatics means that you
need to organize and merge different experimental and in-silico pipelines to have a common way to represent the
information. In proteomics for example, you can use different sample
preparation, combined with different fractionation techniques and different
mass spectrometers; and finally using different search engines and
post-processing tools. The diversity and possible combinations is needed
because allow to explore different solutions for complex problems. (Standarization
in Proteomics: From raw data to metadata files).
HUPO-PSI 2014 Venue: Kempinsky Echerback Hotel. |
Proteomics Standard Initiative formally started
in 2002 (they have more than 12 years). Since
the first manuscript published by the group (Meeting Review: The HUPOProteomics Standards Initiative meeting: towards common standards for exchangingproteomics data), they addressed major challenges in this topic for the
community:
“There
was a remarkable consensus between delegates attending the PSI meeting to the
effect that valuable data would be lost without public repositories and common
interchange formats making information accessible to the scientific community…
All such efforts require support from the user community and from the
scientific press and funding agencies.”
The HUPO-PSI consortium has been working in in
four major groups: (i) Molecular Interactions, (ii) Mass Spectrometry, (iii)
Proteomics Informatics, (iv) Protein separations. From my point of view, the
major results that were obtained under the PSI umbrella were:
- Definition guidelines and Control Vocabularies to report Proteomics and molecular interactions data [1][2].
- Development of PSI standard file formats (mzML, mzIdentML, mzQuantML, qcML, mzTab, PSI-MI, MITAB).
- Implementation of different resources and tools for standardization, visualization and sharing of proteomics data (PRIDE, Intact, Reactome, PRIDE Inspector, PRIDE Converter, ProteoWizard, etc)
Description of major outcomes and results
Guidelines and Control
Vocabularies to report Proteomics and molecular interactions data: The minimum information about experiments
[1][2] series is a collection of manuscripts and guidelines to encourage the
standardised collection, integration, storage and dissemination of proteomics
data, the HUPO-PSI develops guidance modules for reporting the use of techniques such as gel
electrophoresis, mass spectrometry and protein interaction networks.
The MIAPE and MIMIx Guidelines are divided in
various modules:
· Study design and sample.
· Experimental motivation and design;
factors of interest; origin and preprocessing of biological material; numbers
of replicates; relationship to other studies; miscellaneous administrative
detail.
· Separations and sample handling.
· Column chromatography
· Capillary electrophoresis.
· Mass spectrometry.
· Informatics for mass spectrometry.
· Gel electrophoresis.
· Gel image informatics.
· Protein and peptide arrays.
· Statistical analysis of data
· Molecular interaction experiments.
Different paths and ideas, but only those well supported and structured are successful. |
In black are the more successful modules in terms of data standards, resources, tools and
benefits provided to the proteomics community (from my point of view). These
modules demonstrated the importance of having a good idea, a progressive field
and a powerful community behind. The mass Spectrometry and Informatics for mass
spectrometry modules have been led by PeptideAtlas and PRIDE groups amount
others. These groups have relied their pipelines, data and tools in the
progress of the controlled vocabularies, standards and guidelines for data
publication and dissemination. The molecular interaction module has been a
cornerstone of the development of the Intact
database (http://www.ebi.ac.uk/intact/)
and PSIQUIC (https://code.google.com/p/psicquic/).
Some notes from the meeting and current status of each module:
The mass spectrometry guidelines have been guided the development of
standards for MS/MS representation and the final development of mzML. MzML
is still under active development; advances in technologies provide new
challenges, which need to be met by these standard, including the application
of mzML to metabolomics, SWATH-MS and other data-independent acquisition
workflows, and ion mobility MS. mzML is suitable for metabolomics with only the
addition of new CV terms required to meet the current needs of this community.
To tackle the issue of data compression, the use of mgzip combined with a new
compression method MS-Numpress will yield mzML files that are often smaller
than vendor files.
mzML all about compression using MS-Numpress and mgzip |
mzML is a mature file format because it can
represent chromatography information and MS information in the same file and with
the new improvements the size of the file has been decreased considerably
compare with its competitors mz5, mzXML, etc. The mass spectrometry community
still has some challenges for the future with the evolution of some topics such
as ion-mobility and DIA (data independent acquisition). Issues remaining to be
resolved by this group include deciding the means by which synthesized MS2
spectra acquired from MSE, i.e. a data-independent approach that acquires MS1
and MS2 mass spectra in an unbiased and parallel manner, and also how merged,
clustered spectra should be captured in mzML.
Informatics for mass
spectrometry guidelines has been involved in the development and implementation of standards to
represent the process of identification, quality assessment and post-processing
of mass spectrometry data. Apart of the ontologies and the important work done
in standardization the main output of this group is the release of mzIdentML.
mzIdentML was released in 2012 and is the successor of pepXML and protXML file
formats. The mzIdentML standard for peptide/protein identification also requires some updates to meet the needs of protein grouping
including statistical thresholds for protein groups, support for peptide-level
statistics, support for the use of multiple search engines in mzIdentML and the
first support for chemical cross-linking studies.
mzTab (Laurel) & mzIdentML(Hardy) |
Recently this group developed the mzTab,
a lightweight file format for peptide/protein identification and quantitation. mzTab
has been used for proteomics but also for metabolomics. As a bioinformatician
and user of mzTab, I really like the general way of modelling quantitation experiments. As a developer, I like the simplistic way to
represent different complex data in one format; is really simple in terms of
data structure and easy to learn. Also the size of one complete experiment
should be 1/20 of a mzIdentML file. Both file formats are now supported by PRIDE Inspector to check the quality of Proteomics Experiments.
Molecular interaction
experiments module
has been working in the development of molecular interaction resources,
standards, controlled vocabularies, etc. The main work of the molecular
interaction in this meeting was related with PSI-MI XML standard to enable:
· The ability to exchange “abstracted”
data, i.e. knowledge built from experimental data such as protein complex composition/topology
and cooperative interactions.
· The ability to add information on
dynamic interactions.
· The requirement to capture the
causality of molecular interactions needs further discussion with external
groups to ensure we have adequate data capture and in an appropriate for- mat.
The group then discussed the development of the
JAMI Java application programming interfaces. JAMI is a single Java library
designed to unify both the MITAB and PSI-XML standard formats by providing a
common Java framework, while hiding the complexity of both from the naive user
similar to other proteomics libraries such as ms-data-core-api (https://github.com/PRIDE-Utilities/ms-data-core-api).
Once the first version of the JAMI core data model has been released,
subsequent tool development will be made easier as tools need not be format
specific.
On the last day, a complete session was dedicated to ProteomeXchange: major results and future challenges. Some of the partners involved in ProteomeXchange gave talks about their resources and tools. The advances and future developments in resources such as PRIDE, PeptideAtlas and MassIVE were presented and discussed in details. These three resources are the main partners of the consortium at the moment. Robert Chalkley also gave a really nice introduction to MS-viewer: a web-based spectral viewer for proteomics results.
These are some of my quick notes and also some documentation from "Meeting New Challenges: The 2014 HUPO‐PSI/COSMOS Workshop." HUPO-PSI is history and I was part of it. Future challenges have emerged from discussions and new ideas. I met really nice people during the meeting, guys that made our life easy in the lab.
References
[1] Taylor CF, Paton NW, Lilley KS et al. The
minimum information about a proteomics experiment (MIAPE). Nat Biotechnol. 2007 Aug;25(8):887-93.
No comments:
Post a Comment