I'm the chosen one. |
File formats (the way
that we use to represent, storage and exchange our data) are fundamentals
piece in bioinformatics, more than that, are one of the milestone of
the Information Era. In some fields the topic is more stable than others, but
the topic is still in the table for most of us. To have a quick idea see the
evolution of general standards in recent years like XML,
JSON and recently YAML.
What happens in computational proteomics? Here a great picture to resume the broad of file formats in Computational proteomics from Eric manuscript in MCP [1]:
Incredible? In proteomics
and mass spectrometry file formats cover a wide range of process, workflows,
and analytical protocols divided in to main groups Informatics Analysis and MS
Analysis; and the starting point is RAW files
(Vendor formats) (I don’t want to explain in details all of these file formats,
I’ll dedicate in the near future a post to them). I will talk about RAW Files and why is important to deal
with them and who is doing very well the job.
The key to interpreting RAW
data directly has been the development of specific
software to parse the binary content of these raw files into intelligible data,
a tedious and time-consuming task that typically needs to be redone each time a
new machine or a new version of an existing machine or its operating software
appears [2]. Next to the above-mentioned caveats associated with proprietary raw
data formats, there is also the very real problem of “aging” that comes with
any binary formatted data. As time goes by, support for certain formats tends
to evaporate and within the space of several years, readers can no longer be
found for the format.
Then, most of the new softwares and tools in computational proteomics avoid to handle original RAW files and use Standards formats such as: mzXML, mzML or simple peak files. Most of the search engines, quatitation or visualisation tools are based on those files which are more simple to exchange and read. But, who export the original data (RAWs) to those files, the vendors?, No...: PROTEOWIZARD.
Originally published on Bioinformatics in 2008 [3], Proteowizard has played its role for data conversion better than any other tool. ProteoWizard provides a modular and extensible set of open source, cross-platform tools and libraries The framework includes different tools for data conversion and a core API for parsing different data formats [4]. In addition to the open mzML, mzXML, mzIdentML, and mzData XML formats, a variety of proprietary formats can also be handled.
Originally published on Bioinformatics in 2008 [3], Proteowizard has played its role for data conversion better than any other tool. ProteoWizard provides a modular and extensible set of open source, cross-platform tools and libraries The framework includes different tools for data conversion and a core API for parsing different data formats [4]. In addition to the open mzML, mzXML, mzIdentML, and mzData XML formats, a variety of proprietary formats can also be handled.
One of the things I really like from this tool is the simple and modular design which allow the conversion of different proprietary formats to a common data model, see figure about from nature biotechnology paper:
One of the thing still missing in the tool is that to be fully-functional it must be installed and run in a Windows System (Vendors fault). Here (https://github.com/jmchilton/proteowizard-wine-packager) you can find a linux wrapper to run it with wine in a linux machine (didn't tested). See also this help page from TPP.
Supported Data Formats
WIFF, T2D (with DataExplorer)
|
|
MassHunter (.d directories)
|
|
FID, .d directories, XMASSXML
|
|
RAW
|
|
Raw directories
|
|
mzML
|
|
mzXML
|
|
MGF
|
|
MS2/CMS2/BMS2
|
|
mz5
|
Other tools?
- ms2mz by bioproximity: simple utility for converting between common mass spectrometer file formats.
- APLToMGFConverter: converts MaxQuant APL (Andromeda peak lists) to MGF.
- CompassXport: converts Bruker analysis.baf and analysis.yep files to mzXML.
- dat2mgf: converts Mascot results files back to MGF
- DataAnalysis2TPP: converts MGF from Bruker DataAnalysis to TPP-friendly format for use with XPRESS and ASAPRatio
- MassWolf: converts MassLynx format to mzXML
- MGF to .dta File Converter: converts MGF to .dta
- mz2mgf: converts mzData files to MGF
- mzBruker: converts Bruker analysis.baf files to mzXML
- mzStar: converts SCIEX/ABI Analyst format (WIFF) to mzXML
- mzXML2Other: converts mzXML to SEQUEST .dta, MGF, and Micromass .pkl
- PklFileMerger: merges individual Q-TOF .pkl files into a single file for database searching.
- ReAdW: converts Xcalibur native acquisition files to mzXML
- T2D converter: converts ABI SCIEX 4700/4800 t2d files to mzXML
- unfinnigan: reading Thermo .raw files without MsFileReader
- wiff2dta: converts ABI WIFF to .dta
- X2XML: converts from almost any format (Thermo, Bruker, Agilent and Micromass) to mzXML
References
[1] Deutsch, Eric W. "File formats commonly used in mass spectrometry proteomics." Molecular & Cellular Proteomics 11.12 (2012): 1612-1621.
[2] Martens, Lennart, et al. "Do we want our data raw? Including binary mass spectrometry data in public proteomics data repositories." Proteomics 5.13 (2005): 3501-3505.
[3] Kessner, Darren, et al. "ProteoWizard: open source software for rapid proteomics tools development." Bioinformatics 24.21 (2008): 2534-2536.
[4] Perez-Riverol, Yasset, et al. "Open source libraries and frameworks for mass spectrometry based proteomics: A developer's perspective." Biochimica et Biophysica Acta (BBA)-Proteins and Proteomics 1844.1 (2014): 63-76.
No comments:
Post a Comment