Proteomics data analysis is in the middle of a big transition. We are moving from small experiments (e.g. a couple of RAW files, samples) to big large scale experiments. While the average number of RAW files per datasets in PRIDE hasn't grown in the last 6 years (Figure 1), we can see multiple experiments with more than 1000 RAW files (Figure 1 - right).
Figure 1: The boxplot of the number of files per dataset in PRIDE (left - outliers removed; right - outliers included) |
On the other side, File size shows a trend towards large RAW files (Figure 2).
Figure 2: Box plot of file size by datasets in PRIDE (outliers removed) |
Then, how proteomics data analysis can be moved towards large scale and elastic compute architectures such as Cloud infrastructures or High-performance computing (HPC) clusters?
A proteomics data analysis started with the read of the MS and MS/MS information from the RAW files (Figure 3).
Figure 3: Computational proteomics workflow (Source: MCP) |
Until very recently, the development of computational proteomics tools has been skewed by the development of Windows-based software tools such as ProteomeDiscover, MaxQuant, and PeaksDB. An important driver for this bias has been the lack of cross-platform libraries to access instrument output data files (RAW files) from major instrument providers. This fact limited the development of Unix-based tools and frameworks, confining proteomics data analysis to Desktop Computers.
An important breakthrough was achieved in 2016 when Thermo Scientific released the first cross-platform application programming interface (API) that enables access to Thermo RAW files from all their instruments on all commonly used operating systems. Importantly, this provides the enticing possibility to move proteomics into Linux/UNIX environments, including scalable clusters and cloud environments.
ThermoRawFileParser
ThermoRawFileParser enables Unix-based and Cloud architectures to read binary Thermo Raw files and convert them into standard file formats such as mzML. This facilitates the use of open source tools that read mzML (open-source, XML file format) and remove the dependency from Windows. But, which tools support mzML as input, check Table 1.
Tool | URL | Type |
Transproteomics pipeline | http://tools.proteomecenter.org/software.php | Framework |
OpenMS | https://www.openms.de/ | Framework |
MSGF+ | https://github.com/MSGFPlus/msgfplus | Search engine |
Crux | http://cruxtoolkit.sourceforge.net/ | Search engine |
Kojak | http://www.kojak-ms.org/ | Cross-linking |
Pyteomics | http://packages.python.org/pyteomics/ | Python library |
Ursgal | https://github.com/ursgal/ursgal | Framework |
PyOpenMS | https://www.openms.de/ | Python Library |
Dinosaur | https://github.com/fickludd/dinosaur | Quantification |
quandenser | https://github.com/statisticalbiotechnology/quandenser | Quantification |
The ThermoRawFileParser has multiple options (check the project here), including native support to write the output into Amazon S3 buckets.
optional subcommands are xic|query (use [subcommand] -h for more info]):
-i, --input=VALUE The raw file input (Required). Specify this or an
input directory -d.
-d, --input_directory=VALUE
The directory containing the raw files (Required).
Specify this or an input raw file -i.
-o, --output=VALUE The output directory. Specify this or an output
file -b. Specifying neither writes to the input
directory.
-b, --output_file=VALUE The output file. Specify this or an output
directory -o. Specifying neither writes to the
input directory.
-f, --format=VALUE The spectra output format: 0 for MGF, 1 for mzML,
2 for indexed mzML, 3 for Parquet. Defaults to
mzML if no format is specified.
-g, --gzip GZip the output file.
-p, --noPeakPicking Don't use the peak picking provided by the native
Thermo library. By default peak picking is
enabled.
-l, --logging=VALUE Optional logging level: 0 for silent, 1 for
verbose.
-e, --ignoreInstrumentErrors
Ignore missing properties by the instrument.
-u, --s3_url[=VALUE] Optional property to write directly the data into
S3 Storage.
-k, --s3_accesskeyid[=VALUE]
Optional key for the S3 bucket to write the file
output.
-t, --s3_secretaccesskey[=VALUE]
Optional key for the S3 bucket to write the file
output.
-n, --s3_bucketName[=VALUE]
S3 bucket name
One of the most relevant features of the tool is the easy design, that allows extending the tool with new file formats and options. In addition, for each release of the ThermoRawFileParser a conda package and singularity/docker containers are created, enabling the deployment of the tool in docker-base cloud architectures and HPCs.
Installing the tool from 1 single line:
From Conda:
conda install -c conda-forge -c bioconda thermorawfileparser==1.2.1--0
If you don't know how to use Conda, here an introduction.From Docker:
docker pull quay.io/biocontainers/thermorawfileparser:1.2.1--0
I hope, this tool will boost the development of Cloud-based proteomics solutions and more Open-source Unix-based tools and frameworks.
Citation:
Hulstaert, N., Shofstahl, J., Sachsenberg, T., Walzer, M., Barsnes, H., Martens, L., & Perez-Riverol, Y. (2019). ThermoRawFileParser: Modular, Scalable, and Cross-Platform RAW File Conversion. Journal of proteome research, 19(1), 537-542. [paper]
No comments:
Post a Comment