BigBio Notes: ThermoRAWFileParser: A small step towards cloud proteomics solutions

Proteomics data analysis is in the middle of a big transition. We are moving from small experiments (e.g. a couple of RAW files, samples) to big large scale experiments. While the average number of RAW files per datasets in PRIDE hasn't grown in the last 6 years (Figure 1), we can see multiple experiments with more than 1000 RAW files (Figure 1 - right).

Figure 1: The boxplot of the number of files per dataset in PRIDE (left - outliers removed; right - outliers included)

On the other side, File size shows a trend towards large RAW files (Figure 2).

Figure 2: Box plot of file size by datasets in PRIDE (outliers removed)

Then, how proteomics data analysis can be moved towards large scale and elastic compute architectures such as Cloud infrastructures or High-performance computing (HPC) clusters?

A proteomics data analysis started with the read of the MS and MS/MS information from the RAW files (Figure 3).

Resultado de imagen de proteomics data analysis

Figure 3: Computational proteomics workflow (Source: MCP)

Until very recently, the development of computational proteomics tools has been skewed by the development of Windows-based software tools such as ProteomeDiscover, MaxQuant, and PeaksDB. An important driver for this bias has been the lack of cross-platform libraries to access instrument output data files (RAW files) from major instrument providers. This fact limited the development of Unix-based tools and frameworks, confining proteomics data analysis to Desktop Computers.

An important breakthrough was achieved in 2016 when Thermo Scientific released the first cross-platform application programming interface (API) that enables access to Thermo RAW files from all their instruments on all commonly used operating systems. Importantly, this provides the enticing possibility to move proteomics into Linux/UNIX environments, including scalable clusters and cloud environments.

ThermoRawFileParser

ThermoRawFileParser enables Unix-based and Cloud architectures to read binary Thermo Raw files and convert them into standard file formats such as mzML. This facilitates the use of open source tools that read mzML (open-source, XML file format) and remove the dependency from Windows. But, which tools support mzML as input, check Table 1.

Tool	URL	Type
Transproteomics pipeline	http://tools.proteomecenter.org/software.php	Framework
OpenMS	https://www.openms.de/	Framework
MSGF+	https://github.com/MSGFPlus/msgfplus	Search engine
Crux	http://cruxtoolkit.sourceforge.net/	Search engine
Kojak	http://www.kojak-ms.org/	Cross-linking
Pyteomics	http://packages.python.org/pyteomics/	Python library
Ursgal	https://github.com/ursgal/ursgal	Framework
PyOpenMS	https://www.openms.de/	Python Library
Dinosaur	https://github.com/fickludd/dinosaur	Quantification
quandenser	https://github.com/statisticalbiotechnology/quandenser	Quantification

The ThermoRawFileParser has multiple options (check the project here), including native support to write the output into Amazon S3 buckets.

optional subcommands are xic|query (use [subcommand] -h for more info]):
  -i, --input=VALUE          The raw file input (Required). Specify this or an
                               input directory -d.
  -d, --input_directory=VALUE
                             The directory containing the raw files (Required).
                               Specify this or an input raw file -i.
  -o, --output=VALUE         The output directory. Specify this or an output
                               file -b. Specifying neither writes to the input
                               directory.
  -b, --output_file=VALUE    The output file. Specify this or an output
                               directory -o. Specifying neither writes to the
                               input directory.
  -f, --format=VALUE         The spectra output format: 0 for MGF, 1 for mzML,
                               2 for indexed mzML, 3 for Parquet. Defaults to
                               mzML if no format is specified.
  -g, --gzip                 GZip the output file.
  -p, --noPeakPicking        Don't use the peak picking provided by the native
                               Thermo library. By default peak picking is
                               enabled.
  -l, --logging=VALUE        Optional logging level: 0 for silent, 1 for
                               verbose.
  -e, --ignoreInstrumentErrors
                             Ignore missing properties by the instrument.
  -u, --s3_url[=VALUE]       Optional property to write directly the data into
                               S3 Storage.
  -k, --s3_accesskeyid[=VALUE]
                             Optional key for the S3 bucket to write the file
                               output.
  -t, --s3_secretaccesskey[=VALUE]
                             Optional key for the S3 bucket to write the file
                               output.
  -n, --s3_bucketName[=VALUE]
                             S3 bucket name

One of the most relevant features of the tool is the easy design, that allows extending the tool with new file formats and options. In addition, for each release of the ThermoRawFileParser a conda package and singularity/docker containers are created, enabling the deployment of the tool in docker-base cloud architectures and HPCs.

Installing the tool from 1 single line:

From Conda:

conda install -c conda-forge -c bioconda thermorawfileparser==1.2.1--0

If you don't know how to use Conda, here an introduction.

From Docker:

docker pull quay.io/biocontainers/thermorawfileparser:1.2.1--0

I hope, this tool will boost the development of Cloud-based proteomics solutions and more Open-source Unix-based tools and frameworks.

Citation:

Hulstaert, N., Shofstahl, J., Sachsenberg, T., Walzer, M., Barsnes, H., Martens, L., & Perez-Riverol, Y. (2019). ThermoRawFileParser: Modular, Scalable, and Cross-Platform RAW File Conversion. Journal of proteome research, 19(1), 537-542. [paper]

BigBio Notes

Monday, 24 February 2020

ThermoRAWFileParser: A small step towards cloud proteomics solutions

ThermoRawFileParser

No comments:

Post a Comment