Monday 24 February 2020

ThermoRAWFileParser: A small step towards cloud proteomics solutions

Proteomics data analysis is in the middle of a big transition. We are moving from small experiments (e.g. a couple of RAW files, samples) to big large scale experiments. While the average number of RAW files per datasets in PRIDE hasn't grown in the last 6 years (Figure 1), we can see multiple experiments with more than 1000 RAW files (Figure 1 - right).

Figure 1: The boxplot of the number of files per dataset in PRIDE (left - outliers removed; right - outliers included) 

On the other side, File size shows a trend towards large RAW files (Figure 2).

Figure 2: Box plot of file size by datasets in PRIDE (outliers removed)

Then, how proteomics data analysis can be moved towards large scale and elastic compute architectures such as Cloud infrastructures or High-performance computing (HPC) clusters?

A proteomics data analysis started with the read of the MS and MS/MS information from the RAW files (Figure 3). 

Resultado de imagen de proteomics data analysis
Figure 3: Computational proteomics workflow (Source: MCP) 
Until very recently, the development of computational proteomics tools has been skewed by the development of Windows-based software tools such as ProteomeDiscover, MaxQuant, and PeaksDB. An important driver for this bias has been the lack of cross-platform libraries to access instrument output data files (RAW files) from major instrument providers. This fact limited the development of Unix-based tools and frameworks, confining proteomics data analysis to Desktop Computers.

An important breakthrough was achieved in 2016 when Thermo Scientific released the first cross-platform application programming interface (API) that enables access to Thermo RAW files from all their instruments on all commonly used operating systems. Importantly, this provides the enticing possibility to move proteomics into Linux/UNIX environments, including scalable clusters and cloud environments. 

ThermoRawFileParser

ThermoRawFileParser enables Unix-based and Cloud architectures to read binary Thermo Raw files and convert them into standard file formats such as mzML.  This facilitates the use of open source tools that read mzML (open-source, XML file format) and remove the dependency from Windows. But, which tools support mzML as input, check Table 1

ToolURLType
Transproteomics pipelinehttp://tools.proteomecenter.org/software.phpFramework
OpenMShttps://www.openms.de/Framework 
MSGF+ https://github.com/MSGFPlus/msgfplusSearch engine
Cruxhttp://cruxtoolkit.sourceforge.net/ Search engine
Kojakhttp://www.kojak-ms.org/Cross-linking
Pyteomicshttp://packages.python.org/pyteomics/ Python library 
Ursgalhttps://github.com/ursgal/ursgal Framework
PyOpenMShttps://www.openms.de/Python Library 
Dinosaurhttps://github.com/fickludd/dinosaur Quantification
quandenserhttps://github.com/statisticalbiotechnology/quandenserQuantification 

The ThermoRawFileParser has multiple options (check the project here), including native support to write the output into Amazon S3 buckets. 

optional subcommands are xic|query (use [subcommand] -h for more info]):
  -i, --input=VALUE          The raw file input (Required). Specify this or an
                               input directory -d.
  -d, --input_directory=VALUE
                             The directory containing the raw files (Required).
                               Specify this or an input raw file -i.
  -o, --output=VALUE         The output directory. Specify this or an output
                               file -b. Specifying neither writes to the input
                               directory.
  -b, --output_file=VALUE    The output file. Specify this or an output
                               directory -o. Specifying neither writes to the
                               input directory.
  -f, --format=VALUE         The spectra output format: 0 for MGF, 1 for mzML,
                               2 for indexed mzML, 3 for Parquet. Defaults to
                               mzML if no format is specified.
  -g, --gzip                 GZip the output file.
  -p, --noPeakPicking        Don't use the peak picking provided by the native
                               Thermo library. By default peak picking is
                               enabled.
  -l, --logging=VALUE        Optional logging level: 0 for silent, 1 for
                               verbose.
  -e, --ignoreInstrumentErrors
                             Ignore missing properties by the instrument.
  -u, --s3_url[=VALUE]       Optional property to write directly the data into
                               S3 Storage.
  -k, --s3_accesskeyid[=VALUE]
                             Optional key for the S3 bucket to write the file
                               output.
  -t, --s3_secretaccesskey[=VALUE]
                             Optional key for the S3 bucket to write the file
                               output.
  -n, --s3_bucketName[=VALUE]
                             S3 bucket name
One of the most relevant features of the tool is the easy design, that allows extending the tool with new file formats and options. In addition, for each release of the ThermoRawFileParser a conda package and singularity/docker containers are created, enabling the deployment of the tool in docker-base cloud architectures and HPCs.

Installing the tool from 1 single line:

From Conda:

conda install -c conda-forge -c bioconda thermorawfileparser==1.2.1--0
If you don't know how to use Conda, here an introduction.

From Docker:

docker pull quay.io/biocontainers/thermorawfileparser:1.2.1--0
I hope, this tool will boost the development of Cloud-based proteomics solutions and more Open-source Unix-based tools and frameworks.

Citation:
Hulstaert, N., Shofstahl, J., Sachsenberg, T., Walzer, M., Barsnes, H., Martens, L., & Perez-Riverol, Y. (2019). ThermoRawFileParser: Modular, Scalable, and Cross-Platform RAW File Conversion. Journal of proteome research, 19(1), 537-542. [paper]

No comments:

Post a Comment