BigBio Notes: 2020

Monday, 24 February 2020

The need for a Sample to Data standard

The experimental design is a cornerstone of modern science, especially for data scientists. But, How experimental design can be captured for better reuse, reproducibility, and understanding of the original results? How we can write in a file the complexity of our experimental design?

This post is about that

ThermoRAWFileParser: A small step towards cloud proteomics solutions

Proteomics data analysis is in the middle of a big transition. We are moving from small experiments (e.g. a couple of RAW files, samples) to big large scale experiments. While the average number of RAW files per datasets in PRIDE hasn't grown in the last 6 years (Figure 1), we can see multiple experiments with more than 1000 RAW files (Figure 1 - right).

Figure 1: The boxplot of the number of files per dataset in PRIDE (left - outliers removed; right - outliers included)

On the other side, File size shows a trend towards large RAW files (Figure 2).

Figure 2: Box plot of file size by datasets in PRIDE (outliers removed)

Then, how proteomics data analysis can be moved towards large scale and elastic compute architectures such as Cloud infrastructures or High-performance computing (HPC) clusters?