BigBio Notes

Monday, 30 September 2024

The Quest for the Perfect Spectrum – A Benchmark in Consensus Generati

The Quest for the Perfect Spectrum – A Benchmark in Consensus Generation

In proteomics, where mass spectrometry generates oceans of data, efficient processing is critical for extracting meaningful insights. As with any large dataset, some information is redundant, and mass spectra are no exception. The task of grouping similar spectra into clusters and generating a consensus spectrum—a representative spectrum for each group—has emerged as a key step in modern proteomics workflows. While several methods exist for this, we explored in 2022 multiple ones in the manuscript titled “A Comprehensive Evaluation of Consensus Spectrum Generation Methods in Proteomics”, systematically for consensus spectrum generation, making a significant contribution to the field.

The need for a Sample to Data standard

The experimental design is a cornerstone of modern science, especially for data scientists. But, How experimental design can be captured for better reuse, reproducibility, and understanding of the original results? How we can write in a file the complexity of our experimental design?

This post is about that

ThermoRAWFileParser: A small step towards cloud proteomics solutions

Proteomics data analysis is in the middle of a big transition. We are moving from small experiments (e.g. a couple of RAW files, samples) to big large scale experiments. While the average number of RAW files per datasets in PRIDE hasn't grown in the last 6 years (Figure 1), we can see multiple experiments with more than 1000 RAW files (Figure 1 - right).

Figure 1: The boxplot of the number of files per dataset in PRIDE (left - outliers removed; right - outliers included)

On the other side, File size shows a trend towards large RAW files (Figure 2).

Figure 2: Box plot of file size by datasets in PRIDE (outliers removed)

Then, how proteomics data analysis can be moved towards large scale and elastic compute architectures such as Cloud infrastructures or High-performance computing (HPC) clusters?

Chan Zuckerberg Initiative hit a home run

Maintaining bioinformatics infrastructures is really challenging with the current Grant schemes. Most of the time what you have ahead for your software are the following plans: bug fixing, documentation, user support, refactoring to make the software faster or compatible with different platforms or architectures... This is almost impossible to be sold by itself in a grant proposal. It needs to be blur or hidden in the proposal with new improvements, data support or data analysis.. probably with new ideas that the developer and the PI don't even try before. But still, you need to get the grant.

Resultado de imagen de software developer cartoon"

In summary, the bioinformatics community is struggling to get money to maintain and support bioinformatics software and databases. Different initiatives and grant calls have been created in recent years to support and sustain the bioinformatics open-source community. But, they have encouraged the creation of "new/novel" things rather than support the development of well-established tools. And here, is when Mark and the Chan Zuckerberg Initiative succeed today.

The Chan Zuckerberg Initiative (CZI) announced the final list of bioinformatics open-source projects awarded with the CZI’s Grant. After reading the final list, it was clear to me that this group of reviewers really knows what they are doing and the CZI is really pushing forward to maintain bioinformatics infrastructures and software.

ELIXIR support the development of BioContainers

Originally posted in ELIXIR

Building on the success of the 2018 ELIXIR Implementation Study on BioContainers, an open-source registry for over 81,000 life science research related containerised tools, ELIXIR has launched three follow-up projects to progress and maintain the work in this valuable area. ❡

What are containers and why are they important?

Data produced by life science research is always on the increase. Likewise there has been a huge increase in the number of tools and platforms required to support data intensive analysis. These tools, code, back-end software and compute environments come in many flavours meaning it can be challenging to select the appropriate tools and run them smoothly in a local setting.

Software containers package up code and related dependencies so that software applications can run quickly and reliably from one computing environment to another. As an analogy, it’s like buying flat-pack furniture from IKEA: you have the instructions, components and tools all provided in a single package - enabling you to carry out the task (build the furniture) without having to source the required tools (screws, allen keys, instructions etc.).

This is an incredibly useful method of sharing analysis pipelines, software tools and documentation with collaborators and the wider scientific community thereby enabling reproducibility. Researchers can run the analysis without having to install dependencies and worry about the specific version of software that the person who originally created the pipeline used.

Diagram showing what a container is made of (tools and bins/libraries) and how it is compatible with different operating systems (e.g. Mac OS or Linux)

The BioContainers Registry

In 2018 ELIXIR launched a six month Implementation Study that aimed to build an infrastructure to help scientists within ELIXIR publish their software containers in a standardised manner. The result was the development of an open-source registry for life science research related to containers called BioContainers, which provided over 8,100 containerised tools. BioContainers is a community-driven project that provides the infrastructure and basic guidelines to create, manage and distribute bioinformatics packages (e.g. conda) and containers (e.g docker, singularity).

If you haven't submit to bioRxiv, well, you should

We all know that the traditional publication process delays the dissemination of new research, often by months, sometimes by years.

Preprint servers decouple dissemination of research papers from their evaluation and certification by journals, allowing researchers to share work immediately, receive feedback from a much larger audience, and provide evidence of productivity long before formal publication. The arXiv preprint server, launched in 1991 and currently hosted by Cornell University, has demonstrated the effectiveness of this approach.