BigBio Notes: 2019

Friday 15 November 2019

Chan Zuckerberg Initiative hit a home run

Maintaining bioinformatics infrastructures is really challenging with the current Grant schemes. Most of the time what you have ahead for your software are the following plans: bug fixing, documentation, user support, refactoring to make the software faster or compatible with different platforms or architectures... This is almost impossible to be sold by itself in a grant proposal. It needs to be blur or hidden in the proposal with new improvements, data support or data analysis.. probably with new ideas that the developer and the PI don't even try before. But still, you need to get the grant.

Resultado de imagen de software developer cartoon"

In summary, the bioinformatics community is struggling to get money to maintain and support bioinformatics software and databases. Different initiatives and grant calls have been created in recent years to support and sustain the bioinformatics open-source community. But, they have encouraged the creation of "new/novel" things rather than support the development of well-established tools. And here, is when Mark and the Chan Zuckerberg Initiative succeed today.

The Chan Zuckerberg Initiative (CZI) announced the final list of bioinformatics open-source projects awarded with the CZI’s Grant. After reading the final list, it was clear to me that this group of reviewers really knows what they are doing and the CZI is really pushing forward to maintain bioinformatics infrastructures and software.

ELIXIR support the development of BioContainers

Originally posted in ELIXIR

Building on the success of the 2018 ELIXIR Implementation Study on BioContainers, an open-source registry for over 81,000 life science research related containerised tools, ELIXIR has launched three follow-up projects to progress and maintain the work in this valuable area. ❡

What are containers and why are they important?

Data produced by life science research is always on the increase. Likewise there has been a huge increase in the number of tools and platforms required to support data intensive analysis. These tools, code, back-end software and compute environments come in many flavours meaning it can be challenging to select the appropriate tools and run them smoothly in a local setting.

Software containers package up code and related dependencies so that software applications can run quickly and reliably from one computing environment to another. As an analogy, it’s like buying flat-pack furniture from IKEA: you have the instructions, components and tools all provided in a single package - enabling you to carry out the task (build the furniture) without having to source the required tools (screws, allen keys, instructions etc.).

This is an incredibly useful method of sharing analysis pipelines, software tools and documentation with collaborators and the wider scientific community thereby enabling reproducibility. Researchers can run the analysis without having to install dependencies and worry about the specific version of software that the person who originally created the pipeline used.

Diagram showing what a container is made of (tools and bins/libraries) and how it is compatible with different operating systems (e.g. Mac OS or Linux)

The BioContainers Registry

In 2018 ELIXIR launched a six month Implementation Study that aimed to build an infrastructure to help scientists within ELIXIR publish their software containers in a standardised manner. The result was the development of an open-source registry for life science research related to containers called BioContainers, which provided over 8,100 containerised tools. BioContainers is a community-driven project that provides the infrastructure and basic guidelines to create, manage and distribute bioinformatics packages (e.g. conda) and containers (e.g docker, singularity).

If you haven't submit to bioRxiv, well, you should

We all know that the traditional publication process delays the dissemination of new research, often by months, sometimes by years.

Preprint servers decouple dissemination of research papers from their evaluation and certification by journals, allowing researchers to share work immediately, receive feedback from a much larger audience, and provide evidence of productivity long before formal publication. The arXiv preprint server, launched in 1991 and currently hosted by Cornell University, has demonstrated the effectiveness of this approach.

Proteomic identification through database Search by David L. Tabb

Best presentation to understand Peptide/Protein identification algorithms in Proteomics.

Sunday 10 November 2019

10 minutes guide to Bioconda

Bioinformatics is complicated, what with its arcane command-line interface, complex workflows, and massive datasets. For new bioinformaticians, installing the software can present a problem.

Resultado de imagen de installing conda package

But the good news is that the Bioinformatics community has already a solution to this problem: BioConda + BioContainers.

Where to deposit my proteomics data: ProteomeXchange

The ProteomeXchange (PX) (http://www.proteomexchange.org) consortium aggregates the major proteomics resources and has standardized data submission and dissemination of mass spectrometry proteomics data worldwide since 2012. Since its inception, the ProteomeXchange (PX) has aimed to standardize data submission and dissemination of public MS proteomics data worldwide.

https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkz984/5612546

Some Stats are always welcome: In terms of distribution of datasets across individual resources, 12 335 datasets (87.1%), had been submitted to PRIDE, followed by MassIVE (1 126 datasets, 7.9%), jPOST (352 datasets, 2.5%), iProX (174 datasets, 1.2%), PASSEL (139 datasets, 1.0%) and Panorama Public (43 datasets, 0.3%).

List of major preprints servers - where to go

The most well-known preprint server is probably arXiv (pronounced like ‘archive’). It started as a server for preprints in physics and has since expanded out to various subjects, including mathematics, computer science, and economics. The arXiv server is now run by the Cornell University Library and contains 1.37 million preprints so far.

The Open Science Framework provides an open-source framework to help researchers and institutions set up their own preprint servers. One such example is SocArXiv for the Social Sciences. On their website, you can browse more than 2 million preprints, including preprints on arXiv, and many of them have their own preprint digital object identifier (DOI). In cases where the preprint has now been published it also links to the publication’s DOI.

Cold Spring Harbor Laboratory set up bioaRxiv, a preprint server for Biology in 2013 to complement arXiv. The bioaRxiv server has a direct transfer service to several journals such as Science and PNAS and a bit over 60% of papers in bioaRxiv end up published in peer-reviewed journals.

In more recent years a lot of new servers have popped up covering almost every field including the social sciences, arts, and humanities fields. Here’s a quick overview of some of the rest:

arXiv -> Mathematics, Computer science, and economics, Physics
EngrXiv - Engineering
ChemRxiv - Chemical sciences
PsyArXiv - Psychological sciences
SportaRxiv - Sport and exercise science
PaleoarXiv - Paleontology
LawArXiv - Law
AgriXiv - Agricultural sciences
NutriXiv - Nutritional sciences
MarXiv - Ocean and marine-climate sciences
EarthArXiv - Earth sciences
Preprints.org - Arts & Humanities, Behavioral Sciences, Biology, Chemistry, Earth Sciences, Engineering, Life Sciences, Materials Science, Mathematics & Computer Science, Medicine &, Pharmacology, Physical Sciences, Social Sciences

BigBio Notes