BigBio Notes: 2015

Saturday 28 November 2015

Protein identification with Comet, PeptideProphet and ProteinProphet using BioDocker containers

By +Yasset Perez-Riverol and +Felipe Leprevost:

Proteomics data analysis is dominated by database-based search engines strategies. Perhaps the most common protocol today is to retrieve raw data from a mass spectrometry, convert the raw data from binary format to a text-based format and then process it using a database search algorithm. The resulting data need to be statistically filtered in order to converge to a final list of identified peptides and proteins.

Amount Search Engines, Comet (the youngest son of SEQUEST) is one of the most popular nowadays. Today we are going to show how to run a simple analysis protocol using the Comet database search engine followed by statistical analysis using PeptideProphet and ProteinProphet, two of the most known and robust processing algorithms for proteomics data.

This pipeline is available in TPP, however several users prefer to use the individual components rather than Trans-proteomics Pipeline. The big differential here is how we are going to do it. Instead of going through the step-by-step in how to install and configure Comet and TPP, we are going to run the pipeline using Docker containers from the BioDocker project (you can get more information on the project here).

Installing MESOS in your Mac

1- Homebrew is an open source package management system for the Mac that simplifies installation of packages from source.

ruby -e "$(curl -fsSL https://raw.github.com/Homebrew/homebrew/go/install)"

2- Once you have Homebrew installed, you can install Mesos on your laptop with these two commands:

brew update
brew install mesos

You will need to wait while the most current, stable version of Mesos is downloaded, compiled, and installed on your machine. Homebrew will let you know when it’s finished by displaying a beer emoji in your terminal and a message like the following:

/usr/local/Cellar/mesos/0.19.0: 83 files, 24M, built in 17.4 minutes
Start Your Mesos Cluster

3- Running Mesos in your machine: Now that you have Mesos installed on your laptop, it’s easy to start your Mesos cluster. To see Mesos in action, spin up an in-memory master with the following command:

/usr/local/sbin/mesos-master --registry=in_memory --ip=127.0.0.1

A Mesos cluster needs at least one Mesos Master to coordinate and dispatch tasks onto Mesos Slaves. When experimenting on your laptop, a single master is all you need. Once your Mesos Master has started, you can visit its management console: http://localhost:5050

Since a Mesos Master needs slaves onto which it will dispatch jobs, you might also want to run some of those. Mesos Slaves can be started by running the following command for each slave you wish to launch:

sudo /usr/local/sbin/mesos-slave --master=127.0.0.1:5050

Wednesday 30 September 2015

First Scrum Board

Here, my first Scrum board to guide the release of OmicsDI project.

Team members update the task board continuously each sprint; if someone thinks of a new task (“test a new machine learning algorithm”), she writes a new card and puts it on the wall. Either during or before the daily scrum, estimates are changed (up or down), and cards are moved around the board.

Each row on the Scrum board is a user story, which is the unit of work we encourage teams to use for their product backlog.

During the sprint planning meeting, the team selects the product backlog items they can complete during the next Spring. Each product backlog item is turned into multiple sprint backlog items. Each of these is represented by one task card that is placed on the Scrumboard.

Story (User Story): The story description (“As a user we want to…”) shown on that row.
Ongoing: Any card being worked on goes here. The programmer who chooses to work on it moves it over when she's ready to start the task. Often, this happens during the daily scrum when someone says, “I'm going to work on the boojum today.”
Testing: A lot of tasks have corresponding test task cards. So, if there's a “Code the boojum class” card, there is likely one or more task cards related to testing: “Test the boojum”, “Write FitNesse tests for the boojum,” “Write FitNesse fixture for the boojum,”
Done: Cards pile up over here when they're done. They're removed at the end of the sprint. Sometimes we remove some or all during a sprint if there are a lot of cards.

Optionally, depending on the team, the culture, the project and other considerations:

Notes: Just a place to jot a note or two.
Tests Specified: We like to do “Story Test-Driven Development,” or “Acceptance Test-Driven Development,” which means the tests are written before the story is coded. Many teams find that it helps to have acceptance tests identified before coding begins on a particular story. This column just contains a checkmark to indicate the tests are specified.

Friday 11 September 2015

An API for all MS-based File formats

We recently released and published our first Java API (Application Programming Interface) for the most common file formats in proteomics, not only ms files but also identification files such as mzIdentML and mztab.

ms-data-core-api (https://github.com/PRIDE-Utilities/ms-data-core-api)

The library allow the end-users and the developers to use a common data structure for proteomics independently of the file types, and .. But first lets try to understand what is a API.

What is an API?

Imagine you are a builder or civil engineering and your are building your bridge, different components, blocks and different teams needs to be coordinated and plugged for the final results. Wrong communications between the members of the teams, different block sizes or building plans only produced strange results.

In the simplest terms, APIs are sets of requirements, data structures, objects that govern how applications and software components can talk each other. An API, is a set of routines and protocols that provide building blocks for computer programmers and web developers to build software applications. In the past, APIs were largely associated with computer operating systems and desktop applications. In recent years though, we have seen the emergence of Web APIs (Web Services).

What is ms-data-core-api?

The ms-data-core-api is a free, open-source library for developing computational proteomics tools and pipelines. The Application Programming Interface, written in Java, enables rapid tool creation by providing a robust, pluggable programming interface and common data model. The data model is based on controlled vocabularies/ontologies and captures the whole range of data types included in common proteomics experimental workflows, going from spectra to peptide/protein identifications to quantitative results.

The library contains readers for three of the most used Proteomics Standards Initiative standard file formats: mzML, mzIdentML, and mzTab. In addition to mzML, it also supports other common mass spectra data formats: dta, ms2, mgf, pkl, apl (text-based), mzXML and mzData (XML-based). Also, it can be used to read PRIDE XML, the original format used by the PRIDE database, one of the world-leading proteomics resources. Finally, we present a set of algorithms and tools whose implementation illustrates the simplicity of developing applications using the library.

DIA-Umpire Pipeline Using BioDocker containers.

By +Felipe Leprevost and +Yasset Perez-Riverol

The complexity of some bioinformatic softwares is well-known and it has been commented in different papers and blog posts, etc. Especially, those softwares that depend of many software components and tools making impossible for a testing/new-user try for the first time the software. @BioDocker aim to simplify the process of testing/compiling/deploying bioinfo softwares. Our previously post shows how to use the TPP software from System Biology team.

Recently, the Data Independent Acquisition Methods has been receiving a lot of attention by the proteomics community, specially SWATH. In this example We are going to demonstrate the importance of Docker through the use of a complex and powerful pipeline called DIA-Umpire. In this example I will demonstrate how to download, run and obtain the results from the DIA-Umpire pipeline.

Moving Bioinformatics to the Cloud

Constantly we presence new technologies being developed and streamed to public, to researchers that work with molecular biology, the ones that get our attention normally comes from new laboratory methodologies or instruments. In this post, we are going to talk about a different situation that is calling the attention of researchers who work with molecular biology, and more specifically, bioinformatics, in a different way. I’m writing about a technological innovation that comes from the computational field and can have a great impact on how we do biological analysis with bioinformatics software.

A few years ago a cloud startup called dotCloud developed a new software called Docker, to be used only internally, the software made so much success that just after two years releasing docker to the public, the newly Docker company has an estimate worth of $1 bn.

What is Docker and Why it Matters?

Docker has several ways to be employed in different environments, what it does is to basically, provide to the user isolated and containerized software that can be executed apart from the host operating system. It is very similar to what a Virtual Machine does, the difference is that there is no guest operating system. These containers use some system libraries and apply some abstraction layers to the execution of the software inside, in the end you have an isolated environment with a custom software inside that can be shared.

What this has to do with Bioinformatics?

Imagine that you are a senior researcher, or even a recently accepted student, trying to learn how to do some analysis. You are a lab specialist but computers are not your thing. Now imagine that the software you are trying to run needs a Linux operating system with a gcc compiler version 4.9.3 and some libraries like GD. Sounds bad right? That’s where Docker comes in. Docker allows developers to ship software inside a container, that is, a custom environment with all the necessary tools and configuration to run a specific program, what you have to do if just download the container and execute the program inside. Running a Docker container is just as simple as running a program in the command line.

Benefits for Bioinformatics

For a bioinformatician this brings several other benefits. Something that is getting attention today is how to deal with reproducible research in the bioinformatics field. Different computers with different configurations, libraries and software versions can produce different results when comparing results from different software. If we had the chance to transform the environment variable into a constant, that problem would be reduced a lot.

The BioDocker Project

In 2014, a new project called BioDocker was founded. Recently, the project assumed a community-driven policy, the main idea is to get feedback from the community and to enjoy the specialty of each member. The goal here is to provide containerized bioinformatics tools to the general public. For developers bioinformaticians, the project also provides specifications, settings and guidelines on how to produce your own Biodocker containers. Defining guidelines like that we hope that the use of Docker become more common, helping people to deal more easily with different software and to reduce the problem with the reproducible research.

Wrapping up

Docker is a new technology that is gaining a lot of space nowadays, and slowly , it is getting some space in the bioinformatics field as well. It is definitively worth to get some time to learn how to work with it.

Thursday 13 August 2015

The future of Proteomics: The Consensus

After the Big Nature papers about the Human Proteome [1][2] the proteomics community has been divided by the same well-known topics than genomics had before: same reasons, same discussions [3-7]. No one discusses about the technical issues, the instrument settings, nothing about the samples processing, even anything about the analytical method (Most of both projects are "common" bottom-up experiments). Main issues are data-analysis problems and still Computational Proteomics Challenges.

one big lesson I just learn

I'm coming from small country with no resources, no big industries or capitals (Cuba); but with a big tradition in friendship and solidarity. In my previous institute (surprisingly, a big biotech company) we share openly all of our ideas, we discuss openly our results, thoughts, etc.. without thinking in competition, plagiarism, or someone from collaborator group can take your ideas and results to sell them to others or take them as his owns ideas.

The picture completely changed, after one year abroad, the only big think I learned is that outside my farm and my small country: time, ideas, contacts are gold. In science you have people with you can work and collaborate, because they are open by nature (not only because they source code is in github) but also because they share, they help, they support, and they give their ideas without concern. People, that like to talk about science, they encourage young researchers without fear of others, without fear of being open.

But you have other people, people that always looks for competition, they stealing what is not theirs, looking for ideas to be recognised, looking for contacts, looking for papers, to get citations. The good thing is that I learn, and I can recognise them. I can give them my ideas, my time, because they need it more than me. At the end, the friendly ones, the collaborative ones, the ones that share, open, help, support; we are more and not only the ones that have their code in github.

Monday 8 June 2015

first tweet with more than 1k rt and my post with more than 5k

Happy to see my first post with more than 5k visits:

Introduction to Feature selection for bioinformaticians using R, correlation matrix filters, PCA & backward selection

and my first tweet with more than 1k RTs:

Brilliant pic.twitter.com/dvOUhUb5k6
— Yasset Perez-Riverol (@ypriverol) May 26, 2015

Thanks to my Readers.

Yasset

Sunday 31 May 2015

I love technical notes and short manuscripts

One of my first papers in 2012 (here), was related with support vector (SVM) machines. It was a simple algorithm, that improved the method to compute the isoelectric point of peptides using SVM. The first time I presented the results to my colleagues, one of them ask me: "are you planning to publish this?". One of the senior co-authors said, "we can write a big research manuscript, explaining other algorithms, compare them, use other datasets, etc". Another said (computer scientist), "we can explore other features from peptides including topological indexes.. and write a full research manuscript about.."....

"I was very clear from the very beginning, We will write a Technical Note or Letter. "

A Trans-Proteomic Pipeline (TPP) Docker container

By +Felipe Leprevost & +Yasset Perez-Riverol

In my initial post in this blog, I will teach you how to use a Docker container with the Trans-Proteomic Pipeline software installation.

Docker is a great new technology that allows us to create GNU/Linux containers with specific software inside. All kinds of software can be "containerized", including ones that rely on graphical user interfaces.

The whole idea of using a Docker container is built on having a software that is isolated from the host OS and can interact with the outside world. GNU/Linux containers, like Docker, are very useful even in the scientific world where bioinformatics applications are used every day.

Using Docker with bioinformatics software helps to solve some issues we face, like reproducibility, for example. We wrote about this last year [1] . You can also check for more containers with bioinformatics applications in the BioDocker webpage.

Here I am going to describe to you how to install and use one of the most powerful software for proteomics data analysis, the Trans-Proteomics Pipeline (TPP).

Unfortunately, if you are a GNU/Linux user (like me) and your job involve MS/MS data analysis (also, like me), you will probably have some harsh time trying to install TPP. Almost all the tutorials available on the Web focus on the Windows users, so novice bioinformaticians or those that are not too versatile with GNU/Linux can have some hard times.

With a Docker TPP container you can just download it and use it on the command line, the container it self behaves like an executable, so image the possibilities.

Lets begin preparing your environment for Docker. The first thing you have to do is to install some libraries that are essential for the Docker daemon to run properly. If you are running an Ubuntu OS, you can skip this step. If you are on a different OS, like Linux Mint for example you need to follow these steps.

GPMDB identifications by Original source

Source:

@GPMDB BTW do you have an idea how many datasets in GPMDB comes from MassIVE, PRIDE, Tranche and how many were deposited direct in GPMDB.
— Yasset Perez-Riverol (@ypriverol) March 8, 2015

@ypriverol in %proteinids: tranche 14,pride 2,px 37,cptac 19,pep-atlas 8, peptidome 1,massive 3,chorus 3,proteomicsdb 1,other 12
— Ron Beavis (@GPMDB) March 8, 2015

Tuesday 20 January 2015

Bioinformatics for Proteomics Course, Bergen, April 21-24th 2015

The course will include lectures and practicals on open access software for the analysis of mass spectrometry generated proteomics data. Among the tools covering both protein identification and quantification are: SearchGUI, PeptideShaker, MaxQuant, Perseus and Skyline.

Topics Covered:

Why is the experimental design important? What is a protein database? How to convert raw mass spectrometry data to the required formats? What is a proteomics search engine and how do they work? What is protein inference and why is it important? How to interpret and validate proteomics results? What is functional analysis of proteomics data? How to share and reprocess proteomics data? How to quantify proteins?

Special Guest Lecture:

"Introduction to mass spectrometry based proteomics" by Prof. Dr. Lennart Martens from Ghent University and VIB, Ghent, Belgium.

For more details and registration please see the course details.

Friday 2 January 2015

Brazil: A place for Science and Friendship

Búzios

It's really difficult to break stereotypes, especially for developing countries, like Brazil. If you mention its name around the world they are immediately associated with: sports, music, beaches, rum and "País do Carnaval". If you ask to someone in the streets of Germany or China about personalities from Brazil, they will mention Pelé. Breaking stereotypes is a task for years or centuries but we are going in the right direction.

Hotel Ferradura/ Ferradura Resort

Last December I attended to the 2nd Proteomics Meeting of the Brazilian Proteomics Society jointly with the 2nd Pan American HUPO Meeting in Hotel Ferradura/ Ferradura Resort, Búzios, Rio de Janeiro State, Brazil. The venue was gorgeous, mountains close to a small bay that offers calm, clear waters and the open sea. We arrived after 2 hours by car from Rio international airport. My plans, give a talk about PRIDE and ProteomeXchange but more than that, my talk was about "if we really need to share our proteomics data".

Trends in Mass Spec Instruments

I'm not doing marketing for any of the mass spec producers. Here a recently statistic I got about the use of different mass spec instruments using the public data in PRIDE Archive. It can help to researcher to eveluate which are the most popular and well-stablished instruments.

Happy New Year!!!

BigBio Notes