Sunday 10 May 2015

A Trans-Proteomic Pipeline (TPP) Docker container

By +Felipe Leprevost & +Yasset Perez-Riverol

In my initial post in this blog, I will teach you how to use a Docker container with the Trans-Proteomic Pipeline software installation.

Docker is a great new technology that allows us to create GNU/Linux containers with specific software inside. All kinds of software can be "containerized", including ones that rely on graphical user interfaces.

The whole idea of using a Docker container is built on having a software that is isolated from the host OS and can interact with the outside world. GNU/Linux containers, like Docker, are very useful even in the scientific world where bioinformatics applications are used every day.

Using Docker with bioinformatics software helps to solve some issues we face, like reproducibility, for example. We wrote about this last year [1] . You can also check for more containers with bioinformatics applications in the BioDocker webpage.

Here I am going to describe to you how to install and use one of the most powerful software for proteomics data analysis, the Trans-Proteomics Pipeline (TPP).


Unfortunately, if you are a GNU/Linux user (like me) and your job involve MS/MS data analysis (also, like me), you will probably have some harsh time trying to install TPP. Almost all the tutorials available on the Web focus on the Windows users, so novice bioinformaticians or those that are not too versatile with GNU/Linux can have some hard times.

With a Docker TPP container you can just download it and use it on the command line, the container it self behaves like an executable, so image the possibilities.

Lets begin preparing your environment for Docker. The first thing you have to do is to install some libraries that are essential for the Docker daemon to run properly. If you are running an Ubuntu OS, you can skip this step. If you are on a different OS, like Linux Mint for example you need to follow these steps.


Preparing your system


sudo apt-get install cgroup-lite
sudo apt-get install lxc



Now we can download and install the latest Docker version:

Installing Docker

If you are running a Debian based system you can run the following command, otherwise I advise you to check the installation guide on Docker website.

wget -qO- https://get.docker.com/ | sh
sudo reboot


Apparently the docker version available in the official repositories is not the latest one and you are going to find some errors while using it, so just follow the steps above and you will be fine.


Downloading the Image


Docker works by building containers of images. There is a large repository for images on the Docker website called Docker Hub. We are going to download the TPP image from it. Run the following command:

docker pull hexabio/tpp-4.8.0

Hexabio is the name of the repository and tpp-4.8.0 is the name of the image. You will see some progress bars indicating the status of your download. After that you can run the following command to see if you got the image:

docker images

You are going to see something like this:


hexabio/tpp-4.8.0   latest              1830f991f551        43 minutes ago      637.4 MB


This indicates everything went well.

Testing the Container


To see if everything is OK we will just call the tandem executable with the following command:

docker run hexabio/tpp-4.8.0 tandem

NOTE: Docker daemon needs to be used by a sudoer user, so consider that every example bellow must be executed with the 'sudo' command. I'm omitting the sudo just to let the text cleaner.

The run parameter is to tell the docker daemon that you are going to execute something using an image, that way the container is automatically assembled and executed.

next comes the hexabio/tpp-4.8.0; that's the name of the image. The last part is the command that you want to execute. In our case it is the tandem software.

You should get the standard output from tandem complaining that there is no input XML file.


Running TPP


My goal here is not to teach how to use TPP, so I will presume that you know what you are doing.

TPP needs some input files in order to run the analysis properly, so go to a separate directory in your machine and create a new directory called tpp_root. Inside tpp_root you will place all the necessary files to run the Tandem software. In my directory I have the following files (notice that we are going to run the next set of commands from outside the tpp_root directory):

default_input.xml
input.xml
sample.mzML
database_td.fa
taxonomy.xml

So, being an isolated environment, my TPP container cannot see my desktop directories. What we are going to do now is to tell the container to map the tpp_root/ directory inside it self in a directory called /data/ (yes, you should use the trailing slash), in the end it will behave very similar to a shared folder, the container can read those files and can also write inside the directory:

docker run -v tpp_root/:/data/ hexabio/tpp-4.8.0 tandem /data/input.xml

The -v is the volume parameter, that's how tandem will find those files. It is saying that the tpp_root, from my OS must be mapped inside the container into /data/, The /data/ directory will be automatically created once you execute the command. The last part of the command says that I want to execute tandem and that the input.xml file is located in the /data/ directory.

It is imperative that you include into your xml files the /data/ path, do not use the tpp_root/ path, otherwise the TPP software will not find any of your files.

If everything goes well you should see the initial output from the analysis on your screen. Now you can continue the pipeline following the same basic ideas I showed.

If you got interested in how Docker works and want to use other bioinformatics applications I suggest you to take a look on the BioDocker website. I'm continually containerizing new bioinformatics software and you are welcome to join in and help. If you need some software inside a container and don't know how to do it, I'm also glad to help, just send an e-mail.


[1] Leprevost Fda V, Barbosa VC, Francisco EL, Perez-Riverol Y and Carvalho PC. On best practices in the development of bioinformatics software. Front Genet. 2014 Jul 2;5:199. doi: 10.3389/fgene.2014.00199. eCollection 2014.



No comments:

Post a Comment