BigBio Notes: 2013

Tuesday 12 November 2013

My List of Most Active Twitter Users in Proteomics

Recently, I published a list of my top influential authors in Computational proteomics. The list was created using a my PhD References and other resources such as linkedin, twitter, google scholar. I will try to do the same here using the most active twitter accounts that i follow. Twitter can be incredibly powerful for both consuming and contributing to the dialogue in your field. Twitter can be an excellent real-time source of new publications, fresh developments, and current opinion. If you like and use twitter these are some of the twitter account i follow (no order) in Proteomics:

News: JBioWH WebServices

We decided to develop a JBioWH webservice to provides the JBioWH data through internet. The source code is under development now but you can test the server on:

http://hydrax.icgeb.trieste.it:8080/jbiowh-webservices/

Only the DataSet module is available and you can retrieve the dataset info
using the server URL. The webservices is able to send data in XML and JSON.

We are open to develop any webservices requested by users. So, let me know if
your specific needs.

Regards

Friday 1 November 2013

Integrating the Biological Universe

Integrating biological data is perhaps one of the most daunting tasks any bioinformatician has to face. From a cursory look, it is easy to see two major obstacles standing in the way: (i) the sheer amount of existing data, and (ii) the staggering variety of resources and data types used by the different groups working in the field (reviewed at [1]). In fact, the topic of data integration has a long-standing history in computational biology and bioinformatics. A comprehensive picture of this problem can be found in recent papers [2], but this short comment will serve to illustrate some of the hurdles of data integration and as a not-so-shameless plug for our contribution towards a solution.

"Reflecting the data-driven nature of modern biology, databases have grown considerably both in size and number during the last decade. The exact number of databases is difficult to ascertain. While not exhaustive, the 2011 Nucleic Acids Research (NAR) online database collection lists 1330 published biodatabases (1), and estimates derived from the ELIXIR database provider survey suggest an approximate annual growth rate of ∼12% (2). Globally, the numbers are likely to be significantly higher than those mentioned in the online collection, not least because many are unpublished, or not published in the NAR database issue." [1]

Some basic concepts

Traditionally, biological database integration efforts come in three main flavors:

Federated: Sometimes termed portal, navigational or link integration, it is based on the use of hyperlinks to join data from disparate sources; early examples include SRS and Entrez. Using the federated approach, it is relatively easy to provide current, up-to-date information, but maintaining the hyperlinks requires considerable effort.
Mediated or View Integration: Provides a unified query interface and collects the results from various data sources (BioMart).
Warehouse: In this approach different data sources are stored in one place; examples include BioWarehouse and JBioWH. While it provides faster querying over joined datasets, it also requires extra care to maintain the underlying databases completely updated.

One step ahead in Bioinformatics using Package Repositories

About a year ago I published a post about in-house tools in research and how using this type of software may end up undermining the quality of a manuscript and the reproducibility of its results. While I can certainly relate to someone reluctant to release nasty code (i.e. not commented, not well-tested, not documented), I still think we must provide (as supporting information) all “in-house” tools that have been used to reach a result we intend to publish. This applies especially to manuscripts dealing with software packages, tools, etc. I am willing to cut some slack to journals such as Analytical Chemistry or Molecular Cell Proteomics, whose editorial staffs are –and rightly so- more concerned about quality issues involving raw data and experimental reproducibility, but in instances like Bioinformatics, BMC Bioinformatics, several members of the Nature family and others at the forefront of bioinformatics, methinks we should hold them to a higher standard. Some of these journals would greatly benefit from implementing a review system from the point of view of Software Production, moving bioinformatics and science in general one step forward in terms of reproducibility and software reusability. What do you think would happen if the following were checked during peer reviewing?

Creating an Open Source Revolution in Computational Proteomics

First of all, I don’t want to discuss in this post about Open-Source, its strengths & strengths. This post is about the most useful Open-Source packages, frameworks or libraries in the field of computational proteomics (a short version of our manuscript “Open source libraries and frameworks for Mass Spectrometry based Proteomics: A developer’s perspective”).

Schema of the possible computational processing steps of a proteomics data set.

In proteomics like other Omics, the bioinformatics efforts can be divided in three major fields: data processing, storage and visualization. From MS/MS preprocessing to post-processing of the identifications results, even though the objectives of these libraries and packages can vary significantly, they usually share a number of features. Common use cases include the handling of protein and peptide sequences, the parsing of results from various proteomics search engines output files, and the visualization of MS-related information (including mass spectra and chromatograms).

Some Reasons to Rename my Blog as BioCode's Notes

Hi Dear Readers:

I’ve decided that it would be prudent, exposure-wise, to change the name of my professional blog to BioCode's Notes, for a number of reasons:

1. People into bioinformatics comprise a significant part of my –alas, still small- readership. They tend to be always hungry for code tips, language comparisons, and other things that do not fit neatly under the umbrella of “computational proteomics”.

2. My own work is straying more and more from computational proteomics per se into other problems linking biology (Proteomics, Genomics, Life Sciences) with programming (R, Java, Perl, C++). Biocoding is now my bread-and-butter…

3. I need a shorter, catchier name that is easy to use in coffee talks, presentations, or when sharing links with friends.

4. I also decided to add a Blog's mascot, our T-rex:
              Truth    => Science is about Truth.
Tea: UK Science.
              STaTisTics => OK, this one’s got as many ‘S’ as ‘T’, but the latter is more frequent in English.
              T-rex => The future belongs to Big Data, which we’ll use (and are already
   using) to trace back the march of evolution to our preferred
                                 species, including the dinosaurs. And last, but not least, this is
                                 Abel’s (my son) favorite animal.

Hope you enjoy this Idea
Yasset

Saturday 19 October 2013

Which are the best programming languages for a bioinformatician?

This is a basic question when you (as a programmer or biologist or mass spectrometrist) start a career in bioinformatics. What is your favorite programming language in bioinformatics?. This pool will give you a short picture about which languages are mandatory in computational proteomics & bioinformatics. Which languages would you recommend to a student wishing to enter the world of bioinformatics?. We can use this post to comment the strengths and weaknesses of each languages.

customer survey

Some Polls and Discussion about this topic can be found in:

Thursday 17 October 2013

Introduction to Feature selection for bioinformaticians using R, correlation matrix filters, PCA & backward selection

Bioinformatics is becoming more and more a Data Mining field. Every passing day, Genomics and Proteomics yield bucketloads of multivariate data (genes, proteins, DNA, identified peptides, structures), and every one of these biological data units is described by a number of features: length, physicochemical properties, scores, etc. Careful consideration of which features to select when trying to reduce the dimensionality of a specific dataset is, therefore, critical if one wishes to analyze and understand their impact on a model, or to identify what attributes produce a specific biological effect.

For instance, considering a predictive model C1A1 + C2A2 + C3A3 … CnAn = S, where Ci are constants, Ai are features or attributes and S is the predictor output (retention time, toxicity, score, etc). It is essential to identify which of those features (A1, A2 and A3…An) are most relevant to the model and to understand how they correlate with S, as working with such a subset will enable the researcher to discard a lot of irrelevant and redundant information.

What is your tool for peptide/protein identification?

In 2012 We published a Poll about most used softwares from peptide/protein identification in proteomics in Computational Proteomics Linkedin Group. I decided to reproduce the Poll here because linkedin remove this option and also here i have the opportunity to add more softwares to the list.

Wednesday 9 October 2013

My List of Most Influential Authors in Computational Proteomics (according to Articles References, Google Scholar, twitter, Linkedin, Microsoft Academic Search and ResearchGate)

Young researchers starting their careers will often look for reviews, opinions and research manuscripts from the most influential authors of their chosen field. In science, however, unlike many other topics on the Internet, ranked lists or manuscript repositories of top authors sorted by research topic are hard to come by. For some researchers, the idea of such a task brings the words ‘wasted time’ to their minds; the most critical condemn it as a frivolous pursuit. Maybe so. In my opinion, however, it as an excellent starting point.

Home Page of ResearchGate with more than 3 millions of users

These days, more people than ever are involved in science and research. Just look at ResearchGate’s homepage. There are over 3 million persons there –and we’re only counting ResearchGate users. Once simple undertakings, such as finding the right manuscript to cite, the most authoritative group on a topic, or the best software application for a specific task, have become increasingly difficult for graduate students navigating this ocean of data, despite the availability of services such as Google Scholar or Pubmed. The situation will only worsen in the future, as is easy to see by simply tallying the number of published papers in the fields of Proteomics, Genomics, Bioinformatics and Computational Proteomics since 1997:

Number of published manuscripts in Pubmed per year (1997-2012). the statistics was done using the Medline Trend Service http://dan.corlan.net/medline-trend.html

In 2012 alone, over 6,000 and 17,000 manuscripts were published in the fields of proteomics and bioinformatics, respectively. Our young field, computational proteomics, published more than four hundred papers. Perhaps well-established PI’s or Group Leaders can easily tell apart derivative or me-too contributions from groundbreaking work, but young scientists, who spend most of their time implementing someone else’s ideas, can certainly have a hard time doing so. Although technology has come to the rescue with today’s mixture of search engines and social networking tools (ResearchGate, Google Scholar, twitter and LinkedIn among them), the best way to harness its power is, precisely, by starting from a ranked list of the most authoritative voices within a field of research, whose whereabouts can then be traced in the scientific literature, the blogosphere, and anywhere else.

Celebrating Ten Years of Mann and Aebersold’s “Mass spectrometry-based proteomics” review.

In 2003 Mann & Aebersold reviewed on the pages of Nature the challenges and perspectives of the then-nascent field of MS-based proteomics. Mass spectrometry (MS) has since entrenched itself as the method of choice for analyzing complex protein samples, and MS-based proteomics has become an indispensable technology for interpreting genomic data and performing protein analyses (primary sequence, post-translational modifications (PTMs) or protein–protein interactions).

" The ability of mass spectrometry to identify and, increasingly, to precisely quantify thousands of proteins from complex samples can be expected to impact broadly on biology and medicine."

The manuscript by Mann & Aebersold is one of the most cited manuscripts in the field of MS proteomics, For this reason is one of the “core papers” in the field of proteomics and computational proteomics, outlining most of the basic concepts required to understand the fundamentals of this discipline.

Ten years after its publication the main workflow described in the manuscript do not change dramatically. In this period major advances are related to the development of the Thermo’s Orbitrap Mass Spectrometer (Velos, LTQ, Exactive, etc) and new fragmentations types (ETD, HCD). Separation techniques (electrophoretic and chromatographic) were explored extensively in these ten years. Aebersold pioneered in 2005 the use of OFFGEL electrophoresis and electrophoresis fragmentation at peptide level (Heller 2005) and Mann’s group developed the FASP method for sample preparation before protein digestion (Wiśniewski JR et al 2009),both of which have contributed significantly to the dramatic increase in the number of identified proteins characterizing today’s proteomic projects. Surprisingly, the development of electrophoretic methods in the last 3 years looks like a “passed-on topic”. In ten years we moved from identifying at most 500 species in complex samples to identifying 60% of the human proteome.

Retrieve weka weitghts from Linear Regression

Weka function to retrieve the linear regression weights:

import java.util.List;
import weka.classifiers.Evaluation;
import weka.classifiers.functions.SMOreg;
import weka.classifiers.functions.supportVector.Kernel
.....



public double[] getCoefficients(){
       String svmClassifier = svm.toString();
       List<Double> coefficents = new ArrayList<Double>();
       double[] coeficcientsArray = null;
       if(svmClassifier.contains("weights (not support vectors):")){
          String[] svmClassStrings = svmClassifier.split(System.lineSeparator());
          int i = 0;
          boolean lastFound = false;
          while( i < svmClassStrings.length && !lastFound){
              String currentString = svmClassStrings[i];
              if(currentString.contains("weights (not support vectors):")){
                  i++;
                  while(i < svmClassStrings.length && !lastFound){
                      currentString = svmClassStrings[i];
                      String[] parts = currentString.split("\\s+");
                      if(parts.length < 5){
                          lastFound = true;
                      }else{
                          String number = parts[1] + parts[2];
                          coefficents.add(Double.parseDouble(number));
                      }
                      i++;
                  }
              }
              i++;
          }
        }

Wednesday 6 March 2013

#HavanaBioinfo2012 Hard, but Awesome Experience

From 8th to 11th of last December the "I Bioinformatics for Biotechnology Applications" (#HavanaBioinfo2012) was held in the hotel “Occidental Miramar” located within an elegant area in Havana, Cuba. Putting on a Bioinformatic workshop takes a lot of different pieces (small ones and big ones). You need write lot of mails to invited speakers, you need a nice and comfortable place with lots of tables and chairs and good food and the drinks.

I started from October, writing the first mails to EBI friends, advisers and professors (www.ebi.ac.uk), and to be complete honest it was fantastic because most if then accepted from the very beginning. Thanks to Henning Hermjakob and Alex Bateman (@Alexbateman1) for the support. At the end the workshop was fully subscribed, with more than 45 attendees and sixteen speakers, participating in two poster sessions and a panel discussion. Speakers from EBI, Belgium University, Mascot (UK), Bioinformatics Solutions (Canada) accepted the invitation to come here and give one or two lectures about proteomics, genomics, bioinformatics.

BigBio Notes