BigBio Notes: October 2014

Wednesday, 29 October 2014

What is BioHackathon 2014?

In a week BioHackathon 2014 will start (http://www.biohackathon.org/). It will be my first time ins this kind of "meeting". I will give a talk about PRIDE and ProteomeXchange and future developments of both platforms (below the complete list of talks).

But first, a quick introduction of BioHackathon. National Bioscience Database Center (NBDC) and Database Center for Life Science (DBCLS) have been organizing annual BioHackathon since 2008, mainly focusing on standardization (ontologies, controlled vocabularies, metadata) and interoperability of bioinformatics data and web services for improving integration (semantic web, web services, data integration), preservation and utilization of databases in life sciences. This year, we will focus on the standardization and utilization of human genome information with Semantic Web technologies in addition to our previous efforts on semantic interoperability and standardization of bioinformatics data and Web services.

Ontologies versus controlled vocabularies.

While the minimum data standards describe the types of data elements to be captured, the use of standard vocabularies as values to populate the information about these data elements is also important to support interoperability. In many cases, groups develop term lists (controlled vocabularies) that describe what kinds of words and word phrases should be used to describe the values for a given data element. In the ideal case each term is accompanied by a textual definition that describes what the term means in order to support consistency in term use. However, many bioinformaticians have begun to develop and adopt ontologies that can serve in place of vocabularies for use as these allowed term lists. As with a specific vocabulary, an ontology is a domain-specific dictionary of terms and definitions. But an ontology also captures the semantic relationships between the terms, thus allowing logical inferencing about the entities represented by the ontology and by the data annotated using the ontology’s terms.

The semantic relationships incorporated into the ontology represent universal relations between the classes represented by its terms based on knowledge about the entities described by the terms established previously. An ontology is a representation of universals; it described what is general in reality, not what is particular. Thus, ontologies describe classes of entities whereas databases tend to describe instances of entities.

The Open Biomedical Ontology (OBO) library was established in 2001 as a repository of ontologies developed for use by the biomedical research community (http://sourceforge.net/projects/obo). In some cases, the ontology is composed of a highly focused set of terms to support the data annotation needs of a specific model organism community (e.g. the Plasmodium Life Cycle Ontology). In other cases, the ontology covers a broader set of terms that is intended to provide comprehensive coverage of an entire life science domain (e.g. the Cell Type Ontology). The European Bioinformatics Institute has also developed the Ontology Lookup Service (OLS) that provides a web service interface to query multiple OBO ontologies from a single location with a unified output format (http://www.ebi.ac.uk/ontology-lookup/). Both the BioPortal and the OLS permit users to browse individual ontologies and search for terms across ontologies according to term name and certain associated attributes.

Thursday, 23 October 2014

Which journals release more public proteomics data!!!

I'm a big fan of data and the -omics family. Also, I like the idea of make more & more our data public available for others, not only for reuse, but also to guarantee the reproducibility and quality assessment of the results (Making proteomics data accessible and reusable: Current state of proteomics databases and repositories). I'm wondering which of these journals (list - http://scholar.google.co.uk/) encourages their submitters and authors to make their data publicly available:

Journal	h5-index	h5-median
Molecular & Cellular Proteomics	74	101
Journal of Proteome Research	70	91
Proteomics	60	76
Biochimica et Biophysica Acta (BBA)-Proteins and Proteomics	52	78
Journal of Proteomics	49	60
Proteomics - Clinical Applications	35	43
Proteome Science	23	32

After a simple statistic, based on PRIDE data:

Number of PRIDE projects by Journal

Analysis of histone modifications with PEAKS 7: A respond to Search Engines comparison from PEAKs Team

Recently we posted a comparison of different search engines for PTMs studies (Evaluation of Proteomic Search Engines for PTMs Identification). After some discussion of the mentioned results in our post the PEAKS Team just published a blog post with the reanalysis of the dataset. Here the results:

Originally Posted in Peaks Blog:

The complex nature of histone modification patterns has posed as a challenge for bioinformatics analysis over the years. Yuan et al. [1] conducted a study using two datasets from human HeLa histone samples, to benchmark the performance of current proteomic search engines. This article was published in J Proteome Res. 2014 Aug 28 (PubMed), and the data from the two datasets, HCD_Histone and CID_Histone (PXD001118), was made publically available through ProteomeXchange. With this data, the article uses eight different proteomic search engines to compare and evaluate the performance and capability of each. The evaluated search engines in this study are: pFind, Mascot, SEQUEST, ProteinPilot, PEAKS 6, OMSSA, TPP and MaxQuant.

In this study, PEAKS 6 was used to compare the performance capabilities between search engines. However, PEAKS 7, which was released November 2013, is the latest version available of the PEAKS Studio software. PEAKS 7 not only includes better performance than PEAKS 6, but a lot of additional and improved features. Our team has reanalyzed the two datasets HCD_Histone and CID_Histone with PEAKS 7 to update the ID results presented in the publication by Yuan et al. These updated results showed that instead, it is PEAKS, pFind and Mascot that identify the most confident results.

BigBio Notes