Integrating biological data is perhaps one of the most daunting tasks any bioinformatician has to face. From a cursory look, it is easy to see two major obstacles standing in the way: (i) the sheer amount of existing data, and (ii) the staggering variety of resources and data types used by the different groups working in the field (reviewed at [1]). In fact, the topic of data integration has a long-standing history in computational biology and bioinformatics. A comprehensive picture of this problem can be found in recent papers [2], but this short comment will serve to illustrate some of the hurdles of data integration and as a not-so-shameless plug for our contribution towards a solution.
"Reflecting the data-driven nature of modern biology, databases have grown considerably both in size and number during the last decade. The exact number of databases is difficult to ascertain. While not exhaustive, the 2011 Nucleic Acids Research (NAR) online database collection lists 1330 published biodatabases (1), and estimates derived from the ELIXIR database provider survey suggest an approximate annual growth rate of ∼12% (2). Globally, the numbers are likely to be significantly higher than those mentioned in the online collection, not least because many are unpublished, or not published in the NAR database issue." [1]
"Reflecting the data-driven nature of modern biology, databases have grown considerably both in size and number during the last decade. The exact number of databases is difficult to ascertain. While not exhaustive, the 2011 Nucleic Acids Research (NAR) online database collection lists 1330 published biodatabases (1), and estimates derived from the ELIXIR database provider survey suggest an approximate annual growth rate of ∼12% (2). Globally, the numbers are likely to be significantly higher than those mentioned in the online collection, not least because many are unpublished, or not published in the NAR database issue." [1]
Some basic concepts
Traditionally, biological database integration efforts come in three main flavors:
- Federated: Sometimes termed portal, navigational or link integration, it is based on the use of hyperlinks to join data from disparate sources; early examples include SRS and Entrez. Using the federated approach, it is relatively easy to provide current, up-to-date information, but maintaining the hyperlinks requires considerable effort.
- Mediated or View Integration: Provides a unified query interface and collects the results from various data sources (BioMart).
- Warehouse: In this approach different data sources are stored in one place; examples include BioWarehouse and JBioWH. While it provides faster querying over joined datasets, it also requires extra care to maintain the underlying databases completely updated.
The Tower of Babel |
Conceptual integration is an unavoidable task regardless of database type. For instance, the contents of the term “protein” are different in Uniprot and Drugbank). Were they to be integrated, the two “protein terms” would have to be represented under a “global protein concept” and a new data type followed by implementation of the appropriate underlying entities and attributes. It is often argued that a single and central representation is possible only through a series of compromises that would ultimately limit the interpretation of the underlying data. To compound matters, data file formats change (daily) due to new standards or novel resources. New cross-references are being added to databases in each release, meaning that biological data integration projects must walk the extra mile to update schema and cross-references between datasets as well as to add new databases.
Large central resources on NCBI or EBI make great efforts to integrate various databases and provide a fundamental and high-level service for individual biologists around the world. However, querying distributed data comes with certain inherent limitations:
- Concerns about confidentiality often prevent enterprises from using such public services.
- High-performance applications, such as artificial intelligence studies, may require thousands of queries to be executed within one experiment; machine learning algorithm via database querying may require the retrieval of gigabytes of data for each query, which makes the process slow and vulnerable to network failures.
- Some projects may require the flexible assembly of tailored datasets.
- Emerging technologies such as the molecular diagnostics envisioned by the concept of personalized medicine will raise database querying demands far above its current level, becoming a challenge, performance-wise, to the abilities of currently available systems and databases.
- Last but not least, broadband internet access is far from ubiquitous. The current bandwidth limitations of many countries make the use of centralized services difficult, even for teaching purposes.
We recently developed Java BioWareHouse (JBioWH), an open-source framework aimed at users who need to query multiple public datasets in a flexible way using their personal computer or a local workstation. JBioWH is based on Java (duh!) and MySQL-based data warehousing, and is used to construct application-specific databases. Three Java APIs were designed for this purpose: jbiowh-core, which includes basic classes such as the logger and utilities for file handling; jbiowh-dbms, which includes classes used to connect and manipulate directly the relational schema using the mysql-connector-java API, and jbiowh-persistence, which includes the JPA classes and controllers to access data, through the eclipselink API, in an object-oriented manner. We then built and released two main applications on top of these APIs: JBioWH Parser (used to insert the data onto the relational schema) and Desktop Client (a graphical interface for data visualization).
There is an online Wiki pages published in the Project web site at http://code.google.com/p/jbiowh/wiki/Summary?tm=6 and a JBioWH discussion list where you can find examples and questions about how to use the framework.
Biological data integration poses obstacles that are both large and complex. However, here’s hoping that the field will continue to attract innovative and far-sighted scientists to further bridge the gap between data and researchers.
There is an online Wiki pages published in the Project web site at http://code.google.com/p/jbiowh/wiki/Summary?tm=6 and a JBioWH discussion list where you can find examples and questions about how to use the framework.
Biological data integration poses obstacles that are both large and complex. However, here’s hoping that the field will continue to attract innovative and far-sighted scientists to further bridge the gap between data and researchers.
"JBioWH: an open-source Java framework for bioinformatics data integration. Vera R, Perez-Riverol Y, Perez S, Ligeti B, Kertész-Farkas A, Pongor S. Database (Oxford). 2013 Jul 11;2013:bat051. doi: 10.1093/database/bat051. Print 2013."
by Yasset Perez-Riverol & Roberto Vera
Read More Here:
1. Burge,S., Attwood,T.K., Bateman,A., Berardini,T.Z., Cherry,M., O’Donovan,C., Xenarios,I. and Gaudet,P. (2012) Biocurators and Biocuration: surveying the 21st century challenges. Database, 2012.
2. Sansone,S.-A., Rocca-Serra,P., Field,D., Maguire,E., Taylor,C., Hofmann,O., Fang,H., Neumann,S., Tong,W., Amaral-Zettler,L., et al. (2012) Toward interoperable bioscience data. Nature genetics, 44, 121–6.
No comments:
Post a Comment