Showing posts with label bioconductor. Show all posts
Showing posts with label bioconductor. Show all posts
Saturday, 29 August 2015
DIA-Umpire Pipeline Using BioDocker containers.
Thursday, 13 August 2015
The future of Proteomics: The Consensus

After the Big Nature papers about the Human Proteome [1][2] the proteomics community has been divided by the same well-known topics than genomics had before: same reasons, same discussions [3-7]. No one discusses about the technical issues, the instrument settings, nothing about the samples processing, even anything about the analytical method (Most of both projects are "common" bottom-up experiments). Main issues are data-analysis problems and still Computational Proteomics Challenges.
Etiquetas:
Andromeda,
big data,
bioconductor,
Bioinformatic,
biological databases,
computational proteomics,
data analysis,
false discovery rates
Friday, 1 November 2013
Integrating the Biological Universe
Integrating biological data is perhaps one of the most daunting tasks any bioinformatician has to face. From a cursory look, it is easy to see two major obstacles standing in the way: (i) the sheer amount of existing data, and (ii) the staggering variety of resources and data types used by the different groups working in the field (reviewed at [1]). In fact, the topic of data integration has a long-standing history in computational biology and bioinformatics. A comprehensive picture of this problem can be found in recent papers [2], but this short comment will serve to illustrate some of the hurdles of data integration and as a not-so-shameless plug for our contribution towards a solution.
"Reflecting the data-driven nature of modern biology, databases have grown considerably both in size and number during the last decade. The exact number of databases is difficult to ascertain. While not exhaustive, the 2011 Nucleic Acids Research (NAR) online database collection lists 1330 published biodatabases (1), and estimates derived from the ELIXIR database provider survey suggest an approximate annual growth rate of ∼12% (2). Globally, the numbers are likely to be significantly higher than those mentioned in the online collection, not least because many are unpublished, or not published in the NAR database issue." [1]
"Reflecting the data-driven nature of modern biology, databases have grown considerably both in size and number during the last decade. The exact number of databases is difficult to ascertain. While not exhaustive, the 2011 Nucleic Acids Research (NAR) online database collection lists 1330 published biodatabases (1), and estimates derived from the ELIXIR database provider survey suggest an approximate annual growth rate of ∼12% (2). Globally, the numbers are likely to be significantly higher than those mentioned in the online collection, not least because many are unpublished, or not published in the NAR database issue." [1]
Some basic concepts
Traditionally, biological database integration efforts come in three main flavors:
- Federated: Sometimes termed portal, navigational or link integration, it is based on the use of hyperlinks to join data from disparate sources; early examples include SRS and Entrez. Using the federated approach, it is relatively easy to provide current, up-to-date information, but maintaining the hyperlinks requires considerable effort.
- Mediated or View Integration: Provides a unified query interface and collects the results from various data sources (BioMart).
- Warehouse: In this approach different data sources are stored in one place; examples include BioWarehouse and JBioWH. While it provides faster querying over joined datasets, it also requires extra care to maintain the underlying databases completely updated.
Etiquetas:
big data,
bioconductor,
Bioinformatics,
biological databases,
computational proteomics,
data analysis,
EBI,
java,
JBioWH,
proteomics
Monday, 28 October 2013
One step ahead in Bioinformatics using Package Repositories
About a year ago I published a post about in-house tools in research and how using this type of software may end up undermining the quality of a manuscript and the reproducibility of its results. While I can certainly relate to someone reluctant to release nasty code (i.e. not commented, not well-tested, not documented), I still think we must provide (as supporting information) all “in-house” tools that have been used to reach a result we intend to publish. This applies especially to manuscripts dealing with software packages, tools, etc. I am willing to cut some slack to journals such as Analytical Chemistry or Molecular Cell Proteomics, whose editorial staffs are –and rightly so- more concerned about quality issues involving raw data and experimental reproducibility, but in instances like Bioinformatics, BMC Bioinformatics, several members of the Nature family and others at the forefront of bioinformatics, methinks we should hold them to a higher standard. Some of these journals would greatly benefit from implementing a review system from the point of view of Software Production, moving bioinformatics and science in general one step forward in terms of reproducibility and software reusability. What do you think would happen if the following were checked during peer reviewing?
Etiquetas:
bioconductor,
Bioinformatics,
computational proteomics,
in-house programs,
java,
open source,
perl,
PHP,
R,
science & research
Thursday, 17 October 2013
Introduction to Feature selection for bioinformaticians using R, correlation matrix filters, PCA & backward selection
Bioinformatics is becoming more and more a Data Mining field. Every passing day, Genomics and Proteomics yield bucketloads of multivariate data (genes, proteins, DNA, identified peptides, structures), and every one of these biological data units is described by a number of features: length, physicochemical properties, scores, etc. Careful consideration of which features to select when trying to reduce the dimensionality of a specific dataset is, therefore, critical if one wishes to analyze and understand their impact on a model, or to identify what attributes produce a specific biological effect.
For instance, considering a predictive model C1A1 + C2A2 + C3A3 … CnAn = S, where Ci are constants, Ai are features or attributes and S is the predictor output (retention time, toxicity, score, etc). It is essential to identify which of those features (A1, A2 and A3…An) are most relevant to the model and to understand how they correlate with S, as working with such a subset will enable the researcher to discard a lot of irrelevant and redundant information.
Etiquetas:
big data,
bioconductor,
Bioinformatics,
caret,
computational proteomics,
FactoMineR,
feature selection,
R,
statistics
Saturday, 25 August 2012
Why R for Mass Spectrometrist and Computational Proteomics
Why R:
Actually, It is a common practice the integration of the statistical analysis of the resulted data and in silico
predictions of the data generated in your manuscript and your daily
research. Mass spectrometrist, biologist and bioinformaticians commonly
use programs like excel, calc or other office tools to generate their
charts and statistical analysis. In recent years many computational
biologists especially those from the Genomics field, regard R and
Bioconductor as fundamental
tools for their research.
R is a modern, functional programming language
that allows for rapid development of ideas; it is a language and environment for
statistical computing and graphics.The rich set
of inbuilt functions makes it ideal for high-volume analysis or
statistical studies.
Etiquetas:
bioconductor,
computational proteomics,
data analysis,
distributions,
insilico analysis,
mass spectrometry,
principal component analysis,
R,
statistics,
venn diagrams
Subscribe to:
Posts (Atom)