BigBio Notes

Tuesday, 26 August 2014

Adding CITATION to your R package

Original post from Robin's Blog:

Software is very important in science – but good software takes time and effort that could be used to do other work instead. I believe that it is important to do this work – but to make it worthwhile, people need to get credit for their work, and in academia that means citations. However, it is often very difficult to find out how to cite a piece of software – sometimes it is hidden away somewhere in the manual or on the web-page, but often it requires sending an email to the author asking them how they want it cited. The effort that this requires means that many people don’t bother to cite the software they use, and thus the authors don’t get the credit that they need. We need to change this, so that software – which underlies a huge amount of important scientific work – gets the recognition it deserves.

Original post from GitHub Guides:

Digital Object Identifiers (DOI) are the backbone of the academic reference and metrics system. If you’re a researcher writing software, this guide will show you how to make the work you share on GitHub citable by archiving one of your GitHub repositories and assigning a DOI with the data archiving tool Zenodo.

ProTip: This tutorial is aimed at researchers who want to cite GitHub repositories in academic literature. Provided you’ve already set up a GitHub repository, this tutorial can be completed without installing any special software. If you haven’t yet created a project on GitHub, start first byuploading your work to a repository.

ProteoStats: Computing false discovery rates in proteomics

By Amit K. Yadav (@theoneamit) & Yasset Perez-Riverol (@ypriverol):

Perl is a legacy language thought to be abstruse by many modern programmers. I’m passionate with the idea of not letting die a programming language such as Perl. Even when the language is used less in Computational Proteomics, it is still widely used in Bioinformatics. I’m enthusiastic writing about new open-source libraries in Perl that can be easily used. Two years ago, I wrote a post about InSilicoSpectro and how it can be used to study protein databases like I did in “In silico analysis of accurate proteomics, complemented by selective isolation of peptides”.

Today’s post is about ProteoStats [1], a Perl library for False Discovery Rate (FDR) related calculations in proteomics studies. Some background for non-experts:

One of the central and most widely used approach for shotgun proteomics is the use of database search tools to assign spectra to peptides (called as Peptide Spectrum Matches or PSMs). To evaluate the quality of the assignments, these programs need to calculate/correct for population wise error rates to keep the number of false positives under control. In that sense, the best strategy to control the false positives is the target-decoy approach. Originally proposed by Elias & Gygi in 2007, the so-called classical FDR strategy or formula proposed involved a concatenated target-decoy (TD) database search for FDR estimation. This calculation is either done by the search engine or using scripts (in-house, non-published, not benchmarked, different implementations).

So far, the only library developed to compute FDR at spectra level, peptide level and protein level FDRs is MAYU [2]. But, while MAYU only uses the classical FDR approach, ProteoStats provides options for 5 different strategies for calculating the FDR. The only prerequisite being that you need to search using a separate TD database as proposed by Kall et al (2008) [3]. Also, ProteoStats provides a programming interface that can read the native output from most widely used search tools and provide FDR related statistics. In case of tools not supported, pepXML, which has become a de facto standard output format, can be directly read along with tabular text based formats like TSV and CSV (or any other well-defined separator).

Thesis: Development of computational methods for analysing proteomic data for genome annotation

Thesis by Markus Brosch in 2009 about Computational proteomics methods for analysing proteomic data for genome annotation.

Notes from Abstract

Proteomic mass spectrometry is a method that enables sequencing of gene product fragments, enabling the validation and reﬁnement of existing gene annotation as well as the detection of novel protein coding regions. However, the application of proteomics data to genome annotation is hindered by the lack of suitable tools and methods to achieve automatic data processing and genome mapping at high accuracy and throughput.

In the ﬁrst part of this project I evaluate the scoring schemes of “Mascot”, which is a peptide identiﬁcation software that is routinely used, for low and high mass accuracy data and show these to be not suﬃciently accurate. I develop an alternative scoring method that provides more sensitive peptide identiﬁcation speciﬁcally for high accuracy data, while allowing the user to ﬁx the false discovery rate. Building upon this, I utilise the machine learning algorithm “Percolator” to further extend my Mascot scoring scheme with a large set of orthogonal scoring features that assess the quality of a peptide-spectrum match.

To close the gap between high throughput peptide identiﬁcation and large scale genome annotation analysis I introduce a proteogenomics pipeline. A comprehensive database is the central element of this pipeline, enabling the eﬃcient mapping of known and predicted peptides to their genomic loci, each of which is associated with supplemental annotation information such as gene and transcript identiﬁers.

In the last part of my project the pipeline is applied to a large mouse MS dataset. I show the value and the level of coverage that can be achieved for validating genes and gene structures, while also highlighting the limitations of this technique. Moreover, I show where peptide identiﬁcations facilitated the correction of existing annotation, such as re-deﬁning the translated regions or splice boundaries.

Moreover, I propose a set of novel genes that are identiﬁed by the MS analysis pipeline with high conﬁdence, but largely lack transcriptional or conservational evidence.

Java Optimization Tips (Memory, CPU Time and Code)

There are several common optimization techniques that apply regardless of the language being used. Some of these techniques, such as global register allocation, are sophisticated strategies to allocate machine resources (for example, CPU registers) and don't apply to Java bytecodes. We'll focus on the techniques that basically involve restructuring code and substituting equivalent operations within a method.

EntrySet vs KeySet

-----------------------------------------


More efficient:

for (Map.Entry entry : map.entrySet()) {
    Object key = entry.getKey();
    Object value = entry.getValue();
}

than:

for (Object key : map.keySet()) {
    Object value = map.get(key);
}

Avoid to create threads without run methods

------------------------------------


Usage Example: 

public class Test
{
 public void method() throws Exception
 {
  new Thread().start();  //VIOLATION
 }
}
Should be written as:

public class Test
{
 public void method(Runnable r) throws Exception
 {
  new Thread(r).start();  //FIXED
 }
}

Initialise the ArrayList if you know in advance the size

--------------------------------------------

 
For example, use this code if you expect your ArrayList to store around 1000 objects:

List str = new ArrayList(1000)

Use ternary operators

----------------------------------------


class Use_ternary_operator_correction
{
 public boolean test(String value)
 {
  if(value.equals("AppPerfect"))  // VIOLATION
  {
   return true;
  }
  else
  {
   return false;
  }
 }
}

Should be written as:


class Use_ternary_operator_correction
{
 public boolean test(String value)
 {
  return value.equals("AppPerfect"); // CORRECTION
 }
}

Always declare constant fields Static


public class Always_declare_constant_field_static_violation
{
 final int MAX = 1000; // VIOLATION
 final String NAME = "Noname"; // VIOLATION
}

Should be written as:

public class Always_declare_constant_field_static_correction
{
 static final int MAX = 1000; // CORRECTION
 static final String NAME = "Noname"; // VIOLATION
}

Sunday, 6 April 2014

SWATH-MS and next-generation targeted proteomics

For proteomics, two main LC-MS/MS strategies have been used thus far. They have in common that the sample proteins are converted by proteolysis into peptides, which are then separated by (capillary) liquid chromatography. They differ in the mass spectrometric method used.

The first and most widely used strategy is known as shotgun proteomics or discovery proteomics. For this method, the MS instrument is operated in data-dependent acquisition (DDA) mode, where fragment ion (MS2) spectra for selected precursor ions detectable in a survey (MS1) scan are generated (Figure 1 - Discovery workflow). The resulting fragment ion spectra are then assigned to their corresponding peptide sequences by sequence database searching (See Open source libraries and frameworks for mass spectrometry based proteomics: A developer's perspective).

The second main strategy is referred to as targeted proteomics. There, the MS instrument is operated in selected reaction monitoring (SRM) (also called multiple reaction monitoring) mode (Figure 1 - Targeted Workflow). With this method, a sample is queried for the presence and quantity of a limited set of peptides that have to be specified prior to data acquisition. SRM does not require the explicit detection of the targeted precursors but proceeds by the acquisition, sequentially across the LC retention time domain, of predefined pairs of precursor and product ion masses, called transitions, several of which constitute a definitive assay for the detection of a peptide in a complex sample (See Targeted proteomics) .

Figure 1 - Discovery and Targeted proteomics workflows