BigBio Notes

Sunday, 8 June 2014

Thesis: Development of computational methods for analysing proteomic data for genome annotation

Thesis by Markus Brosch in 2009 about Computational proteomics methods for analysing proteomic data for genome annotation.

Notes from Abstract

Proteomic mass spectrometry is a method that enables sequencing of gene product fragments, enabling the validation and reﬁnement of existing gene annotation as well as the detection of novel protein coding regions. However, the application of proteomics data to genome annotation is hindered by the lack of suitable tools and methods to achieve automatic data processing and genome mapping at high accuracy and throughput.

In the ﬁrst part of this project I evaluate the scoring schemes of “Mascot”, which is a peptide identiﬁcation software that is routinely used, for low and high mass accuracy data and show these to be not suﬃciently accurate. I develop an alternative scoring method that provides more sensitive peptide identiﬁcation speciﬁcally for high accuracy data, while allowing the user to ﬁx the false discovery rate. Building upon this, I utilise the machine learning algorithm “Percolator” to further extend my Mascot scoring scheme with a large set of orthogonal scoring features that assess the quality of a peptide-spectrum match.

To close the gap between high throughput peptide identiﬁcation and large scale genome annotation analysis I introduce a proteogenomics pipeline. A comprehensive database is the central element of this pipeline, enabling the eﬃcient mapping of known and predicted peptides to their genomic loci, each of which is associated with supplemental annotation information such as gene and transcript identiﬁers.

In the last part of my project the pipeline is applied to a large mouse MS dataset. I show the value and the level of coverage that can be achieved for validating genes and gene structures, while also highlighting the limitations of this technique. Moreover, I show where peptide identiﬁcations facilitated the correction of existing annotation, such as re-deﬁning the translated regions or splice boundaries.

Moreover, I propose a set of novel genes that are identiﬁed by the MS analysis pipeline with high conﬁdence, but largely lack transcriptional or conservational evidence.

Java Optimization Tips (Memory, CPU Time and Code)

There are several common optimization techniques that apply regardless of the language being used. Some of these techniques, such as global register allocation, are sophisticated strategies to allocate machine resources (for example, CPU registers) and don't apply to Java bytecodes. We'll focus on the techniques that basically involve restructuring code and substituting equivalent operations within a method.

EntrySet vs KeySet

-----------------------------------------


More efficient:

for (Map.Entry entry : map.entrySet()) {
    Object key = entry.getKey();
    Object value = entry.getValue();
}

than:

for (Object key : map.keySet()) {
    Object value = map.get(key);
}

Avoid to create threads without run methods

------------------------------------


Usage Example: 

public class Test
{
 public void method() throws Exception
 {
  new Thread().start();  //VIOLATION
 }
}
Should be written as:

public class Test
{
 public void method(Runnable r) throws Exception
 {
  new Thread(r).start();  //FIXED
 }
}

Initialise the ArrayList if you know in advance the size

--------------------------------------------

 
For example, use this code if you expect your ArrayList to store around 1000 objects:

List str = new ArrayList(1000)

Use ternary operators

----------------------------------------


class Use_ternary_operator_correction
{
 public boolean test(String value)
 {
  if(value.equals("AppPerfect"))  // VIOLATION
  {
   return true;
  }
  else
  {
   return false;
  }
 }
}

Should be written as:


class Use_ternary_operator_correction
{
 public boolean test(String value)
 {
  return value.equals("AppPerfect"); // CORRECTION
 }
}

Always declare constant fields Static


public class Always_declare_constant_field_static_violation
{
 final int MAX = 1000; // VIOLATION
 final String NAME = "Noname"; // VIOLATION
}

Should be written as:

public class Always_declare_constant_field_static_correction
{
 static final int MAX = 1000; // CORRECTION
 static final String NAME = "Noname"; // VIOLATION
}

Sunday, 6 April 2014

SWATH-MS and next-generation targeted proteomics

For proteomics, two main LC-MS/MS strategies have been used thus far. They have in common that the sample proteins are converted by proteolysis into peptides, which are then separated by (capillary) liquid chromatography. They differ in the mass spectrometric method used.

The first and most widely used strategy is known as shotgun proteomics or discovery proteomics. For this method, the MS instrument is operated in data-dependent acquisition (DDA) mode, where fragment ion (MS2) spectra for selected precursor ions detectable in a survey (MS1) scan are generated (Figure 1 - Discovery workflow). The resulting fragment ion spectra are then assigned to their corresponding peptide sequences by sequence database searching (See Open source libraries and frameworks for mass spectrometry based proteomics: A developer's perspective).

The second main strategy is referred to as targeted proteomics. There, the MS instrument is operated in selected reaction monitoring (SRM) (also called multiple reaction monitoring) mode (Figure 1 - Targeted Workflow). With this method, a sample is queried for the presence and quantity of a limited set of peptides that have to be specified prior to data acquisition. SRM does not require the explicit detection of the targeted precursors but proceeds by the acquisition, sequentially across the LC retention time domain, of predefined pairs of precursor and product ion masses, called transitions, several of which constitute a definitive assay for the detection of a peptide in a complex sample (See Targeted proteomics) .

Figure 1 - Discovery and Targeted proteomics workflows

Most read from the Journal of Proteome Research for 2013.

1- Protein Digestion: An Overview of the Available Techniques and Recent
    Developments

    Linda Switzar, Martin Giera, Wilfried M. A. Niessen

    DOI: 10.1021/pr301201x

2- Andromeda: A Peptide Search Engine Integrated into the MaxQuant
     Environment

     Jürgen Cox, Nadin Neuhauser, Annette Michalski, Richard A. Scheltema, Jesper
     V. Olsen, Matthias Mann

     DOI: 10.1021/pr101065j

2- Evaluation and Optimization of Mass Spectrometric Settings during
     Data-dependent Acquisition Mode: Focus on LTQ-Orbitrap Mass Analyzers

     Anastasia Kalli, Geoffrey T. Smith, Michael J. Sweredoski, Sonja Hess

     DOI: 10.1021/pr3011588

3- An Automated Pipeline for High-Throughput Label-Free Quantitative
     Proteomics

     Hendrik Weisser, Sven Nahnsen, Jonas Grossmann, Lars Nilse, Andreas Quandt,
     Hendrik Brauer, Marc Sturm, Erhan Kenar, Oliver Kohlbacher, Ruedi Aebersold,
     Lars Malmström

     DOI: 10.1021/pr300992u

4- Proteome Wide Purification and Identification of O-GlcNAc-Modified Proteins
     Using Click Chemistry and Mass Spectrometry

     Hannes Hahne, Nadine Sobotzki, Tamara Nyberg, Dominic Helm, Vladimir S.
     Borodkin, Daan M. F. van Aalten, Brian Agnew, Bernhard Kuster

     DOI: 10.1021/pr300967y

5- A Proteomics Search Algorithm Specifically Designed for High-Resolution
     Tandem Mass Spectra

     Craig D. Wenger, Joshua J. Coon

     DOI: 10.1021/pr301024c

6- Analyzing Protein–Protein Interaction Networks

    Gavin C. K. W. Koh, Pablo Porras, Bruno Aranda, Henning Hermjakob, Sandra E.
    Orchard

    DOI: 10.1021/pr201211w

7- Combination of FASP and StageTip-Based Fractionation Allows In-Depth
     Analysis of the Hippocampal Membrane Proteome

     Jacek R. Wisniewski, Alexandre Zougman, Matthias Mann

     DOI: 10.1021/pr900748n

8- The Biology/Disease-driven Human Proteome Project (B/D-HPP): Enabling
     Protein Research for the Life Sciences Community

     Ruedi Aebersold, Gary D. Bader, Aled M. Edwards, Jennifer E. van Eyk, Martin
     Kussmann, Jun Qin, Gilbert S. Omenn

     DOI: 10.1021/pr301151m

9- Comparative Study of Targeted and Label-free Mass Spectrometry Methods
      for Protein Quantification

       Linda IJsselstijn, Marcel P. Stoop, Christoph Stingl, Peter A. E. Sillevis Smitt,
       Theo M. Luider, Lennard J. M. Dekker

       DOI: 10.1021/pr301221f

Wednesday, 19 February 2014

In the ERA of science communication, Why you need Twitter, Professional Blog and ImpactStory?

Where is the information? Where are the scientifically relevant results? Where are the good ideas? Are these things (only) in journals? I usually prefer to write about bioinformatics and how we should include, annotate and cite our bioinformatics tools inside research papers (The importance of Package Repositories for Science and Research, The problem of in-house tools); but this post represents my take on the future of scientific publications and their dissemination based on the manuscript “Beyond the paper” (1).

In the not too distant future, today’s science journals will be replaced by a set of decentralized, interoperable services that are built on a core infrastructure of open data and evolving standards — like the Internet itself. What the journal did in the past for a single article, the social media and internet resources are doing for the entire scholarly output. We are now immersed in a transition to another science communication system— one that will tap on Web technology to significantly improves dissemination. I prefer to represent the future of science communication by a block diagram where the four main components: (i) Data, (ii) Publications, (iii) Dissemination and (iv) Certification/Reward are completely interconnected:

Solving Invalid signature in JNLP

I have this error each time i run my jnlp:

invalid SHA1 signature file digest for

I found some discussions about possible solutions:

http://stackoverflow.com/questions/8176166/invalid-sha1-signature-file-digest

http://stackoverflow.com/questions/11673707/java-web-start-jar-signing-issue

But he problem was still there. I solved the problem using plugin option (<unsignAlreadySignedJars>true</unsignAlreadySignedJars>) and removing previous signatures to avoid possible signature duplications:



  <plugin>
     <groupId>org.codehaus.mojo.webstart</groupId>
       <artifactId>webstart-maven-plugin</artifactId>
         <executions>
           <execution>
             <id>jnlp-building</id>
             <phase>package</phase>
               <goals>
                 <goal>jnlp</goal>
               </goals>
            </execution>
         </executions>
         <configuration>
           <!-- Include all the dependencies -->
           <excludeTransitive>false</excludeTransitive>
           <unsignAlreadySignedJars>true</unsignAlreadySignedJars>
           <verbose>true</verbose>
           <verifyjar>true</verifyjar>
           <!-- The path where the libraries are stored -->
           <libPath>lib</libPath>
           <jnlp>
             <inputTemplate>webstart/jnlp-template.vm</inputTemplate>
             <outputFile>ProteoLimsViewer.jnlp</outputFile>
             <mainClass>cu.edu.cigb.biocomp.proteolims.gui.ProteoLimsViewer</mainClass>
           </jnlp>
           <sign>
             <keystore>keystorefile</keystore>
             <alias>proteolimsviewer</alias>
             <storepass>password</storepass>
             <keypass>password</keypass>
             <keystoreConfig>
               <delete>false</delete>
               <gen>false</gen>
             </keystoreConfig>
           </sign>
              <!-- building process -->
              <pack200>false</pack200>
              <verbose>true</verbose>
         </configuration>
     </plugin>