BigBio Notes: June 2014

Sunday 8 June 2014

Thesis: Development of computational methods for analysing proteomic data for genome annotation

Thesis by Markus Brosch in 2009 about Computational proteomics methods for analysing proteomic data for genome annotation.

Notes from Abstract

Proteomic mass spectrometry is a method that enables sequencing of gene product fragments, enabling the validation and reﬁnement of existing gene annotation as well as the detection of novel protein coding regions. However, the application of proteomics data to genome annotation is hindered by the lack of suitable tools and methods to achieve automatic data processing and genome mapping at high accuracy and throughput.

In the ﬁrst part of this project I evaluate the scoring schemes of “Mascot”, which is a peptide identiﬁcation software that is routinely used, for low and high mass accuracy data and show these to be not suﬃciently accurate. I develop an alternative scoring method that provides more sensitive peptide identiﬁcation speciﬁcally for high accuracy data, while allowing the user to ﬁx the false discovery rate. Building upon this, I utilise the machine learning algorithm “Percolator” to further extend my Mascot scoring scheme with a large set of orthogonal scoring features that assess the quality of a peptide-spectrum match.

To close the gap between high throughput peptide identiﬁcation and large scale genome annotation analysis I introduce a proteogenomics pipeline. A comprehensive database is the central element of this pipeline, enabling the eﬃcient mapping of known and predicted peptides to their genomic loci, each of which is associated with supplemental annotation information such as gene and transcript identiﬁers.

In the last part of my project the pipeline is applied to a large mouse MS dataset. I show the value and the level of coverage that can be achieved for validating genes and gene structures, while also highlighting the limitations of this technique. Moreover, I show where peptide identiﬁcations facilitated the correction of existing annotation, such as re-deﬁning the translated regions or splice boundaries.

Moreover, I propose a set of novel genes that are identiﬁed by the MS analysis pipeline with high conﬁdence, but largely lack transcriptional or conservational evidence.

Java Optimization Tips (Memory, CPU Time and Code)

There are several common optimization techniques that apply regardless of the language being used. Some of these techniques, such as global register allocation, are sophisticated strategies to allocate machine resources (for example, CPU registers) and don't apply to Java bytecodes. We'll focus on the techniques that basically involve restructuring code and substituting equivalent operations within a method.

EntrySet vs KeySet

-----------------------------------------


More efficient:

for (Map.Entry entry : map.entrySet()) {
    Object key = entry.getKey();
    Object value = entry.getValue();
}

than:

for (Object key : map.keySet()) {
    Object value = map.get(key);
}

Avoid to create threads without run methods

------------------------------------


Usage Example: 

public class Test
{
 public void method() throws Exception
 {
  new Thread().start();  //VIOLATION
 }
}
Should be written as:

public class Test
{
 public void method(Runnable r) throws Exception
 {
  new Thread(r).start();  //FIXED
 }
}

Initialise the ArrayList if you know in advance the size

--------------------------------------------

 
For example, use this code if you expect your ArrayList to store around 1000 objects:

List str = new ArrayList(1000)

Use ternary operators

----------------------------------------


class Use_ternary_operator_correction
{
 public boolean test(String value)
 {
  if(value.equals("AppPerfect"))  // VIOLATION
  {
   return true;
  }
  else
  {
   return false;
  }
 }
}

Should be written as:


class Use_ternary_operator_correction
{
 public boolean test(String value)
 {
  return value.equals("AppPerfect"); // CORRECTION
 }
}

Always declare constant fields Static


public class Always_declare_constant_field_static_violation
{
 final int MAX = 1000; // VIOLATION
 final String NAME = "Noname"; // VIOLATION
}

Should be written as:

public class Always_declare_constant_field_static_correction
{
 static final int MAX = 1000; // CORRECTION
 static final String NAME = "Noname"; // VIOLATION
}