In contrast with genomics,
bioinformaticians in proteomics don’t have a "big" and "complete" perl
library for proteomics data analysis. It could be related with the
"heterogeneity" in proteomics. A lot of different instruments,
protocols, properties. Also genomic have a huge community
(bioinformaticians) and standardize tools (instruments and software’s).
In 2006 Collinge and Masselot published an open-source perl library
named InSilicoSpectro. The aim was provide a set of recurrent functions that are necessary for proteomics data analysis.
Some of the Illustrative functions are: mz list file format conversions, protein sequence digestion, theoretical peptide and fragment mass computations, graphical display, matching with experimental data, isoelectric point estimation (with different methods), and peptide retention time prediction.
At the end of the manuscript abstract the authors says: "We believe that
InSilicoSpectro will be of great help to bioinformaticians, without
detailed knowledge of proteomics specifics, and to mass spectrometrists
with computer programming interest as well"
But what we can do with InsilicoSpectro & Bioperl:
Reading a Fasta File and make a Tryptic Digestion:
----------------
#!/usr/bin/perl
use Bio::SeqIO;
use Bio::Seq;
use IO::String;
use InSilicoSpectro;
use InSilicoSpectro::InSilico::MassCalculator;
use InSilicoSpectro::InSilico::CleavEnzyme;
use InSilicoSpectro::InSilico::AASequence;
use InSilicoSpectro::InSilico::Peptide;
use InSilicoSpectro::InSilico::IsoelPoint;
use InSilicoSpectro::InSilico::ExpCalibrator;
undef $InSilicoSpectro::InSilico::MassCalculator::invalidElementCall;
$inFile = Bio::SeqIO->new(-file => "$ARGV[0]", -format => 'fasta');
my $enzyme = $ARGV[1]; // Enzyme
my $miss = $ARGV[2]; // Miss Cleavage Sites
my $name_out = $ARGV[3]; // out put file
InSilicoSpectro::init("insilicodef.xml"); // file of InsilicoSpectro Definitions
open (OUTDATA, ">$name_out") or die("Error: cannot open file $name_out\n");
while (my $Protein = $inFile->next_seq()){
$id = $Protein->display_id(); # Id. Protein.
$seq = $Protein->seq(); # String of sequence
$description = $Protein->description(); # Description of the sequence
use Bio::SeqIO;
use Bio::Seq;
use IO::String;
use InSilicoSpectro;
use InSilicoSpectro::InSilico::MassCalculator;
use InSilicoSpectro::InSilico::CleavEnzyme;
use InSilicoSpectro::InSilico::AASequence;
use InSilicoSpectro::InSilico::Peptide;
use InSilicoSpectro::InSilico::IsoelPoint;
use InSilicoSpectro::InSilico::ExpCalibrator;
undef $InSilicoSpectro::InSilico::MassCalculator::invalidElementCall;
$inFile = Bio::SeqIO->new(-file => "$ARGV[0]", -format => 'fasta');
my $enzyme = $ARGV[1]; // Enzyme
my $miss = $ARGV[2]; // Miss Cleavage Sites
my $name_out = $ARGV[3]; // out put file
InSilicoSpectro::init("insilicodef.xml"); // file of InsilicoSpectro Definitions
open (OUTDATA, ">$name_out") or die("Error: cannot open file $name_out\n");
while (my $Protein = $inFile->next_seq()){
$id = $Protein->display_id(); # Id. Protein.
$seq = $Protein->seq(); # String of sequence
$description = $Protein->description(); # Description of the sequence
my $protein = new InSilicoSpectro::InSilico::Peptide(sequence=>"$seq",modif=>"");
my $proteinSequence = new InSilicoSpectro::InSilico::AASequence(sequence=>$seq, AC=>$id);
my $proteinSequence = new InSilicoSpectro::InSilico::AASequence(sequence=>$seq, AC=>$id);
$mass_value = $protein->getMass();
$protein_mass_value = sprintf("%.6f", $mass_value); // Get the protein mass
$protein_mass_value = sprintf("%.6f", $mass_value); // Get the protein mass
@result = digestByRegExp(protein=>$ProteinSequence,minMass=>"0",nmc=>$misscleave,
enzyme=>InSilicoSpectro::InSilico::CleavEnzyme::getFromDico("$enzyme"));
foreach $p (@result){
$i++;
$peptide = $p->sequence;
print (OUTDATA ">$id\t$description\t$protein_mass_value\n");
print (OUTDATA "$peptide\n");
}foreach $p (@result){
$i++;
$peptide = $p->sequence;
print (OUTDATA ">$id\t$description\t$protein_mass_value\n");
print (OUTDATA "$peptide\n");
}close OUTDATA or die("Error: cannot close file $name_out\n"); # wait for close output file.
---------------------
Whit a simple script the user can obtain very good results without efforts. One of the key feature is the use of this library for protein database analysis. Proteomic Identifications are mainly based on Database Search. Is a common practice when you are writing your manuscript put some "estimations" or predictions about the possible behavior of the experiments. Most of the time you need to use database knowledge and a statistical background of the database, even before the experiment design...
An example:
Distribution of mass for Unique Peptides for different peptide tolerance error in ppm:
The process could be divided in three steps. Each of then could be computed using InsilicoSpectro. We can reuse the first script to digest the protein sequences and put some code inside to filter by peptide mass and retrieve the peptide with the mass annotated:
foreach $p (@result){
$i++;
$peptide = $p->sequence;
$i++;
$peptide = $p->sequence;
my $peptideInsilico = new InSilicoSpectro::InSilico::Peptide(sequence=>"$peptide",modif=>"");
$mass_value = $peptideInsilico->getMass();
$peptide_mass = sprintf("%.6f", $mass_value); // Get the protein mass
$peptide_mass = sprintf("%.6f", $mass_value); // Get the protein mass
if (($peptide_mass >= $ARGV[4]) && ($peptide_mass <= $ARGV[5])){
print (OUTDATA "$peptide\t$peptide_mass\n");
}
}
This small change filter all peptide between 800-3500 (QTOF resolution) and compute the mass of each peptide. After this small change, an R script or a common perl algorithm can help to retrieve the histogram by peptide mass.
Another key feature of InsilicoSpectro is the estimation of different peptide/protein properties like isoelectric point, retention time.. Also function for Mascot .dat peptide/spectrum match extraction and processing.
the user can predict easily the isoelectric point with algorithm developed by David Tabb and different datatasets:
Amino acid | NH2 | COOH | C | D | E | H | K | R | Y |
EMBOSS | 8.6 | 3.6 | 8.5 | 3.9 | 4.1 | 6.5 | 10.8 | 12.5 | 10.1 |
DTASelect | 8.0 | 3.1 | 8.5 | 4.4 | 4.4 | 6.5 | 10.0 | 12.0 | 10.0 |
Solomon | 9.6 | 2.4 | 8.3 | 3.9 | 4.3 | 6.0 | 10.5 | 12.5 | 10.1 |
Sillero | 8.2 | 3.2 | 9.0 | 4.0 | 4.5 | 6.4 | 10.4 | 12.0 | 10.0 |
Rodwell | 8.0 | 3.1 | 8.33 | 3.68 | 4.25 | 6.0 | 11.5 | 11.5 | 10.07 |
Patrickios | 11.2 | 4.2 | - | 4.2 | 4.2 | - | 11.2 | 11.2 | - |
Lehninger | 9.69 | 2.34 | 8.33 | 3.86 | 4.25 | 6.0 | 10.5 | 12.4 | 10.0 |
Retention time could be compute with Krokin and Petritis methods with simple functions.
An example of isoelectric point calculation is:
$pi = InSilicoSpectro::InSilico::IsoelPoint->new(method=>"iterative",current=>"Lehninger",%settings);
$piPeptide = $pi->predict(peptide => uc $peptide);
An important module of the library is related with the spectrum processing. Even process is considered as "more specialized", some researcher with less bioinformatician background uses these functions to evaluate the spectrum quality...
InsilicoSpectro give an excellent opportunity to process your database and your identification data...
Take a look!!!
http://search.cpan.org/~alexmass/InSilicoSpectro-1.3.24/
http://pubs.acs.org/doi/abs/10.1021/pr0504236
http://pubs.acs.org/doi/abs/10.1021/ac070488n
http://pubs.acs.org/doi/abs/10.1021/pr700840y
http://pubs.acs.org/doi/abs/10.1021/pr200031z
Yes you can compute the isoelectric point, retention time, also you can do some basics functionalities such as digestions, parsing, etc. Are you interested to compute the isoelectric point?
ReplyDelete