By Amit K. Yadav (@theoneamit) & Yasset Perez-Riverol (@ypriverol):
Perl is a legacy language thought to be abstruse by many modern programmers. I’m passionate with the idea of not letting die a programming language such as Perl. Even when the language is used less in Computational Proteomics, it is still widely used in Bioinformatics. I’m enthusiastic writing about new open-source libraries in Perl that can be easily used. Two years ago, I wrote a post about InSilicoSpectro and how it can be used to study protein databases like I did in “In silico analysis of accurate proteomics, complemented by selective isolation of peptides”.
Perl is a legacy language thought to be abstruse by many modern programmers. I’m passionate with the idea of not letting die a programming language such as Perl. Even when the language is used less in Computational Proteomics, it is still widely used in Bioinformatics. I’m enthusiastic writing about new open-source libraries in Perl that can be easily used. Two years ago, I wrote a post about InSilicoSpectro and how it can be used to study protein databases like I did in “In silico analysis of accurate proteomics, complemented by selective isolation of peptides”.
Today’s post is about ProteoStats [1], a Perl library for False Discovery Rate (FDR) related calculations in proteomics studies. Some background for non-experts:
One of the central and most widely used approach for shotgun proteomics is the use of database search tools to assign spectra to peptides (called as Peptide Spectrum Matches or PSMs). To evaluate the quality of the assignments, these programs need to calculate/correct for population wise error rates to keep the number of false positives under control. In that sense, the best strategy to control the false positives is the target-decoy approach. Originally proposed by Elias & Gygi in 2007, the so-called classical FDR strategy or formula proposed involved a concatenated target-decoy (TD) database search for FDR estimation. This calculation is either done by the search engine or using scripts (in-house, non-published, not benchmarked, different implementations).
So far, the only library developed to compute FDR at spectra level, peptide level and protein level FDRs is MAYU [2]. But, while MAYU only uses the classical FDR approach, ProteoStats provides options for 5 different strategies for calculating the FDR. The only prerequisite being that you need to search using a separate TD database as proposed by Kall et al (2008) [3]. Also, ProteoStats provides a programming interface that can read the native output from most widely used search tools and provide FDR related statistics. In case of tools not supported, pepXML, which has become a de facto standard output format, can be directly read along with tabular text based formats like TSV and CSV (or any other well-defined separator).
Here, some concepts will be explained using the functions provided by the library. Let’s start:
A generic project starts from a database search, which can be done either with a (1) concatenated TD database (single search) or (2) separate T-D databases (two separate searches). Researchers then apply either Elias & Gygi formula (Concatenated) or Kall’s formula (Separate).
FDR estimation formulae/methods in ProteoStats
The ProteoStats library supports the following formulations on FDR estimation:(1) Concatenated target decoy search based FDR [4], Concatenated FDR (FDRC)
This method was proposed by Elias and Gygi [4], although the basic concept of reverse database searching was based on Peng et al. [5] The database consists of combined set of proteins from target and their reversed sequences as decoy. In general, the ratio of false positives to true positives is estimated to be the FDR at the given threshold. The concatenated search is the most popular method of FDR estimation. The assumption is that for every decoy passing the threshold, there must be a corresponding false hit in target. Thus, the false positives are estimated by doubling the decoy count. The formula is-
(2) Simple/Separate target-decoy based FDR by Kall et. al [3], Separate/Simple FDR (FDRS)
This method was proposed by Kall et al as an alternative to combined search strategy. Kall et al suggested that combining the target and decoy databases overestimates the FDR and the decoy distribution no longer matches the target incorrect population. In this method, target and decoy searches are conducted separately. Each spectrum has one best target PSM and one best decoy PSM. Since the two searches are independent, it is assumed that the number of false positives amongst the targets is same as the number of decoys passing the threshold. It is also referred to as the simple FDR method by the authors. False positives are estimated by the count of decoys above the threshold.
Simple FDR described above assumes the size of true and false hits to be same or similar which is not an entirely correct assumption. Though all decoys contribute fully to the null distribution, all targets are not correct. Target distribution is a bimodal distribution representing the true and false target hits. This causes a non-balanced ratio of the true and false hits leading to overestimation of FDR. If a correction factor, Percentage of Incorrect Targets (PIT), is introduced in the equation 2 above, the FDR estimation is more accurate and increases sensitivity by allowing more correct hits at the same FDR. PIT is traditionally known as π0 or true negatives. This formulation depends on the percentage of incorrect target hits (PIT) that contribute to the negative (false) distribution as a weight in the above method to prevent overestimation of FDR and thus enhances the number of identifications. It should be noted that the name PIT is misleading since it is not a percentage but a fraction.
(4) Refined formulation on Kall’s method [6], Refined Separate FDR (FDRRS)
In this formulation, the authors propose for calculating the FDR in the correct reference population by not doubling the estimated false targets by observing decoy hits above threshold directly. All decoys above threshold should not be taken as false positives unless they score more than target PSMs for the given spectrum. This simplification leads to inflation of decoy population and thus overestimation of FDR. The hits could be above threshold only in target (target only) or only in decoy (decoy only). When both are above threshold, either target could be better (target better) or the decoy (decoy better). Only decoy better are considered to be false positives and their count is doubled as suggested in Elias and Gygi method. The FDR in the correct reference population is then calculated by estimating the false positive PSM by subtracting from this population (decoy better + target better + target only), and dividing the result by the same number. The formula thus becomes-
(5) Refined method on Elias et al method [7], FDRRC
In this formulation, the authors argue that since decoy hits are obviously wrong, they can be disregarded in FDR estimation and thus the FDRRC formula is changed to yield the following formula-
ProteoStats supports these two methods and some of their alternate formulae described above, which provide better results. The newer formulae seem to work better but not much in practice due to code availability. It also follows that proteomics community has not settled for a single formula and these differ in basic concept. A primary testing of FDR method suited to a lab protocol is thus recommended. The five methods are explained in the paper. The details for installation, dependencies and developer documentation can be found in Documentation downloaded with the code.
Quick Guide, Examples, Data & Results files
Example Scripts Many example scripts are provided in the tests folder.Data:
The files are provided in the DATA folder. Every file name is self-explanatory with a tag for target or decoy and algorithm name. For every algorithm, file names are like this-
QTOF_target_masswiz.csv
QTOF_decoy_masswiz.csv
Results:
The example result files are provided in the OUTPUT folder.
File Parsers
Some inbuilt functions can be used to parse XML or native search engine files into tabular text (CSV/TSV) formats. For example X!Tandem XML, pepXML, Mascot dat etc.
FDR estimation
The user has two files per search after database search - one Target and one corresponding
Decoy file. Choose the script corresponding to the search algorithm used for database searching.
use
MODULES::PSMFileIO; #file reading & parsing
use
MODULES::PepXMLParser; #parsing
pepXML files
#parse
tandem XML file
my
$tandemCSV = ReadTandemXML($tar);
# Tandem
target/decoy XML file
#parse
Sequest XLS file from Thermo ProteomeDiscoverer
my
$SequestCSV = ReadSequestXLS($tar);
# Tandem
target/decoy XML file
#parse
pepXML file to TSV XLS file
my $pepXML2tsv = ReadPepXML($tar); # pepXML
to TSV file
#parse
Mascot dat file, convert to pepXML and parse the PepXML file
system("Mascot2XML $tar -notgz
-nodta -D$tardb"); # convert to pepXML using
Mascot2XML from TPP
my $pepXML2tsv = ReadPepXML($tarpepXML);
# pepXML
to TSV file
Test_MassWiz.pl
Test_Tandem.pl
Test_OMSSA.pl
Test_Mascot.pl
Test_Sequest.pl
Test_pepXML.pl
Test_AnyText.pl
Edit any one of the script to enter the file names in @Tars, @Decs and @Outs arrays. Run the following command perl Test_MassWiz.pl and press enter to calculate FDR for input files. These scripts have a complete pipeline.
Use Modules Examples
Initially, the program requires the use pragmas for calling specific modules from library. If you are writing your code in tests folder, you need to specify the base directory (ProteoStats, so that it can find MODULES folder) by use lib ‘../’ pragma.
use
strict;
use
warnings;
use
lib '../'; #to define base directory
use
MODULES::FDRCalculator; #main module that
controls calling others
use
MODULES::PSMFileIO; # Handles input file
reading and Output writing
use
MODULES::PepXMLParser; # Handles pepXML
parsing
use
MODULES::SeparateFDR; #
Calculates FDRS
use
MODULES::ConcatenatedFDR; # Calculates
FDRC
use
MODULES::FDRPIT; Calculates FDRPIT
use
MODULES::RefinedConcatenatedFDR; Calculates
FDRRC
use
MODULES::RefinedSeparateFDR; Calculates
FDRRS
use
MODULES::ChartMaker; # Creates ScatterPlot, Histogram,
ROC curves
use MODULES:: ComparisonVenn; # Creates Venn, Compares PSMs/PeptidesApart from the FDR and q-value calculations, ProteoStats can also be used for following tasks-
(1) Generating ROC curves
ROC curves can be plotted by the function ROCfromFDRFile.
use MODULES::ChartMaker; #
#define the parameters/arguments
my @FDRfiles = (‘FDR1_mascot.csv’,
‘FDR2_masswiz.csv’,
‘FDR3_omssa.csv’); #input files
my @seriesname = (‘Mascot’,‘MassWiz,‘OMSSA’); #corresponding seriesname
my $out = ‘MyROC.xlsx’ ; # Excel Output
my $FDRthr = 0.05; # FDR cutoff to show in ROC
my $qvalcol = -1; #column number for q-value. -1 means last column
#Call the ROC function with the arguments
$out=ROCfromFDRFile(\@FDRfiles,\@seriesname,$out,$FDRThr,$qvalcol);
exit;
(2) Generating Scatter Plots
Scatterplots can be generated using the function ScatterPlot_FDRFile.
use
MODULES::ChartMaker; #
#define
the parameters/arguments
my
@FDRfiles = (‘FDR1_masswiz.csv’,
‘FDR2_masswiz.csv’,
‘FDR3_masswiz.csv’);
#input
files
my
$xcol = 5;
my
$xcolname = ‘Mass’;
my
$ycol = 7; #last column
my
$ycolname = ‘Score’;
my
$out = ‘MyScatter.xlsx’ ; # Excel
Output
#Call
the ScatterPlot function with the arguments
$out= ScatterPlot_FDRFile(\@FDRfiles,$xcol,$xcolname,$ycol,$ycolname,$out);
exit;
(3) Generating Histograms
Histograms can be generated using the function HistogramsfromFile function.
use
MODULES::ChartMaker;
#define
the input/arguments
my @files = (‘FDR_MassWiz1.csv’,
‘FDR_ MassWiz2.csv’, ‘FDR_ MassWiz3.csv’); #FDR Files
my
$ use_linear_axes =
1;
my
$use_integral_bins = 0;
HistogramsfromFile(@files, @columns, $bins)
(4) Comparing FDR outputs and Creating Venn Diagrams
This is a handy utility that can compare two or three FDR files. The FDR files need to contain only the filtered PSMs , else it will compare everything without any cutoff of score/p-value/e-value.
Start by defining the input files and Venn legends. Define 2 files for two set comparison and 3 for a three way comparison. Also define the column numbers for the FDR file for scans, peptide and protein columns.
use MODULES::ComparisonVenn;
#define
the input/arguments
my $FDR1 = ‘FDR_MassWiz1.csv’;
#FDR File 1
my $FDR2 = ‘FDR_ MassWiz2.csv’;
#FDR File 2
my $FDR3 = ‘FDR_ MassWiz3.csv’;
#FDR File 3
my
$out = ‘Comparison_MassWiz_3_replicates.csv’; # OUTPUT
File
#
Define the labels for Venn
my $legend1 = ‘MassWiz_rep1’;
my $legend2 = ‘MassWiz_rep2’;
my $legend3 = ‘MassWiz_rep3’;
#Define
column numbers for ScanID, peptide and Proteins
my $scancol = 0;
#scan column
my $pepcol = 3;
#peptide column
my $protcol = 7;
#protein column
Read the files and get the data structures as hash references for easy comparison. Dereference the hashes.
#Get
Hashrefs for Spectra and Peptides for comparison
my($SpecRef1,$PepRef1)
=
ReadSpecPep($FDR1,$scancol,$pepcol,$protcol);
my($SpecRef2,$PepRef2)
=
ReadSpecPep($FDR2,$scancol,$pepcol,$protcol);
my($SpecRef3,$PepRef3) = ReadSpecPep($FDR3,$scancol,$pepcol,$protcol);
#Dereference
the hashrefs
my %SpecFDR1 = %$SpecRef1;
my %SpecFDR2 = %$SpecRef2;
my %SpecFDR3 = %$SpecRef3;
my %PepFDR1 = %$PepRef1;
my %PepFDR2 = %$PepRef2;
my %PepFDR3 = %$PepRef3;
Create Venn diagrams for 2 or 3 sets as per requirement. Define a title for chart and call the function CreateVenn2 or CreateVenn3 as shown in example below.
##Create
Venn Charts for 3 or 2 sets passing spectra comparison
#
Venn 3 sets ScanIDs
my $title1 = 'Spectra
Comparison'; #Chart title
my $VennSpec3 = CreateVenn3 ([keys%SpecFDR1], [keys%SpecFDR2], [keys%SpecFDR3],
$legend1, $legend2, $legend3, $title1, "$out.spectra.png" ) ;
print
"Spectra venn created in file $VennSpec3\n";
#or
try Venn 2 sets ScanIDs
my $VennSpec2=CreateVenn2
( [ keys%SpecFDR1 ] , [ keys%SpecFDR2 ] ,
$legend1,$legend2,$title1,"$out.spectra.png") ;
print
"Spectra venn created in file $VennSpec2\n";
##Create
Venn Charts for 3 or 2 sets for peptide comparison
#
Venn 3 sets peptides
my
$title2 =
'Peptides Comparison';
my $VennPep3 = CreateVenn3
([ keys%PepFDR1], [keys%PepFDR2], [keys%PepFDR3],
$legend1, $legend2, $legend3, $title2, "$out.peptide.png" ) ;
print
"Peptide venn created in file $VennPep3\n";
Finally, the comparisons can also be made and results written to a CSV file. The function CompareFDR3 or CompareFDR2 can be used with the hash references. The CSV file $out (defined in the input at the beginning of this code) contains the output of the comparisons.
Reference List
[1] Yadav, A. K.; Kadimi, P. K.; Kumar, D.; Dash, D. ProteoStats--a library for estimating false discovery rates in proteomics pipelines. Bioinformatics 2013, 29 (21), 2799-2800.
[2] Reiter, L.; Claassen, M.; Schrimpf, S. P.; Jovanovic, M.; Schmidt, A.; Buhmann, J. M.; Hengartner, M. O.; Aebersold, R. Protein identification false discovery rates for very large proteomics data sets generated by tandem mass spectrometry. Mol. Cell Proteomics 2009, 8 (11), 2405-2417.
[3] Kall, L.; Storey, J. D.; MacCoss, M. J.; Noble, W. S. Assigning significance to peptides identified by tandem mass spectrometry using decoy databases. J. Proteome. Res. 2008, 7 (1), 29-34.
[4] Elias, J. E.; Gygi, S. P. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat. Methods 2007, 4 (3), 207-214.
[5] Peng, J.; Elias, J. E.; Thoreen, C. C.; Licklider, L. J.; Gygi, S. P. Evaluation of multidimensional chromatography coupled with tandem mass spectrometry (LC/LC-MS/MS) for large-scale protein analysis: the yeast proteome. J. Proteome. Res. 2003, 2 (1), 43-50.
[6] Navarro, P.; Vazquez, J. A refined method to calculate false discovery rates for peptide identification using decoy databases. J. Proteome. Res. 2009, 8 (4), 1792-1796.
[7] Cerqueira, F. R.; Graber, A.; Schwikowski, B.; Baumgartner, C. MUDE: a new approach for optimizing sensitivity in the target-decoy search strategy for large-scale peptide/protein identification. J. Proteome Res. 2010, 9 (5), 2265-2277.
No comments:
Post a Comment