By +Yasset Perez-Riverol and +Enrique Audain :
Isoelectric point (pI) can be defined as the point of singularity in a titration curve, corresponding to the solution pH value at which the net overall surface charge is equal to zero. Currently, there are available different modern analytical biochemistry and proteomics methods depend on the isoelectric point as a principal feature for protein and peptide characterization. Peptide/Protein fractionation according to their pI is widely used in current proteomics sample preparation procedures previous to the LC-MS/MS analysis. The experimental pI records generated by pI-based fractionation procedures are a valuable information to validate the confidence of the identifications, to remove false positive and and could be used to re-compute peptide/protein posterior error probabilities in MS-based proteomics experiments.
Isoelectric point (pI) can be defined as the point of singularity in a titration curve, corresponding to the solution pH value at which the net overall surface charge is equal to zero. Currently, there are available different modern analytical biochemistry and proteomics methods depend on the isoelectric point as a principal feature for protein and peptide characterization. Peptide/Protein fractionation according to their pI is widely used in current proteomics sample preparation procedures previous to the LC-MS/MS analysis. The experimental pI records generated by pI-based fractionation procedures are a valuable information to validate the confidence of the identifications, to remove false positive and and could be used to re-compute peptide/protein posterior error probabilities in MS-based proteomics experiments.
Theses approaches require an accurate theoretical prediction of pI. Even thought
there are several tools/methods to predict the isoelectric point, it remains
hard to define beforehand what methods perform well on a specific dataset.
We believe that the best way to compute the isoelectric point (pI) is to have a complete package with most of the algorithms and methods in the state of the art that can do the job for you [2]. We recently developed an R package (pIR) to
compute isoelectric point using long-standing and novels pI methods that can be
grouped in three main categories : a) iterative, b) Bjellvist-based methods and c) machine learning methods. In addition, pIR also offers a statistical
and graphical framework to evaluate the performance of each method and its
capability to “detect” outliers (those peptides/protein with theoretical pI
biased from experimental value) in a high-throughput environment.
First lets install the package:
First, we need to install
devtools
:install.packages("devtools")
library(devtools)
Then we just call:
install_github("ypriverol/pIR")
library(pIR)
How to compute peptide/protein isoelectric
point using the SVM method using pIR?
Currently, pIR contains several support vector machines (SVM-based) pre-trained
models that can be used to predict the isoelectric point from any amino acid
sequence [1]. This models has been trained using different dataset at different pH
range. Thus, the accuracy in the
prediction will be strongly depend of the chosen method. To choose a specific model, the option
available are the following:
- default (3.0-10.0 pH, ~7000 features) (1)
- heller (4.0-6.0 pH, ~6000 features) (2)
- branca (3.15-9.15 pH, ~69000 features) (3)
Additional details about the dataset used to
build each svm-model can be seen in the references.
To compute the pI of a single sequence, use the
command bellow:
----------------------------------------------------------------------------------------
> pI <-
pISVMpeptide(sequence = "AGAAPYVQAFDSLLAGPVAE",
model="default")
> #the result
will be...
> pI
[1] 4.015568[1]
----------------------------------------------------------------------------------------
> pI <-
pISVMpeptide (sequence = "AGAAPYVQAFDSLLAGPVAE",
model="heller")
> #the result
will be...
[1] 4.178931
----------------------------------------------------------------------------------------
> # using the SVM model branca
> pI <- pISVMpeptide (sequence =
"AGAAPYVQAFDSLLAGPVAE", model="branca")
> #The result will be...
> pI
[1] 4.049737
----------------------------------------------------------------------------------------
Most of the tools allow to compute the isoelectric point (example above) to individual sequences. However, to predict the isoelectric point of a list of peptides/proteins (for example, a short list of sequences contained in a file) the following option could be used:
----------------------------------------------------------------------------------------
#reading sequences from any file
>peptides <- read.table (file=pepFile, header =
TRUE, sep = "\t")
#rename column name
>colnames (peptides) <-c("sequence")
#computing pI using SVM method with default option.
> peptides <- mdply (peptides, function
(sequence) {pISVMpeptide(sequence = sequence, model="default")})
>colnames (peptides) <-c ("sequence",
“pIsvm”)
----------------------------------------------------------------------------------------
Even though pIR supports the isoelectric point prediction using different pre-builded SVM models, it is possible to train a new model from a more complex dataset (for example, up to 100000 sequences). It must contains the experimental pI values associated to each sequence. This could be a time-consuming process depending of the size of dataset. The results in the prediction will be strongly related with the quality of the dataset supplied. Bellow, the steps to perform this analysis:
----------------------------------------------------------------------------------------
#reading sequences from any file
peptides <-
read.table(file="data/svmDataDefault.csv", header = TRUE, sep =
",")
#rename column name
colnames(peptides) <-c("sequence",
"pIExp")
#building dataset with features
data <- svmPIBuildData(originalData = peptides)
#Optionality, to compute pI using SVM method with
default option.
data <- pISVMsequences(df=data, model =
"default", newModel = FALSE)
#Build a new model from the current dataset by choose
newModel flag = TRUE, and predict new values.
data <- pISVMsequences(df=data, newModel = TRUE)
----------------------------------------------------------------------------------------
An Iterative algorithm to predict the isoelectric point value works following three major steps. Firstly, it count the numbers of copies of the amino acids which play a role in determining pI and to propose pH values. Second, it determines the expected charge on the polypeptide for a particular pH value. And finally, it determines the expected proportion of charged and uncharged side chains for a particular amino acid, which are assessed from the supplied pK values (ref). There are several pK sets defined that could be used according to experimental setting evaluated. PIR supports the isoelectric point prediction based on Iterative method by choose the current pK set available from literature. Bellow some examples:
-----------library(pIR)
seq
<- "GLPRKILCAIAKKKGKCKGPLKLVCKC"
pI
<- pIIterative(sequence = seq, pkSetMethod = "solomon")
print(pI)
#The
result will be 10.526
#Computing
all pI values using Iterative method.
>
sequence <- "AADCEVEQWDSDEPIPAK"
>
pIvalues <- computeAllIterativeValues(seq = sequence)
#The
result will be...
>
pIvalues
method values
1 solomon 3.4161
2 rodwell 3.3749
3 emboss 3.5322
4 lehninger 3.3711
5 grimsley 3.3012
6 patrickios 3.4220
7 DtaSelect 3.7848
8 toseland 3.3571
9 thurlkill 3.4784
10 nozaki_tanford
3.6445
-------------------------------
Scaling-up the isoelectric point prediction
The first idea with pIR was to benchmark the state-of-art isoelectric point prediction algorithms and compare its performance. Let’s take the Branca’s dataset like example to perform a comparative analysis. Since this dataset was collected using a high resolution isoelectrofocusing technology, it puts a high quality data out to do this kind of analysis.
First, use to computePIValues function to predict all isoelectric point variable. This function take a data.frame in the way of: sequence pIExp, and return a new data.frame with all pI values computed using all methods available in pIR. This operation could be time-consuming depending of the size of the dataset.
#load dataset
#compute pI values
#view output
sequence experimental
calibrated expasy skoog solomon rodwell emboss lehninger grimsley patrickios
DtaSelect toseland [...truncated]
AADCEVEQWDSDEPIPAK
3.615 3.712 3.712 3.359
3.416 3.375 3.532
3.371 3.301 3.422
3.785 3.357
AADPPAENSSAPEAEQGGAE
3.615 3.703 3.446 2.776
2.896 3.118 3.257
2.844 3.120 3.501
3.314 3.144
AADSICDGDLVDSQIR 3.615 3.770 3.770 3.414
3.467 3.383 3.662
3.426 3.283 3.598
3.987 3.338
AADTDGDGQVNYEEFVR
3.615 3.769 3.769 3.418
3.474 3.421 3.606
3.430 3.343 3.501
3.872 3.397
AAEEAFVNDIDESSPGTEWER 3.615
3.714 3.714 3.364 3.423
3.409 3.495 3.376
3.358 3.723 3.714
3.413
ADAEDLLDSFLSNILQDCR
3.615 3.710 3.710 3.353
3.409 3.340 3.575
3.366 3.246 3.501
3.872 3.303
ADAEGESDLENSR 3.615 3.834
3.834 3.492 3.548 3.515
3.638 3.502 3.457
3.899 3.872 3.510
ADDPSSYMEVVQAANTSGNWEELVK
3.615 3.834 3.834 3.492
3.548 3.515 3.638
3.502 3.457 3.501
3.872 3.510
ADEEPTPADGR 3.615 3.917
3.917 3.578 3.632 3.585
3.738 3.588 3.523
4.024 3.987 3.574
Visualizing results with pIR
Correlation
> rawCorr <- plotRawCorrelation(branca)
> png("peptideRawCorrelation.png", width = 800, height = 800)
> multiplot(plotlist = rawCorr, cols=3)
> dev.off()
Figure 1. Correlation between theoretical and experimental isoelectric points using 14 different pI method on Branca dataset. |
Histogram
> histogram <- plotHistFunc(branca)
> png("Distributions.png", width = 800, height = 800)
> multiplot(plotlist = histogram, cols=3)
> dev.off()
Figure 2. Distribution of isoelectric point for different methods and the experimental distribution. The y-axis is the number of peptides for the corresponding isoelectric point value (x-axis). |
Plotting outliers dispersion
> outliers_dis <- plotOutlierOverall(branca)
> png("OutlierDispersion.png", width = 800, height = 800)
> multiplot(plotlist = outliers_dis, cols=3)
> dev.off()
Outliers distribution
> outliers <- plotOutlierDistribution(branca)
> png("OutlierDistribution.png", width = 800, height = 800)
> multiplot(plotlist = outliers, cols=3)
> dev.off()
Figure 4. Distribution of outliers and non-outliers populations of the Branca dataset. An outliers is defined if: Abs (pIexperimental – pItheoretical) ≥ 2*RMSE. |
Visualizing fractions
> fractions <- plotFractionCorrelation(branca)
> png("FractionCorrelation.png", width = 800, height = 800)
> multiplot(plotlist = fractions, cols=3)
> dev.off()
Figure 5. Experimental vs. theoretical isoelectric point for different peptide fractions of an in-gel electrophoresis experiment (Branca dataset). |
References
3. Branca, R.M., et al. HiRIEF LC-MS enables deep proteome coverage and unbiased proteogenomics. Nature methods 2014;11(1):59-62.
No comments:
Post a Comment