Thursday, 17 October 2013

Introduction to Feature selection for bioinformaticians using R, correlation matrix filters, PCA & backward selection

Bioinformatics is becoming more and more a Data Mining field. Every passing day, Genomics and Proteomics yield bucketloads of multivariate data (genes, proteins, DNA, identified peptides, structures), and every one of these biological data units is described by a number of features: length, physicochemical properties, scores, etc. Careful consideration of which features to select when trying to reduce the dimensionality of a specific dataset is, therefore, critical if one wishes to analyze and understand their impact on a model, or to identify what attributes produce a specific biological effect.

For instance, considering a predictive model C1A1 + C2A2 + C3A3 … CnAn = S, where Ci are constants, Ai are features or attributes and S is the predictor output (retention time, toxicity, score, etc). It is essential to identify which of those features (A1, A2 and A3…An) are most relevant to the model and to understand how they correlate with S, as working with such a subset will enable the researcher to discard a lot of irrelevant and redundant information.


There are two main approaches to this selection process:
  • Filter approaches: you select the features first, then you use this subset to execute classification or clustering algorithms, etc;
One of the simplest and most powerful filter approaches is the use of  correlation matrix filters. In regression and data mining problems, variables may be highly correlated with one another or "redundant". For example in cheminformatics, aromatic rings, bond counts and carbon atom counts can be very tightly correlated. If so, any one of these variables could be used as a proxy for all the others. It is best to choose the feature which is most likely to be the direct cause of toxicity, absorption or a specific response distribution. 

Correlation Matrix :

R Example: Removing features with more than 0.70 of Correlation.

library(corrplot)
#corrplot: the library to compute correlation matrix.

datMy <- read.table("data.csv", header = TRUE)
#read the tab file using the read table function.

datMy.scale<- scale(datMy[2:ncol(datMy)],center=TRUE,scale=TRUE);
#scale all the features (from feature 2 bacause feature 1 is the predictor output)

corMatMy <- cor(datMy.scale)
#compute the correlation matrix

corrplot(corMatMy, order = "hclust")
#visualize the matrix, clustering features by correlation index.


Resulting Output:

Highly Correlate Matrix for 400 features.
After inspecting the matrix, we set the correlation threshold at 0.70.

highlyCor <- findCorrelation(corMatMy, 0.70)
#Apply correlation filter at 0.70,
#then we remove all the variable correlated with more 0.7.
datMyFiltered.scale <- datMy.scale[,-highlyCor]
corMatMy <- cor(datMyFiltered.scale)
corrplot(corMatMy, order = "hclust")

Resulting Output:

Correlation matrix after filter.

Now it is possible to filter out “redundant” features by examining in detail the correlation matrix. Remember that the closer the correlation between two variables is to 1, the more related their behavior and the more redundant one is with respect to the other.

Using PCA

A relatively sophisticated way to do the correlation matrix analysis would be to perform a Principal Components Analysis (PCA). Feature extraction approaches transform data in high-dimensional space to a space of fewer dimensions. Principal component analysis, the most important linear technique for reducing dimensionality, performs a linear mapping of the data to a lower dimensional space in such a way that the variance of the data in the low-dimensional representation is maximized. In other words, PCA analysis builds a set of features by selecting those axes which maximize data variance.

Principal Component Analysis (PCA) is a multivariate technique that summarizes systematic patterns of variation in the data. From a data analysis standpoint, PCA is used for studying one table of observations and variables with the idea of transforming the observed variables into a set of new variables, the principal components, which are uncorrelated and explain the variation of the data. Therefore, PCA can be used to bring down a “complex” data set to a lower dimensionality, in order to reveal the structures or the dominant types of variations in both the observations and the variables.

PCA in R

In R, several functions from a number of different packages can be used to perform PCA. My suggestion is FactoMineR whose typical PCA output consists of a set of eigenvalues, a table with the scores or Principal Components (PCs), and a table of loadings (or correlations between variables and PCs).

R Example: PCA function using FactoMineR for 400 features & 5 PCs

require(FactoMineR) 
# PCA with function PCA

datMy <- read.table("data.csv", header = TRUE)
#read the tab file using the read table function.

pca <- PCA(datMy, scale.unit=TRUE, ncp=5, graph=T)
#scale all the features,  ncp: number of dimensions kept in the results (by default 5)

dimdesc(pca)
#This line of code will sort the variables the most linked to each PC. It is very useful when you have many variables.

Here, you can find an excellent tutorial on FactoMineR and Principal Component analysis in R:


Wrapper Approaches with Backwards Selection

"Wrapper" approaches can be viewed as built-in functions to optimize the number of predictors in the optimization or regression problem.  Many feature selection routines use a "wrapper" approach to find appropriate variables such that an algorithm searching through feature space repeatedly fits the model with different predictor sets. The best predictor set is determined by some measure of performance (correlation R^2, root-mean-square deviation). An example of one search routine is backwards selection (a.k.a. recursive feature elimination). In many cases, using these models with built-in feature selection will be more efficient than algorithms where the search routine for the right predictors is external to the model. Built-in feature selection typically couples the predictor search algorithm with parameter estimation, and is usually optimized with a single objective function (e.g. error rates or likelihood).

Using Built-in Backward Selection

The algorithm fits the model to all predictors. Each predictor is ranked according to relevance to the model. With each iteration of feature selection, the Ci top-ranked predictors are retained, the model is refit and performance is re-assessed. Built-in backward selection is being used for at least three purposes: predictor selection, model fitting and performance evaluation. Unless the number of samples is large, especially in relation to the number of variables, one static training set may not be able to fulfill these needs.

The "crantastic" package caret contains functions for training and plotting classification and regression models. In this case, the rfe function is used to obtain the potential selection. It has several arguments:

  • x, a matrix or data frame of predictor variables
  • y, a vector (numeric or factor) of outcomes
  • sizes, an integer vector for the specific subset sizes that should be tested (which must not include ncol(x))
  • rfeControl, a list of options that can be used to specify the model and the methods for prediction, ranking etc.
For a specific model, a set of functions must be specified in rfeControl$functions. There are a number of pre-defined sets of functions for several models, including: linear regression (in the object lmFuncs), random forests (rfFuncs), naive Bayes (nbFuncs), bagged trees (treebagFuncs) and functions that can be used with caret's train function (caretFuncs).

R example: Selecting features using backward selection and the caret package

library(caret);
#load caret library

data_features<-as.matrix(read.table("data-features.csv",sep="\t", header=TRUE));
#load data features

data_class<-as.matrix(read.table('data.csv', header=TRUE));
#load data classes

data_features<- scale(data_features, center=TRUE, scale=TRUE);
#scale data features

inTrain <- createDataPartition(data_class, p = 3/4, list = FALSE); 
#Divide the dataset in train and test sets

#Create the Training Dataset for Descriptors 
trainDescr <- data_features[inTrain,];

# Create the Testing dataset for Descriptors
testDescr <- data_features[-inTrain,];

trainClass <- data_class[inTrain];
testClass <- data_class[-inTrain];


descrCorr <- cor(trainDescr);
highCorr <- findCorrelation(descrCorr, 0.70);
trainDescr <- trainDescr[, -highCorr];
testDescr <- testDescr[, -highCorr];
# Here, we can included a correlation matrix analysis to remove the redundant features before the backwards selection 

svmProfile <- rfe(x=trainDescr, y = trainClass, sizes = c(1:5), rfeControl= rfeControl(functions = caretFuncs,number = 2),method = "svmRadial",fit = FALSE);
#caret function: the rfe is the backwards selection, c is the possible sizes of the features sets, and method the optimization method is a support vector machine.


Finally If you want to use this approach in practice you should take a look and cite this manuascript:
   "Perez-Riverol Y, Audain E, Millan A, Ramos Y, Sanchez A, Vizcaino JA, et al. 
    Isoelectric point optimization using peptide descriptors and support vector 
    machines. J Proteomics. 2012;75:2269–74."

I would like to recommend an excellent Review about Feature Selection in Bioinformatics.

No comments:

Post a Comment