Evaluating Test Sets

A function, postResample, can be used obtain the same performance measures as generated by train for regression or classification.

caret also contains several functions that can be used to describe the performance of classification models. The functions sensitivity, specificity, posPredValue and negPredValue can be used to characterize performance where there are two classes. By default, the first level of the outcome factor is used to define the "positive" result (i.e. the event of interest), although this can be changed.

The function confusionMatrix can also be used to summarize the results of a classification model. This example uses objects from the webpage for Model Training and Tuning:

testPred <- predict(gbmFit3, testing)
postResample(testPred, testing$Class)
Accuracy    Kappa 
  0.8627   0.7252 
sensitivity(testPred, testing$Class)
[1] 0.8519
confusionMatrix(testPred, testing$Class)
Confusion Matrix and Statistics

Prediction  M  R
         M 23  3
         R  4 21
               Accuracy : 0.863         
                 95% CI : (0.737, 0.943)
    No Information Rate : 0.529         
    P-Value [Acc > NIR] : 5.01e-07      
                  Kappa : 0.725         
 Mcnemar's Test P-Value : 1             
            Sensitivity : 0.852         
            Specificity : 0.875         
         Pos Pred Value : 0.885         
         Neg Pred Value : 0.840         
             Prevalence : 0.529         
         Detection Rate : 0.451         
   Detection Prevalence : 0.510         
      Balanced Accuracy : 0.863         
       'Positive' Class : M             

The "no--information rate" is the largest proportion of the observed classes (there were more actives than inactives in this test set). A hypothesis test is also computed to evaluate whether the overall accuracy rate is greater than the rate of the largest class. Also, the prevalence of the "positive event" is computed from the data (unless passed in as an argument), the detection rate (the rate of true events also predicted to be events) and the detection prevalence (the prevalence of predicted events).

Suppose a 2x2 table with notation

The formulas used here are:

When there are three or more classes, confusionMatrix will show the confusion matrix and a set of "one-versus-all" results. For example, in a three class problem, the sensitivity of the first class is calculated against all the samples in the second and third classes (and so on).

Also, a resampled estimate of the training set can also be obtained using confusionMatrix.train. For each resampling iteration, a confusion matrix is created from the hold-out samples and these values can be aggregated to diagnose issues with the model fit.

For example:

Cross-Validated (10 fold, repeated 10 times) Confusion Matrix 

(entries are percentages of table totals)
Prediction    M    R
         M 46.4 10.1
         R  7.1 36.4

These values are the percentages that hold-out samples landed in the confusion matrix during resampling. There are several methods for normalizing these values. See ?confusionMatrix.train for details.

Evaluating Class Probabilities

The package also contains two functions for class probability predictions for data sets with two classes.

The lift function can be used to evaluate probabilities thresholds that can capture a certain percentage of hits. The function requires a set of sample probability predictions (not from the training set) and the true class labels. For example, we can simulate two-class samples using the twoClassSim function and fit a set of models to the training set:

trainingSim <- twoClassSim(1000)
evalSim     <- twoClassSim(1000)
testingSim  <- twoClassSim(1000)
ctrl <- trainControl(method = "cv",
                     classProbs = TRUE,
                     summaryFunction = twoClassSummary)
fdaModel <- train(Class ~ ., data = trainingSim,
                  method = "fda",
                  metric = "ROC",
                  tuneLength = 20,
                  trControl = ctrl)
ldaModel <- train(Class ~ ., data = trainingSim,
                  method = "lda",
                  metric = "ROC",
                  trControl = ctrl)
c5Model <- train(Class ~ ., data = trainingSim,
                 method = "C5.0",
                 metric = "ROC",
                 tuneLength = 10,
                 trControl = ctrl)
Loading required package: C50
## A summary of the resampling results:
  TrainROC TrainSens TrainSpec method
1    0.964    0.9227    0.8596    fda
  TrainROC TrainSens TrainSpec method
1   0.9044    0.8343    0.7913    lda
  TrainROC TrainSens TrainSpec method
1   0.9496    0.8676    0.8531   C5.0

From these models, we can predict the evaluation set and save the probabilities of being the first class:

evalResults <- data.frame(Class = evalSim$Class)
evalResults$FDA <- predict(fdaModel, evalSim, type = "prob")[,"Class1"]
evalResults$LDA <- predict(ldaModel, evalSim, type = "prob")[,"Class1"]
evalResults$C5.0 <- predict(c5Model, evalSim, type = "prob")[,"Class1"]
   Class     FDA     LDA   C5.0
1 Class1 0.99073 0.88382 0.8446
2 Class1 0.99166 0.75724 0.8882
3 Class1 0.88943 0.88838 0.5732
4 Class2 0.03117 0.01405 0.1690
5 Class1 0.75646 0.93207 0.4824
6 Class2 0.16122 0.05242 0.3310

The lift function does the calculations and the corresponding plot function is used to plot the lift curve (although some call this the gain curve). The value argument creates reference lines.

liftData <- lift(Class ~ FDA + LDA + C5.0, data = evalResults)
plot(liftData, values = 60, auto.key = list(columns = 3,
                                            lines = TRUE,
                                            points = FALSE))
plot of chunk lift_plot

From this we can see that, to find 60 percent of the hits, a little more than 30 percent of the data can be sampled (when ordered by the probability predictions). The LDA model does somewhat worse than the other two models.

The other function is for probability calibration. Other functions in the gbm package, the rms package and others. These plots can be used to assess whether the value of the probability prediction is consistent with the event rate in the data. The format for the function is very similar to the lift function:

calData <- calibration(Class ~ FDA + LDA + C5.0,
                       data = evalResults,
                       cuts = 13)
plot(calData, type = "l", auto.key = list(columns = 3,
                                          lines = TRUE,
                                          points = FALSE))
plot of chunk cal_plot

Processing Affy Arrays

For Affymetrix gene chip data, RMA processing (Irizarry, 2003) is a popular method of processing gene expression data. However, for predictive modeling, it has a drawback in that the processing is batch oriented; if an additional sample is collected, the RMA processing must be repeated using all the samples. This is mainly because of two steps in the processing: the quantile normalization process and the calculation of expression. Quantile normalization normalizes the data such that the between--chip distributions have equivalent quantiles and is very effective in improving the quality of the data. It is possible to let the samples in the training set define the reference distribution for normalization. For the expression calculation, a robust method such as a trimmed mean can be used to summarize the probe level data into a single summary metric per sample.

For example:

# first, let affy/expresso know that the method exists
normalize.AffyBatch.methods <- c(normalize.AffyBatch.methods, "normalize2Reference")
RawData <- ReadAffy(celfile.path = FilePath)
Batch1Step1 <- bg.correct(RawData, "rma")
Batch1Step2 <- normalize.AffyBatch.quantiles(Batch1Step1)
referencePM <- pm(Batch1Step2)[, 1]
Batch1Step3 <- computeExprSet(Batch1Step2, "pmonly", "trimMean")
Batch2Step1 <- bg.correct(RawData2, "rma")
Batch2Step2 <- normalize.AffyBatch.normalize2Reference(Batch2Step1, ref = referencePM)
Batch2Step3 <- computeExprSet(Batch2Step2, "pmonly", "trimMean")