1 Cross validation for logistic regression

Cross validation is an alternative approach to training/testing split. For k-fold cross validation, the dataset is divided into k parts. Each part serves as the test set in each iteration and the rest serve as training set. The out-of-sample performance measures from the k iterations are averaged.

Note

  1. We use the entire dataset for cross validation

  2. We need to use glm instead of lm to fit the model (if we want to use cv.glm fucntion in boot package)

  3. The default measure of performance is the Mean Squared Error (MSE). If we want to use another measure we need to define a cost function.

1.1 Cross validation for logistic regression

Refer to lecture slides and Elements of Statistical Learning book (section 7.10) for more advice on cross validation.

pcut <- 0.5
#Symmetric cost
cost1 <- function(r, pi, pcut){
  mean(((r==0)&(pi>pcut)) | ((r==1)&(pi<pcut)))
}

#Asymmetric cost
cost2 <- function(r, pi, pcut){
  weight1 <- 2
  weight0 <- 1
  c1 <- (r==1)&(pi<pcut) #logical vector - true if actual 1 but predict 0
  c0 <-(r==0)&(pi>pcut) #logical vector - true if actual 0 but predict 1
  return(mean(weight1*c1+weight0*c0))
}

We can use the same cost function as defined before, but you need to modify it such that there are only two input: observed \(Y\) and predicted probability, so that the cv.glm can recognize it for cross-validation with asymmetric cost.

costfunc  <- function(obs, pred.p){
    weight1 <- 5   # define the weight for "true=1 but pred=0" (FN)
    weight0 <- 1    # define the weight for "true=0 but pred=1" (FP)
    pcut <- 1/(1+weight1/weight0)
    c1 <- (obs==1)&(pred.p < pcut)    # count for "true=1 but pred=0"   (FN)
    c0 <- (obs==0)&(pred.p >= pcut)   # count for "true=0 but pred=1"   (FP)
    cost <- mean(weight1*c1 + weight0*c0)  # misclassification with weight
    return(cost) # you have to return to a value when you write R functions
} # end 

10-fold cross validation, note you should use the full data for cross-validation. In cv.glm, default cost function is the average squared error function.

credit_data <- read.csv(file = "https://xiaoruizhu.github.io/Data-Mining-R/lecture/data/credit_default.csv", header=T)
library(dplyr)
credit_data<- rename(credit_data, default=default.payment.next.month)
credit_data$SEX<- as.factor(credit_data$SEX)
credit_data$EDUCATION<- as.factor(credit_data$EDUCATION)
credit_data$MARRIAGE<- as.factor(credit_data$MARRIAGE)

library(boot)
credit_glm1<- glm(default~. , family=binomial, data=credit_data);  
cv_result  <- cv.glm(data=credit_data, glmfit=credit_glm1, cost=costfunc, K=10) 
cv_result$delta[2]
## [1] 0.6778167

The first component of delta is the raw cross-validation estimate of prediction error. The second component is the adjusted cross-validation estimate. The adjustment is designed to compensate for the bias introduced by not using leave-one-out cross-validation.

Keep in mind that CV-score is averaged model error. Here, it is the cost you have defined before. You may also use F-score for cross-validation, but again, you need to define the function of F-score. (Exercise!)

go to top