1 Objective

The objective of this case is to get you understand logistic regression (binary classification) and some important ideas such as cross validation, ROC curve, cut-off probability.

2 Credit Card Default Data

We will use a Credit Card Default Data for this lab and illustration. The details of the data can be found at http://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients. Think about what kind of factors could affect people to fail to pay their credit balance.

We first load the credit scoring data. It is easy to load comma-separated values (CSV).

credit_data <- read.csv(file = "https://xiaoruizhu.github.io/Data-Mining-R/lecture/data/credit_default.csv", header=T)

Look at what information do we have.

colnames(credit_data)
##  [1] "LIMIT_BAL"                  "SEX"                       
##  [3] "EDUCATION"                  "MARRIAGE"                  
##  [5] "AGE"                        "PAY_0"                     
##  [7] "PAY_2"                      "PAY_3"                     
##  [9] "PAY_4"                      "PAY_5"                     
## [11] "PAY_6"                      "BILL_AMT1"                 
## [13] "BILL_AMT2"                  "BILL_AMT3"                 
## [15] "BILL_AMT4"                  "BILL_AMT5"                 
## [17] "BILL_AMT6"                  "PAY_AMT1"                  
## [19] "PAY_AMT2"                   "PAY_AMT3"                  
## [21] "PAY_AMT4"                   "PAY_AMT5"                  
## [23] "PAY_AMT6"                   "default.payment.next.month"

Let’s look at how many people were actually default in this sample.

mean(credit_data$default.payment.next.month)
## [1] 0.2193333

The name of response variable is too long! I want to make it shorter by renaming. Recall the rename() function.

library(dplyr)
credit_data<- rename(credit_data, default=default.payment.next.month)

How about the variable type and summary statistics?

str(credit_data)    # structure - see variable type
summary(credit_data) # summary statistics

We see all variables are int, but we know that SEX, EDUCATION, MARRIAGE are categorical, we convert them to factor.

credit_data$SEX<- as.factor(credit_data$SEX)
credit_data$EDUCATION<- as.factor(credit_data$EDUCATION)
credit_data$MARRIAGE<- as.factor(credit_data$MARRIAGE)

We omit other EDA, but you shouldn’t whenever you are doing data analysis.

go to top

3 Logistic Regression

Randomly split the data to training (80%) and testing (20%) datasets:

index <- sample(nrow(credit_data),nrow(credit_data)*0.80)
credit_train = credit_data[index,]
credit_test = credit_data[-index,]

3.1 Train a logistic regression model with all variables

credit_glm0<- glm(default~., family=binomial, data=credit_train)
summary(credit_glm0)
## 
## Call:
## glm(formula = default ~ ., family = binomial, data = credit_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2204  -0.6959  -0.5311  -0.2787   3.3126  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.036e+00  1.549e-01  -6.686 2.29e-11 ***
## LIMIT_BAL   -8.374e-07  2.841e-07  -2.947 0.003205 ** 
## SEX2        -1.410e-01  5.476e-02  -2.575 0.010012 *  
## EDUCATION2  -1.576e-01  6.367e-02  -2.476 0.013302 *  
## EDUCATION3  -2.250e-01  8.561e-02  -2.628 0.008581 ** 
## EDUCATION4  -1.216e+00  3.349e-01  -3.630 0.000283 ***
## MARRIAGE2   -1.313e-01  6.209e-02  -2.114 0.034521 *  
## MARRIAGE3   -1.607e-01  2.348e-01  -0.684 0.493670    
## AGE          7.730e-03  3.357e-03   2.303 0.021292 *  
## PAY_0        5.973e-01  3.180e-02  18.783  < 2e-16 ***
## PAY_2        7.976e-02  3.649e-02   2.186 0.028808 *  
## PAY_3        5.966e-02  4.021e-02   1.484 0.137892    
## PAY_4        5.305e-02  4.489e-02   1.182 0.237349    
## PAY_5        4.748e-02  4.831e-02   0.983 0.325690    
## PAY_6        1.324e-02  4.006e-02   0.331 0.741000    
## BILL_AMT1   -6.889e-06  2.034e-06  -3.388 0.000705 ***
## BILL_AMT2    4.778e-06  2.521e-06   1.895 0.058085 .  
## BILL_AMT3    1.826e-06  2.190e-06   0.834 0.404334    
## BILL_AMT4   -5.368e-06  2.548e-06  -2.107 0.035120 *  
## BILL_AMT5    5.308e-06  2.796e-06   1.899 0.057592 .  
## BILL_AMT6   -7.715e-07  2.123e-06  -0.363 0.716315    
## PAY_AMT1    -1.097e-05  3.513e-06  -3.123 0.001789 ** 
## PAY_AMT2    -6.712e-06  3.215e-06  -2.088 0.036838 *  
## PAY_AMT3    -3.556e-06  3.418e-06  -1.040 0.298139    
## PAY_AMT4    -5.010e-06  3.070e-06  -1.632 0.102657    
## PAY_AMT5    -2.095e-06  3.210e-06  -0.653 0.514046    
## PAY_AMT6    -3.219e-06  2.152e-06  -1.496 0.134588    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10071  on 9599  degrees of freedom
## Residual deviance:  8762  on 9573  degrees of freedom
## AIC: 8816
## 
## Number of Fisher Scoring iterations: 5

You have seen glm() before. In this lab, this is the main function used to build logistic regression model because it is a member of generalized linear model. In glm(), the only thing new is family. It specifies the distribution of your response variable. You may also specify the link function after the name of distribution, for example, family=binomial(logit) (default link is logit). You can also specify family=binomial(link = "probit") to run probit regression. You may also use glm() to build many other generalized linear models.

3.2 Binary Classification

As we talked in the lecture, people may be more interested in the classification results. But we have to define a cut-off probability first.

These tables illustrate the impact of choosing different cut-off probability. Choosing a large cut-off probability will result in few cases being predicted as 1, and choosing a small cut-off probability will result in many cases being predicted as 1.

pred_glm0_train <- predict(credit_glm0, type="response")

table(credit_train$default, (pred_glm0_train > 0.9)*1, dnn=c("Truth","Predicted"))
##      Predicted
## Truth    0    1
##     0 7498    8
##     1 2080   14
table(credit_train$default, (pred_glm0_train > 0.5)*1, dnn=c("Truth","Predicted"))
##      Predicted
## Truth    0    1
##     0 7299  207
##     1 1546  548
table(credit_train$default, (pred_glm0_train > 0.2)*1, dnn=c("Truth","Predicted"))
##      Predicted
## Truth    0    1
##     0 4613 2893
##     1  615 1479
table(credit_train$default, (pred_glm0_train > 0.0001)*1, dnn=c("Truth","Predicted"))
##      Predicted
## Truth    0    1
##     0    2 7504
##     1    0 2094

Therefore, determine the optimal cut-off probability is crucial. The simplest way to determine the cut-off is to use the proportion of “1” in the original data. We will intriduce a more appropriate way to determine the optimal p-cut.

3.3 Asymmetric cost

In the case of giving loan to someone, the cost function can indicate the trade off between the risk of giving loan to someone who cannot pay (predict 0, truth 1), and risk of rejecting someone who qualifies (predict 1, truth 0). Given different business situation, one may need to have asymmetric costs for false positive and false negative. Meanwhile, when you want a binary classification decision rule, you need to choose different cut-off probability. Choosing a large cut-off probability will result in few cases being predicted as 1, and choosing a small cut-off probability will result in many cases being predicted as 1.

  1. The symmetric cost function and asymmetric cost function with 5:1 cost ratio:
#Symmetric cost
cost1 <- function(r, pi, pcut){
  mean(((r==0)&(pi>pcut)) | ((r==1)&(pi<pcut)))
}

#Asymmetric cost
cost2 <- function(r, pi, pcut){
  weight1 <- 5
  weight0 <- 1
  c1 <- (r==1)&(pi<pcut) #logical vector - true if actual 1 but predict 0
  c0 <-(r==0)&(pi>pcut) #logical vector - true if actual 0 but predict 1
  return(mean(weight1*c1+weight0*c0))
}
pcut <- 1/(5+1)
#Symmetric cost
cost1(r = credit_train$default, pi = pred_glm0_train, pcut)
## [1] 0.46375
#Asymmetric cost
cost2(r = credit_train$default, pi = pred_glm0_train, pcut)
## [1] 0.6704167
  1. (Optional) Symmetric cost and asymmetric cost with different cut-off probability:
pcuts <- costs1 <- costs2 <- matrix(c(0.9,0.5,0.2,0.0001), nrow = 4, ncol = 1)

costs1[] <- vapply(X = pcuts, FUN = cost1, FUN.VALUE = numeric(1), 
                   r = credit_train$default, pi = pred_glm0_train)
costs2[] <- vapply(X = pcuts, FUN = cost2, FUN.VALUE = numeric(1), 
                   r = credit_train$default, pi = pred_glm0_train)

All_costs <- as.matrix(cbind(costs1, costs2))
dimnames(All_costs) <- list(c(0.9,0.5,0.2,0.0001), c("Symm cost", "Asym cost"))
All_costs
##       Symm cost Asym cost
## 0.9   0.2175000 1.0841667
## 0.5   0.1826042 0.8267708
## 0.2   0.3654167 0.6216667
## 1e-04 0.7816667 0.7816667

4 Summary

4.1 Things to remember

  • Know how to use glm() to build logistic regression;

  • Know how to get ROC and AUC based on predicted probability;

  • Know how to get PR curve and AUC based on predicted probability;

  • Know how to do binary classification, and calculation of MR, FPR, FNR, and cost;

  • Know how to use LASSO for logistic regression

4.2 Guide for Assignment

  • EDA

  • Train logistic model

  • Prediction (ROC, AUC; PR, AUC)

  • Model comparison using AUC

  • Find optimal cut-off based on training sample

  • Classification – Obtain the confusion matrix and calculate the MR, asymmetric cost, FPR, Precision, and Recall.

  • Build new models by Variable Selection

  • Calculate all criteria

  • Comprehensive comparison

go to top