1 Objective

The objective of this case is to get you understand logistic regression (binary classification) and some important ideas such as cross validation, ROC curve, cut-off probability.

2 Credit Card Default Data

We will use a Credit Card Default Data for this lab and illustration. The details of the data can be found at http://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients. Think about what kind of factors could affect people to fail to pay their credit balance.

We first load the credit scoring data. It is easy to load comma-separated values (CSV).

credit_data <- read.csv(file = "https://xiaoruizhu.github.io/Data-Mining-R/lecture/data/credit_default.csv", header=T)

Look at what information do we have.

colnames(credit_data)
##  [1] "LIMIT_BAL"                  "SEX"                       
##  [3] "EDUCATION"                  "MARRIAGE"                  
##  [5] "AGE"                        "PAY_0"                     
##  [7] "PAY_2"                      "PAY_3"                     
##  [9] "PAY_4"                      "PAY_5"                     
## [11] "PAY_6"                      "BILL_AMT1"                 
## [13] "BILL_AMT2"                  "BILL_AMT3"                 
## [15] "BILL_AMT4"                  "BILL_AMT5"                 
## [17] "BILL_AMT6"                  "PAY_AMT1"                  
## [19] "PAY_AMT2"                   "PAY_AMT3"                  
## [21] "PAY_AMT4"                   "PAY_AMT5"                  
## [23] "PAY_AMT6"                   "default.payment.next.month"

Let’s look at how many people were actually default in this sample.

mean(credit_data$default.payment.next.month)
## [1] 0.2193333

The name of response variable is too long! I want to make it shorter by renaming. Recall the rename() function.

library(dplyr)
credit_data<- rename(credit_data, default=default.payment.next.month)

How about the variable type and summary statistics?

str(credit_data)    # structure - see variable type
summary(credit_data) # summary statistics

We see all variables are int, but we know that SEX, EDUCATION, MARRIAGE are categorical, we convert them to factor.

credit_data$SEX<- as.factor(credit_data$SEX)
credit_data$EDUCATION<- as.factor(credit_data$EDUCATION)
credit_data$MARRIAGE<- as.factor(credit_data$MARRIAGE)

We omit other EDA, but you shouldn’t whenever you are doing data analysis.

go to top

3 Logistic Regression

Randomly split the data to training (80%) and testing (20%) datasets:

index <- sample(nrow(credit_data),nrow(credit_data)*0.80)
credit_train = credit_data[index,]
credit_test = credit_data[-index,]

3.1 Train a logistic regression model with all variables

credit_glm0 <- glm(default~., family=binomial, data=credit_train)
summary(credit_glm0)
## 
## Call:
## glm(formula = default ~ ., family = binomial, data = credit_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1722  -0.6932  -0.5366  -0.2874   3.3046  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.091e+00  1.540e-01  -7.086 1.38e-12 ***
## LIMIT_BAL   -7.182e-07  2.815e-07  -2.551 0.010745 *  
## SEX2        -1.020e-01  5.485e-02  -1.859 0.063065 .  
## EDUCATION2  -1.583e-01  6.307e-02  -2.510 0.012063 *  
## EDUCATION3  -1.828e-01  8.430e-02  -2.168 0.030150 *  
## EDUCATION4  -1.525e+00  3.975e-01  -3.837 0.000124 ***
## MARRIAGE2   -1.197e-01  6.187e-02  -1.934 0.053098 .  
## MARRIAGE3    1.553e-03  2.258e-01   0.007 0.994514    
## AGE          7.835e-03  3.333e-03   2.351 0.018718 *  
## PAY_0        5.849e-01  3.155e-02  18.542  < 2e-16 ***
## PAY_2        1.044e-01  3.609e-02   2.894 0.003805 ** 
## PAY_3        4.764e-02  3.993e-02   1.193 0.232863    
## PAY_4        2.720e-02  4.498e-02   0.605 0.545393    
## PAY_5        5.239e-02  4.771e-02   1.098 0.272179    
## PAY_6        1.048e-03  4.049e-02   0.026 0.979357    
## BILL_AMT1   -6.422e-06  2.073e-06  -3.098 0.001949 ** 
## BILL_AMT2    2.600e-06  2.697e-06   0.964 0.335095    
## BILL_AMT3    2.881e-06  2.278e-06   1.265 0.205932    
## BILL_AMT4   -4.468e-06  2.544e-06  -1.756 0.079084 .  
## BILL_AMT5    3.867e-06  2.965e-06   1.304 0.192194    
## BILL_AMT6    3.909e-07  2.317e-06   0.169 0.866053    
## PAY_AMT1    -8.793e-06  3.411e-06  -2.578 0.009952 ** 
## PAY_AMT2    -1.124e-05  3.666e-06  -3.065 0.002178 ** 
## PAY_AMT3    -1.675e-06  3.114e-06  -0.538 0.590628    
## PAY_AMT4    -4.896e-06  3.010e-06  -1.627 0.103792    
## PAY_AMT5    -3.289e-06  3.306e-06  -0.995 0.319864    
## PAY_AMT6    -1.762e-06  2.140e-06  -0.823 0.410482    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10040.1  on 9599  degrees of freedom
## Residual deviance:  8778.3  on 9573  degrees of freedom
## AIC: 8832.3
## 
## Number of Fisher Scoring iterations: 6

You have seen glm() before. In this lab, this is the main function used to build logistic regression model because it is a member of generalized linear model. In glm(), the only thing new is family. It specifies the distribution of your response variable. You may also specify the link function after the name of distribution, for example, family=binomial(logit) (default link is logit). You can also specify family=binomial(link = "probit") to run probit regression. You may also use glm() to build many other generalized linear models.

3.2 Variable Selection

3.2.1 Variable Selection with Stepwise Approach

We can use the same procedures of variable selection, i.e. forward, backward, and stepwise, for linear regression models. Caution: this will take a long period of time since the dimension of predictor variables is not very small and the sample size is large.

credit_glm_back <- step(credit_glm0) # backward selection (if you don't specify anything)
summary(credit_glm_back)
credit_glm_back$deviance
AIC(credit_glm_back)
BIC(credit_glm_back)

You can try model selection with BIC (usually results in a simpler model than AIC criterion)

credit_glm_back_BIC <- step(credit_glm0, k=log(nrow(credit_train))) 
summary(credit_glm_back_BIC)
credit_glm_back_BIC$deviance
AIC(credit_glm_back_BIC)
BIC(credit_glm_back_BIC)

Exercise: Try forward and stepwise selection procedures to see if they deliver the same best model.


3.2.2 Variable selection with LASSO

Be careful that LASSO does require x to be numeric matrix. Therefore, we need to manually convert categorical variable (“SEX”, “EDUCATION” and “MARRIAGE”) to dummy variable. For simplicity, only if you have evidence that the categorical variable has monotonic relationship to response can you directly convert it to numeric by using as.numeric(). For example, the probability of default increases/decreases as EDUCATION level goes from 1 to 4. This can be seen from the two-way contingency table by calculating the default proportion at each education level.

Here I will show how to convert categorical variable to dummy variables.

dummy <- model.matrix(~ ., data = credit_data)
# look at first few rows of data
head(dummy)

The function model.matrix() can automatically convert categorical variable to dummy. It also creates a column of 1, which we don’t need at this time. That column of 1 is used for estimating intercept if you write algorithm by yourself, but most available functions automatically creates that column during estimation.

credit_data_lasso <- data.frame(dummy[,-1])

Now let’s get data prepared for LASSO.

#index <- sample(nrow(credit_data),nrow(credit_data)*0.80)
credit_train_X = as.matrix(select(credit_data_lasso, -default)[index,])
credit_test_X = as.matrix(select(credit_data_lasso, -default)[-index,])
credit_train_Y = credit_data_lasso[index, "default"]
credit_test_Y = credit_data_lasso[-index, "default"]
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.0
credit_lasso <- glmnet(x=credit_train_X, y=credit_train_Y, family = "binomial")

Perform cross-validation to determine the shrinkage parameter.

credit_lasso_cv<- cv.glmnet(x=credit_train_X, y=credit_train_Y, family = "binomial", type.measure = "class")
plot(credit_lasso_cv)

For logistc regression, we can specify type.measure="class" so that the CV error will be misclassification error.

Get the coefficient with optimal \(\lambda\)

coef(credit_lasso, s=credit_lasso_cv$lambda.min)
## 27 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept) -1.111218e+00
## LIMIT_BAL   -7.114730e-07
## SEX2        -9.566709e-02
## EDUCATION2  -1.418756e-01
## EDUCATION3  -1.613499e-01
## EDUCATION4  -1.448724e+00
## MARRIAGE2   -1.129672e-01
## MARRIAGE3    .           
## AGE          7.523725e-03
## PAY_0        5.864557e-01
## PAY_2        1.015633e-01
## PAY_3        5.115741e-02
## PAY_4        2.584251e-02
## PAY_5        5.040374e-02
## PAY_6        3.784503e-03
## BILL_AMT1   -3.880204e-06
## BILL_AMT2    7.496496e-08
## BILL_AMT3    1.269392e-06
## BILL_AMT4   -9.995223e-07
## BILL_AMT5    1.751675e-06
## BILL_AMT6    4.347703e-07
## PAY_AMT1    -6.277688e-06
## PAY_AMT2    -1.017124e-05
## PAY_AMT3    -3.212839e-06
## PAY_AMT4    -2.969138e-06
## PAY_AMT5    -3.320249e-06
## PAY_AMT6    -1.647002e-06
coef(credit_lasso, s=credit_lasso_cv$lambda.1se)
## 27 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept) -1.227832e+00
## LIMIT_BAL   -6.053830e-07
## SEX2        -4.802123e-02
## EDUCATION2  -2.098067e-02
## EDUCATION3   .           
## EDUCATION4  -9.177075e-01
## MARRIAGE2   -5.940404e-02
## MARRIAGE3    .           
## AGE          4.791246e-03
## PAY_0        5.808326e-01
## PAY_2        9.297812e-02
## PAY_3        5.333530e-02
## PAY_4        1.890388e-02
## PAY_5        4.770851e-02
## PAY_6        4.727086e-03
## BILL_AMT1   -1.925545e-06
## BILL_AMT2    .           
## BILL_AMT3    .           
## BILL_AMT4    .           
## BILL_AMT5    .           
## BILL_AMT6    .           
## PAY_AMT1    -3.534711e-06
## PAY_AMT2    -4.633002e-06
## PAY_AMT3    -1.200810e-06
## PAY_AMT4    -3.553567e-07
## PAY_AMT5    -1.724196e-06
## PAY_AMT6    -6.429552e-07

3.2.2.1 Prediction

# in-sample prediction
pred_lasso_train <- predict(credit_lasso, newx=credit_train_X, s=credit_lasso_cv$lambda.1se, type = "response")
# out-of-sample prediction
pred_lasso_test <- predict(credit_lasso, newx=credit_test_X, s=credit_lasso_cv$lambda.1se, type = "response")