1 Credit Score Data

1.1 Load Data

credit.data <- read.csv("data/credit0.csv", header=T)

We remove X9 and id from the data since we will not be using them for prediction.

credit.data$X9 = NULL
credit.data$id = NULL
credit.data$Y = as.factor(credit.data$Y)

Now split the data 90/10 as training/testing datasets:

id_train <- sample(nrow(credit.data),nrow(credit.data)*0.90)
credit.train = credit.data[id_train,]
credit.test = credit.data[-id_train,]

The training dataset has 61 variables, 4500 obs.

You are already familiar with the credit scoring set. Let’s define a cost function for benchmarking testing set performance. Note this is slightly different from the one we used for searching for optimal cut-off probability in logistic regression. Here the 2nd argument is the predict class instead of the predict probability (since many methods are not based on predict probability).

creditcost <- function(observed, predicted){
  weight1 = 10
  weight0 = 1
  c1 = (observed==1)&(predicted == 0) #logical vector - true if actual 1 but predict 0
  c0 = (observed==0)&(predicted == 1) #logical vector - true if actual 0 but predict 1
  return(mean(weight1*c1+weight0*c0))
}

go to top

2 Discriminant Analysis

Linear Discriminant Analysis (LDA) (in-sample and out-of-sample performance measure) is illustrated here. The following illustrate the usage of an arbitrary cut off probability.

2.1 In-sample

library(MASS)
credit.train$Y = as.factor(credit.train$Y)
credit.lda <- lda(Y~.,data=credit.train)
prob.lda.in <- predict(credit.lda,data=credit.train)
pcut.lda <- .15
pred.lda.in <- (prob.lda.in$posterior[,2]>=pcut.lda)*1
table(credit.train$Y,pred.lda.in,dnn=c("Obs","Pred"))

##    Pred
## Obs    0    1
##   0 3891  336
##   1  160  113

mean(ifelse(credit.train$Y != pred.lda.in, 1, 0))

## [1] 0.1102222

2.2 Out-of-sample

lda.out <- predict(credit.lda,newdata=credit.test)
cut.lda <- .12
pred.lda.out <- as.numeric((lda.out$posterior[,2]>=cut.lda))
table(credit.test$Y,pred.lda.out,dnn=c("Obs","Pred"))

##    Pred
## Obs   0   1
##   0 413  60
##   1  14  13

mean(ifelse(credit.test$Y != pred.lda.out, 1, 0))

## [1] 0.148

creditcost(credit.test$Y, pred.lda.out)

## [1] 0.4

go to top