1 Gasoline Price Case: Indicator (Dummy) Variables

1.1 Do you believe the simple linear model can do the magic?

1.1.1 YES!

Give me your reasons.

1.1.2 NO!

Why?

1.2 Import the data

Gas_prices <- readxl::read_xlsx(path = "data/Gas_prices_1.xlsx")
colnames(Gas_prices)

[1] “Date” “Year” “Month” “Unleaded”
[5] “Crude_price” “SP500” “PDI” “CPI”
[9] “RR_sales” “Unemp” “Housing” “Demand”
[13] “Production” “Imports” “Stocks” “Time”
[17] “L1_Unleaded” “L1_Crude_price” “L1_SP500” “L1_PDI”
[21] “L1_CPI” “L1_RR_sales” “L1_Unemp” “L1_Housing”
[25] “L1_Demand” “L1_Production” “L1_Imports” “L1_Stocks”
[29] “Recession”

Gas_prices_clean <- na.omit(Gas_prices)
if (!require("forecast")){install.packages("forecast")}

1.3 Box plot between Unleaded and Recession

library("PerformanceAnalytics")
Gas_prices_clean$Recession <- as.factor(Gas_prices_clean$Recession)
# summary(Gas_prices_clean)
# Default plot
library(ggplot2)
e <- ggplot(Gas_prices_clean, aes(x = Recession, y = Unleaded))
e + geom_boxplot()

What is your guess here?

1.3.1 One simple regression model we have obtained:

mod_mul1 <- lm(formula = Unleaded ~ L1_Crude_price, 
              data = Gas_prices_clean)
summary(mod_mul1)
## 
## Call:
## lm(formula = Unleaded ~ L1_Crude_price, data = Gas_prices_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -71.658 -10.606  -3.224  10.459  64.539 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    66.19172    2.72392   24.30   <2e-16 ***
## L1_Crude_price  2.87386    0.04317   66.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.8 on 237 degrees of freedom
## Multiple R-squared:  0.9492, Adjusted R-squared:  0.949 
## F-statistic:  4431 on 1 and 237 DF,  p-value: < 2.2e-16

1.3.2 What we can do to include the recessoin factor?

mod_mul_rec <- lm(Unleaded ~ L1_Crude_price + Recession,
                  data = Gas_prices_clean)
summary(mod_mul_rec)
## 
## Call:
## lm(formula = Unleaded ~ L1_Crude_price + Recession, data = Gas_prices_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -42.863 -10.247  -2.713   9.345  62.256 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     63.44396    2.45479  25.845  < 2e-16 ***
## L1_Crude_price   2.95248    0.03979  74.205  < 2e-16 ***
## Recession1     -62.34339    7.92719  -7.865 1.32e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.56 on 236 degrees of freedom
## Multiple R-squared:  0.9598, Adjusted R-squared:  0.9594 
## F-statistic:  2815 on 2 and 236 DF,  p-value: < 2.2e-16
coefs <- coef(mod_mul_rec)
Data_Recession <- subset(Gas_prices_clean, Recession==1)
pred_rec <- predict(mod_mul_rec, newdata = Data_Recession)
plot(NULL, xlim=range(Gas_prices_clean$L1_Crude_price), 
     ylim=range(Gas_prices_clean$Unleaded), 
     ylab="Unleaded", xlab="L1_Crude_price")
abline(a = (coefs[1]), b = 2.9)
abline(a = (coefs[1]+coefs[3]), b = 2.9, col = "red")
text(x=c(40,80),y=c(350,200),pos=4, labels = c("Non-Recession", "Recession"),
     col = c("black", "red"))

go to top

2 Boston Housing Data Case

Boston housing data is a built-in dataset in MASS package, so you do not need to download externally. Package MASS comes with R when you installed R, so no need to use install.packages(MASS) to download and install, but you do need to load this package.

2.1 Load Data

library(MASS)
data(Boston); #this data is in MASS package
colnames(Boston) 
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"

You can find details of the dataset from help document.

?Boston

The original data are 506 observations on 14 variables, medv being the response variable \(y\):

2.2 Exploratory Data Analysis (EDA)

We have introduced many EDA techniques in lab 2. We will briefly go through some of them here.

dim(Boston) 
## [1] 506  14
names(Boston)
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"
# str(Boston)
# summary(Boston)

We skip the Exploratory Data Analysis (EDA) in this notes, but you should not omit it in your HW and Cases. EDA is very important and always the first analysis to do before any modeling.

2.3 Preparation

2.3.1 Splitting data to training and testing samples

Next we sample 90% of the original data and use it as the training set. The remaining 10% is used as test set. The regression model will be built on the training set and future performance of your model will be evaluated with the test set.

sample_index <- sample(nrow(Boston),nrow(Boston)*0.90)
Boston_train <- Boston[sample_index,]
Boston_test <- Boston[-sample_index,]

2.3.2 (Optional) Standardization

If we want our results to be invariant to the units and the parameter estimates \(\beta_i\) to be comparible, we can standardize the variables. Essentially we are replacing the original values with their z-score.

1st Way: create new variables manually.

Boston$sd.crim <- (Boston$crim-mean(Boston$crim))/sd(Boston$crim); 

This does the same thing.

Boston$sd.crim <- scale(Boston$crim); 

2nd way: If you have a lot of variables to standardize then the above is not very pleasing to do. You can use a loop like this. It standardizes every varables other than the last one which is \(y\).

for (i in 1:(ncol(Boston_train)-1)){
  Boston_train[,i] <- scale(Boston_train[,i])
}

The technique is not as important in linear regression because it will only affect the interpretation but not the model estimation and inference.

go to top

3 Model Building

You task is to build a best model with training data. You can refer to the regression and variable selection code on the slides for more detailed description of linear regression.

The following model includes all \(x\) varables in the model

model_1 <- lm(medv~crim+zn+chas+nox+rm+dis+rad+tax+ptratio+black+lstat, data=Boston_train)

To include all variables in the model, you can write the statement this simpler way.

model_1 <- lm(medv~., data=Boston_train)
summary(model_1)
## 
## Call:
## lm(formula = medv ~ ., data = Boston_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.7552  -2.7514  -0.4968   1.6540  25.0881 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  22.4842     0.2218 101.387  < 2e-16 ***
## crim         -0.9112     0.2925  -3.115 0.001958 ** 
## zn            1.1667     0.3395   3.436 0.000646 ***
## indus         0.3541     0.4472   0.792 0.428862    
## chas          0.7809     0.2304   3.390 0.000762 ***
## nox          -2.1489     0.4626  -4.645 4.49e-06 ***
## rm            2.4271     0.3015   8.051 7.71e-15 ***
## age           0.3796     0.3907   0.972 0.331722    
## dis          -2.9483     0.4396  -6.707 6.10e-11 ***
## rad           2.8890     0.6088   4.745 2.82e-06 ***
## tax          -2.2954     0.6795  -3.378 0.000794 ***
## ptratio      -2.0767     0.2956  -7.026 8.10e-12 ***
## black         0.8365     0.2627   3.184 0.001558 ** 
## lstat        -4.0515     0.3720 -10.892  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.73 on 441 degrees of freedom
## Multiple R-squared:  0.7378, Adjusted R-squared:   0.73 
## F-statistic: 95.43 on 13 and 441 DF,  p-value: < 2.2e-16

But, is this the model you want to use?

3.1 (Optional) Interaction terms in model

If you suspect the effect of one predictor x1 on the response y depends on the value of another predictor x2, you can add interaction terms in model. To specify interaction in the model, you put : between two variables with interaction effect. For example

lm(medv~crim+zn+crim:zn, data=Boston_train)
#The following way automatically add the main effects of crim and zn
lm(medv~crim*zn, data=Boston_train)

For now we will not investigate the interactions of variables.

4 Model Assessment

Suppose that everything in model diagnostics is okay. In other words, the model we have built is fairly a valid model. Then we need to evaluate the model performance in terms of different metrics.

Commonly used metrics include MSE, (adjusted) \(R^2\), AIC, BIC for in-sample performance, and MSPE for out-of-sample performance.

4.1 In-sample model evaluation (train error)

MSE of the regression, which is the square of ‘Residual standard error’ in the above summary. It is the sum of squared residuals(SSE) divided by degrees of freedom (n-p-1). In some textbooks the sum of squred residuals(SSE) is called residual sum of squares(RSS). MSE of the regression should be the unbiased estimator for variance of \(\epsilon\), the error term in the regression model.

model_summary <- summary(model_1)
(model_summary$sigma)^2
## [1] 22.37703

\(R^2\) of the model

model_summary$r.squared
## [1] 0.7377558

Adjusted-\(R^2\) of the model, if you add a variable (or several in a group), SSE will decrease, \(R^2\) will increase, but Adjusted-\(R^2\) could go either way.

model_summary$adj.r.squared
## [1] 0.7300253

AIC and BIC of the model, these are information criteria. Smaller values indicate better fit.

AIC(model_1)
## [1] 2721.17
BIC(model_1)
## [1] 2782.975

BIC, AIC, and Adjusted \(R^2\) have complexity penalty in the definition, thus when comparing across different models they are better indicators on how well the model will perform on future data.

4.2 Out-of-sample prediction (test error)

To evaluate how the model performs on future data, we use predict() to get the predicted values from the test set.

#pi is a vector that contains predicted values for test set.
pi <- predict(object = model_1, newdata = Boston_test)

Just as any other function, you can write the above statement the following way as long as the arguments are in the right order.

pi <- predict(model_1, Boston_test)

The most common measure is the Mean Squared Error (MSE): average of the squared differences between the predicted and actual values

mean((pi - Boston_test$medv)^2)
## [1] 31.88281

A less popular measure is the Mean Absolute Error (MAE). You can probably guess that here instead of taking the average of squared error, MAE is the average of absolute value of error.

mean(abs(pi - Boston_test$medv))
## [1] 3.128037

Note that if you ignore the second argument of predict(), it gives you the in-sample prediction on the training set:

predict(model_1)

Which is the same as

model_1$fitted.values

4.3 Cross-validation

Instead of fitting a model on a pre-specified 90% training sample and evaluate the MSPE on the hold-out 10% testing sample, it is more reliable to use cross-validation for out-of-sample performance evaluation. For k-fold cross-validation, the dataset is divided into k parts (equal sample size). Each part serves as the testing sample in and the rest (k-1 together) serves as training sample. This training/testing procedure is iteratively performed k times. The CV score is usually the average of the metric of out-of-sample performance across k iterations.

Note

  1. We use the entire dataset for cross validation

  2. We need to use glm() instead of lm() to fit the model (if we want to use cv.glm function in boot package)

  3. The default cost is the Mean Squared Prediction Error (MSPE). If we want to use another measure we need to define a cost function.

10-fold Cross Validation

library(boot)
model.glm1 = glm(medv~., data = Boston)
cv.glm(data = Boston, glmfit = model.glm1, K = 10)$delta[2]
## [1] 23.31722

Comparing with model 2

model.glm2 = glm(medv~. -indus -age, data = Boston)
cv.glm(data = Boston, glmfit = model.glm2, K = 10)$delta[2]
## [1] 23.32833

LOOCV (Leave-one-out Cross Validation)

cv.glm(data = Boston, glmfit = model.glm1, K = nrow(Boston))$delta[2]

Now we define a MAD cost function as the CV score. The function takes 2 input vectors, pi = predicted values, r = actual values.

MAD_cost = function(pi, r){
  return(mean(abs(pi-r)))
}
cv.glm(data = Boston, glmfit = model.glm1, cost = MAD_cost, K = 10)$delta[2]
## [1] 3.38637

Exercise: Do 10-fold cross-validation for all candidate models, and compare the CV score in terms of MSE and MAE.

4.4 Robust Linear Regression

The traditional linear regression uses ordinary least square for the estimation, but we know that OLS is very sensitive to outliers, meaning that the estimation results can be severely affected by outliers. As an alternative, robust estimation can be used deal with this issue. Most robust estimator is called M-estimator (here is a good tutorial for introduction).

In MASS package, function rlm() is designed for robust linear regression.

lm_rob= rlm(medv~., data = Boston_train)
pred_rob= predict(lm_rob, newdata = Boston_test)
mspe_rob= mean((pred_rob-Boston_test$medv)^2)
mspe_rob
## [1] 36.73636

go to top

5 Variable Selection

5.1 Compare Model Fit Manually

model_1 <- lm(medv~., data = Boston_train)
model_2 <- lm(medv~crim+zn, data = Boston_train)
summary(model_1)
summary(model_2)
AIC(model_1); BIC(model_1)
AIC(model_2); BIC(model_2)

Exercise: Compare MSE, \(R^2\), and MSPE of these three models.

5.2 Best Subset Regression

The ‘leaps’ package provides procedures for best subset regression.

install.packages('leaps')
library(leaps)

Which subset of variables should you include in order to minimize BIC?

#regsubsets only takes data frame as input
subset_result <- regsubsets(medv~.,data=Boston_train, nbest=2, nvmax = 14)
summary(subset_result)
## Subset selection object
## Call: regsubsets.formula(medv ~ ., data = Boston_train, nbest = 2, 
##     nvmax = 14)
## 13 Variables  (and intercept)
##         Forced in Forced out
## crim        FALSE      FALSE
## zn          FALSE      FALSE
## indus       FALSE      FALSE
## chas        FALSE      FALSE
## nox         FALSE      FALSE
## rm          FALSE      FALSE
## age         FALSE      FALSE
## dis         FALSE      FALSE
## rad         FALSE      FALSE
## tax         FALSE      FALSE
## ptratio     FALSE      FALSE
## black       FALSE      FALSE
## lstat       FALSE      FALSE
## 2 subsets of each size up to 13
## Selection Algorithm: exhaustive
##           crim zn  indus chas nox rm  age dis rad tax ptratio black lstat
## 1  ( 1 )  " "  " " " "   " "  " " " " " " " " " " " " " "     " "   "*"  
## 1  ( 2 )  " "  " " " "   " "  " " "*" " " " " " " " " " "     " "   " "  
## 2  ( 1 )  " "  " " " "   " "  " " "*" " " " " " " " " " "     " "   "*"  
## 2  ( 2 )  " "  " " " "   " "  " " " " " " " " " " " " "*"     " "   "*"  
## 3  ( 1 )  " "  " " " "   " "  " " "*" " " " " " " " " "*"     " "   "*"  
## 3  ( 2 )  " "  " " " "   " "  " " "*" " " " " " " " " " "     "*"   "*"  
## 4  ( 1 )  " "  " " " "   " "  " " "*" " " " " " " " " "*"     "*"   "*"  
## 4  ( 2 )  " "  " " " "   " "  " " "*" " " "*" " " " " "*"     " "   "*"  
## 5  ( 1 )  " "  " " " "   " "  "*" "*" " " "*" " " " " "*"     " "   "*"  
## 5  ( 2 )  " "  " " " "   " "  " " "*" " " "*" " " " " "*"     "*"   "*"  
## 6  ( 1 )  " "  " " " "   " "  "*" "*" " " "*" " " " " "*"     "*"   "*"  
## 6  ( 2 )  "*"  " " " "   " "  "*" "*" " " "*" " " " " "*"     " "   "*"  
## 7  ( 1 )  " "  "*" " "   " "  "*" "*" " " "*" " " " " "*"     "*"   "*"  
## 7  ( 2 )  " "  " " " "   "*"  "*" "*" " " "*" " " " " "*"     "*"   "*"  
## 8  ( 1 )  " "  " " " "   " "  "*" "*" " " "*" "*" "*" "*"     "*"   "*"  
## 8  ( 2 )  "*"  " " " "   " "  "*" "*" " " "*" "*" " " "*"     "*"   "*"  
## 9  ( 1 )  "*"  " " " "   " "  "*" "*" " " "*" "*" "*" "*"     "*"   "*"  
## 9  ( 2 )  " "  "*" " "   " "  "*" "*" " " "*" "*" "*" "*"     "*"   "*"  
## 10  ( 1 ) "*"  "*" " "   " "  "*" "*" " " "*" "*" "*" "*"     "*"   "*"  
## 10  ( 2 ) "*"  " " " "   "*"  "*" "*" " " "*" "*" "*" "*"     "*"   "*"  
## 11  ( 1 ) "*"  "*" " "   "*"  "*" "*" " " "*" "*" "*" "*"     "*"   "*"  
## 11  ( 2 ) "*"  "*" "*"   " "  "*" "*" " " "*" "*" "*" "*"     "*"   "*"  
## 12  ( 1 ) "*"  "*" "*"   "*"  "*" "*" " " "*" "*" "*" "*"     "*"   "*"  
## 12  ( 2 ) "*"  "*" " "   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"  
## 13  ( 1 ) "*"  "*" "*"   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"
plot(subset_result, scale="bic")

Each row represents a model. Black indicates that a variable is included in the model, while white indicates that it is not. The argument scale = "" can be “Cp”, “adjr2”, “r2” or “bic”.

What is the problem with best subset regression? If there are n independent variables, the number of possible nonempty subsets is 2^n - 1. If you try a best subset regression with more than 50 variables, you might need to wait for your entire life to get the result.

5.3 Forward/Backward/Stepwise Regression Using AIC

To perform the Forward/Backward/Stepwise Regression in R, we need to define the starting points:

nullmodel=lm(medv~1, data=Boston_train)
fullmodel=lm(medv~., data=Boston_train)

nullmodel is the model with no varaible in it, while fullmodel is the model with every variable in it.

5.3.1 Backward Elimination

model_step_b <- step(fullmodel,direction='backward')
## Start:  AIC=1411.54
## medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad + 
##     tax + ptratio + black + lstat
## 
##           Df Sum of Sq     RSS    AIC
## - age      1      0.98  9520.1 1409.6
## - indus    1      5.17  9524.3 1409.8
## <none>                  9519.1 1411.5
## - chas     1     74.25  9593.3 1413.1
## - zn       1    200.58  9719.7 1419.0
## - crim     1    233.87  9752.9 1420.6
## - tax      1    258.09  9777.2 1421.7
## - black    1    296.04  9815.1 1423.5
## - nox      1    349.18  9868.3 1425.9
## - rad      1    407.01  9926.1 1428.6
## - dis      1    988.69 10507.8 1454.5
## - ptratio  1   1159.66 10678.7 1461.8
## - rm       1   1953.39 11472.5 1494.5
## - lstat    1   2073.55 11592.6 1499.2
## 
## Step:  AIC=1409.59
## medv ~ crim + zn + indus + chas + nox + rm + dis + rad + tax + 
##     ptratio + black + lstat
## 
##           Df Sum of Sq     RSS    AIC
## - indus    1      5.13  9525.2 1407.8
## <none>                  9520.1 1409.6
## - chas     1     73.74  9593.8 1411.1
## - zn       1    205.95  9726.0 1417.3
## - crim     1    234.09  9754.2 1418.6
## - tax      1    258.59  9778.7 1419.8
## - black    1    295.06  9815.1 1421.5
## - nox      1    388.63  9908.7 1425.8
## - rad      1    412.71  9932.8 1426.9
## - dis      1   1067.66 10587.7 1456.0
## - ptratio  1   1172.64 10692.7 1460.4
## - rm       1   2021.81 11541.9 1495.2
## - lstat    1   2379.06 11899.1 1509.1
## 
## Step:  AIC=1407.84
## medv ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio + 
##     black + lstat
## 
##           Df Sum of Sq     RSS    AIC
## <none>                  9525.2 1407.8
## - chas     1     78.66  9603.8 1409.6
## - zn       1    201.00  9726.2 1415.3
## - crim     1    236.66  9761.9 1417.0
## - black    1    292.62  9817.8 1419.6
## - tax      1    296.78  9822.0 1419.8
## - nox      1    393.90  9919.1 1424.3
## - rad      1    429.29  9954.5 1425.9
## - dis      1   1140.05 10665.2 1457.3
## - ptratio  1   1171.15 10696.3 1458.6
## - rm       1   2020.12 11545.3 1493.3
## - lstat    1   2375.32 11900.5 1507.1

5.3.2 Forward Selection

model_step_f <- step(nullmodel, scope=list(lower=nullmodel, upper=fullmodel), direction='forward')
## Start:  AIC=2035.25
## medv ~ 1
## 
##           Df Sum of Sq   RSS    AIC
## + lstat    1   22304.4 17391 1661.8
## + rm       1   20032.4 19663 1717.6
## + ptratio  1   11787.0 27908 1877.0
## + indus    1   10179.9 29515 1902.4
## + tax      1    9814.3 29881 1908.0
## + nox      1    7700.8 31995 1939.1
## + rad      1    6762.2 32933 1952.3
## + crim     1    6490.9 33204 1956.0
## + age      1    6063.3 33632 1961.8
## + zn       1    5524.3 34171 1969.1
## + black    1    4521.9 35173 1982.2
## + dis      1    2865.9 36829 2003.2
## + chas     1     766.9 38928 2028.4
## <none>                 39695 2035.2
## 
## Step:  AIC=1661.75
## medv ~ lstat
## 
##           Df Sum of Sq   RSS    AIC
## + rm       1    3943.2 13448 1546.8
## + ptratio  1    2779.3 14612 1584.5
## + chas     1     475.6 16915 1651.1
## + dis      1     474.9 16916 1651.2
## + tax      1     406.0 16985 1653.0
## + black    1     296.6 17094 1655.9
## + zn       1     235.0 17156 1657.6
## + indus    1     225.7 17165 1657.8
## + crim     1     192.1 17199 1658.7
## + age      1     179.3 17212 1659.0
## + rad      1      79.0 17312 1661.7
## <none>                 17391 1661.8
## + nox      1       1.8 17389 1663.7
## 
## Step:  AIC=1546.75
## medv ~ lstat + rm
## 
##           Df Sum of Sq   RSS    AIC
## + ptratio  1   1798.13 11650 1483.4
## + black    1    589.43 12858 1528.4
## + tax      1    583.79 12864 1528.6
## + crim     1    381.68 13066 1535.7
## + rad      1    303.90 13144 1538.3
## + chas     1    285.84 13162 1539.0
## + dis      1    179.90 13268 1542.6
## + indus    1    146.48 13301 1543.8
## + zn       1     91.66 13356 1545.6
## <none>                 13448 1546.8
## + nox      1     39.69 13408 1547.4
## + age      1      1.17 13446 1548.7
## 
## Step:  AIC=1483.44
## medv ~ lstat + rm + ptratio
## 
##         Df Sum of Sq   RSS    AIC
## + black  1    462.50 11187 1467.0
## + dis    1    285.08 11364 1474.2
## + crim   1    182.80 11467 1478.2
## + chas   1    150.97 11499 1479.5
## + tax    1    117.61 11532 1480.8
## + nox    1     52.27 11597 1483.4
## <none>               11650 1483.4
## + age    1     20.36 11629 1484.6
## + indus  1     11.92 11638 1485.0
## + rad    1      5.68 11644 1485.2
## + zn     1      5.62 11644 1485.2
## 
## Step:  AIC=1467.01
## medv ~ lstat + rm + ptratio + black
## 
##         Df Sum of Sq   RSS    AIC
## + dis    1    391.53 10796 1452.8
## + chas   1    131.66 11055 1463.6
## + crim   1     67.73 11119 1466.2
## <none>               11187 1467.0
## + age    1     32.32 11155 1467.7
## + tax    1     20.86 11166 1468.2
## + rad    1     20.42 11167 1468.2
## + zn     1      8.78 11178 1468.7
## + nox    1      6.29 11181 1468.8
## + indus  1      0.47 11187 1469.0
## 
## Step:  AIC=1452.8
## medv ~ lstat + rm + ptratio + black + dis
## 
##         Df Sum of Sq   RSS    AIC
## + nox    1    475.79 10320 1434.3
## + indus  1    175.96 10620 1447.3
## + tax    1    148.81 10647 1448.5
## + crim   1    134.84 10661 1449.1
## + zn     1    115.90 10680 1449.9
## + chas   1     83.63 10712 1451.3
## + age    1     82.30 10713 1451.3
## <none>               10796 1452.8
## + rad    1      1.80 10794 1454.7
## 
## Step:  AIC=1434.29
## medv ~ lstat + rm + ptratio + black + dis + nox
## 
##         Df Sum of Sq   RSS    AIC
## + zn     1   118.194 10202 1431.0
## + chas   1   113.092 10207 1431.3
## + crim   1    93.809 10226 1432.1
## + rad    1    65.130 10255 1433.4
## <none>               10320 1434.3
## + indus  1    21.471 10298 1435.3
## + age    1     9.965 10310 1435.8
## + tax    1     8.623 10311 1435.9
## 
## Step:  AIC=1431.05
## medv ~ lstat + rm + ptratio + black + dis + nox + zn
## 
##         Df Sum of Sq   RSS    AIC
## + crim   1   133.158 10068 1427.1
## + chas   1   121.590 10080 1427.6
## <none>               10202 1431.0
## + rad    1    37.392 10164 1431.4
## + tax    1    36.422 10165 1431.4
## + indus  1    21.193 10180 1432.1
## + age    1     3.052 10198 1432.9
## 
## Step:  AIC=1427.07
## medv ~ lstat + rm + ptratio + black + dis + nox + zn + crim
## 
##         Df Sum of Sq     RSS    AIC
## + rad    1   141.938  9926.4 1422.6
## + chas   1   108.830  9959.5 1424.1
## <none>               10068.3 1427.1
## + indus  1    21.139 10047.2 1428.1
## + tax    1     6.864 10061.5 1428.8
## + age    1     5.091 10063.3 1428.8
## 
## Step:  AIC=1422.61
## medv ~ lstat + rm + ptratio + black + dis + nox + zn + crim + 
##     rad
## 
##         Df Sum of Sq    RSS    AIC
## + tax    1    322.56 9603.8 1409.6
## + chas   1    104.44 9822.0 1419.8
## <none>               9926.4 1422.6
## + indus  1     36.46 9890.0 1422.9
## + age    1      0.97 9925.4 1424.6
## 
## Step:  AIC=1409.58
## medv ~ lstat + rm + ptratio + black + dis + nox + zn + crim + 
##     rad + tax
## 
##         Df Sum of Sq    RSS    AIC
## + chas   1    78.657 9525.2 1407.8
## <none>               9603.8 1409.6
## + indus  1    10.049 9593.8 1411.1
## + age    1     0.417 9603.4 1411.6
## 
## Step:  AIC=1407.84
## medv ~ lstat + rm + ptratio + black + dis + nox + zn + crim + 
##     rad + tax + chas
## 
##         Df Sum of Sq    RSS    AIC
## <none>               9525.2 1407.8
## + indus  1    5.1281 9520.1 1409.6
## + age    1    0.9403 9524.3 1409.8

5.3.3 Stepwise Selection (Output Omitted)

model_step_s <- step(nullmodel, scope=list(lower=nullmodel, upper=fullmodel), direction='both')

One caution when comparing fit statistics using AIC, the definition varies by program/procedure.

extractAIC(model_1)
## [1]   14.000 1411.544
AIC(model_1)
## [1] 2704.778

Exercise

  1. Comparing in-sample and out-of-sample performance between these reduced models.
  1. Conduct 10-fold cross validation on the full sample and compare the CV scores.
  • For pros and cons of variable/model selection using the common fit statistics: (adjusted) \(R^2\), MSE, AIC, BIC, etc. refer to Ch9 in “Applied Linear Regression Models” by Kutner et al.
  • For other variable selection methods refer to section 3.4 - 3.8 of “Elements of Statistical Learning” (Free Online).

go to top

5.4 Lasso and Elastic net

Two of the state-of-the-art automatic variable selection techniques of predictive modeling , Lasso [1] and Elastic net [2], are provided in the glmnet package. These method are in general better than the stepwise regressions, especially when dealing with large amount of predictor variables.

For a more theoretical review of these techniques refer to section 3.4 - 3.8 of “Elements of Statistical Learning” (Free Online). For more details of the glmnet package you can watch Professor Trevor Hastie’s presentation.

In short, Lasso and Elastic net are solving this optimization problem [2] :

\[\begin{array} {rl} \hat{\beta}= & \underset{\beta}{\text{argmin}}|y-X\beta|^2_2 \\ \text{subject to} & (1-\alpha)|\beta|_1+\alpha|\beta|^2\le t \end{array}\]

Note that when solving the above problem without the constraint we have a familiar least square regression model. When the constraint is added, depending on how large \(t\) is, some of the “unimportant” predictor variables will become 0 and are forced out of the model. Lasso is a special case of Elastic net when \(\alpha = 0\).

Two of the most important tuning parameters are \(t\) and \(\alpha\) in the above equation.

  • \(t\) is controlled by adding the left hand side of the constraint as a penalty term in the objective. By default 100 different penalty parameter \(\lambda\) are evaluated. When \(\lambda\) is large less variables are included in the model. You can choose the \(\lambda\) yourself if you know how many variables you want in the model or use cv.glmnet to conduct a cross-validation.

  • The default \(alpha = 1\) (the definition is the opposite of the above equation) gives Lasso penalty, which is fine in most cases.

The case for using elastic net is [3]:

  • Use elastic net when variables are highly correlated and you want to select more than one predictor variables from a group of correlated variables. Lasso tends to select one variable from a group and ignore the others.

  • Use elastic net when there are more variables than observations.

Here we demonstrate Lasso on Boston Housing data. For parsimony we are not using training/testing split in this example.

Note: If you have a binary response you can use family= “binomial” option in the glmnet() function.

library(MASS)
data(Boston);
colnames(Boston) 
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"
install.packages('glmnet')
library(glmnet)

glmnet does not take data frame as input, so we have to specify the x matrix and y vector.

lasso_fit = glmnet(x = as.matrix(Boston[, -c(which(colnames(Boston)=='medv'))]), y = Boston$medv, alpha = 1)
#lambda = 0.5
coef(lasso_fit,s=0.5)
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept) 14.177440452
## crim        -0.013381578
## zn           .          
## indus        .          
## chas         1.565138250
## nox         -0.014613729
## rm           4.237566862
## age          .          
## dis         -0.081500956
## rad          .          
## tax          .          
## ptratio     -0.739162698
## black        0.005955512
## lstat       -0.513806573
#lambda = 1
coef(lasso_fit,s=1)
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept) 15.283205304
## crim         .          
## zn           .          
## indus        .          
## chas         0.074725982
## nox          .          
## rm           3.862575281
## age          .          
## dis          .          
## rad          .          
## tax          .          
## ptratio     -0.620249174
## black        0.001973097
## lstat       -0.496888716
# Solution Paths
plot(lasso_fit)

#use 5-fold cross validation to pick lambda
cv_lasso_fit = cv.glmnet(x = as.matrix(Boston[, -c(which(colnames(Boston)=='medv'))]), y = Boston$medv, alpha = 1, nfolds = 5)
plot(cv_lasso_fit)

The best \(\lambda\) (or s) is given by:

cv_lasso_fit$lambda.min
## [1] 0.009170489

Given a selected s you can use predict() this way to get prediction:

Boston.insample.prediction = predict(lasso_fit, as.matrix(Boston[, -c(which(colnames(Boston)=='medv'))]), s = cv_lasso_fit$lambda.min)

Exercise: For a high dimensional dataset having n<p (n=200, p=500), is it possible to fit a linear regression on that directly? Fit a model with first five variables. How about LASSO variable selection?

[1]: Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 267-288. [2]: Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301-320.
[3]: http://www.stanford.edu/~hastie/TALKS/enet_talk.pdf “Regularization and Variable Selection via the Elastic Net”

go to top