1 mcycle motorcycle accident dataset

mcycle is a data frame giving a series of measurements of head acceleration in a simulated motorcycle accident, used to test crash helmets. It is an univariate case. We are interested in the relationship between the times in milliseconds after impact and the acceleration.

library(MASS)
data('mcycle')
str(mcycle)
summary(mcycle)
# Rename the variables for ease of usage
Y <- mcycle$accel
X <- mcycle$times

#Scatterplot
plot(Y~X, xlab="time",ylab="Acceleration", main="Scatterplot of Acceleration against Time")

2 Simple Linear Regression

We can simply assume a linear relationship and run a linear model between acceleration and time.

lm_mod <- lm(Y~X, data= mcycle)
summary(lm_mod)

Fitted Regression Line

But after we draw the fitted linear line on the scatterplot, we can clearly tell that the linear assumption seems violated.

plot(X, Y, xlab="Times", ylab="Acceleration", main="Simple Linear Regression Line")
abline(lm_mod, col="blue", lwd = 1)

3 Polynomial Regression

If we assume there could be a polynomial relationship, we can try polynomial regression. The coefficients can be easily estimated using least squares linear regression because this is just a standard linear model with predictors \(x, x^2, x^3, \dots, x^d\).

3.1 Quadratic

We can conduct a quadratic regression as follows by simply adding I(X^2) in the formula argument:

quad_mod <- lm(Y~X+I(X^2), data=mcycle) 
summary(quad_mod)
## 
## Call:
## lm(formula = Y ~ X + I(X^2), data = mcycle)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -96.527 -30.817   9.589  29.210 104.728 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -15.22700   15.49471  -0.983  0.32757   
## X            -2.30749    1.20439  -1.916  0.05757 . 
## I(X^2)        0.05935    0.02038   2.912  0.00422 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 45.06 on 130 degrees of freedom
## Multiple R-squared:  0.1437, Adjusted R-squared:  0.1306 
## F-statistic: 10.91 on 2 and 130 DF,  p-value: 4.167e-05

It seems that the fitted line captures the curvature a little bit better than the linear regression.

plot(X ,Y ,xlab="Times", main = "Quadratic",ylab="Acceleration",cex=.5)
lines(X,quad_mod$fitted.values, col="blue", lwd = 1)

Is this model superior to the simple linear regression model?

In order to answer this question, we can conduct the ANOVA test to compare the fits of these two models.

anova(lm_mod,quad_mod)

3.2 Fifth-degree Polynomial

We can also try higher order polynomial regress, for example a fifth-degree polynomial.

poly_mod <- lm(Y~poly(X,5,raw=T),data=mcycle) 
summary(poly_mod)
## 
## Call:
## lm(formula = Y ~ poly(X, 5, raw = T), data = mcycle)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -77.271 -21.285   0.975  25.386  82.371 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -1.059e+02  3.499e+01  -3.026    0.003 ** 
## poly(X, 5, raw = T)1  4.978e+01  1.036e+01   4.804 4.31e-06 ***
## poly(X, 5, raw = T)2 -6.359e+00  1.015e+00  -6.266 5.30e-09 ***
## poly(X, 5, raw = T)3  2.969e-01  4.253e-02   6.982 1.45e-10 ***
## poly(X, 5, raw = T)4 -5.742e-03  7.932e-04  -7.239 3.83e-11 ***
## poly(X, 5, raw = T)5  3.919e-05  5.406e-06   7.250 3.60e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33.9 on 127 degrees of freedom
## Multiple R-squared:  0.5264, Adjusted R-squared:  0.5078 
## F-statistic: 28.24 on 5 and 127 DF,  p-value: < 2.2e-16

You can also assess the model performance.

#poly_mod_summary <- summary(poly_mod)
#(poly_mod_summary$sigma)^2 
#poly_mod_summary$r.squared
#poly_mod_summary$adj.r.squared
#AIC(poly_mod)
#BIC(poly_mod)
plot(X ,Y ,xlab="Times", main = "Fifth-degree polynomial",ylab="Acceleration",cex=.5)
lines(X,poly_mod$fitted.values, col="blue", lwd = 1)

This one may be even better to capture the curvature of samples on the left hand side of scatter plot than those two model above.

go to top

4 Splines

4.1 Regression Splines

In order to fit regression splines in R, we use the bs() function from the splines library. By default, “cubic splines” are produced. That is cubic polynomial with no interior knots

library (splines)
reg_sp <- lm(Y~bs(X),data=mcycle)
summary(reg_sp)
## 
## Call:
## lm(formula = Y ~ bs(X), data = mcycle)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -83.083 -28.772   2.935  31.004  94.108 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    41.26      15.42   2.675 0.008433 ** 
## bs(X)1       -258.14      41.07  -6.285 4.66e-09 ***
## bs(X)2        110.67      28.33   3.906 0.000151 ***
## bs(X)3        -72.91      27.99  -2.605 0.010274 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40 on 129 degrees of freedom
## Multiple R-squared:  0.3303, Adjusted R-squared:  0.3147 
## F-statistic: 21.21 on 3 and 129 DF,  p-value: 3.134e-11
plot(X ,Y ,xlab="Times", main = "Regression Spline",ylab="Acceleration",cex=.5)
lines(X,reg_sp$fitted.values, col="blue", lwd = 1)

conf_interval <- predict(reg_sp, interval="confidence",
                         level = 0.95)
lines(X, conf_interval[,2], col="red", lty=2)
lines(X, conf_interval[,3], col="red", lty=2)

You can also specify the knot locations.

#fit_sp=lm(Y~bs(X,knots=c(15.6,23.4,34.8)),data=mcycle) 
#summary(fit_sp)
#AIC(fit_sp)

You can also specify the degree of freedom.

reg_sp2=lm(Y~bs(X,df=10),data=mcycle) 

plot(X ,Y ,xlab="Times", main = "Regression Spline with df=10",ylab="Acceleration",cex=.5)
lines(X,reg_sp2$fitted.values, col="blue", lwd = 1)

conf_interval <- predict(reg_sp2, interval="confidence",
                         level = 0.95)
lines(X, conf_interval[,2], col="red", lty=2)
lines(X, conf_interval[,3], col="red", lty=2)

4.2 Natural Cubic Splines

Here the degree of freedom is pre-specified and different numbers are used to see the best curve that fits the data.

#First: Natural Spline- pre-specified degree of freedom=4
fit2=lm(Y~ns(X,df=4),data=mcycle) 
plot(X ,Y,main= "Natural Cubic Spline with df=4", xlab="Times", ylab="Acceleration") 
lines(X, fit2$fitted.values)

conf_interval <- predict(fit2, interval="confidence",
                         level = 0.95)
lines(X, conf_interval[,2], col="red", lty=2)
lines(X, conf_interval[,3], col="red", lty=2)

fit2c=lm(Y~ns(X,df=10),data=mcycle) 

plot(X ,Y , main= "Natural Cubic Spline with df=10", xlab="Times", ylab="Acceleration") 
lines(X, fit2c$fitted.values)

conf_interval <- predict(fit2c, interval="confidence",
                         level = 0.95)
lines(X, conf_interval[,2], col="red", lty=2)
lines(X, conf_interval[,3], col="red", lty=2)

fit2d=lm(Y~ns(X,df=20),data=mcycle) 

plot(X ,Y, main= "Natural Cubic Spline with df=20", xlab="Times", ylab="Acceleration") 
lines(X, fit2d$fitted.values)

conf_interval <- predict(fit2d, interval="confidence",
                         level = 0.95)
lines(X, conf_interval[,2], col="red", lty=2)
lines(X, conf_interval[,3], col="red", lty=2)

4.3 Smoothing/Penalized Spline

We now use penalized splines where a penalty/smoothing parameter can help control the smoothness while many knots can be used and knot location does not need to be carefully selected. The s() function is part of the gam function from the mgcv package.

go to top