Chapter 12 OLS with Polynomials and Interactions
Sample R
code for Ramsey and Schafer’s Statistical Sleuth 3ed for POL90: Statistics.
12.1 Introduction
This document is intended to help describe how to undertake analyses introduced as examples in the Second Edition of the Statistical Sleuth (2002) by Fred Ramsey and Dan Schafer. More information about the book can be found at. This file as well as the associated reproducible analysis source file can be found at .
We also set some options to improve legibility of graphs and output.
trellis.par.set(theme=col.mosaic()) # get a better color scheme for lattice
options(digits=3)
The specific goal of this document is to demonstrate how to calculate the quantities described in Chapter 10: Inferential Tools for Multiple Regression using R.
12.2 Galileo’s data on the motion of falling bodies
Galileo investigated the relationship between height and horizontal distance. This is the question addressed in case study 10.1 in the Sleuth.
12.2.1 Data coding, summary statistics and graphical display
We begin by reading the data and summarizing the variables.
<- Sleuth3::case1001
galileo summary(galileo)
## Distance Height
## Min. :253 Min. : 100
## 1st Qu.:366 1st Qu.: 250
## Median :451 Median : 450
## Mean :434 Mean : 493
## 3rd Qu.:514 3rd Qu.: 700
## Max. :573 Max. :1000
head(galileo, 2)
## Distance Height
## 1 253 100
## 2 337 200
There we a total of 7 trials of Galileo’s experiment. For each trial, he recorded the initial height and then measured the horizontal distance as shown in Display 10.1 (page 268).
We can start to explore this relationship by creating a scatterplot of Galileo’s horizontal distances versus initial heights. The following graph is akin to Display 10.2 (page 269).
plot(Distance ~ Height, data=galileo)
12.2.2 2.2 Models
The first model that we created is a cubic model as interpreted on page 269 and summarized in Display 10.13 (page 286).
<- lm(Distance ~ Height, data=galileo)
lm0 summary(lm0)
##
## Call:
## lm(formula = Distance ~ Height, data = galileo)
##
## Residuals:
## 1 2 3 4 5 6 7
## -50.05 0.62 25.29 31.29 25.29 -2.38 -30.05
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 269.712 24.312 11.09 0.00010 ***
## Height 0.333 0.042 7.93 0.00051 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33.7 on 5 degrees of freedom
## Multiple R-squared: 0.926, Adjusted R-squared: 0.912
## F-statistic: 62.9 on 1 and 5 DF, p-value: 0.000513
plot(Distance ~ Height, data=galileo)
abline(lm0)
<- lm(Distance ~ Height+I(Height^2) + I(Height^3),
lm1 data=galileo)
summary(lm1)
##
## Call:
## lm(formula = Distance ~ Height + I(Height^2) + I(Height^3), data = galileo)
##
## Residuals:
## 1 2 3 4 5 6 7
## -2.4036 3.5809 1.8917 -4.4688 -0.0804 2.3216 -0.8414
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.56e+02 8.33e+00 18.71 0.00033 ***
## Height 1.12e+00 6.57e-02 16.98 0.00044 ***
## I(Height^2) -1.24e-03 1.38e-04 -8.99 0.00290 **
## I(Height^3) 5.48e-07 8.33e-08 6.58 0.00715 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.01 on 3 degrees of freedom
## Multiple R-squared: 0.999, Adjusted R-squared: 0.999
## F-statistic: 1.6e+03 on 3 and 3 DF, p-value: 2.66e-05
We next decrease the polynomial for Height by one degree to obtain a quadratic model as interpreted on page 269 and summarized in Display 10.7 (page 277). This model is used for most of the following results.
<- lm(Distance ~ Height+I(Height^2), data=galileo)
lm2 summary(lm2)
##
## Call:
## lm(formula = Distance ~ Height + I(Height^2), data = galileo)
##
## Residuals:
## 1 2 3 4 5 6 7
## -14.31 9.17 13.52 1.94 -6.18 -12.61 8.46
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.00e+02 1.68e+01 11.93 0.00028 ***
## Height 7.08e-01 7.48e-02 9.47 0.00069 ***
## I(Height^2) -3.44e-04 6.68e-05 -5.15 0.00676 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.6 on 4 degrees of freedom
## Multiple R-squared: 0.99, Adjusted R-squared: 0.986
## F-statistic: 205 on 2 and 4 DF, p-value: 9.33e-05
We can fit an equivalent model (with different parametrization) using the function:
<- lm(Distance ~ poly(Height, 5), data=galileo)
lmpoly summary(lmpoly)
##
## Call:
## lm(formula = Distance ~ poly(Height, 5), data = galileo)
##
## Residuals:
## 1 2 3 4 5 6 7
## 0.1594 -0.7323 1.1159 -0.9275 0.4882 -0.1196 0.0159
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 434.000 0.646 671.93 0.00095 ***
## poly(Height, 5)1 267.116 1.709 156.31 0.00407 **
## poly(Height, 5)2 -70.194 1.709 -41.08 0.01550 *
## poly(Height, 5)3 26.378 1.709 15.44 0.04118 *
## poly(Height, 5)4 -5.960 1.709 -3.49 0.17777
## poly(Height, 5)5 3.132 1.709 1.83 0.31794
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.71 on 1 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 5.27e+03 on 5 and 1 DF, p-value: 0.0105
The following figure presents the predicted values from the quadratic model and the original data points akin to Display 10.2 (page 269).
<- transform(galileo, pred = predict(lm2))
galileo xyplot(pred+Distance ~ Height, auto.key=TRUE, data=galileo)
To obtain the expected values of \(\hat{\mu}\left(\mathrm{Distance}|\mathrm{Height} = 0 \right)\) and \(\hat{\mu}\left(\mathrm{Distance}|\mathrm{Height} = 250\right)\), we used the command with the quadratic model as shown in Display 10.7 (page 277).
predict(lm2, interval="confidence", data.frame(Height=c(0, 250)))
## fit lwr upr
## 1 200 153 246
## 2 356 337 374
We can also verify the above confidence interval calculations with the following code:
355.1+c(-1, 1)*6.62*qt(.975, 4)
## [1] 337 373
To verify numbers on page 279, an interval for the predicted values , we used the following code:
predict(
lm2, interval="predict",
data.frame(Height=c(0, 250))
)
## fit lwr upr
## 1 200 140 260
## 2 356 313 398
Lastly, we display the ANOVA table for the quadratic model that is interpreted on page 283 (Display 10.11).
anova(lm2) %>%
xtable() %>%
print(type = "html",
comment = FALSE) # turn off message about package
Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
---|---|---|---|---|---|
Height | 1 | 71350.79 | 71350.79 | 383.57 | 0.0000 |
I(Height^2) | 1 | 4927.13 | 4927.13 | 26.49 | 0.0068 |
Residuals | 4 | 744.08 | 186.02 |
12.3 3 Echolocation in bats
How do bats make their way about in the dark? Echolocation requires a lot of energy. Does it depend on mass and species? This is the question addressed in case study 10.2 in the Sleuth.
12.3.1 Data coding, summary statistics and graphical display
We begin by reading the data, performing transformations where necessary and summarizing the variables.
<- Sleuth3::case1002
bats <- transform(bats, logmass = log(Mass))
bats <- transform(bats, logenergy = log(Energy))
bats head(bats,2)
## Mass Type Energy logmass logenergy
## 1 779 non-echolocating bats 43.7 6.66 3.78
## 2 628 non-echolocating bats 34.8 6.44 3.55
A total of nrow(bats)
flying vertebrates were included in this study. There were nrow(subset(bats, Type=="echolocating bats"))
echolocating bats, nrow(subset(bats, Type=="non-echolocating bats"))
non-echolocating bats, and nrow(subset(bats, Type=="non-echolocating birds"))
non-echolocating birds. For each subject their mass and flight energy expenditure were recorded as shown in Display 10.3
(page 270).
We can next observe the pattern between log(energy expenditure) as a function of log(body mass) for each group with a scatterplot. The following figure is akin to Display 10.4 (page 271).
xyplot(
~ Mass,
Energy group=Type,
scales=list(y=list(log=TRUE),
x=list(log=TRUE)),
auto.key=TRUE,
data=bats
)
12.3.2 2.1 Multiple regression
We first evaluate a multiple regression model for log(energy expenditure) given type of species and log(body mass) as defined on page 272 and shown in Display 10.6 (page 273).
<- lm(logenergy ~ logmass, data=bats)
lm1 #stargazer(lm0, type='html')
stargazer(lm1, type = 'html')
Dependent variable: | |
logenergy | |
logmass | 0.809*** |
(0.027) | |
Constant | -1.470*** |
(0.137) | |
Observations | 20 |
R2 | 0.981 |
Adjusted R2 | 0.979 |
Residual Std. Error | 0.180 (df = 18) |
F Statistic | 908.000*** (df = 1; 18) |
Note: | p<0.1; p<0.05; p<0.01 |
<- lm(logenergy ~ logmass + Type, data=bats)
lm2 #summary(lm1)
stargazer(lm1, lm2, type='html')
Dependent variable: | ||
logenergy | ||
(1) | (2) | |
logmass | 0.809*** | 0.815*** |
(0.027) | (0.045) | |
Typenon-echolocating bats | -0.079 | |
(0.203) | ||
Typenon-echolocating birds | 0.024 | |
(0.158) | ||
Constant | -1.470*** | -1.500*** |
(0.137) | (0.150) | |
Observations | 20 | 20 |
R2 | 0.981 | 0.982 |
Adjusted R2 | 0.979 | 0.978 |
Residual Std. Error | 0.180 (df = 18) | 0.186 (df = 16) |
F Statistic | 908.000*** (df = 1; 18) | 284.000*** (df = 3; 16) |
Note: | p<0.1; p<0.05; p<0.01 |
Next, we calculated confidence intervals for the coefficients which are interpreted on page 274.
confint(lm1)
## 2.5 % 97.5 %
## (Intercept) -1.756 -1.180
## logmass 0.752 0.865
exp(confint(lm1))
## 2.5 % 97.5 %
## (Intercept) 0.173 0.307
## logmass 2.122 2.375
Since the significance of a model depends on which variables are included, the Sleuth proposes two other models, one only looking at the type of flying animal and the other allows the three groups to have different straight-line regressions with mass. These two models are displayed below and discussed on pages 274–275.
summary(lm(logenergy ~ Type, data=bats))
##
## Call:
## lm(formula = logenergy ~ Type, data = bats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8872 -0.3994 0.0236 0.4932 1.5253
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.653 0.422 1.55 0.14058
## Typenon-echolocating bats 2.743 0.597 4.59 0.00026 ***
## Typenon-echolocating birds 2.134 0.488 4.38 0.00041 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.845 on 17 degrees of freedom
## Multiple R-squared: 0.595, Adjusted R-squared: 0.548
## F-statistic: 12.5 on 2 and 17 DF, p-value: 0.000458
summary(lm(logenergy ~ Type * logmass, data=bats))
##
## Call:
## lm(formula = logenergy ~ Type * logmass, data = bats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.2515 -0.1264 -0.0095 0.0812 0.3284
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.4705 0.2477 -5.94 3.6e-05 ***
## Typenon-echolocating bats 1.2681 1.2854 0.99 0.34
## Typenon-echolocating birds -0.1103 0.3847 -0.29 0.78
## logmass 0.8047 0.0867 9.28 2.3e-07 ***
## Typenon-echolocating bats:logmass -0.2149 0.2236 -0.96 0.35
## Typenon-echolocating birds:logmass 0.0307 0.1028 0.30 0.77
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.19 on 14 degrees of freedom
## Multiple R-squared: 0.983, Adjusted R-squared: 0.977
## F-statistic: 163 on 5 and 14 DF, p-value: 6.7e-12
To construct the confidence bands discussed on page 277 and shown in Display 10.9 (page 279) we used the following code:
<- predict(
pred
lm1, se.fit=TRUE,
newdata=data.frame(
Type=c("non-echolocating birds", "non-echolocating birds"),
logmass=c(log(100), log(400))
)
)<- pred$fit[1]
pred.fit pred.fit
## 1
## 2.26
<- pred$se.fit[1]
pred.se pred.se
## 1
## 0.0409
<- sqrt(4*qf(.95, 4, 16));
multiplier multiplier
## [1] 3.47
<- exp(pred.fit-pred.se*multiplier)
lower lower
## 1
## 8.28
<- exp(pred.fit+pred.se*multiplier)
upper upper
## 1
## 11
# for the other reference points
<- predict(
pred2
lm1, se.fit=TRUE,
newdata=data.frame(
Type=c("non-echolocating bats", "non-echolocating bats"),
logmass=c(log(100), log(400))
)
)
<- predict(
pred3
lm1, se.fit=TRUE,
newdata=data.frame(
Type=c("echolocating bats", "echolocating bats"),
logmass=c(log(100), log(400))
)
)
.9 <- rbind(
table10c("Intercept estimate", "Standard error"),
round(cbind(pred2$fit, pred2$se.fit), 4),
round(cbind(pred3$fit, pred3$se.fit), 4)
)
.9 table10
## [,1] [,2]
## "Intercept estimate" "Standard error"
## 1 "2.2555" "0.0409"
## 2 "3.3765" "0.05"
## 1 "2.2555" "0.0409"
## 2 "3.3765" "0.05"
Next we can assess the model by evaluating the extra sums of squares \(F\)-test for testing the equality of intercepts in the parallel regression lines model as shown in Display 10.10 (page 282).
<- lm(logenergy ~ logmass, data=bats)
lm1 <- lm(logenergy ~ logmass + Type, data=bats)
lm2
anova(lm1, lm2) %>%
xtable() %>%
print(type = "html",
comment = FALSE) # turn off message about package
Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) | |
---|---|---|---|---|---|---|
1 | 18 | 0.58 | ||||
2 | 16 | 0.55 | 2 | 0.03 | 0.43 | 0.6593 |
We can also compare the full model with interaction terms and the reduced model (without interaction terms) with the extra sum of squares \(F\)-test as described in Display 10.12 (page 284).
<- lm(logenergy ~ logmass * Type, data=bats)
lm3 stargazer(lm1, lm2, lm3, type = 'html',
omit.stat = c("f", "ser"))
Dependent variable: | |||
logenergy | |||
(1) | (2) | (3) | |
logmass | 0.809*** | 0.815*** | 0.805*** |
(0.027) | (0.045) | (0.087) | |
Typenon-echolocating bats | -0.079 | 1.270 | |
(0.203) | (1.280) | ||
Typenon-echolocating birds | 0.024 | -0.110 | |
(0.158) | (0.385) | ||
logmass:Typenon-echolocating bats | -0.215 | ||
(0.224) | |||
logmass:Typenon-echolocating birds | 0.031 | ||
(0.103) | |||
Constant | -1.470*** | -1.500*** | -1.470*** |
(0.137) | (0.150) | (0.248) | |
Observations | 20 | 20 | 20 |
R2 | 0.981 | 0.982 | 0.983 |
Adjusted R2 | 0.979 | 0.978 | 0.977 |
Note: | p<0.1; p<0.05; p<0.01 |
anova(lm2, lm3) %>%
xtable() %>%
print(type = "html",
comment = FALSE) # turn off message about package
Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) | |
---|---|---|---|---|---|---|
1 | 16 | 0.55 | ||||
2 | 14 | 0.50 | 2 | 0.05 | 0.67 | 0.5265 |
anova(lm1, lm2, lm3) %>%
xtable() %>%
print(type = "html",
comment = FALSE) # turn off message about package
Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) | |
---|---|---|---|---|---|---|
1 | 18 | 0.58 | ||||
2 | 16 | 0.55 | 2 | 0.03 | 0.41 | 0.6713 |
3 | 14 | 0.50 | 2 | 0.05 | 0.67 | 0.5265 |
This supplement was put together by Omar Wasow. It builds on prior work by Project Mosaic, an “NSF-funded initiative to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum.” We are grateful to Nicholas Horton, Linda Loi, Kate Aloisio, and Ruobing Zhang for their effort developing The Statistical Sleuth in R.(horton2013statistical?)