# load packages
library(tidyverse)
library(tidymodels)
library(knitr) # for tables
library(patchwork) # for laying out plots
library(rms) # for vif
# set default theme and larger font size for ggplot2
::theme_set(ggplot2::theme_minimal(base_size = 20)) ggplot2
MLR: Inference
STA 210 - Summer 2022
Welcome
Annoucement
- Class observation on June 1 by Ed and June 8 by Ben
- Certificate in College Teaching (CCT program): Observation Requirement
Topics
Conduct a hypothesis test for \(\beta_j\)
Calculate a confidence interval for \(\beta_j\)
Inference pitfalls
Computational setup
Data: rail_trail
- The Pioneer Valley Planning Commission (PVPC) collected data for ninety days from April 5, 2005 to November 15, 2005.
- Data collectors set up a laser sensor, with breaks in the laser beam recording when a rail-trail user passed the data collection station.
<- read_csv(here::here("slides", "data/rail_trail.csv"))
rail_trail rail_trail
# A tibble: 90 × 7
volume hightemp avgtemp season cloudcover precip day_type
<dbl> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
1 501 83 66.5 Summer 7.60 0 Weekday
2 419 73 61 Summer 6.30 0.290 Weekday
3 397 74 63 Spring 7.5 0.320 Weekday
4 385 95 78 Summer 2.60 0 Weekend
5 200 44 48 Spring 10 0.140 Weekday
6 375 69 61.5 Spring 6.60 0.0200 Weekday
7 417 66 52.5 Spring 2.40 0 Weekday
8 629 66 52 Spring 0 0 Weekend
9 533 80 67.5 Summer 3.80 0 Weekend
10 547 79 62 Summer 4.10 0 Weekday
# … with 80 more rows
Source: Pioneer Valley Planning Commission via the mosaicData package.
Variables
Outcome:
volume
estimated number of trail users that day (number of breaks recorded)
Predictors
hightemp
daily high temperature (in degrees Fahrenheit)avgtemp
average of daily low and daily high temperature (in degrees Fahrenheit)season
one of “Fall”, “Spring”, or “Summer”cloudcover
measure of cloud cover (in oktas)precip
measure of precipitation (in inches)day_type
one of “weekday” or “weekend”
Conduct a hypothesis test for \(\beta_j\)
Review: Simple linear regression (SLR)
ggplot(rail_trail, aes(x = hightemp, y = volume)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "High temp (F)", y = "Number of riders")
SLR model summary
<- linear_reg() %>%
rt_slr_fit set_engine("lm") %>%
fit(volume ~ hightemp, data = rail_trail)
tidy(rt_slr_fit)
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -17.1 59.4 -0.288 0.774
2 hightemp 5.70 0.848 6.72 0.00000000171
SLR hypothesis test
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -17.1 59.4 -0.288 0.774
2 hightemp 5.70 0.848 6.72 0.00000000171
- Set hypotheses: \(H_0: \beta_1 = 0\) and \(H_A: \beta_1 \ne 0\)
- Calculate test statistic and p-value: The test statistic is \(t = 6.72\) with a degrees of freedom of 88, and a p-value < 0.0001.
- State the conclusion: With a small p-value, we reject \(H_0\). The data provide strong evidence that high temperature is a helpful predictor for the number of daily riders.
Multiple linear regression
<- linear_reg() %>%
rt_mlr_main_fit set_engine("lm") %>%
fit(volume ~ hightemp + season, data = rail_trail)
tidy(rt_mlr_main_fit)
# A tibble: 4 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -125. 71.7 -1.75 0.0841
2 hightemp 7.54 1.17 6.43 0.00000000692
3 seasonSpring 5.13 34.3 0.150 0.881
4 seasonSummer -76.8 47.7 -1.61 0.111
MLR hypothesis test: hightemp
- Set hypotheses: \(H_0: \beta_{hightemp} = 0\) and \(H_A: \beta_{hightemp} \ne 0\), given
season
is in the model
- Calculate test statistic and p-value: The test statistic is \(t = 6.43\) with a degrees of freedom of 86, and a p-value < 0.0001.
- State the conclusion: With such a small p-value, the data provides strong evidence against \(H_0\), i.e., the data provide strong evidence that high temperature for the day is a helpful predictor in a model given
season
is in the model
The model for season = Spring
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -125.23 | 71.66 | -1.75 | 0.08 |
hightemp | 7.54 | 1.17 | 6.43 | 0.00 |
seasonSpring | 5.13 | 34.32 | 0.15 | 0.88 |
seasonSummer | -76.84 | 47.71 | -1.61 | 0.11 |
\[ \begin{aligned} \widehat{volume} &= -125.23 + 7.54 \times \texttt{hightemp} + 5.13 \times \texttt{seasonSpring} - 76.84 \times \texttt{seasonSummer} \\ &= -125.23 + 7.54 \times \texttt{hightemp} + 5.13 \times 1 - 76.84 \times 0 \\ &= -120.10 + 7.54 \times \texttt{hightemp} \end{aligned} \]
The model for season = Summer
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -125.23 | 71.66 | -1.75 | 0.08 |
hightemp | 7.54 | 1.17 | 6.43 | 0.00 |
seasonSpring | 5.13 | 34.32 | 0.15 | 0.88 |
seasonSummer | -76.84 | 47.71 | -1.61 | 0.11 |
\[ \begin{aligned} \widehat{volume} &= -125.23 + 7.54 \times \texttt{hightemp} + 5.13 \times \texttt{seasonSpring} - 76.84 \times \texttt{seasonSummer} \\ &= -125.23 + 7.54 \times \texttt{hightemp} + 5.13 \times 0 - 76.84 \times 1 \\ &= -202.07 + 7.54 \times \texttt{hightemp} \end{aligned} \]
The model for season = Fall
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -125.23 | 71.66 | -1.75 | 0.08 |
hightemp | 7.54 | 1.17 | 6.43 | 0.00 |
seasonSpring | 5.13 | 34.32 | 0.15 | 0.88 |
seasonSummer | -76.84 | 47.71 | -1.61 | 0.11 |
\[ \begin{aligned} \widehat{volume} &= -125.23 + 7.54 \times \texttt{hightemp} + 5.13 \times \texttt{seasonSpring} - 76.84 \times \texttt{seasonSummer} \\ &= -125.23 + 7.54 \times \texttt{hightemp} + 5.13 \times 0 - 76.84 \times 0 \\ &= -125.23 + 7.54 \times \texttt{hightemp} \end{aligned} \]
The models
Same slope, different intercepts
season = Spring
: \(-120.10 + 7.54 \times \texttt{hightemp}\)season = Summer
: \(-202.07 + 7.54 \times \texttt{hightemp}\)season = Fall
: \(-125.23 + 7.54 \times \texttt{hightemp}\)
Application exercise
Ex 1. Recreate the following visualization in R based on the results of the model.
Application exercise
Ex 2. Add an interaction effect between hightemp
and season
and comment on the significance of the interaction predictors. Time permitting, visualize the interaction model as well.
Confidence interval for \(\beta_j\)
Confidence interval for \(\beta_j\)
The \(C%\) confidence interval for \(\beta_j\) \[\hat{\beta}_j \pm t^* SE(\hat{\beta}_j)\] where \(t^*\) follows a \(t\) distribution with \(n - p - 1\) degrees of freedom.
In context, we are \(C%\) confident that for every one unit increase in \(x_j\), we expect \(y\) to change by LB to UB units, on average, holding all else constant.
expect + on average + hold all else constant
p: # of predictors in the model w/ dummy variables
Confidence interval for \(\beta_j\)
tidy(rt_mlr_main_fit, conf.int = TRUE)
# A tibble: 4 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -125. 71.7 -1.75 0.0841 -268. 17.2
2 hightemp 7.54 1.17 6.43 0.00000000692 5.21 9.87
3 seasonSpring 5.13 34.3 0.150 0.881 -63.1 73.4
4 seasonSummer -76.8 47.7 -1.61 0.111 -172. 18.0
CI for hightemp
# A tibble: 4 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -125. 71.7 -1.75 0.0841 -268. 17.2
2 hightemp 7.54 1.17 6.43 0.00000000692 5.21 9.87
3 seasonSpring 5.13 34.3 0.150 0.881 -63.1 73.4
4 seasonSummer -76.8 47.7 -1.61 0.111 -172. 18.0
We are 95% confident that for every degrees Fahrenheit the day is warmer, we expect the number of riders to increase by 5.21 to 9.87, on average, holding season constant.
CI for seasonSpring
# A tibble: 4 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -125. 71.7 -1.75 0.0841 -268. 17.2
2 hightemp 7.54 1.17 6.43 0.00000000692 5.21 9.87
3 seasonSpring 5.13 34.3 0.150 0.881 -63.1 73.4
4 seasonSummer -76.8 47.7 -1.61 0.111 -172. 18.0
We are 95% confident that the number of riders on a Spring day is expected to be lower by 63.1 to higher by 73.4 compared to a Fall day, on average, holding high temperature for the day constant.
Construct HT and CI based on simulation
- CI: Bootstrap observations, fit model, obtain Bootstrap distribution
- HT: Permute observations, fit model, obtain dist of permuted samples.
- Q: Permute wrt single variable? wrt observations? diff permutations for diff variables?
- Permutation wrt observations! Hold other variables constant!
Inference pitfalls
Large sample sizes
Small sample sizes
Connections between CI and HT
- Instead of checking p-values, we can use CI to do hypothesis testing.
- If CI include 0, do not reject; Or else, reject \(H_0\)
Notes on permutation test
- Permutation is sensitive to outliers and high-leverage points
- Small repetitions may not be enough. (Bootstrap will be more representative)