MLR: Inference

STA 210 - Summer 2022

Author

Yunran Chen

Welcome

Annoucement

Class observation on June 1 by Ed and June 8 by Ben
Certificate in College Teaching (CCT program): Observation Requirement

Topics

Conduct a hypothesis test for \(\beta_j\)
Calculate a confidence interval for \(\beta_j\)
Inference pitfalls

Computational setup

# load packages
library(tidyverse)
library(tidymodels)
library(knitr)      # for tables
library(patchwork)  # for laying out plots
library(rms)        # for vif

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_minimal(base_size = 20))

Data: `rail_trail`

The Pioneer Valley Planning Commission (PVPC) collected data for ninety days from April 5, 2005 to November 15, 2005.
Data collectors set up a laser sensor, with breaks in the laser beam recording when a rail-trail user passed the data collection station.

rail_trail <- read_csv(here::here("slides", "data/rail_trail.csv"))
rail_trail

# A tibble: 90 × 7
   volume hightemp avgtemp season cloudcover precip day_type
    <dbl>    <dbl>   <dbl> <chr>       <dbl>  <dbl> <chr>   
 1    501       83    66.5 Summer       7.60 0      Weekday 
 2    419       73    61   Summer       6.30 0.290  Weekday 
 3    397       74    63   Spring       7.5  0.320  Weekday 
 4    385       95    78   Summer       2.60 0      Weekend 
 5    200       44    48   Spring      10    0.140  Weekday 
 6    375       69    61.5 Spring       6.60 0.0200 Weekday 
 7    417       66    52.5 Spring       2.40 0      Weekday 
 8    629       66    52   Spring       0    0      Weekend 
 9    533       80    67.5 Summer       3.80 0      Weekend 
10    547       79    62   Summer       4.10 0      Weekday 
# … with 80 more rows

Source: Pioneer Valley Planning Commission via the mosaicData package.

Variables

Outcome:

volume estimated number of trail users that day (number of breaks recorded)

Predictors

hightemp daily high temperature (in degrees Fahrenheit)
avgtemp average of daily low and daily high temperature (in degrees Fahrenheit)
season one of “Fall”, “Spring”, or “Summer”
cloudcover measure of cloud cover (in oktas)
precip measure of precipitation (in inches)
day_type one of “weekday” or “weekend”

Conduct a hypothesis test for \(\beta_j\)

Review: Simple linear regression (SLR)

ggplot(rail_trail, aes(x = hightemp, y = volume)) + 
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "High temp (F)", y = "Number of riders")

SLR model summary

rt_slr_fit <- linear_reg() %>%
  set_engine("lm") %>%
  fit(volume ~ hightemp, data = rail_trail)

tidy(rt_slr_fit)

# A tibble: 2 × 5
  term        estimate std.error statistic       p.value
  <chr>          <dbl>     <dbl>     <dbl>         <dbl>
1 (Intercept)   -17.1     59.4      -0.288 0.774        
2 hightemp        5.70     0.848     6.72  0.00000000171

SLR hypothesis test

# A tibble: 2 × 5
  term        estimate std.error statistic       p.value
  <chr>          <dbl>     <dbl>     <dbl>         <dbl>
1 (Intercept)   -17.1     59.4      -0.288 0.774        
2 hightemp        5.70     0.848     6.72  0.00000000171

Set hypotheses: \(H_0: \beta_1 = 0\) and \(H_A: \beta_1 \ne 0\)

Calculate test statistic and p-value: The test statistic is \(t = 6.72\) with a degrees of freedom of 88, and a p-value < 0.0001.

State the conclusion: With a small p-value, we reject \(H_0\). The data provide strong evidence that high temperature is a helpful predictor for the number of daily riders.

Multiple linear regression

rt_mlr_main_fit <- linear_reg() %>%
  set_engine("lm") %>%
  fit(volume ~ hightemp + season, data = rail_trail)

tidy(rt_mlr_main_fit)

# A tibble: 4 × 5
  term         estimate std.error statistic       p.value
  <chr>           <dbl>     <dbl>     <dbl>         <dbl>
1 (Intercept)   -125.       71.7     -1.75  0.0841       
2 hightemp         7.54      1.17     6.43  0.00000000692
3 seasonSpring     5.13     34.3      0.150 0.881        
4 seasonSummer   -76.8      47.7     -1.61  0.111

MLR hypothesis test: `hightemp`

Set hypotheses: \(H_0: \beta_{hightemp} = 0\) and \(H_A: \beta_{hightemp} \ne 0\), given season is in the model

Calculate test statistic and p-value: The test statistic is \(t = 6.43\) with a degrees of freedom of 86, and a p-value < 0.0001.

State the conclusion: With such a small p-value, the data provides strong evidence against \(H_0\), i.e., the data provide strong evidence that high temperature for the day is a helpful predictor in a model given season is in the model

The model for `season = Spring`

term	estimate	std.error	statistic	p.value
(Intercept)	-125.23	71.66	-1.75	0.08
hightemp	7.54	1.17	6.43	0.00
seasonSpring	5.13	34.32	0.15	0.88
seasonSummer	-76.84	47.71	-1.61	0.11

\[ \begin{aligned} \widehat{volume} &= -125.23 + 7.54 \times \texttt{hightemp} + 5.13 \times \texttt{seasonSpring} - 76.84 \times \texttt{seasonSummer} \\ &= -125.23 + 7.54 \times \texttt{hightemp} + 5.13 \times 1 - 76.84 \times 0 \\ &= -120.10 + 7.54 \times \texttt{hightemp} \end{aligned} \]

The model for `season = Summer`

term	estimate	std.error	statistic	p.value
(Intercept)	-125.23	71.66	-1.75	0.08
hightemp	7.54	1.17	6.43	0.00
seasonSpring	5.13	34.32	0.15	0.88
seasonSummer	-76.84	47.71	-1.61	0.11

\[ \begin{aligned} \widehat{volume} &= -125.23 + 7.54 \times \texttt{hightemp} + 5.13 \times \texttt{seasonSpring} - 76.84 \times \texttt{seasonSummer} \\ &= -125.23 + 7.54 \times \texttt{hightemp} + 5.13 \times 0 - 76.84 \times 1 \\ &= -202.07 + 7.54 \times \texttt{hightemp} \end{aligned} \]

The model for `season = Fall`

term	estimate	std.error	statistic	p.value
(Intercept)	-125.23	71.66	-1.75	0.08
hightemp	7.54	1.17	6.43	0.00
seasonSpring	5.13	34.32	0.15	0.88
seasonSummer	-76.84	47.71	-1.61	0.11

\[ \begin{aligned} \widehat{volume} &= -125.23 + 7.54 \times \texttt{hightemp} + 5.13 \times \texttt{seasonSpring} - 76.84 \times \texttt{seasonSummer} \\ &= -125.23 + 7.54 \times \texttt{hightemp} + 5.13 \times 0 - 76.84 \times 0 \\ &= -125.23 + 7.54 \times \texttt{hightemp} \end{aligned} \]

The models

Same slope, different intercepts

season = Spring: \(-120.10 + 7.54 \times \texttt{hightemp}\)
season = Summer: \(-202.07 + 7.54 \times \texttt{hightemp}\)
season = Fall: \(-125.23 + 7.54 \times \texttt{hightemp}\)

Application exercise

📋 github.com/sta210-s22/ae-8-rail-trail

Ex 1. Recreate the following visualization in R based on the results of the model.

Application exercise

📋 ae-8-rail-trail

Ex 2. Add an interaction effect between hightemp and season and comment on the significance of the interaction predictors. Time permitting, visualize the interaction model as well.

Confidence interval for \(\beta_j\)

The \(C%\) confidence interval for \(\beta_j\) \[\hat{\beta}_j \pm t^* SE(\hat{\beta}_j)\] where \(t^*\) follows a \(t\) distribution with \(n - p - 1\) degrees of freedom.
In context, we are \(C%\) confident that for every one unit increase in \(x_j\), we expect \(y\) to change by LB to UB units, on average, holding all else constant.
expect + on average + hold all else constant
p: # of predictors in the model w/ dummy variables

Confidence interval for \(\beta_j\)

tidy(rt_mlr_main_fit, conf.int = TRUE)

# A tibble: 4 × 7
  term         estimate std.error statistic       p.value conf.low conf.high
  <chr>           <dbl>     <dbl>     <dbl>         <dbl>    <dbl>     <dbl>
1 (Intercept)   -125.       71.7     -1.75  0.0841         -268.       17.2 
2 hightemp         7.54      1.17     6.43  0.00000000692     5.21      9.87
3 seasonSpring     5.13     34.3      0.150 0.881           -63.1      73.4 
4 seasonSummer   -76.8      47.7     -1.61  0.111          -172.       18.0

CI for `hightemp`

# A tibble: 4 × 7
  term         estimate std.error statistic       p.value conf.low conf.high
  <chr>           <dbl>     <dbl>     <dbl>         <dbl>    <dbl>     <dbl>
1 (Intercept)   -125.       71.7     -1.75  0.0841         -268.       17.2 
2 hightemp         7.54      1.17     6.43  0.00000000692     5.21      9.87
3 seasonSpring     5.13     34.3      0.150 0.881           -63.1      73.4 
4 seasonSummer   -76.8      47.7     -1.61  0.111          -172.       18.0

We are 95% confident that for every degrees Fahrenheit the day is warmer, we expect the number of riders to increase by 5.21 to 9.87, on average, holding season constant.

CI for `seasonSpring`

# A tibble: 4 × 7
  term         estimate std.error statistic       p.value conf.low conf.high
  <chr>           <dbl>     <dbl>     <dbl>         <dbl>    <dbl>     <dbl>
1 (Intercept)   -125.       71.7     -1.75  0.0841         -268.       17.2 
2 hightemp         7.54      1.17     6.43  0.00000000692     5.21      9.87
3 seasonSpring     5.13     34.3      0.150 0.881           -63.1      73.4 
4 seasonSummer   -76.8      47.7     -1.61  0.111          -172.       18.0

We are 95% confident that the number of riders on a Spring day is expected to be lower by 63.1 to higher by 73.4 compared to a Fall day, on average, holding high temperature for the day constant.

Construct HT and CI based on simulation

CI: Bootstrap observations, fit model, obtain Bootstrap distribution
HT: Permute observations, fit model, obtain dist of permuted samples.
Q: Permute wrt single variable? wrt observations? diff permutations for diff variables?
Permutation wrt observations! Hold other variables constant!

Inference pitfalls

Large sample sizes

Danger

If the sample size is large enough, the test will likely result in rejecting \(H_0: \beta_j = 0\) even \(x_j\) has a very small effect on \(y\). (t-statistics increases)

Consider the practical significance of the result not just the statistical significance.
Use the confidence interval to draw conclusions instead of relying only p-values.

Small sample sizes

Danger

If the sample size is small, there may not be enough evidence to reject \(H_0: \beta_j=0\).

When you fail to reject the null hypothesis, DON’T immediately conclude that the variable has no association with the response.
There may be a linear association that is just not strong enough to detect given your data, or there may be a non-linear association.

Connections between CI and HT

Instead of checking p-values, we can use CI to do hypothesis testing.
If CI include 0, do not reject; Or else, reject \(H_0\)

Notes on permutation test

Permutation is sensitive to outliers and high-leverage points
Small repetitions may not be enough. (Bootstrap will be more representative)

Welcome

Annoucement

Topics

Computational setup

Data: rail_trail

Variables

Conduct a hypothesis test for \(\beta_j\)

Review: Simple linear regression (SLR)

SLR model summary

SLR hypothesis test

Multiple linear regression

MLR hypothesis test: hightemp

The model for season = Spring

The model for season = Summer

The model for season = Fall

The models

Application exercise

Application exercise

Confidence interval for \(\beta_j\)

Confidence interval for \(\beta_j\)

Confidence interval for \(\beta_j\)

CI for hightemp

CI for seasonSpring

Construct HT and CI based on simulation

Inference pitfalls

Large sample sizes

Small sample sizes

Connections between CI and HT

Notes on permutation test

Data: `rail_trail`

MLR hypothesis test: `hightemp`

The model for `season = Spring`

The model for `season = Summer`

The model for `season = Fall`

CI for `hightemp`

CI for `seasonSpring`