```
# load packages
library(tidyverse) # for data wrangling and visualization
library(tidymodels) # for modeling
library(openintro) # for the duke_forest dataset
library(scales) # for pretty axis labels
library(knitr) # for pretty tables
library(kableExtra) # also for pretty tables
library(patchwork) # arrange plots
# set default theme and larger font size for ggplot2
::theme_set(ggplot2::theme_minimal(base_size = 20)) ggplot2
```

# SLR: Model diagnostics

STA 210 - Summer 2022

# Welcome

## Previous Questions

- Interpretation of CI: In frequentists’ world, we do not say we are 95% confident that the slope falls into [90,200], or the slope is between 90 and 200. Since the slope is a fixed value that always unknown to us instead of a variable! We should interpret it as : if we have repeated trials, 95% of time, caculated CI will cover the true slope. Since the CI here is a variable which vary with realizations.

## Computational setup

# Mathematical models for inference

## The regression model, revisited

```
<- linear_reg() %>%
df_fit set_engine("lm") %>%
fit(price ~ area, data = duke_forest)
tidy(df_fit) %>%
kable(digits = 2)
```

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

(Intercept) | 116652.33 | 53302.46 | 2.19 | 0.03 |

area | 159.48 | 18.17 | 8.78 | 0.00 |

## HT for the slope

**Hypotheses:** \(H_0: \beta_1 = 0\) vs. \(H_A: \beta_1 \ne 0\)

**Test statistic:** Number of standard errors the estimate is away from the null

\[ T = \frac{\text{Estimate - Null}}{\text{Standard error}} \\ \]

**p-value:** Probability of observing a test statistic at least as extreme (in the direction of the alternative hypothesis) from the null value as the one observed

\[ p-value = P(|t| > \text{test statistic}), \]

calculated from a \(t\) distribution with \(n - 2\) degrees of freedom

## HT: Test statistic

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

(Intercept) | 116652.33 | 53302.46 | 2.19 | 0.03 |

area | 159.48 | 18.17 | 8.78 | 0.00 |

\[ t = \frac{\hat{\beta}_1 - 0}{SE_{\hat{\beta}_1}} = \frac{159.48 - 0}{18.17} = 8.78 \]

## HT: p-value

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

(Intercept) | 116652.33 | 53302.46 | 2.19 | 0.03 |

area | 159.48 | 18.17 | 8.78 | 0.00 |

## Understanding the p-value

Magnitude of p-value | Interpretation |
---|---|

p-value < 0.01 | strong evidence against \(H_0\) |

0.01 < p-value < 0.05 | moderate evidence against \(H_0\) |

0.05 < p-value < 0.1 | weak evidence against \(H_0\) |

p-value > 0.1 | effectively no evidence against \(H_0\) |

## HT: Conclusion, in context

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

(Intercept) | 116652.33 | 53302.46 | 2.19 | 0.03 |

area | 159.48 | 18.17 | 8.78 | 0.00 |

- The data provide convincing evidence that the population slope \(\beta_1\) is different from 0.
- The data provide convincing evidence of a linear relationship between area and price of houses in Duke Forest.

## CI for the slope

\[ \text{Estimate} \pm \text{ (critical value) } \times \text{SE} \]

\[ \hat{\beta}_1 \pm t^* \times SE_{\hat{\beta}_1} \]

where \(t^*\) is calculated from a \(t\) distribution with \(n-2\) degrees of freedom

## CI: Critical value

`qt`

quantile of t-dist

```
# confidence level: 95%
qt(0.975, df = nrow(duke_forest) - 2)
```

`[1] 1.984984`

```
# confidence level: 90%
qt(0.95, df = nrow(duke_forest) - 2)
```

`[1] 1.660881`

```
# confidence level: 99%
qt(0.995, df = nrow(duke_forest) - 2)
```

`[1] 2.628016`

## 95% CI for the slope: Calculation

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

(Intercept) | 116652.33 | 53302.46 | 2.19 | 0.03 |

area | 159.48 | 18.17 | 8.78 | 0.00 |

\[\hat{\beta}_1 = 159.48 \hspace{15mm} t^* = 1.98 \hspace{15mm} SE_{\hat{\beta}_1} = 18.17\]

\[159.48 \pm 1.98 \times 18.17 = (123.50, 195.46)\]

## 95% CI for the slope: Computation

```
tidy(df_fit, conf.int = TRUE, conf.level = 0.95) %>%
kable(digits = 2)
```

term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|

(Intercept) | 116652.33 | 53302.46 | 2.19 | 0.03 | 10847.77 | 222456.88 |

area | 159.48 | 18.17 | 8.78 | 0.00 | 123.41 | 195.55 |

## Confidence interval for predictions

- Suppose we want to answer the question
*“What is the predicted sale price of a Duke Forest house that is 2,800 square feet?”* - We said reporting a single estimate for the slope is not wise, and we should report a plausible range instead
- Similarly, reporting a single prediction for a new value is not wise, and we should report a plausible range instead

## Two types of predictions

Prediction for the

**mean**: “What is the average predicted sale price of Duke Forest houses that are 2,800 square feet?”Prediction for an

**individual observation**: “What is the predicted sale price of a Duke Forest house that is 2,800 square feet?”

Which would you expect to be more variable? The average prediction or the prediction for an individual observation? Based on your answer, how would you expect the widths of plausible ranges for these two predictions to compare?

## Uncertainty in predictions

**Confidence interval for the mean outcome:** \[\hat{y} \pm t_{n-2}^* \times \color{purple}{\mathbf{SE}_{\hat{\boldsymbol{\mu}}}}\]

**Prediction interval for an individual observation:** \[\hat{y} \pm t_{n-2}^* \times \color{purple}{\mathbf{SE_{\hat{y}}}}\]

## Standard errors

**Standard error of the mean outcome:** \[SE_{\hat{\mu}} = \hat{\sigma}_\epsilon\sqrt{\frac{1}{n} + \frac{(x-\bar{x})^2}{\sum\limits_{i=1}^n(x_i - \bar{x})^2}}\]

**Standard error of an individual outcome:** \[SE_{\hat{y}} = \hat{\sigma}_\epsilon\sqrt{1 + \frac{1}{n} + \frac{(x-\bar{x})^2}{\sum\limits_{i=1}^n(x_i - \bar{x})^2}}\]

## Standard errors

**Standard error of the mean outcome:** \[SE_{\hat{\mu}} = \hat{\sigma}_\epsilon\sqrt{\frac{1}{n} + \frac{(x-\bar{x})^2}{\sum\limits_{i=1}^n(x_i - \bar{x})^2}}\]

**Standard error of an individual outcome:** \[SE_{\hat{y}} = \hat{\sigma}_\epsilon\sqrt{\mathbf{\color{purple}{\Large{1}}} + \frac{1}{n} + \frac{(x-\bar{x})^2}{\sum\limits_{i=1}^n(x_i - \bar{x})^2}}\]

## Confidence interval

The 95% **confidence interval** for the *mean* outcome:

```
<- tibble(area = 2800)
new_house
predict(df_fit, new_data = new_house, type = "conf_int", level = 0.95)
```

```
# A tibble: 1 × 2
.pred_lower .pred_upper
<dbl> <dbl>
1 529351. 597060.
```

We are 95% confident that mean sale price of Duke Forest houses that are 2,800 square feet is between $529,351 and $597,060.

## Prediction interval

The 95% **prediction interval** for the *individual* outcome:

`predict(df_fit, new_data = new_house, type = "pred_int", level = 0.95)`

```
# A tibble: 1 × 2
.pred_lower .pred_upper
<dbl> <dbl>
1 226438. 899973.
```

We are 95% confident that predicted sale price of a Duke Forest house that is 2,800 square feet is between $226,438 and $899,973.

## Comparing intervals

## Extrapolation

Calculate the prediction interval for the sale price of a “tiny house” in Duke Forest that is 225 square feet.

# Model conditions

## Model conditions

**Linearity:**There is a linear relationship between the outcome and predictor variables**Constant variance:**The variability of the errors is equal for all values of the predictor variable, i.e. the errors are homeoscedastic**Normality:**The errors follow a normal distribution**Independence:**The errors are independent from each other

## Linearity

✅ The residuals vs. fitted values plot should not show a random scatter of residuals (no distinguishable pattern or structure)

## Residuals vs. fitted values

```
<- augment(df_fit$fit)
df_aug
ggplot(df_aug, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
ylim(-1000000, 1000000) +
labs(
x = "Fitted value", y = "Residual",
title = "Residuals vs. fitted values"
)
```

## Application exercise

## Non-linear relationships

## Constant variance

✅ The vertical spread of the residuals should be relatively constant across the plot

## Non-constant variance

## Normality

## Independence

We can often check the independence assumption based on the context of the data and how the observations were collected

If the data were collected in a particular order, examine a scatterplot of the residuals versus order in which the data were collected

✅ If this is a random sample of Duke Houses, the error for one house does not tell us anything about the error for another use

## Recap

Used residual plots to check conditions for SLR:

- Linearity
- Constant variance

- Normality
- Independence

Which of these conditions are required for fitting a SLR? Which for simulation-based inference for the slope for an SLR? Which for inference with mathematical models?

- fitting: linearity + independence
- inference (simulation-based): linearity + independence
- inference (mathematical): … + normality + constant variance