STA 210 - Summer 2022
Yunran Chen
Logistic regression for binary response variable
Relationship between odds and probabilities
Use logistic regression model to calculate predicted odds and probabilities
Quantitative outcome variable:
Categorical outcome variable:
Logistic regression
2 Outcomes
1: Yes, 0: No
Multinomial logistic regression
3+ Outcomes
1: Democrat, 2: Republican, 3: Independent
Students in grades 9 - 12 surveyed about health risk behaviors including whether they usually get 7 or more hours of sleep.
Sleep7
1: yes
0: no
data(YouthRisk2009)
sleep <- YouthRisk2009 %>%
as_tibble() %>%
filter(!is.na(Age), !is.na(Sleep7))
sleep %>%
relocate(Age, Sleep7)
# A tibble: 446 × 6
Age Sleep7 Sleep SmokeLife SmokeDaily MarijuaEver
<int> <int> <fct> <fct> <fct> <int>
1 16 1 8 hours Yes Yes 1
2 17 0 5 hours Yes Yes 1
3 18 0 5 hours Yes Yes 1
4 17 1 7 hours Yes No 1
5 15 0 4 or less hours No No 0
6 17 0 6 hours No No 0
7 17 1 7 hours No No 0
8 16 1 8 hours Yes No 0
9 16 1 8 hours No No 0
10 18 0 4 or less hours Yes Yes 1
# … with 436 more rows
Outcome: \(Y\) = 1: yes, 0: no
Outcome: Probability of getting 7+ hours of sleep
Outcome: Probability of getting 7+ hours of sleep
🛑 This model produces predictions outside of 0 and 1.
✅ This model (called a logistic regression model) only produces predictions between 0 and 1.
Method | Outcome | Model |
---|---|---|
Linear regression | Quantitative | \(Y = \beta_0 + \beta_1~ X\) |
Logistic regression | Binary | \(\log\big(\frac{\pi}{1-\pi}\big) = \beta_0 + \beta_1 ~ X\) |
Suppose there is a 70% chance it will rain tomorrow
# A tibble: 2 × 3
Sleep7 n p
<int> <int> <dbl>
1 0 150 0.336
2 1 296 0.664
\(P(\text{7+ hours of sleep}) = P(Y = 1) = p = 0.664\)
\(P(\text{< 7 hours of sleep}) = P(Y = 0) = 1 - p = 0.336\)
\(P(\text{odds of 7+ hours of sleep}) = \frac{0.664}{0.336} = 1.976\)
odds
\[\omega = \frac{\pi}{1-\pi}\]
probability
\[\pi = \frac{\omega}{1 + \omega}\]
\[\text{probability} = \pi = \frac{\exp\{\beta_0 + \beta_1~X\}}{1 + \exp\{\beta_0 + \beta_1~X\}}\]
Logit form: \[\log\big(\frac{\pi}{1-\pi}\big) = \beta_0 + \beta_1~X\]
Probability form:
\[\pi = \frac{\exp\{\beta_0 + \beta_1~X\}}{1 + \exp\{\beta_0 + \beta_1~X\}}\]
This dataset is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. We want to use age
to predict if a randomly selected adult is high risk of having coronary heart disease in the next 10 years.
high_risk
:
age
: Age at exam time (in years)
heart
heart_disease <- read_csv(here::here("slides", "data/framingham.csv")) %>%
select(age, TenYearCHD) %>%
drop_na() %>% # drop observations with NAs
mutate(high_risk = as.factor(TenYearCHD)) %>%
select(age, high_risk)
heart_disease
# A tibble: 4,240 × 2
age high_risk
<dbl> <fct>
1 39 0
2 46 0
3 48 0
4 61 1
5 46 0
6 43 0
7 63 1
8 45 0
9 52 0
10 43 0
# … with 4,230 more rows
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -5.561 | 0.284 | -19.599 | 0 |
age | 0.075 | 0.005 | 14.178 | 0 |
\[\log\Big(\frac{\hat{\pi}}{1-\hat{\pi}}\Big) = -5.561 + 0.075 \times \text{age}\] where \(\hat{\pi}\) is the predicted probability of being high risk
# A tibble: 4,240 × 8
high_risk age .fitted .resid .std.resid .hat .sigma .cooksd
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 39 -2.65 -0.370 -0.370 0.000466 0.895 0.0000165
2 0 46 -2.13 -0.475 -0.475 0.000322 0.895 0.0000192
3 0 48 -1.98 -0.509 -0.509 0.000288 0.895 0.0000199
4 1 61 -1.01 1.62 1.62 0.000706 0.895 0.000968
5 0 46 -2.13 -0.475 -0.475 0.000322 0.895 0.0000192
6 0 43 -2.35 -0.427 -0.427 0.000384 0.895 0.0000183
7 1 63 -0.858 1.56 1.56 0.000956 0.895 0.00113
8 0 45 -2.20 -0.458 -0.458 0.000342 0.895 0.0000189
9 0 52 -1.68 -0.585 -0.585 0.000262 0.895 0.0000244
10 0 43 -2.35 -0.427 -0.427 0.000384 0.895 0.0000183
# … with 4,230 more rows
For observation 1
\[\text{predicted odds} = \hat{\omega} = \frac{\hat{\pi}}{1-\hat{\pi}} = \exp\{-2.650\} = 0.071\]
# A tibble: 4,240 × 2
.pred_0 .pred_1
<dbl> <dbl>
1 0.934 0.0660
2 0.894 0.106
3 0.878 0.122
4 0.733 0.267
5 0.894 0.106
6 0.913 0.0870
7 0.702 0.298
8 0.900 0.0996
9 0.843 0.157
10 0.913 0.0870
# … with 4,230 more rows
\[\text{predicted probabilities} = \hat{\pi} = \frac{\exp\{-2.650\}}{1 + \exp\{-2.650\}} = 0.066\]
For a logistic regression, the default prediction is the class
.
What does the following table show?
Logistic regression for binary response variable
Relationship between odds and probabilities
Used logistic regression model to calculate predicted odds and probabilities