Probabilities, odds, odds ratios

STA 210 - Summer 2022

Yunran Chen

Welcome

Announcements

Exam 2 scores for part 1 are posted
Project proposals + AE 9 due Wednesday, June 8, at 11:59pm

Exam 2

Conceptual part
Applied part

Topics

Use the odds ratio to compare the odds of two groups
Interpret the coefficients of a logistic regression model with
- a single categorical predictor
- a single quantitative predictor
- multiple predictors

Computational setup

# load packages
library(tidyverse)
library(tidymodels)
library(knitr)

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_minimal(base_size = 20))

Odds ratios

Risk of coronary heart disease

This dataset is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. We want to examine the relationship between various health characteristics and the risk of having heart disease.

high_risk:
- 1: High risk of having heart disease in next 10 years
- 0: Not high risk of having heart disease in next 10 years
age: Age at exam time (in years)
education: 1 = Some High School, 2 = High School or GED, 3 = Some College or Vocational School, 4 = College

High risk vs. education

Education	High risk	Not high risk
Some high school	323	1397
High school or GED	147	1106
Some college or vocational school	88	601
College	70	403

Compare the odds for two groups

Education	High risk	Not high risk
Some high school	323	1397
High school or GED	147	1106
Some college or vocational school	88	601
College	70	403

We want to compare the risk of heart disease for those with a High School diploma/GED and those with a college degree.
We’ll use the odds to compare the two groups

\[ \text{odds} = \frac{P(\text{success})}{P(\text{failure})} = \frac{\text{# of successes}}{\text{# of failures}} \]

Compare the odds for two groups

Education	High risk	Not high risk
Some high school	323	1397
High school or GED	147	1106
Some college or vocational school	88	601
College	70	403

Odds of having high risk for the High school or GED group: \(\frac{147}{1106} = 0.133\)
Odds of having high risk for the College group: \(\frac{70}{403} = 0.174\)
Based on this, we see those with a college degree had higher odds of having high risk for heart disease than those with a high school diploma or GED.

Odds ratio (OR)

Education	High risk	Not high risk
Some high school	323	1397
High school or GED	147	1106
Some college or vocational school	88	601
College	70	403

Let’s summarize the relationship between the two groups. To do so, we’ll use the odds ratio (OR).

\[ OR = \frac{\text{odds}_1}{\text{odds}_2} = \frac{\omega_1}{\omega_2} \]

OR: College vs. High school or GED

Education	High risk	Not high risk
Some high school	323	1397
High school or GED	147	1106
Some college or vocational school	88	601
College	70	403

\[OR = \frac{\text{odds}_{College}}{\text{odds}_{HS}} = \frac{0.174}{0.133} = \mathbf{1.308}\]

The odds of having high risk for heart disease are 1.30 times higher for those with a college degree than those with a high school diploma or GED.

OR: College vs. Some high school

Education	High risk	Not high risk
Some high school	323	1397
High school or GED	147	1106
Some college or vocational school	88	601
College	70	403

\[OR = \frac{\text{odds}_{College}}{\text{odds}_{Some HS}} = \frac{70/403}{323/1397} = 0.751\]

The odds of having high risk for having heart disease for those with a college degree are 0.751 times the odds of having high risk for heart disease for those with some high school.

More natural interpretation

It’s more natural to interpret the odds ratio with a statement with the odds ratio greater than 1.
The odds of having high risk for heart disease are 1.33 times higher for those with some high school than those with a college degree.

Making the table 1

First, rename the levels of the categorical variables:

heart_disease <- heart_disease %>%
  mutate(
    high_risk_names = if_else(high_risk == "1", "High risk", "Not high risk"),
    education_names = case_when(
      education == "1" ~ "Some high school",
      education == "2" ~ "High school or GED",
      education == "3" ~ "Some college or vocational school",
      education == "4" ~ "College"
    ),
    education_names = fct_relevel(education_names, "Some high school", "High school or GED", "Some college or vocational school", "College")
  )

Making the table 2

Then, make the table:

heart_disease %>%
  count(education_names, high_risk_names) %>%
  pivot_wider(names_from = high_risk_names, values_from = n) %>%
  kable(col.names = c("Education", "High risk", "Not high risk"))

Deeper look into the code

heart_disease %>%
  count(education_names, high_risk_names)

# A tibble: 8 × 3
  education_names                   high_risk_names     n
  <fct>                             <chr>           <int>
1 Some high school                  High risk         323
2 Some high school                  Not high risk    1397
3 High school or GED                High risk         147
4 High school or GED                Not high risk    1106
5 Some college or vocational school High risk          88
6 Some college or vocational school Not high risk     601
7 College                           High risk          70
8 College                           Not high risk     403

Deeper look into the code

heart_disease %>%
  count(education_names, high_risk_names) %>%
  pivot_wider(names_from = high_risk_names, values_from = n)

# A tibble: 4 × 3
  education_names                   `High risk` `Not high risk`
  <fct>                                   <int>           <int>
1 Some high school                          323            1397
2 High school or GED                        147            1106
3 Some college or vocational school          88             601
4 College                                    70             403

Deeper look into the code

heart_disease %>%
  count(education_names, high_risk_names) %>%
  pivot_wider(names_from = high_risk_names, values_from = n) %>%
  kable()

education_names	High risk	Not high risk
Some high school	323	1397
High school or GED	147	1106
Some college or vocational school	88	601
College	70	403

Deeper look into the code

heart_disease %>%
  count(education_names, high_risk_names) %>%
  pivot_wider(names_from = high_risk_names, values_from = n) %>%
  kable(col.names = c("Education", "High risk", "Not high risk"))

Education	High risk	Not high risk
Some high school	323	1397
High school or GED	147	1106
Some college or vocational school	88	601
College	70	403

Logistic regression with: categorical predictor

Categorical predictor

Recall: Education - 1 = Some High School, 2 = High School or GED, 3 = Some College or Vocational School, 4 = College

heart_edu_fit <- logistic_reg() %>%
  set_engine("glm") %>%
  fit(high_risk ~ education, data  = heart_disease, family = "binomial")

tidy(heart_edu_fit) %>%
  kable(digits = 3)

term	estimate	std.error	statistic	p.value
(Intercept)	-1.464	0.062	-23.719	0.000
education2	-0.554	0.107	-5.159	0.000
education3	-0.457	0.130	-3.520	0.000
education4	-0.286	0.143	-1.994	0.046

Interpreting `education4` - log-odds

term	estimate	std.error	statistic	p.value
(Intercept)	-1.464	0.062	-23.719	0.000
education2	-0.554	0.107	-5.159	0.000
education3	-0.457	0.130	-3.520	0.000
education4	-0.286	0.143	-1.994	0.046

The log-odds of having high risk for heart disease are expected to be 0.286 less for those with a college degree compared to those with some high school (the baseline group).

Interpreting `education4` - odds

term	estimate	std.error	statistic	p.value
(Intercept)	-1.464	0.062	-23.719	0.000
education2	-0.554	0.107	-5.159	0.000
education3	-0.457	0.130	-3.520	0.000
education4	-0.286	0.143	-1.994	0.046

The odds of having high risk for heart disease for those with a college degree are expected to be 0.751 (exp(-0.286)) times the odds for those with some high school.

Coefficients + odds ratios

The model coefficient, -0.286, is the expected change in the log-odds when going from the Some high school group to the College group.

Therefore, \(e^{-0.286}\) = 0.751 is the expected change in the odds when going from the Some high school group to the College group.

\[ OR = e^{\hat{\beta}_j} = \exp\{\hat{\beta}_j\} \]

Logistic regression: quantitative predictor

Quantitative predictor

heart_age_fit <- logistic_reg() %>%
  set_engine("glm") %>%
  fit(high_risk ~ age, data  = heart_disease, family = "binomial")

tidy(heart_age_fit) %>%
  kable(digits = 3)

term	estimate	std.error	statistic	p.value
(Intercept)	-5.619	0.288	-19.498	0
age	0.076	0.005	14.174	0

Interpreting `age`: log-odds

term	estimate	std.error	statistic	p.value
(Intercept)	-5.619	0.288	-19.498	0
age	0.076	0.005	14.174	0

For each additional year in age, the log-odds of having high risk for heart disease are expected to increase by 0.076.

Interpreting `age`: odds

term	estimate	std.error	statistic	p.value
(Intercept)	-5.619	0.288	-19.498	0
age	0.076	0.005	14.174	0

For each additional year in age, the odds of having high risk for heart disease are expected to multiply by a factor of 1.08 (exp(0.076)).
Alternate interpretation: For each additional year in age, the odds of having high risk for heart disease are expected to increase by 8%.

Logistic regression: multiple predictors

Multiple predictors

heart_edu_age_fit <- logistic_reg() %>%
  set_engine("glm") %>%
  fit(high_risk ~ education + age, data  = heart_disease, family = "binomial")

tidy(heart_edu_age_fit) %>%
  kable(digits = 3)

term	estimate	std.error	statistic	p.value
(Intercept)	-5.385	0.308	-17.507	0.000
education2	-0.242	0.112	-2.162	0.031
education3	-0.235	0.134	-1.761	0.078
education4	-0.020	0.148	-0.136	0.892
age	0.073	0.005	13.385	0.000

Interpretation in terms of log-odds

term	estimate	std.error	statistic	p.value
(Intercept)	-5.385	0.308	-17.507	0.000
education2	-0.242	0.112	-2.162	0.031
education3	-0.235	0.134	-1.761	0.078
education4	-0.020	0.148	-0.136	0.892
age	0.073	0.005	13.385	0.000

education4: The log-odds of having high risk for heart disease are expected to be 0.020 less for those with a college degree compared to those with some high school, holding age constant.

Interpretation in terms of log-odds

term	estimate	std.error	statistic	p.value
(Intercept)	-5.385	0.308	-17.507	0.000
education2	-0.242	0.112	-2.162	0.031
education3	-0.235	0.134	-1.761	0.078
education4	-0.020	0.148	-0.136	0.892
age	0.073	0.005	13.385	0.000

age: For each additional year in age, the log-odds of having high risk for heart disease are expected to increase by 0.073, holding education level constant.

Interpretation in terms of odds

term	estimate	std.error	statistic	p.value
(Intercept)	-5.385	0.308	-17.507	0.000
education2	-0.242	0.112	-2.162	0.031
education3	-0.235	0.134	-1.761	0.078
education4	-0.020	0.148	-0.136	0.892
age	0.073	0.005	13.385	0.000

education4: The odds of having high risk for heart disease for those with a college degree are expected to be 0.98 (exp(-0.020)) times the odds for those with some high school, holding age constant.

Interpretation in terms of odds

term	estimate	std.error	statistic	p.value
(Intercept)	-5.385	0.308	-17.507	0.000
education2	-0.242	0.112	-2.162	0.031
education3	-0.235	0.134	-1.761	0.078
education4	-0.020	0.148	-0.136	0.892
age	0.073	0.005	13.385	0.000

age: For each additional year in age, the odds having high risk for heart disease are expected to multiply by a factor of 1.08 (exp(0.073)), holding education level constant.

Recap

Use the odds ratio to compare the odds of two groups
Interpret the coefficients of a logistic regression model with
- a single categorical predictor
- a single quantitative predictor
- multiple predictors

Probabilities, odds, odds ratios

Welcome

Announcements

Exam 2

Topics

Computational setup

Odds ratios

Risk of coronary heart disease

High risk vs. education

Compare the odds for two groups

Compare the odds for two groups

Odds ratio (OR)

OR: College vs. High school or GED

OR: College vs. Some high school

More natural interpretation

Making the table 1

Making the table 2

Deeper look into the code

Deeper look into the code

Deeper look into the code

Deeper look into the code

Logistic regression with: categorical predictor

Categorical predictor

Interpreting education4 - log-odds

Interpreting education4 - odds

Coefficients + odds ratios

Logistic regression: quantitative predictor

Quantitative predictor

Interpreting age: log-odds

Interpreting age: odds

Logistic regression: multiple predictors

Multiple predictors

Interpretation in terms of log-odds

Interpretation in terms of log-odds

Interpretation in terms of odds

Interpretation in terms of odds

Recap

Interpreting `education4` - log-odds

Interpreting `education4` - odds

Interpreting `age`: log-odds

Interpreting `age`: odds