STA 210 - Summer 2022

Author

Yunran Chen

# Welcome

## Topics

• Introduce multinomial logistic regression

• Interpret model coefficients

• Inference for a coefficient $$\beta_{jk}$$

## Computational setup

# load packages
library(tidyverse)
library(tidymodels)
library(NHANES)
library(knitr)
library(patchwork)

# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_minimal(base_size = 20))

# Generalized Linear Models

## Generalized Linear Models (GLMs)

• In practice, there are many different types of outcome variables:

• Binary: Win or Lose
• Nominal: Democrat, Republican or Third Party candidate
• Ordered: Movie rating (1 - 5 stars)
• and others…
• Predicting each of these outcomes requires a generalized linear model, a broader class of models that generalize the multiple linear regression model

Note

Recommended reading for more details about GLMs: Generalized Linear Models: A Unifying Theory.

## Binary outcome (Logistic)

• Given $$P(y_i=1|x_i)= \hat{\pi}_i\hspace{5mm} \text{ and } \hspace{5mm}P(y_i=0|x_i) = 1-\hat{\pi}_i$$

$\log\Big(\frac{\hat{\pi}_i}{1-\hat{\pi}_i}\Big) = \hat{\beta}_0 + \hat{\beta}_1 x_{i}$

• We can calculate $$\hat{\pi}_i$$ by solving the logit equation:

$\hat{\pi}_i = \frac{e^{\hat{\beta}_0 + \hat{\beta}_1 x_{i}}}{1 + e^{\hat{\beta}_0 + \hat{\beta}_1 x_{i}}}$

## Binary outcome (Logistic)

• Suppose we consider $$y=0$$ the baseline category such that

$P(y_i=1|x_i) = \hat{\pi}_{i1} \hspace{2mm} \text{ and } \hspace{2mm} P(y_i=0|x_i) = \hat{\pi}_{i0}$

• Then the logistic regression model is

$\log\bigg(\frac{\hat{\pi}_{i1}}{1- \hat{\pi}_{i1}}\bigg) = \log\bigg(\frac{\hat{\pi}_{i1}}{\hat{\pi}_{i0}}\bigg) = \hat{\beta}_0 + \hat{\beta}_1 x_i$

• Slope, $$\hat{\beta}_1$$: When $$x$$ increases by one unit, the odds of $$y=1$$ versus the baseline $$y=0$$ are expected to multiply by a factor of $$e^{\hat{\beta}_1}$$

• Intercept, $$\hat{\beta}_0$$: When $$x=0$$, the predicted odds of $$y=1$$ versus the baseline $$y=0$$ are $$\exp\{\hat{\beta}_0\}$$

## Multinomial outcome variable

• Suppose the outcome variable $$y$$ is categorical and can take values $$1, 2, \ldots, K$$ such that $$(K > 2)$$

• Multinomial Distribution:

$P(y=1) = \pi_1, P(y=2) = \pi_2, \ldots, P(y=K) = \pi_K$

such that $$\sum\limits_{k=1}^{K} \pi_k = 1$$

## Multinomial Logistic Regression

• If we have an explanatory variable $$x$$, then we want to fit a model such that $$P(y = k) = \pi_k$$ is a function of $$x$$

• Choose a baseline category. Let’s choose $$y=1$$. Then,

$\log\bigg(\frac{\pi_{ik}}{\pi_{i1}}\bigg) = \beta_{0k} + \beta_{1k} x_i$

• In the multinomial logistic model, we have a separate equation for each category of the outcome relative to the baseline category

• If the outcome has $$K$$ possible categories, there will be $$K-1$$ equations as part of the multinomial logistic model

## Multinomial Logistic Regression

• Suppose we have a outcome variable $$y$$ that can take three possible outcomes that are coded as “A”, “B”, “C”

• Let “A” be the baseline category. Then

$\log\bigg(\frac{\pi_{iB}}{\pi_{iA}}\bigg) = \beta_{0B} + \beta_{1B}x_i \\[10pt] \log\bigg(\frac{\pi_{iC}}{\pi_{iA}}\bigg) = \beta_{0C} + \beta_{1C} x_i$

# Data

## NHANES Data

• National Health and Nutrition Examination Survey is conducted by the National Center for Health Statistics (NCHS)

• The goal is to “assess the health and nutritional status of adults and children in the United States”

• This survey includes an interview and a physical examination

## NHANES Data

• We will use the data from the NHANES R package

• Contains 75 variables for the 2009 - 2010 and 2011 - 2012 sample years

• The data in this package is modified for educational purposes and should not be used for research

• Original data can be obtained from the NCHS website for research purposes

• Type ?NHANES in console to see list of variables and definitions

## Variables

Goal: Use a person’s age and whether they do regular physical activity to predict their self-reported health rating.

• Outcome: HealthGen: Self-reported rating of participant’s health in general. Excellent, Vgood, Good, Fair, or Poor.

• Predictors:

• Age:Age at time of screening (in years). Participants 80 or older were recorded as 80.
• PhysActive: Participant does moderate to vigorous-intensity sports, fitness or recreational activities.

## The data

nhanes_adult <- NHANES %>%
filter(Age >= 18) %>%
select(HealthGen, Age, PhysActive) %>%
drop_na() %>%
mutate(obs_num = 1:n())
glimpse(nhanes_adult)
Rows: 6,710
Columns: 4
$HealthGen <fct> Good, Good, Good, Good, Vgood, Vgood, Vgood, Vgood, Vgood, …$ Age        <int> 34, 34, 34, 49, 45, 45, 45, 66, 58, 54, 50, 33, 60, 56, 56,…
$PhysActive <fct> No, No, No, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, No, …$ obs_num    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …

# Fitting a multinomial logistic regression

## Model in R

Use the multinom_reg() function with the "nnet" engine:

health_fit <- multinom_reg() %>%
set_engine("nnet") %>%
fit(HealthGen ~ Age + PhysActive, data = nhanes_adult)

## Model result

health_fit
parsnip model object

Call:
nnet::multinom(formula = HealthGen ~ Age + PhysActive, data = data,
trace = FALSE)

Coefficients:
(Intercept)           Age PhysActiveYes
Vgood   1.2053460  0.0009101848    -0.3209047
Good    1.9476261 -0.0023686122    -1.0014925
Fair    0.9145492  0.0030462534    -1.6454297
Poor   -1.5211414  0.0221905681    -2.6556343

Residual Deviance: 17588.88
AIC: 17612.88 

## Next steps

What function do we use to get the model summary, i.e., coefficient estimates.

tidy(health_fit)
Error in model.frame.default(formula = HealthGen ~ Age + PhysActive, data = data): 'data' must be a data.frame, environment, or list

## Looking inside the result of fit()

What is the name of the dataset in the call? Is it right?

health_fit$fit$call
nnet::multinom(formula = HealthGen ~ Age + PhysActive, data = data,
trace = FALSE)

## Repair, and get back on track

health_fit <- repair_call(health_fit, data = nhanes_adult)
health_fit$fit$call
nnet::multinom(formula = HealthGen ~ Age + PhysActive, data = nhanes_adult,
trace = FALSE)
tidy(health_fit)
# A tibble: 12 × 6
y.level term           estimate std.error statistic  p.value
<chr>   <chr>             <dbl>     <dbl>     <dbl>    <dbl>
1 Vgood   (Intercept)    1.21       0.145       8.33  8.42e-17
2 Vgood   Age            0.000910   0.00246     0.369 7.12e- 1
3 Vgood   PhysActiveYes -0.321      0.0929     -3.45  5.51e- 4
4 Good    (Intercept)    1.95       0.141      13.8   1.39e-43
5 Good    Age           -0.00237    0.00242    -0.977 3.29e- 1
6 Good    PhysActiveYes -1.00       0.0901    -11.1   1.00e-28
7 Fair    (Intercept)    0.915      0.164       5.57  2.61e- 8
8 Fair    Age            0.00305    0.00288     1.06  2.90e- 1
9 Fair    PhysActiveYes -1.65       0.107     -15.3   5.69e-53
10 Poor    (Intercept)   -1.52       0.290      -5.24  1.62e- 7
11 Poor    Age            0.0222     0.00491     4.52  6.11e- 6
12 Poor    PhysActiveYes -2.66       0.236     -11.3   1.75e-29

## Model output

tidy(health_fit)
# A tibble: 12 × 6
y.level term           estimate std.error statistic  p.value
<chr>   <chr>             <dbl>     <dbl>     <dbl>    <dbl>
1 Vgood   (Intercept)    1.21       0.145       8.33  8.42e-17
2 Vgood   Age            0.000910   0.00246     0.369 7.12e- 1
3 Vgood   PhysActiveYes -0.321      0.0929     -3.45  5.51e- 4
4 Good    (Intercept)    1.95       0.141      13.8   1.39e-43
5 Good    Age           -0.00237    0.00242    -0.977 3.29e- 1
6 Good    PhysActiveYes -1.00       0.0901    -11.1   1.00e-28
7 Fair    (Intercept)    0.915      0.164       5.57  2.61e- 8
8 Fair    Age            0.00305    0.00288     1.06  2.90e- 1
9 Fair    PhysActiveYes -1.65       0.107     -15.3   5.69e-53
10 Poor    (Intercept)   -1.52       0.290      -5.24  1.62e- 7
11 Poor    Age            0.0222     0.00491     4.52  6.11e- 6
12 Poor    PhysActiveYes -2.66       0.236     -11.3   1.75e-29

## Model output, with CI

tidy(health_fit, conf.int = TRUE)
# A tibble: 12 × 8
y.level term         estimate std.error statistic  p.value conf.low conf.high
<chr>   <chr>           <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 Vgood   (Intercept)   1.21e+0   0.145       8.33  8.42e-17  0.922     1.49
2 Vgood   Age           9.10e-4   0.00246     0.369 7.12e- 1 -0.00392   0.00574
3 Vgood   PhysActiveY… -3.21e-1   0.0929     -3.45  5.51e- 4 -0.503    -0.139
4 Good    (Intercept)   1.95e+0   0.141      13.8   1.39e-43  1.67      2.22
5 Good    Age          -2.37e-3   0.00242    -0.977 3.29e- 1 -0.00712   0.00238
6 Good    PhysActiveY… -1.00e+0   0.0901    -11.1   1.00e-28 -1.18     -0.825
7 Fair    (Intercept)   9.15e-1   0.164       5.57  2.61e- 8  0.592     1.24
8 Fair    Age           3.05e-3   0.00288     1.06  2.90e- 1 -0.00260   0.00869
9 Fair    PhysActiveY… -1.65e+0   0.107     -15.3   5.69e-53 -1.86     -1.43
10 Poor    (Intercept)  -1.52e+0   0.290      -5.24  1.62e- 7 -2.09     -0.952
11 Poor    Age           2.22e-2   0.00491     4.52  6.11e- 6  0.0126    0.0318
12 Poor    PhysActiveY… -2.66e+0   0.236     -11.3   1.75e-29 -3.12     -2.19   

## Model output, with CI

y.level term estimate std.error statistic p.value conf.low conf.high
Vgood (Intercept) 1.205 0.145 8.325 0.000 0.922 1.489
Vgood Age 0.001 0.002 0.369 0.712 -0.004 0.006
Vgood PhysActiveYes -0.321 0.093 -3.454 0.001 -0.503 -0.139
Good (Intercept) 1.948 0.141 13.844 0.000 1.672 2.223
Good Age -0.002 0.002 -0.977 0.329 -0.007 0.002
Good PhysActiveYes -1.001 0.090 -11.120 0.000 -1.178 -0.825
Fair (Intercept) 0.915 0.164 5.566 0.000 0.592 1.237
Fair Age 0.003 0.003 1.058 0.290 -0.003 0.009
Fair PhysActiveYes -1.645 0.107 -15.319 0.000 -1.856 -1.435
Poor (Intercept) -1.521 0.290 -5.238 0.000 -2.090 -0.952
Poor Age 0.022 0.005 4.522 0.000 0.013 0.032
Poor PhysActiveYes -2.656 0.236 -11.275 0.000 -3.117 -2.194

## Fair vs. Excellent Health

The baseline category for the model is Excellent.

The model equation for the log-odds a person rates themselves as having “Fair” health vs. “Excellent” is

$\log\Big(\frac{\hat{\pi}_{Fair}}{\hat{\pi}_{Excellent}}\Big) = 0.915 + 0.003 ~ \text{age} - 1.645 ~ \text{PhysActive}$

## Interpretations

$\log\Big(\frac{\hat{\pi}_{Fair}}{\hat{\pi}_{Excellent}}\Big) = 0.915 + 0.003 ~ \text{age} - 1.645 ~ \text{PhysActive}$

For each additional year in age, the odds a person rates themselves as having fair health versus excellent health are expected to multiply by 1.003 (exp(0.003)), holding physical activity constant.

The odds a person who does physical activity will rate themselves as having fair health versus excellent health are expected to be 0.193 (exp(-1.645)) times the odds for a person who doesn’t do physical activity, holding age constant.

## Interpretations

$\log\Big(\frac{\hat{\pi}_{Fair}}{\hat{\pi}_{Excellent}}\Big) = 0.915 + 0.003 ~ \text{age} - 1.645 ~ \text{PhysActive}$

The odds a 0 year old person who doesn’t do physical activity rates themselves as having fair health vs. excellent health are 2.497 (exp(0.915)).

⚠️ Need to mean-center age for the intercept to have a meaningful interpretation!

## Hypothesis test for $$\beta_{jk}$$

The test of significance for the coefficient $$\beta_{jk}$$ is

Hypotheses: $$H_0: \beta_{jk} = 0 \hspace{2mm} \text{ vs } \hspace{2mm} H_a: \beta_{jk} \neq 0$$

Test Statistic: $z = \frac{\hat{\beta}_{jk} - 0}{SE(\hat{\beta_{jk}})}$

P-value: $$P(|Z| > |z|)$$,

where $$Z \sim N(0, 1)$$, the Standard Normal distribution

## Confidence interval for $$\beta_{jk}$$

• We can calculate the C% confidence interval for $$\beta_{jk}$$ using $$\hat{\beta}_{jk} \pm z^* SE(\hat{\beta}_{jk})$$, where $$z^*$$ is calculated from the $$N(0,1)$$ distribution.

• We are $$C\%$$ confident that for every one unit change in $$x_{j}$$, the odds of $$y = k$$ versus the baseline will multiply by a factor of $$\exp\{\hat{\beta}_{jk} - z^* SE(\hat{\beta}_{jk})\}$$ to $$\exp\{\hat{\beta}_{jk} + z^* SE(\hat{\beta}_{jk})\}$$, holding all else constant.

## Interpreting CIs for $$\beta_{jk}$$

tidy(health_fit, conf.int = TRUE) %>%
filter(y.level == "Fair") %>%
kable(digits = 3)
y.level term estimate std.error statistic p.value conf.low conf.high
Fair (Intercept) 0.915 0.164 5.566 0.00 0.592 1.237
Fair Age 0.003 0.003 1.058 0.29 -0.003 0.009
Fair PhysActiveYes -1.645 0.107 -15.319 0.00 -1.856 -1.435

We are 95% confident, that for each additional year in age, the odds a person rates themselves as having fair health versus excellent health will multiply by 0.997 (exp(-0.003)) to 1.009 (exp(0.009)) , holding physical activity constant.

## Interpreting CIs for $$\beta_{jk}$$

tidy(health_fit, conf.int = TRUE) %>%
filter(y.level == "Fair") %>%
kable(digits = 3)
y.level term estimate std.error statistic p.value conf.low conf.high
Fair (Intercept) 0.915 0.164 5.566 0.00 0.592 1.237
Fair Age 0.003 0.003 1.058 0.29 -0.003 0.009
Fair PhysActiveYes -1.645 0.107 -15.319 0.00 -1.856 -1.435

We are 95% confident that the odds a person who does physical activity will rate themselves as having fair health versus excellent health are 0.156 (exp(-1.856 )) to 0.238 (exp(-1.435)) times the odds for a person who doesn’t do physical activity, holding age constant.

## Recap

• Introduce multinomial logistic regression

• Interpret model coefficients

• Inference for a coefficient $$\beta_{jk}$$