```
# load packages
library(tidyverse)
library(tidymodels)
library(NHANES)
library(knitr)
library(patchwork)
# set default theme and larger font size for ggplot2
::theme_set(ggplot2::theme_minimal(base_size = 20)) ggplot2
```

# Multinomial Logistic Regression (MultiLR)

STA 210 - Summer 2022

# Welcome

## Topics

Introduce multinomial logistic regression

Interpret model coefficients

Inference for a coefficient \(\beta_{jk}\)

## Computational setup

# Generalized Linear Models

## Generalized Linear Models (GLMs)

In practice, there are many different types of outcome variables:

**Binary**: Win or Lose**Nominal**: Democrat, Republican or Third Party candidate**Ordered**: Movie rating (1 - 5 stars)- and others…

Predicting each of these outcomes requires a

**generalized linear model**, a broader class of models that*generalize*the multiple linear regression model

## Binary outcome (Logistic)

Given \(P(y_i=1|x_i)= \hat{\pi}_i\hspace{5mm} \text{ and } \hspace{5mm}P(y_i=0|x_i) = 1-\hat{\pi}_i\)

\[ \log\Big(\frac{\hat{\pi}_i}{1-\hat{\pi}_i}\Big) = \hat{\beta}_0 + \hat{\beta}_1 x_{i} \]

We can calculate \(\hat{\pi}_i\) by solving the logit equation:

\[ \hat{\pi}_i = \frac{e^{\hat{\beta}_0 + \hat{\beta}_1 x_{i}}}{1 + e^{\hat{\beta}_0 + \hat{\beta}_1 x_{i}}} \]

## Binary outcome (Logistic)

Suppose we consider \(y=0\) the

**baseline category**such that\[ P(y_i=1|x_i) = \hat{\pi}_{i1} \hspace{2mm} \text{ and } \hspace{2mm} P(y_i=0|x_i) = \hat{\pi}_{i0} \]

Then the logistic regression model is

\[ \log\bigg(\frac{\hat{\pi}_{i1}}{1- \hat{\pi}_{i1}}\bigg) = \log\bigg(\frac{\hat{\pi}_{i1}}{\hat{\pi}_{i0}}\bigg) = \hat{\beta}_0 + \hat{\beta}_1 x_i \]

Slope, \(\hat{\beta}_1\): When \(x\) increases by one unit, the odds of \(y=1\) versus the baseline \(y=0\) are expected to multiply by a factor of \(e^{\hat{\beta}_1}\)

Intercept, \(\hat{\beta}_0\): When \(x=0\), the predicted odds of \(y=1\) versus the baseline \(y=0\) are \(\exp\{\hat{\beta}_0\}\)

## Multinomial outcome variable

Suppose the outcome variable \(y\) is categorical and can take values \(1, 2, \ldots, K\) such that \((K > 2)\)

**Multinomial Distribution:**\[ P(y=1) = \pi_1, P(y=2) = \pi_2, \ldots, P(y=K) = \pi_K \]

such that \(\sum\limits_{k=1}^{K} \pi_k = 1\)

## Multinomial Logistic Regression

If we have an explanatory variable \(x\), then we want to fit a model such that \(P(y = k) = \pi_k\) is a function of \(x\)

Choose a baseline category. Let’s choose \(y=1\). Then,

\[ \log\bigg(\frac{\pi_{ik}}{\pi_{i1}}\bigg) = \beta_{0k} + \beta_{1k} x_i \]

In the multinomial logistic model, we have a separate equation for each category of the outcome relative to the baseline category

- If the outcome has \(K\) possible categories, there will be \(K-1\) equations as part of the multinomial logistic model

## Multinomial Logistic Regression

Suppose we have a outcome variable \(y\) that can take three possible outcomes that are coded as “A”, “B”, “C”

Let “A” be the baseline category. Then

\[ \log\bigg(\frac{\pi_{iB}}{\pi_{iA}}\bigg) = \beta_{0B} + \beta_{1B}x_i \\[10pt] \log\bigg(\frac{\pi_{iC}}{\pi_{iA}}\bigg) = \beta_{0C} + \beta_{1C} x_i \]

# Data

## NHANES Data

National Health and Nutrition Examination Survey is conducted by the National Center for Health Statistics (NCHS)

The goal is to

*“assess the health and nutritional status of adults and children in the United States”*This survey includes an interview and a physical examination

## NHANES Data

We will use the data from the

`NHANES`

R packageContains 75 variables for the 2009 - 2010 and 2011 - 2012 sample years

The data in this package is modified for educational purposes and should

**not**be used for researchOriginal data can be obtained from the NCHS website for research purposes

Type

`?NHANES`

in console to see list of variables and definitions

## Variables

**Goal:** Use a person’s age and whether they do regular physical activity to predict their self-reported health rating.

Outcome:

`HealthGen`

: Self-reported rating of participant’s health in general. Excellent, Vgood, Good, Fair, or Poor.Predictors:

`Age`

:Age at time of screening (in years). Participants 80 or older were recorded as 80.`PhysActive`

: Participant does moderate to vigorous-intensity sports, fitness or recreational activities.

## The data

```
<- NHANES %>%
nhanes_adult filter(Age >= 18) %>%
select(HealthGen, Age, PhysActive) %>%
drop_na() %>%
mutate(obs_num = 1:n())
```

`glimpse(nhanes_adult)`

```
Rows: 6,710
Columns: 4
$ HealthGen <fct> Good, Good, Good, Good, Vgood, Vgood, Vgood, Vgood, Vgood, …
$ Age <int> 34, 34, 34, 49, 45, 45, 45, 66, 58, 54, 50, 33, 60, 56, 56,…
$ PhysActive <fct> No, No, No, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, No, …
$ obs_num <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
```

## Exploratory data analysis

## Exploratory data analysis

# Fitting a multinomial logistic regression

## Model in R

Use the `multinom_reg()`

function with the `"nnet"`

engine:

```
<- multinom_reg() %>%
health_fit set_engine("nnet") %>%
fit(HealthGen ~ Age + PhysActive, data = nhanes_adult)
```

## Model result

` health_fit`

```
parsnip model object
Call:
nnet::multinom(formula = HealthGen ~ Age + PhysActive, data = data,
trace = FALSE)
Coefficients:
(Intercept) Age PhysActiveYes
Vgood 1.2053460 0.0009101848 -0.3209047
Good 1.9476261 -0.0023686122 -1.0014925
Fair 0.9145492 0.0030462534 -1.6454297
Poor -1.5211414 0.0221905681 -2.6556343
Residual Deviance: 17588.88
AIC: 17612.88
```

## Next steps

What function do we use to get the model summary, i.e., coefficient estimates.

`tidy(health_fit)`

`Error in model.frame.default(formula = HealthGen ~ Age + PhysActive, data = data): 'data' must be a data.frame, environment, or list`

## Looking inside the result of `fit()`

What is the name of the dataset in the call? Is it right?

`$fit$call health_fit`

```
nnet::multinom(formula = HealthGen ~ Age + PhysActive, data = data,
trace = FALSE)
```

## Repair, and get back on track

```
<- repair_call(health_fit, data = nhanes_adult)
health_fit $fit$call health_fit
```

```
nnet::multinom(formula = HealthGen ~ Age + PhysActive, data = nhanes_adult,
trace = FALSE)
```

`tidy(health_fit)`

```
# A tibble: 12 × 6
y.level term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Vgood (Intercept) 1.21 0.145 8.33 8.42e-17
2 Vgood Age 0.000910 0.00246 0.369 7.12e- 1
3 Vgood PhysActiveYes -0.321 0.0929 -3.45 5.51e- 4
4 Good (Intercept) 1.95 0.141 13.8 1.39e-43
5 Good Age -0.00237 0.00242 -0.977 3.29e- 1
6 Good PhysActiveYes -1.00 0.0901 -11.1 1.00e-28
7 Fair (Intercept) 0.915 0.164 5.57 2.61e- 8
8 Fair Age 0.00305 0.00288 1.06 2.90e- 1
9 Fair PhysActiveYes -1.65 0.107 -15.3 5.69e-53
10 Poor (Intercept) -1.52 0.290 -5.24 1.62e- 7
11 Poor Age 0.0222 0.00491 4.52 6.11e- 6
12 Poor PhysActiveYes -2.66 0.236 -11.3 1.75e-29
```

## Model output

`tidy(health_fit)`

```
# A tibble: 12 × 6
y.level term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Vgood (Intercept) 1.21 0.145 8.33 8.42e-17
2 Vgood Age 0.000910 0.00246 0.369 7.12e- 1
3 Vgood PhysActiveYes -0.321 0.0929 -3.45 5.51e- 4
4 Good (Intercept) 1.95 0.141 13.8 1.39e-43
5 Good Age -0.00237 0.00242 -0.977 3.29e- 1
6 Good PhysActiveYes -1.00 0.0901 -11.1 1.00e-28
7 Fair (Intercept) 0.915 0.164 5.57 2.61e- 8
8 Fair Age 0.00305 0.00288 1.06 2.90e- 1
9 Fair PhysActiveYes -1.65 0.107 -15.3 5.69e-53
10 Poor (Intercept) -1.52 0.290 -5.24 1.62e- 7
11 Poor Age 0.0222 0.00491 4.52 6.11e- 6
12 Poor PhysActiveYes -2.66 0.236 -11.3 1.75e-29
```

## Model output, with CI

`tidy(health_fit, conf.int = TRUE)`

```
# A tibble: 12 × 8
y.level term estimate std.error statistic p.value conf.low conf.high
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Vgood (Intercept) 1.21e+0 0.145 8.33 8.42e-17 0.922 1.49
2 Vgood Age 9.10e-4 0.00246 0.369 7.12e- 1 -0.00392 0.00574
3 Vgood PhysActiveY… -3.21e-1 0.0929 -3.45 5.51e- 4 -0.503 -0.139
4 Good (Intercept) 1.95e+0 0.141 13.8 1.39e-43 1.67 2.22
5 Good Age -2.37e-3 0.00242 -0.977 3.29e- 1 -0.00712 0.00238
6 Good PhysActiveY… -1.00e+0 0.0901 -11.1 1.00e-28 -1.18 -0.825
7 Fair (Intercept) 9.15e-1 0.164 5.57 2.61e- 8 0.592 1.24
8 Fair Age 3.05e-3 0.00288 1.06 2.90e- 1 -0.00260 0.00869
9 Fair PhysActiveY… -1.65e+0 0.107 -15.3 5.69e-53 -1.86 -1.43
10 Poor (Intercept) -1.52e+0 0.290 -5.24 1.62e- 7 -2.09 -0.952
11 Poor Age 2.22e-2 0.00491 4.52 6.11e- 6 0.0126 0.0318
12 Poor PhysActiveY… -2.66e+0 0.236 -11.3 1.75e-29 -3.12 -2.19
```

## Model output, with CI

y.level | term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|---|

Vgood | (Intercept) | 1.205 | 0.145 | 8.325 | 0.000 | 0.922 | 1.489 |

Vgood | Age | 0.001 | 0.002 | 0.369 | 0.712 | -0.004 | 0.006 |

Vgood | PhysActiveYes | -0.321 | 0.093 | -3.454 | 0.001 | -0.503 | -0.139 |

Good | (Intercept) | 1.948 | 0.141 | 13.844 | 0.000 | 1.672 | 2.223 |

Good | Age | -0.002 | 0.002 | -0.977 | 0.329 | -0.007 | 0.002 |

Good | PhysActiveYes | -1.001 | 0.090 | -11.120 | 0.000 | -1.178 | -0.825 |

Fair | (Intercept) | 0.915 | 0.164 | 5.566 | 0.000 | 0.592 | 1.237 |

Fair | Age | 0.003 | 0.003 | 1.058 | 0.290 | -0.003 | 0.009 |

Fair | PhysActiveYes | -1.645 | 0.107 | -15.319 | 0.000 | -1.856 | -1.435 |

Poor | (Intercept) | -1.521 | 0.290 | -5.238 | 0.000 | -2.090 | -0.952 |

Poor | Age | 0.022 | 0.005 | 4.522 | 0.000 | 0.013 | 0.032 |

Poor | PhysActiveYes | -2.656 | 0.236 | -11.275 | 0.000 | -3.117 | -2.194 |

## Fair vs. Excellent Health

The baseline category for the model is `Excellent`

.

The model equation for the log-odds a person rates themselves as having “Fair” health vs. “Excellent” is

\[ \log\Big(\frac{\hat{\pi}_{Fair}}{\hat{\pi}_{Excellent}}\Big) = 0.915 + 0.003 ~ \text{age} - 1.645 ~ \text{PhysActive} \]

## Interpretations

\[ \log\Big(\frac{\hat{\pi}_{Fair}}{\hat{\pi}_{Excellent}}\Big) = 0.915 + 0.003 ~ \text{age} - 1.645 ~ \text{PhysActive} \]

For each additional year in age, the odds a person rates themselves as having fair health versus excellent health are expected to multiply by 1.003 (exp(0.003)), holding physical activity constant.

The odds a person who does physical activity will rate themselves as having fair health versus excellent health are expected to be 0.193 (exp(-1.645)) times the odds for a person who doesn’t do physical activity, holding age constant.

## Interpretations

\[ \log\Big(\frac{\hat{\pi}_{Fair}}{\hat{\pi}_{Excellent}}\Big) = 0.915 + 0.003 ~ \text{age} - 1.645 ~ \text{PhysActive} \]

The odds a 0 year old person who doesn’t do physical activity rates themselves as having fair health vs. excellent health are 2.497 (exp(0.915)).

⚠️ **Need to mean-center age for the intercept to have a meaningful interpretation!**

## Hypothesis test for \(\beta_{jk}\)

The test of significance for the coefficient \(\beta_{jk}\) is

**Hypotheses**: \(H_0: \beta_{jk} = 0 \hspace{2mm} \text{ vs } \hspace{2mm} H_a: \beta_{jk} \neq 0\)

**Test Statistic**: \[z = \frac{\hat{\beta}_{jk} - 0}{SE(\hat{\beta_{jk}})}\]

**P-value**: \(P(|Z| > |z|)\),

where \(Z \sim N(0, 1)\), the Standard Normal distribution

## Confidence interval for \(\beta_{jk}\)

We can calculate the

**C% confidence interval**for \(\beta_{jk}\) using \(\hat{\beta}_{jk} \pm z^* SE(\hat{\beta}_{jk})\), where \(z^*\) is calculated from the \(N(0,1)\) distribution.We are \(C\%\) confident that for every one unit change in \(x_{j}\), the odds of \(y = k\) versus the baseline will multiply by a factor of \(\exp\{\hat{\beta}_{jk} - z^* SE(\hat{\beta}_{jk})\}\) to \(\exp\{\hat{\beta}_{jk} + z^* SE(\hat{\beta}_{jk})\}\), holding all else constant.

## Interpreting CIs for \(\beta_{jk}\)

```
tidy(health_fit, conf.int = TRUE) %>%
filter(y.level == "Fair") %>%
kable(digits = 3)
```

y.level | term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|---|

Fair | (Intercept) | 0.915 | 0.164 | 5.566 | 0.00 | 0.592 | 1.237 |

Fair | Age | 0.003 | 0.003 | 1.058 | 0.29 | -0.003 | 0.009 |

Fair | PhysActiveYes | -1.645 | 0.107 | -15.319 | 0.00 | -1.856 | -1.435 |

We are 95% confident, that for each additional year in age, the odds a person rates themselves as having fair health versus excellent health will multiply by 0.997 (exp(-0.003)) to 1.009 (exp(0.009)) , holding physical activity constant.

## Interpreting CIs for \(\beta_{jk}\)

```
tidy(health_fit, conf.int = TRUE) %>%
filter(y.level == "Fair") %>%
kable(digits = 3)
```

y.level | term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|---|

Fair | (Intercept) | 0.915 | 0.164 | 5.566 | 0.00 | 0.592 | 1.237 |

Fair | Age | 0.003 | 0.003 | 1.058 | 0.29 | -0.003 | 0.009 |

Fair | PhysActiveYes | -1.645 | 0.107 | -15.319 | 0.00 | -1.856 | -1.435 |

We are 95% confident that the odds a person who does physical activity will rate themselves as having fair health versus excellent health are 0.156 (exp(-1.856 )) to 0.238 (exp(-1.435)) times the odds for a person who doesn’t do physical activity, holding age constant.

## Recap

Introduce multinomial logistic regression

Interpret model coefficients

Inference for a coefficient \(\beta_{jk}\)