```
# load packages
library(tidyverse)
library(tidymodels)
library(openintro)
library(patchwork)
library(knitr)
library(kableExtra)
library(colorblindr)
# set default theme and larger font size for ggplot2
::theme_set(ggplot2::theme_minimal(base_size = 20)) ggplot2
```

# Types of predictors

STA 210 - Summer 2022

# Welcome

## Announcements

- Congratulations on finishing Exam 1!

## Topics

Mean-centering quantitative predictors

Using indicator variables for categorical predictors

Using interaction terms

## Computational setup

# Introduction

## Data: Peer-to-peer lender

Today’s data is a sample of 50 loans made through a peer-to-peer lending club. The data is in the `loan50`

data frame in the **openintro** R package.

```
# A tibble: 50 × 4
annual_income debt_to_income verified_income interest_rate
<dbl> <dbl> <fct> <dbl>
1 59000 0.558 Not Verified 10.9
2 60000 1.31 Not Verified 9.92
3 75000 1.06 Verified 26.3
4 75000 0.574 Not Verified 9.92
5 254000 0.238 Not Verified 9.43
6 67000 1.08 Source Verified 9.92
7 28800 0.0997 Source Verified 17.1
8 80000 0.351 Not Verified 6.08
9 34000 0.698 Not Verified 7.97
10 80000 0.167 Source Verified 12.6
# … with 40 more rows
```

## Variables

**Predictors**:

`annual_income`

: Annual income`debt_to_income`

: Debt-to-income ratio, i.e. the percentage of a borrower’s total debt divided by their total income`verified_income`

: Whether borrower’s income source and amount have been verified (`Not Verified`

,`Source Verified`

,`Verified`

)

**Outcome**: `interest_rate`

: Interest rate for the loan

## Outcome: `interest_rate`

min | median | max |
---|---|---|

5.31 | 9.93 | 26.3 |

## Predictors

## Data manipulation 1: Rescale income

```
<- loan50 %>%
loan50 mutate(annual_income_th = annual_income / 1000)
ggplot(loan50, aes(x = annual_income_th)) +
geom_histogram(binwidth = 20) +
labs(title = "Annual income (in $1000s)")
```

## Outcome vs. predictors

## Fit regression model

```
<- linear_reg() %>%
int_fit set_engine("lm") %>%
fit(interest_rate ~ debt_to_income + verified_income + annual_income_th,
data = loan50)
```

## Summarize model results

term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|

(Intercept) | 10.726 | 1.507 | 7.116 | 0.000 | 7.690 | 13.762 |

debt_to_income | 0.671 | 0.676 | 0.993 | 0.326 | -0.690 | 2.033 |

verified_incomeSource Verified | 2.211 | 1.399 | 1.581 | 0.121 | -0.606 | 5.028 |

verified_incomeVerified | 6.880 | 1.801 | 3.820 | 0.000 | 3.253 | 10.508 |

annual_income_th | -0.021 | 0.011 | -1.804 | 0.078 | -0.043 | 0.002 |

Describe the subset of borrowers who are expected to get an interest rate of 10.726% based on our model. Is this interpretation meaningful? Why or why not?

# Mean-centered variables

## Mean-centering

If we are interested in interpreting the intercept, we can **mean-center** the quantitative predictors in the model.

We can mean-center a quantitative predictor \(X_j\) using the following:

\[X_{j_{Cent}} = X_{j}- \bar{X}_{j}\]

If we mean-center all quantitative variables, then the intercept is interpreted as the expected value of the response variable when all quantitative variables are at their mean value.

## Data manipulation 2: Mean-center numeric predictors

```
<- loan50 %>%
loan50 mutate(
debt_inc_cent = debt_to_income - mean(debt_to_income),
annual_income_th_cent = annual_income_th - mean(annual_income_th)
)
```

## Visualize mean-centered predictors

## Using mean-centered variables in the model

How do you expect the model to change if we use the `debt_inc_cent`

and `annual_income_cent`

in the model?

```
# A tibble: 5 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 9.44 0.977 9.66 1.50e-12 7.48 11.4
2 debt_inc_cent 0.671 0.676 0.993 3.26e- 1 -0.690 2.03
3 verified_incomeSourc… 2.21 1.40 1.58 1.21e- 1 -0.606 5.03
4 verified_incomeVerif… 6.88 1.80 3.82 4.06e- 4 3.25 10.5
5 annual_income_th_cent -0.0205 0.0114 -1.80 7.79e- 2 -0.0434 0.00238
```

## Original vs. mean-centered model

term | estimate |
---|---|

(Intercept) | 10.726 |

debt_to_income | 0.671 |

verified_incomeSource Verified | 2.211 |

verified_incomeVerified | 6.880 |

annual_income_th | -0.021 |

term | estimate |
---|---|

(Intercept) | 9.444 |

debt_inc_cent | 0.671 |

verified_incomeSource Verified | 2.211 |

verified_incomeVerified | 6.880 |

annual_income_th_cent | -0.021 |

# Indicator variables

## Dummy variables

Suppose there is a categorical variable with \(K\) categories (levels)

We can treat them as continuous variables, Or

We can make \(K-1\) indicator variables (by default)

An

**indicator variable**takes values 1 or 0- 1 if the observation belongs to that category
- 0 if the observation does not belong to that category
- all 0s if the observation belong to the benchmark category

## Indicator variables

We can also make \(K\) indicator variables - one indicator for each category

An

**indicator variable**takes values 1 or 0- 1 if the observation belongs to that category
- 0 if the observation does not belong to that category

No intercept in this case

## Data manipulation 3: Create indicator variables for `verified_income`

Since

```
<- loan50 %>%
loan50 mutate(
not_verified = if_else(verified_income == "Not Verified", 1, 0),
source_verified = if_else(verified_income == "Source Verified", 1, 0),
verified = if_else(verified_income == "Verified", 1, 0)
)
```

```
# A tibble: 3 × 4
verified_income not_verified source_verified verified
<fct> <dbl> <dbl> <dbl>
1 Not Verified 1 0 0
2 Verified 0 0 1
3 Source Verified 0 1 0
```

## Indicators in the model

- We will use \(K-1\) of the indicator variables in the model.
- The
**baseline**is the category that doesn’t have a term in the model. - The coefficients of the indicator variables in the model are interpreted as the expected change in the response compared to the baseline, holding all other variables constant.
- This approach is also called
**dummy coding**.

```
# A tibble: 3 × 3
verified_income source_verified verified
<fct> <dbl> <dbl>
1 Not Verified 0 0
2 Verified 0 1
3 Source Verified 1 0
```

## Interpreting `verified_income`

term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|

(Intercept) | 9.444 | 0.977 | 9.663 | 0.000 | 7.476 | 11.413 |

debt_inc_cent | 0.671 | 0.676 | 0.993 | 0.326 | -0.690 | 2.033 |

verified_incomeSource Verified | 2.211 | 1.399 | 1.581 | 0.121 | -0.606 | 5.028 |

verified_incomeVerified | 6.880 | 1.801 | 3.820 | 0.000 | 3.253 | 10.508 |

annual_income_th_cent | -0.021 | 0.011 | -1.804 | 0.078 | -0.043 | 0.002 |

- The baseline category is
`Not verified`

. - People with source verified income are expected to take a loan with an interest rate that is 2.211% higher, on average, than the rate on loans to those whose income is not verified, holding all else constant.
- People with verified income are expected to take a loan with an interest rate that is 6.880% higher, on average, than the rate on loans to those whose income is not verified, holding all else constant.

# Interaction terms

## Interaction terms

- Sometimes the relationship between a predictor variable and the response depends on the value of another predictor variable.
- This is an
**interaction effect**. - To account for this, we can include
**interaction terms**in the model.

## Interest rate vs. annual income

The lines are not parallel indicating there is an **interaction effect**. The slope of annual income differs based on the income verification.

## Interaction term in model

```
<- linear_reg() %>%
int_cent_int_fit set_engine("lm") %>%
fit(interest_rate ~ debt_inc_cent + debt_inc_cent +
+ verified_income * annual_income_th_cent,
annual_income_th_cent data = loan50)
```

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

(Intercept) | 9.484 | 0.989 | 9.586 | 0.000 |

debt_inc_cent | 0.691 | 0.685 | 1.009 | 0.319 |

annual_income_th_cent | -0.007 | 0.020 | -0.341 | 0.735 |

verified_incomeSource Verified | 2.157 | 1.418 | 1.522 | 0.135 |

verified_incomeVerified | 7.181 | 1.870 | 3.840 | 0.000 |

annual_income_th_cent:verified_incomeSource Verified | -0.016 | 0.026 | -0.643 | 0.523 |

annual_income_th_cent:verified_incomeVerified | -0.032 | 0.033 | -0.979 | 0.333 |

## Interpreting interaction terms

- What the interaction means: The effect of annual income on the interest rate differs by -0.016 when the income is source verified compared to when it is not verified, holding all else constant.
- Interpreting
`annual_income`

for source verified: If the income is source verified, we expect the interest rate to decrease by 0.023% (-0.007 + -0.016) for each additional thousand dollars in annual income, holding all else constant.

## Data manipulation 4: Create interaction variables

Defining the interaction variable in the model formula as `verified_income * annual_income_th_cent`

is an implicit data manipulation step as well

```
Rows: 50
Columns: 9
$ `(Intercept)` <dbl> 1, 1, 1, 1, 1, …
$ debt_inc_cent <dbl> -0.16511719, 0.…
$ annual_income_th_cent <dbl> -27.17, -26.17,…
$ `verified_incomeNot Verified` <dbl> 1, 1, 0, 1, 1, …
$ `verified_incomeSource Verified` <dbl> 0, 0, 0, 0, 0, …
$ verified_incomeVerified <dbl> 0, 0, 1, 0, 0, …
$ `annual_income_th_cent:verified_incomeNot Verified` <dbl> -27.17, -26.17,…
$ `annual_income_th_cent:verified_incomeSource Verified` <dbl> 0.00, 0.00, 0.0…
$ `annual_income_th_cent:verified_incomeVerified` <dbl> 0.00, 0.00, -11…
```

# Transformation

## Data manipulation 5: Transformation on variables

- Linearity is with respect of \(\beta\): \(y = \beta_0+\beta_1 x^2\) is also a linear regression
- For right-skewed long tail distributed variable (financial data), log transformation : decrease the variability of data and make data conform more closely to the normal distribution

## Data manipulation 5: log transformation

## Interpretation on log-scale

As annual income increase by 10% (\(\log(x+0.1x)=\log(1.1\times x)=log(1.1)+\log x\)), the interest rate is expected to increase by \(\beta \times \log(1.1)\) on average, hold … constant.

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

(Intercept) | 35.144 | 14.200 | 2.475 | 0.017 |

debt_to_income | 0.725 | 0.671 | 1.081 | 0.286 |

verified_incomeSource Verified | 2.140 | 1.397 | 1.532 | 0.133 |

verified_incomeVerified | 7.032 | 1.809 | 3.888 | 0.000 |

annual_income_log | -2.338 | 1.260 | -1.855 | 0.070 |

# Wrap up

## Recap

Mean-centering quantitative predictors

Using indicator variables for categorical predictors

Using interaction terms

## Looking backward

Data manipulation, with **dplyr** (from **tidyverse**):

```
%>%
loan50 select(interest_rate, annual_income, debt_to_income, verified_income) %>%
mutate(
# 1. rescale income
annual_income_th = annual_income / 1000,
# 2. mean-center quantitative predictors
debt_inc_cent = debt_to_income - mean(debt_to_income),
annual_income_th_cent = annual_income_th - mean(annual_income_th),
# 3. create dummy variables for verified_income
source_verified = if_else(verified_income == "Source Verified", 1, 0),
verified = if_else(verified_income == "Verified", 1, 0),
# 4. create interaction variables
`annual_income_th_cent:verified_incomeSource Verified` = annual_income_th_cent * source_verified,
`annual_income_th_cent:verified_incomeVerified` = annual_income_th_cent * verified
)
```

## Looking forward

**Feature engineering**, with **recipes** (from **tidymodels**):

```
<- recipe( ~ ., data = loan50) %>%
loan_rec # 1. rescale income
step_mutate(annual_income_th = annual_income / 1000) %>%
# 2. mean-center quantitative predictors
step_center(all_numeric_predictors()) %>%
# 3. create dummy variables for verified_income
step_dummy(verified_income) %>%
# 4. create interaction variables
step_interact(terms = ~ annual_income_th:verified_income)
```

## Recipe

` loan_rec`

```
Recipe
Inputs:
role #variables
predictor 25
Operations:
Variable mutation for annual_income / 1000
Centering for all_numeric_predictors()
Dummy variables from verified_income
Interactions with annual_income_th:verified_income
```