# load packages
library(tidyverse)
library(tidymodels)
library(openintro)
library(patchwork)
library(knitr)
library(kableExtra)
library(colorblindr)
# set default theme and larger font size for ggplot2
ggplot2::theme_set(ggplot2::theme_minimal(base_size = 20))Types of predictors
STA 210 - Summer 2022
Welcome
Announcements
- Congratulations on finishing Exam 1!
Topics
- Mean-centering quantitative predictors 
- Using indicator variables for categorical predictors 
- Using interaction terms 
Computational setup
Introduction
Data: Peer-to-peer lender
Today’s data is a sample of 50 loans made through a peer-to-peer lending club. The data is in the loan50 data frame in the openintro R package.
# A tibble: 50 × 4
   annual_income debt_to_income verified_income interest_rate
           <dbl>          <dbl> <fct>                   <dbl>
 1         59000         0.558  Not Verified            10.9 
 2         60000         1.31   Not Verified             9.92
 3         75000         1.06   Verified                26.3 
 4         75000         0.574  Not Verified             9.92
 5        254000         0.238  Not Verified             9.43
 6         67000         1.08   Source Verified          9.92
 7         28800         0.0997 Source Verified         17.1 
 8         80000         0.351  Not Verified             6.08
 9         34000         0.698  Not Verified             7.97
10         80000         0.167  Source Verified         12.6 
# … with 40 more rowsVariables
Predictors:
- annual_income: Annual income
- debt_to_income: Debt-to-income ratio, i.e. the percentage of a borrower’s total debt divided by their total income
- verified_income: Whether borrower’s income source and amount have been verified (- Not Verified,- Source Verified,- Verified)
Outcome: interest_rate: Interest rate for the loan
Outcome: interest_rate

| min | median | max | 
|---|---|---|
| 5.31 | 9.93 | 26.3 | 
Predictors

Data manipulation 1: Rescale income
loan50 <- loan50 %>%
  mutate(annual_income_th = annual_income / 1000)
ggplot(loan50, aes(x = annual_income_th)) +
  geom_histogram(binwidth = 20) +
  labs(title = "Annual income (in $1000s)")
Outcome vs. predictors

Fit regression model
int_fit <- linear_reg() %>%
  set_engine("lm") %>%
  fit(interest_rate ~ debt_to_income + verified_income  + annual_income_th,
      data = loan50)Summarize model results
| term | estimate | std.error | statistic | p.value | conf.low | conf.high | 
|---|---|---|---|---|---|---|
| (Intercept) | 10.726 | 1.507 | 7.116 | 0.000 | 7.690 | 13.762 | 
| debt_to_income | 0.671 | 0.676 | 0.993 | 0.326 | -0.690 | 2.033 | 
| verified_incomeSource Verified | 2.211 | 1.399 | 1.581 | 0.121 | -0.606 | 5.028 | 
| verified_incomeVerified | 6.880 | 1.801 | 3.820 | 0.000 | 3.253 | 10.508 | 
| annual_income_th | -0.021 | 0.011 | -1.804 | 0.078 | -0.043 | 0.002 | 
Describe the subset of borrowers who are expected to get an interest rate of 10.726% based on our model. Is this interpretation meaningful? Why or why not?
Mean-centered variables
Mean-centering
If we are interested in interpreting the intercept, we can mean-center the quantitative predictors in the model.
We can mean-center a quantitative predictor \(X_j\) using the following:
\[X_{j_{Cent}} = X_{j}- \bar{X}_{j}\]
If we mean-center all quantitative variables, then the intercept is interpreted as the expected value of the response variable when all quantitative variables are at their mean value.
Data manipulation 2: Mean-center numeric predictors
loan50 <- loan50 %>%
  mutate(
    debt_inc_cent = debt_to_income - mean(debt_to_income), 
    annual_income_th_cent = annual_income_th - mean(annual_income_th)
    )Visualize mean-centered predictors

Using mean-centered variables in the model
How do you expect the model to change if we use the debt_inc_cent and annual_income_cent in the model?
# A tibble: 5 × 7
  term                  estimate std.error statistic  p.value conf.low conf.high
  <chr>                    <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)             9.44      0.977      9.66  1.50e-12   7.48    11.4    
2 debt_inc_cent           0.671     0.676      0.993 3.26e- 1  -0.690    2.03   
3 verified_incomeSourc…   2.21      1.40       1.58  1.21e- 1  -0.606    5.03   
4 verified_incomeVerif…   6.88      1.80       3.82  4.06e- 4   3.25    10.5    
5 annual_income_th_cent  -0.0205    0.0114    -1.80  7.79e- 2  -0.0434   0.00238Original vs. mean-centered model
| term | estimate | 
|---|---|
| (Intercept) | 10.726 | 
| debt_to_income | 0.671 | 
| verified_incomeSource Verified | 2.211 | 
| verified_incomeVerified | 6.880 | 
| annual_income_th | -0.021 | 
| term | estimate | 
|---|---|
| (Intercept) | 9.444 | 
| debt_inc_cent | 0.671 | 
| verified_incomeSource Verified | 2.211 | 
| verified_incomeVerified | 6.880 | 
| annual_income_th_cent | -0.021 | 
Indicator variables
Dummy variables
- Suppose there is a categorical variable with \(K\) categories (levels) 
- We can treat them as continuous variables, Or 
- We can make \(K-1\) indicator variables (by default) 
- An indicator variable takes values 1 or 0 - 1 if the observation belongs to that category
- 0 if the observation does not belong to that category
- all 0s if the observation belong to the benchmark category
 
Indicator variables
- We can also make \(K\) indicator variables - one indicator for each category 
- An indicator variable takes values 1 or 0 - 1 if the observation belongs to that category
- 0 if the observation does not belong to that category
 
- No intercept in this case 
Data manipulation 3: Create indicator variables for verified_income
Since
loan50 <- loan50 %>%
  mutate(
    not_verified = if_else(verified_income == "Not Verified", 1, 0),
    source_verified = if_else(verified_income == "Source Verified", 1, 0),
    verified = if_else(verified_income == "Verified", 1, 0)
  )# A tibble: 3 × 4
  verified_income not_verified source_verified verified
  <fct>                  <dbl>           <dbl>    <dbl>
1 Not Verified               1               0        0
2 Verified                   0               0        1
3 Source Verified            0               1        0Indicators in the model
- We will use \(K-1\) of the indicator variables in the model.
- The baseline is the category that doesn’t have a term in the model.
- The coefficients of the indicator variables in the model are interpreted as the expected change in the response compared to the baseline, holding all other variables constant.
- This approach is also called dummy coding.
# A tibble: 3 × 3
  verified_income source_verified verified
  <fct>                     <dbl>    <dbl>
1 Not Verified                  0        0
2 Verified                      0        1
3 Source Verified               1        0Interpreting verified_income
| term | estimate | std.error | statistic | p.value | conf.low | conf.high | 
|---|---|---|---|---|---|---|
| (Intercept) | 9.444 | 0.977 | 9.663 | 0.000 | 7.476 | 11.413 | 
| debt_inc_cent | 0.671 | 0.676 | 0.993 | 0.326 | -0.690 | 2.033 | 
| verified_incomeSource Verified | 2.211 | 1.399 | 1.581 | 0.121 | -0.606 | 5.028 | 
| verified_incomeVerified | 6.880 | 1.801 | 3.820 | 0.000 | 3.253 | 10.508 | 
| annual_income_th_cent | -0.021 | 0.011 | -1.804 | 0.078 | -0.043 | 0.002 | 
- The baseline category is Not verified.
- People with source verified income are expected to take a loan with an interest rate that is 2.211% higher, on average, than the rate on loans to those whose income is not verified, holding all else constant.
- People with verified income are expected to take a loan with an interest rate that is 6.880% higher, on average, than the rate on loans to those whose income is not verified, holding all else constant.
Interaction terms
Interaction terms
- Sometimes the relationship between a predictor variable and the response depends on the value of another predictor variable.
- This is an interaction effect.
- To account for this, we can include interaction terms in the model.
Interest rate vs. annual income
The lines are not parallel indicating there is an interaction effect. The slope of annual income differs based on the income verification.

Interaction term in model
int_cent_int_fit <- linear_reg() %>%
  set_engine("lm") %>%
  fit(interest_rate ~ debt_inc_cent  +  debt_inc_cent + 
        annual_income_th_cent + verified_income * annual_income_th_cent,
      data = loan50)| term | estimate | std.error | statistic | p.value | 
|---|---|---|---|---|
| (Intercept) | 9.484 | 0.989 | 9.586 | 0.000 | 
| debt_inc_cent | 0.691 | 0.685 | 1.009 | 0.319 | 
| annual_income_th_cent | -0.007 | 0.020 | -0.341 | 0.735 | 
| verified_incomeSource Verified | 2.157 | 1.418 | 1.522 | 0.135 | 
| verified_incomeVerified | 7.181 | 1.870 | 3.840 | 0.000 | 
| annual_income_th_cent:verified_incomeSource Verified | -0.016 | 0.026 | -0.643 | 0.523 | 
| annual_income_th_cent:verified_incomeVerified | -0.032 | 0.033 | -0.979 | 0.333 | 
Interpreting interaction terms
- What the interaction means: The effect of annual income on the interest rate differs by -0.016 when the income is source verified compared to when it is not verified, holding all else constant.
- Interpreting annual_incomefor source verified: If the income is source verified, we expect the interest rate to decrease by 0.023% (-0.007 + -0.016) for each additional thousand dollars in annual income, holding all else constant.
Data manipulation 4: Create interaction variables
Defining the interaction variable in the model formula as verified_income * annual_income_th_cent is an implicit data manipulation step as well
Rows: 50
Columns: 9
$ `(Intercept)`                                          <dbl> 1, 1, 1, 1, 1, …
$ debt_inc_cent                                          <dbl> -0.16511719, 0.…
$ annual_income_th_cent                                  <dbl> -27.17, -26.17,…
$ `verified_incomeNot Verified`                          <dbl> 1, 1, 0, 1, 1, …
$ `verified_incomeSource Verified`                       <dbl> 0, 0, 0, 0, 0, …
$ verified_incomeVerified                                <dbl> 0, 0, 1, 0, 0, …
$ `annual_income_th_cent:verified_incomeNot Verified`    <dbl> -27.17, -26.17,…
$ `annual_income_th_cent:verified_incomeSource Verified` <dbl> 0.00, 0.00, 0.0…
$ `annual_income_th_cent:verified_incomeVerified`        <dbl> 0.00, 0.00, -11…Transformation
Data manipulation 5: Transformation on variables
- Linearity is with respect of \(\beta\): \(y = \beta_0+\beta_1 x^2\) is also a linear regression
- For right-skewed long tail distributed variable (financial data), log transformation : decrease the variability of data and make data conform more closely to the normal distribution
Data manipulation 5: log transformation


Interpretation on log-scale
As annual income increase by 10% (\(\log(x+0.1x)=\log(1.1\times x)=log(1.1)+\log x\)), the interest rate is expected to increase by \(\beta \times \log(1.1)\) on average, hold … constant.
| term | estimate | std.error | statistic | p.value | 
|---|---|---|---|---|
| (Intercept) | 35.144 | 14.200 | 2.475 | 0.017 | 
| debt_to_income | 0.725 | 0.671 | 1.081 | 0.286 | 
| verified_incomeSource Verified | 2.140 | 1.397 | 1.532 | 0.133 | 
| verified_incomeVerified | 7.032 | 1.809 | 3.888 | 0.000 | 
| annual_income_log | -2.338 | 1.260 | -1.855 | 0.070 | 
Wrap up
Recap
- Mean-centering quantitative predictors 
- Using indicator variables for categorical predictors 
- Using interaction terms 
Looking backward
Data manipulation, with dplyr (from tidyverse):
loan50 %>%
  select(interest_rate, annual_income, debt_to_income, verified_income) %>%
  mutate(
    # 1. rescale income
    annual_income_th = annual_income / 1000,
    # 2. mean-center quantitative predictors
    debt_inc_cent = debt_to_income - mean(debt_to_income),
    annual_income_th_cent = annual_income_th - mean(annual_income_th),
    # 3. create dummy variables for verified_income
    source_verified = if_else(verified_income == "Source Verified", 1, 0),
    verified = if_else(verified_income == "Verified", 1, 0),
    # 4. create interaction variables
    `annual_income_th_cent:verified_incomeSource Verified` = annual_income_th_cent * source_verified,
    `annual_income_th_cent:verified_incomeVerified` = annual_income_th_cent * verified
  )Looking forward
Feature engineering, with recipes (from tidymodels):
loan_rec <- recipe( ~ ., data = loan50) %>%
  # 1. rescale income
  step_mutate(annual_income_th = annual_income / 1000) %>%
  # 2. mean-center quantitative predictors
  step_center(all_numeric_predictors()) %>%
  # 3. create dummy variables for verified_income
  step_dummy(verified_income) %>%
  # 4. create interaction variables
  step_interact(terms = ~ annual_income_th:verified_income)Recipe
loan_recRecipe
Inputs:
      role #variables
 predictor         25
Operations:
Variable mutation for annual_income / 1000
Centering for all_numeric_predictors()
Dummy variables from verified_income
Interactions with annual_income_th:verified_income