Data Wrangling

STA 210 - Summer 2022

Yunran Chen

Welcome

Topics

Data Cleaning/Wrangling
Course Evaluation (10-15min)

Data Cleaning using tidyverse

Raw data to understanding, insight, and knowledge
Workflow for real-world data analysis

Packages in Tidyverse

Data Wrangling

Focus on data wrangling

Data import (readr, tibble)
Tidy data (tidyr)
Wrangle (dplyr, stringr, lubridate,janitor)

library(tidyverse)
library(cowplot)

Data import

Data import using `readr`

read_csv, …

Extract the certain type of data

readr::parse_*: parse the characters/numbers only

function `parse` in pkg `readr`

parse_number("$100")

[1] 100

parse_number("20%")

[1] 20

parse_number("It cost $123.45")

[1] 123.45

# Used in America
parse_number("$123,456,789")

[1] 123456789

# Used in many parts of Europe
parse_number("123.456.789", locale = locale(grouping_mark = "."))

[1] 123456789

# Used in Switzerland
parse_number("123'456'789", locale = locale(grouping_mark = "'"))

[1] 123456789

function `parse` in pkg `readr`

x1 <- "El Ni\xf1o was particularly bad this year"
x2 <- "\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd"

x1

[1] "El Ni\xf1o was particularly bad this year"

x2

[1] "\x82\xb1\x82\xf1\x82ɂ\xbf\x82\xcd"

parse_character(x1, locale = locale(encoding = "Latin1"))

[1] "El Niño was particularly bad this year"

parse_character(x2, locale = locale(encoding = "Shift-JIS"))

[1] "こんにちは"

function `parse` in pkg `readr`

parse_date("1 janvier 2015", "%d %B %Y", locale = locale("fr"))

[1] "2015-01-01"

parse_date("01/02/15", "%m/%d/%y")

[1] "2015-01-02"

parse_date("01/02/15", "%d/%m/%y")

[1] "2015-02-01"

parse_date("01/02/15", "%y/%m/%d")

[1] "2001-02-15"

parse_datetime("2010-10-01T2010")

[1] "2010-10-01 20:10:00 UTC"

library(hms)
parse_time("01:10 am")

01:10:00

Other packages for data importing

Package haven: SPSS, Stata, SAS file
Package readxl: Excel file .xls, .xlsx
Package jsonlite/htmltab: json, html
use as_tibble to coerce a data frame to a tibble

`janitor` package can help with cleaning names

clean_names,remove_empty_cols,remove_empty_rows

`janitor` package

clean_names,remove_empty_cols,remove_empty_rows

Tidy data

`pivot_longer` function in `tidyr` pkg

pivot_longer: from wide to long

table4 %>% 
  pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")

`pivot_wider` function in `tidyr` pkg

pivot_wider: from long to wide

table2 %>%
    pivot_wider(names_from = key, values_from = value)

`separate` function in `tidyr` pkg

separate: from 1 column to 2+ column
can sep based on digits or characters.

table3 %>% 
  separate(rate, into = c("cases", "population"), sep = "/")

`unite` function in `tidyr` pkg

unite: from 2+ column to 1 column

table6 %>% 
  unite("year", century, year, sep = "")

Transform

Packages in Tidyverse

`dplyr`: `join` multiple datasets

Relational datasets in `nycflights13`

`join` multiple datasets

R pkg nycflights13 provide multiple relational datasets:

flights connects to planes via a single variable, tailnum.
flights connects to airlines through the carrier variable.
flights connects to airports in two ways: via the origin and dest variables.
flights connects to weather via origin (the location), and year, month, day and hour (the time)

`inner_join`

inner_join(x,y,by="key")

`*_join`

left_join, right_join, full_join

Data wrangling with `dplyr`

slice(): pick rows using indexes
filter(): keep rows satisfying your condition
select(): select variables obtain a tibble
pull(): grab a column as a vector
relocate(): relocate a variable
arrange(): reorder rows
rename(): rename a variable
mutate(): add columns
group_by()%>%summarize(): summary statistics for different groups
count(): count the frequency
distinct(): keep unique rows
functions within mutate(): across(), if_else(), case_when()
functions for selecting variables: starts_with(), ends_with(), contains(), matches(), everything()

`case_when`

Practice

library("palmerpenguins")

Keep the chinstrap and gentoo penguins, living in Dream and Biscoe Islands.
Get first 100 observation
Only keep columns from species to flipper_length_mm, and sex and year
Rename sex as gender
Move gender right after island, move numeric variables after factor variables
Add a new column to identify each observation
Transfer island as character
Add a new variable called bill_ratio which is the ratio of bill length to bill depth
Obtain the mean and standard deviation of body mass of different species
For different species, obtain the mean of variables ending with mm
Provide the distribution of different species of penguins living in different island across time

Practice

To penguins, add a new column size_bin that contains:

“large” if body mass is greater than 4500 g
“medium” if body mass is greater than 3000 g, and less than or equal to 4500 g
“small” if body mass is less than or equal to 3000 g

Deal with different types of variables

stringr for strings
forcats for factors
lubridate for dates and times

`stringr` for strings

Cheatsheets for stringr

Useful functions in `stringr`

If your raw data has numbers as variable names, you may consider to add characters in front of it.

table4 %>% 
  pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")

str_c: join multiple strings to a single string

str_c(c("x", "y", "z"), collapse = ", ")

[1] "x, y, z"

str_c("x",1:3)

[1] "x1" "x2" "x3"

Useful functions in `stringr`

To obtain nice variable names. - Can use janitor::clean_names - Or str_to_upper, str_to_lower, str_to_title

colnames(penguins)

[1] "species"           "island"            "bill_length_mm"   
[4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
[7] "sex"               "year"

colnames(penguins)%>%
  str_to_title()

[1] "Species"           "Island"            "Bill_length_mm"   
[4] "Bill_depth_mm"     "Flipper_length_mm" "Body_mass_g"      
[7] "Sex"               "Year"

Useful functions in `stringr`

str_detect(): Return TRUE/FALSE for strings satisfying the pattern
pattern: regular expressions (CheatSheets)
str_replace, str_replace_all(multiple replacements)
Text mining: Useful if your have text in the survey, and you want to extract important features from text

x <- c("apple", "banana", "pear")
str_detect(x, "[aeiou]$")

[1]  TRUE  TRUE FALSE

str_replace(x, "[aeiou]", "-")

[1] "-pple"  "b-nana" "p-ar"

x <- c("1 house", "2 cars", "3 people")
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))

[1] "one house"    "two cars"     "three people"

`forcats` for factors

fct_infreq: order levels by frequency
fct_rev: reverse the order of levels
fct_reorder,fct_reorder2: order according to other variables
fct_relevel: reorder manually
fct_collapse: combine levels
Useful for visualization in ggplot.

`fct_infreq` and `fct_rev`

p1=penguins %>%
  ggplot(aes(species)) + geom_bar()
p2=penguins %>%
  mutate(species=species %>% fct_infreq() %>% fct_rev()) %>%
  ggplot(aes(species)) + geom_bar()
plot_grid(p1,p2)

Examples for `fct_reorder`

library(cowplot)
peng_summary=penguins %>%
  group_by(species)%>%
  summarise(
    bill_length_mean=mean(bill_length_mm, na.rm=T),
    bill_depth_mean=mean(bill_depth_mm, na.rm=T))
p1=ggplot(peng_summary,aes(bill_length_mean,species)) + 
  geom_point()
p2=ggplot(peng_summary,aes(bill_length_mean,fct_reorder(species,bill_length_mean))) + geom_point()
plot_grid(p1,p2)

Examples for `fct_reorder2`

Reorder the factor by the y values associated with the largest x values
Easier to read: colours line up with the legend

p1=ggplot(penguins,aes(bill_length_mm,flipper_length_mm,color=species)) + 
  geom_line()
p2=ggplot(penguins,aes(bill_length_mm,flipper_length_mm,color=fct_reorder2(species,bill_length_mm,flipper_length_mm))) + 
  geom_line() + guides(color=guide_legend(title="species"))
plot_grid(p1,p2)

`fct_recode()`

penguins %>%
  count(species)

# A tibble: 3 × 2
  species       n
  <fct>     <int>
1 Adelie      152
2 Chinstrap    68
3 Gentoo      124

penguins %>%
  mutate(species = fct_recode(species,
                              "Adeliea" = "Adelie",
                               "ChinChin" = "Chinstrap",
                               "Gentooman" = "Gentoo")) %>%
  count(species)

# A tibble: 3 × 2
  species       n
  <fct>     <int>
1 Adeliea     152
2 ChinChin     68
3 Gentooman   124

`fct_collapse`

Can use to collapse lots of levels

penguins %>% 
  mutate(species = fct_collapse(species,
                                "Adelie"="Adelie",
                                "Other"=c("Chinstrap","Gentoo")))%>%
  count(species)

# A tibble: 2 × 2
  species     n
  <fct>   <int>
1 Adelie    152
2 Other     192

`fct_lump`

volcano <- read_csv(here::here("slides", "data/volcano.csv"))
volcano %>% count(country) %>% nrow()

[1] 89

volcano %>%
  mutate(country=fct_lump(country,n=5))%>%
  count(country)

# A tibble: 6 × 2
  country           n
  <fct>         <int>
1 Chile            43
2 Indonesia        95
3 Japan            92
4 Russia           79
5 United States    99
6 Other           550

`lubridate` for dates and times

library("lubridate")
library(nycflights13)

flights %>% 
  select(year, month, day, hour, minute) %>% 
  mutate(departure = make_datetime(year, month, day))%>%
  ggplot(aes(departure))+
  geom_freqpoly()

flights %>% 
  select(year, month, day, hour, minute) %>% 
  mutate(departure = make_datetime(year, month, day))%>%
  filter(departure < ymd(20131002)) %>% 
  ggplot(aes(departure))+
  geom_freqpoly()

year(), month()
mday() (day of the month), yday() (day of the year), wday() (day of the week)
hour(), minute(), second()

Data Wrangling

Welcome

Topics

Data Cleaning using tidyverse

Packages in Tidyverse

Data Wrangling

Focus on data wrangling

Data import

Data import using readr

Extract the certain type of data

function parse in pkg readr

function parse in pkg readr

function parse in pkg readr

Other packages for data importing

janitor package can help with cleaning names

janitor package

Tidy data

Tidy data

Tidy data

Tidy data

Tidy data

Tidy data

Tidy data

pivot_longer function in tidyr pkg

pivot_wider function in tidyr pkg

separate function in tidyr pkg

unite function in tidyr pkg

Transform

Packages in Tidyverse

dplyr: join multiple datasets

Relational datasets in nycflights13

join multiple datasets

inner_join

*_join

Data wrangling with dplyr

case_when

Practice

Practice

Deal with different types of variables

Deal with different types of variables

stringr for strings

stringr for strings

Useful functions in stringr

Useful functions in stringr

Useful functions in stringr

forcats for factors

forcats for factors

fct_infreq and fct_rev

Examples for fct_reorder

Examples for fct_reorder2

fct_recode()

fct_collapse

fct_lump

lubridate for dates and times

lubridate for dates and times

Reference

Data import using `readr`

function `parse` in pkg `readr`

function `parse` in pkg `readr`

function `parse` in pkg `readr`

`janitor` package can help with cleaning names

`janitor` package

`pivot_longer` function in `tidyr` pkg

`pivot_wider` function in `tidyr` pkg

`separate` function in `tidyr` pkg

`unite` function in `tidyr` pkg

`dplyr`: `join` multiple datasets

Relational datasets in `nycflights13`

`join` multiple datasets

`inner_join`

`*_join`

Data wrangling with `dplyr`

`case_when`

`stringr` for strings

`stringr` for strings

Useful functions in `stringr`

Useful functions in `stringr`

Useful functions in `stringr`

`forcats` for factors

`forcats` for factors

`fct_infreq` and `fct_rev`

Examples for `fct_reorder`

Examples for `fct_reorder2`

`fct_recode()`

`fct_collapse`

`fct_lump`

`lubridate` for dates and times

`lubridate` for dates and times