Data Wrangling

STA 210 - Summer 2022

Yunran Chen

Welcome

Topics

  • Data Cleaning/Wrangling

  • Course Evaluation (10-15min)

Data Cleaning using tidyverse

  • Raw data to understanding, insight, and knowledge

  • Workflow for real-world data analysis

Packages in Tidyverse

Data Wrangling

Focus on data wrangling

  • Data import (readr, tibble)

  • Tidy data (tidyr)

  • Wrangle (dplyr, stringr, lubridate,janitor)

library(tidyverse)
library(cowplot)

Data import

Data import using readr

  • read_csv, …

Extract the certain type of data

readr::parse_*: parse the characters/numbers only

function parse in pkg readr

parse_number("$100")
[1] 100
parse_number("20%")
[1] 20
parse_number("It cost $123.45")
[1] 123.45
# Used in America
parse_number("$123,456,789")
[1] 123456789
# Used in many parts of Europe
parse_number("123.456.789", locale = locale(grouping_mark = "."))
[1] 123456789
# Used in Switzerland
parse_number("123'456'789", locale = locale(grouping_mark = "'"))
[1] 123456789

function parse in pkg readr

x1 <- "El Ni\xf1o was particularly bad this year"
x2 <- "\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd"

x1
[1] "El Ni\xf1o was particularly bad this year"
x2
[1] "\x82\xb1\x82\xf1\x82ɂ\xbf\x82\xcd"
parse_character(x1, locale = locale(encoding = "Latin1"))
[1] "El Niño was particularly bad this year"
parse_character(x2, locale = locale(encoding = "Shift-JIS"))
[1] "こんにちは"

function parse in pkg readr

parse_date("1 janvier 2015", "%d %B %Y", locale = locale("fr"))
[1] "2015-01-01"
parse_date("01/02/15", "%m/%d/%y")
[1] "2015-01-02"
parse_date("01/02/15", "%d/%m/%y")
[1] "2015-02-01"
parse_date("01/02/15", "%y/%m/%d")
[1] "2001-02-15"
parse_datetime("2010-10-01T2010")
[1] "2010-10-01 20:10:00 UTC"
library(hms)
parse_time("01:10 am")
01:10:00

Other packages for data importing

  • Package haven: SPSS, Stata, SAS file

  • Package readxl: Excel file .xls, .xlsx

  • Package jsonlite/htmltab: json, html

  • use as_tibble to coerce a data frame to a tibble

janitor package can help with cleaning names

clean_names,remove_empty_cols,remove_empty_rows

janitor package

clean_names,remove_empty_cols,remove_empty_rows

Tidy data

Tidy data

Tidy data

Tidy data

Tidy data

Tidy data

Tidy data

pivot_longer function in tidyr pkg

  • pivot_longer: from wide to long
table4 %>% 
  pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")

pivot_wider function in tidyr pkg

  • pivot_wider: from long to wide
table2 %>%
    pivot_wider(names_from = key, values_from = value)

separate function in tidyr pkg

  • separate: from 1 column to 2+ column
  • can sep based on digits or characters.
table3 %>% 
  separate(rate, into = c("cases", "population"), sep = "/")

unite function in tidyr pkg

  • unite: from 2+ column to 1 column
table6 %>% 
  unite("year", century, year, sep = "")

Transform

Packages in Tidyverse

dplyr: join multiple datasets

Relational datasets in nycflights13

join multiple datasets

R pkg nycflights13 provide multiple relational datasets:

  • flights connects to planes via a single variable, tailnum.

  • flights connects to airlines through the carrier variable.

  • flights connects to airports in two ways: via the origin and dest variables.

  • flights connects to weather via origin (the location), and year, month, day and hour (the time)

inner_join

inner_join(x,y,by="key")

*_join

left_join, right_join, full_join

Data wrangling with dplyr

  • slice(): pick rows using indexes
  • filter(): keep rows satisfying your condition
  • select(): select variables obtain a tibble
  • pull(): grab a column as a vector
  • relocate(): relocate a variable
  • arrange(): reorder rows
  • rename(): rename a variable
  • mutate(): add columns
  • group_by()%>%summarize(): summary statistics for different groups
  • count(): count the frequency
  • distinct(): keep unique rows
  • functions within mutate(): across(), if_else(), case_when()
  • functions for selecting variables: starts_with(), ends_with(), contains(), matches(), everything()

case_when

Practice

library("palmerpenguins")
  • Keep the chinstrap and gentoo penguins, living in Dream and Biscoe Islands.
  • Get first 100 observation
  • Only keep columns from species to flipper_length_mm, and sex and year
  • Rename sex as gender
  • Move gender right after island, move numeric variables after factor variables
  • Add a new column to identify each observation
  • Transfer island as character
  • Add a new variable called bill_ratio which is the ratio of bill length to bill depth
  • Obtain the mean and standard deviation of body mass of different species
  • For different species, obtain the mean of variables ending with mm
  • Provide the distribution of different species of penguins living in different island across time

Practice

To penguins, add a new column size_bin that contains:

  • “large” if body mass is greater than 4500 g
  • “medium” if body mass is greater than 3000 g, and less than or equal to 4500 g
  • “small” if body mass is less than or equal to 3000 g

Deal with different types of variables

Deal with different types of variables

  • stringr for strings
  • forcats for factors
  • lubridate for dates and times

stringr for strings

stringr for strings

Useful functions in stringr

If your raw data has numbers as variable names, you may consider to add characters in front of it.

table4 %>% 
  pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")

str_c: join multiple strings to a single string

str_c(c("x", "y", "z"), collapse = ", ")
[1] "x, y, z"
str_c("x",1:3) 
[1] "x1" "x2" "x3"

Useful functions in stringr

To obtain nice variable names. - Can use janitor::clean_names - Or str_to_upper, str_to_lower, str_to_title

colnames(penguins)
[1] "species"           "island"            "bill_length_mm"   
[4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
[7] "sex"               "year"             
colnames(penguins)%>%
  str_to_title()
[1] "Species"           "Island"            "Bill_length_mm"   
[4] "Bill_depth_mm"     "Flipper_length_mm" "Body_mass_g"      
[7] "Sex"               "Year"             

Useful functions in stringr

  • str_detect(): Return TRUE/FALSE for strings satisfying the pattern
  • pattern: regular expressions (CheatSheets)
  • str_replace, str_replace_all(multiple replacements)
  • Text mining: Useful if your have text in the survey, and you want to extract important features from text
x <- c("apple", "banana", "pear")
str_detect(x, "[aeiou]$")
[1]  TRUE  TRUE FALSE
str_replace(x, "[aeiou]", "-")
[1] "-pple"  "b-nana" "p-ar"  
x <- c("1 house", "2 cars", "3 people")
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
[1] "one house"    "two cars"     "three people"

forcats for factors

forcats for factors

  • fct_infreq: order levels by frequency
  • fct_rev: reverse the order of levels
  • fct_reorder,fct_reorder2: order according to other variables
  • fct_relevel: reorder manually
  • fct_collapse: combine levels
  • Useful for visualization in ggplot.

fct_infreq and fct_rev

p1=penguins %>%
  ggplot(aes(species)) + geom_bar()
p2=penguins %>%
  mutate(species=species %>% fct_infreq() %>% fct_rev()) %>%
  ggplot(aes(species)) + geom_bar()
plot_grid(p1,p2)

Examples for fct_reorder

library(cowplot)
peng_summary=penguins %>%
  group_by(species)%>%
  summarise(
    bill_length_mean=mean(bill_length_mm, na.rm=T),
    bill_depth_mean=mean(bill_depth_mm, na.rm=T))
p1=ggplot(peng_summary,aes(bill_length_mean,species)) + 
  geom_point()
p2=ggplot(peng_summary,aes(bill_length_mean,fct_reorder(species,bill_length_mean))) + geom_point()
plot_grid(p1,p2)

Examples for fct_reorder2

  • Reorder the factor by the y values associated with the largest x values
  • Easier to read: colours line up with the legend
p1=ggplot(penguins,aes(bill_length_mm,flipper_length_mm,color=species)) + 
  geom_line()
p2=ggplot(penguins,aes(bill_length_mm,flipper_length_mm,color=fct_reorder2(species,bill_length_mm,flipper_length_mm))) + 
  geom_line() + guides(color=guide_legend(title="species"))
plot_grid(p1,p2)

fct_recode()

penguins %>%
  count(species)
# A tibble: 3 × 2
  species       n
  <fct>     <int>
1 Adelie      152
2 Chinstrap    68
3 Gentoo      124
penguins %>%
  mutate(species = fct_recode(species,
                              "Adeliea" = "Adelie",
                               "ChinChin" = "Chinstrap",
                               "Gentooman" = "Gentoo")) %>%
  count(species)
# A tibble: 3 × 2
  species       n
  <fct>     <int>
1 Adeliea     152
2 ChinChin     68
3 Gentooman   124

fct_collapse

Can use to collapse lots of levels

penguins %>% 
  mutate(species = fct_collapse(species,
                                "Adelie"="Adelie",
                                "Other"=c("Chinstrap","Gentoo")))%>%
  count(species)
# A tibble: 2 × 2
  species     n
  <fct>   <int>
1 Adelie    152
2 Other     192

fct_lump

volcano <- read_csv(here::here("slides", "data/volcano.csv"))
volcano %>% count(country) %>% nrow()
[1] 89
volcano %>%
  mutate(country=fct_lump(country,n=5))%>%
  count(country)
# A tibble: 6 × 2
  country           n
  <fct>         <int>
1 Chile            43
2 Indonesia        95
3 Japan            92
4 Russia           79
5 United States    99
6 Other           550

lubridate for dates and times

lubridate for dates and times

library("lubridate")
library(nycflights13)
flights %>% 
  select(year, month, day, hour, minute) %>% 
  mutate(departure = make_datetime(year, month, day))%>%
  ggplot(aes(departure))+
  geom_freqpoly()
flights %>% 
  select(year, month, day, hour, minute) %>% 
  mutate(departure = make_datetime(year, month, day))%>%
  filter(departure < ymd(20131002)) %>% 
  ggplot(aes(departure))+
  geom_freqpoly()
  • year(), month()
  • mday() (day of the month), yday() (day of the year), wday() (day of the week)
  • hour(), minute(), second()

Reference