STA 210 - Summer 2022
Yunran Chen
Data Cleaning/Wrangling
Course Evaluation (10-15min)
Raw data to understanding, insight, and knowledge
Workflow for real-world data analysis

Data import (readr, tibble)
Tidy data (tidyr)
Wrangle (dplyr, stringr, lubridate,janitor)
readrread_csv, …readr::parse_*: parse the characters/numbers only

parse in pkg readr[1] 100
[1] 20
[1] 123.45
[1] 123456789
[1] 123456789
[1] 123456789
parse in pkg readrparse in pkg readr[1] "2015-01-01"
[1] "2015-01-02"
[1] "2015-02-01"
[1] "2001-02-15"
Package haven: SPSS, Stata, SAS file
Package readxl: Excel file .xls, .xlsx
Package jsonlite/htmltab: json, html
use as_tibble to coerce a data frame to a tibble
janitor package can help with cleaning namesclean_names,remove_empty_cols,remove_empty_rows
janitor packageclean_names,remove_empty_cols,remove_empty_rows
pivot_longer function in tidyr pkgpivot_longer: from wide to longpivot_wider function in tidyr pkgpivot_wider: from long to wideseparate function in tidyr pkgseparate: from 1 column to 2+ columnsep based on digits or characters.unite function in tidyr pkgunite: from 2+ column to 1 columndplyr: join multiple datasetsnycflights13join multiple datasetsR pkg nycflights13 provide multiple relational datasets:
flights connects to planes via a single variable, tailnum.
flights connects to airlines through the carrier variable.
flights connects to airports in two ways: via the origin and dest variables.
flights connects to weather via origin (the location), and year, month, day and hour (the time)
inner_join*_joinleft_join, right_join, full_join

dplyrslice(): pick rows using indexesfilter(): keep rows satisfying your conditionselect(): select variables obtain a tibblepull(): grab a column as a vectorrelocate(): relocate a variablearrange(): reorder rowsrename(): rename a variablemutate(): add columnsgroup_by()%>%summarize(): summary statistics for different groupscount(): count the frequencydistinct(): keep unique rowsmutate(): across(), if_else(), case_when()starts_with(), ends_with(), contains(), matches(), everything()case_whenspecies to flipper_length_mm, and sex and yearsex as gendergender right after island, move numeric variables after factor variablesisland as characterbill_ratio which is the ratio of bill length to bill depthmmTo penguins, add a new column size_bin that contains:
stringr for stringsforcats for factorslubridate for dates and timesstringr for stringsstringr for stringsstringrIf your raw data has numbers as variable names, you may consider to add characters in front of it.
stringrTo obtain nice variable names. - Can use janitor::clean_names - Or str_to_upper, str_to_lower, str_to_title
stringrstr_detect(): Return TRUE/FALSE for strings satisfying the patternstr_replace, str_replace_all(multiple replacements)forcats for factorsforcats for factorsfct_infreq: order levels by frequencyfct_rev: reverse the order of levelsfct_reorder,fct_reorder2: order according to other variablesfct_relevel: reorder manuallyfct_collapse: combine levelsggplot.fct_infreq and fct_revfct_reorderlibrary(cowplot)
peng_summary=penguins %>%
group_by(species)%>%
summarise(
bill_length_mean=mean(bill_length_mm, na.rm=T),
bill_depth_mean=mean(bill_depth_mm, na.rm=T))
p1=ggplot(peng_summary,aes(bill_length_mean,species)) +
geom_point()
p2=ggplot(peng_summary,aes(bill_length_mean,fct_reorder(species,bill_length_mean))) + geom_point()
plot_grid(p1,p2)fct_reorder2fct_recode()# A tibble: 3 × 2
species n
<fct> <int>
1 Adelie 152
2 Chinstrap 68
3 Gentoo 124
penguins %>%
mutate(species = fct_recode(species,
"Adeliea" = "Adelie",
"ChinChin" = "Chinstrap",
"Gentooman" = "Gentoo")) %>%
count(species)# A tibble: 3 × 2
species n
<fct> <int>
1 Adeliea 152
2 ChinChin 68
3 Gentooman 124
fct_collapseCan use to collapse lots of levels
fct_lumplubridate for dates and timeslubridate for dates and timesyear(), month()mday() (day of the month), yday() (day of the year), wday() (day of the week)hour(), minute(), second()