STA 210 - Summer 2022
Yunran Chen
Data Cleaning/Wrangling
Course Evaluation (10-15min)
Raw data to understanding, insight, and knowledge
Workflow for real-world data analysis
Data import (readr
, tibble
)
Tidy data (tidyr
)
Wrangle (dplyr
, stringr
, lubridate
,janitor
)
readr
read_csv
, …readr::parse_*
: parse the characters/numbers only
parse
in pkg readr
[1] 100
[1] 20
[1] 123.45
[1] 123456789
[1] 123456789
[1] 123456789
parse
in pkg readr
parse
in pkg readr
[1] "2015-01-01"
[1] "2015-01-02"
[1] "2015-02-01"
[1] "2001-02-15"
Package haven
: SPSS, Stata, SAS file
Package readxl
: Excel file .xls, .xlsx
Package jsonlite
/htmltab
: json, html
use as_tibble
to coerce a data frame to a tibble
janitor
package can help with cleaning namesclean_names
,remove_empty_cols
,remove_empty_rows
janitor
packageclean_names
,remove_empty_cols
,remove_empty_rows
pivot_longer
function in tidyr
pkgpivot_longer
: from wide to longpivot_wider
function in tidyr
pkgpivot_wider
: from long to wideseparate
function in tidyr
pkgseparate
: from 1 column to 2+ columnsep
based on digits or characters.unite
function in tidyr
pkgunite
: from 2+ column to 1 columndplyr
: join
multiple datasetsnycflights13
join
multiple datasetsR pkg nycflights13
provide multiple relational datasets:
flights
connects to planes
via a single variable, tailnum
.
flights
connects to airlines
through the carrier
variable.
flights
connects to airports
in two ways: via the origin
and dest
variables.
flights
connects to weather
via origin
(the location), and year
, month
, day
and hour
(the time)
inner_join
*_join
left_join
, right_join
, full_join
dplyr
slice()
: pick rows using indexesfilter()
: keep rows satisfying your conditionselect()
: select variables obtain a tibblepull()
: grab a column as a vectorrelocate()
: relocate a variablearrange()
: reorder rowsrename()
: rename a variablemutate()
: add columnsgroup_by()%>%summarize()
: summary statistics for different groupscount()
: count the frequencydistinct()
: keep unique rowsmutate()
: across()
, if_else()
, case_when()
starts_with()
, ends_with()
, contains()
, matches()
, everything()
case_when
species
to flipper_length_mm
, and sex
and year
sex
as gender
gender
right after island, move numeric variables after factor variablesisland
as characterbill_ratio
which is the ratio of bill length to bill depthmm
To penguins, add a new column size_bin that contains:
stringr
for stringsforcats
for factorslubridate
for dates and timesstringr
for stringsstringr
for stringsstringr
If your raw data has numbers as variable names, you may consider to add characters in front of it.
stringr
To obtain nice variable names. - Can use janitor::clean_names
- Or str_to_upper
, str_to_lower
, str_to_title
stringr
str_detect()
: Return TRUE/FALSE for strings satisfying the patternstr_replace
, str_replace_all
(multiple replacements)forcats
for factorsforcats
for factorsfct_infreq
: order levels by frequencyfct_rev
: reverse the order of levelsfct_reorder
,fct_reorder2
: order according to other variablesfct_relevel
: reorder manuallyfct_collapse
: combine levelsggplot
.fct_infreq
and fct_rev
fct_reorder
library(cowplot)
peng_summary=penguins %>%
group_by(species)%>%
summarise(
bill_length_mean=mean(bill_length_mm, na.rm=T),
bill_depth_mean=mean(bill_depth_mm, na.rm=T))
p1=ggplot(peng_summary,aes(bill_length_mean,species)) +
geom_point()
p2=ggplot(peng_summary,aes(bill_length_mean,fct_reorder(species,bill_length_mean))) + geom_point()
plot_grid(p1,p2)
fct_reorder2
fct_recode()
# A tibble: 3 × 2
species n
<fct> <int>
1 Adelie 152
2 Chinstrap 68
3 Gentoo 124
penguins %>%
mutate(species = fct_recode(species,
"Adeliea" = "Adelie",
"ChinChin" = "Chinstrap",
"Gentooman" = "Gentoo")) %>%
count(species)
# A tibble: 3 × 2
species n
<fct> <int>
1 Adeliea 152
2 ChinChin 68
3 Gentooman 124
fct_collapse
Can use to collapse lots of levels
fct_lump
lubridate
for dates and timeslubridate
for dates and timesyear()
, month()
mday()
(day of the month), yday()
(day of the year), wday()
(day of the week)hour()
, minute()
, second()