library(pacman)
p_load(dplyr, tidyr, stringr, lubridate, janitor)Advanced Data Wrangling in R
Description
This class dives deeper into the world of R data wrangling, covering advanced techniques and their applications. We’ll explore complex functions and join operations essential for real-world data analysis.
Required packages for today
1 Review of Previous Lessons
![]() |
![]() |
Quick recap of essential data wrangling functions:
mutate()filter()select()slice(),slice_head(),slice_tail(),slice_sample()rename()arrange()distinct()count()glimpse()summarise()group_by
unite()separate()pivot_longer()pivot_wider()case_when()nest()unnest()
skim()
2 Complex Joins
In data analysis, combining data from different sources is often necessary. Here, we’ll use dplyr’s join functions to merge datasets based on common keys.
2.1 Example: Merging Weather and Crop Yield Data
2.2 left_join()
Includes all records from the left dataset and the matched records from the right dataset. If there is no match, the result is NA in the columns of the right dataset.
# Daily data
daily_weather_data <- tibble(
date = seq(as.Date("2024-04-01"), as.Date("2024-10-01"), by = "day"),
# create month column with name
month = lubridate::month(date, label = TRUE, abbr = TRUE),
precipitation = round(runif(n=184, min = 0, max = 25), digits = 0),
temperature = runif(n=184, min = 10, max = 30)
)
# Summarize daily weather by month
monthly_weather_data <- daily_weather_data %>%
group_by(month) %>%
summarise(total_precipitation = sum(precipitation),
avg_temperature = mean(temperature)) %>%
ungroup()
# Forage data
forage_data <- tibble(
date = seq(as.Date("2024-04-01"), as.Date("2024-10-01"), by = "month"),
# extract month from date
month = lubridate::month(date, label = TRUE, abbr = TRUE),
forage_yield = c(500, 1200, 3000, 4000, 2800, 1500, 0) )
# Merge data
left_joined_data <- left_join(monthly_weather_data, forage_data, by = "month")2.3 right_join()
Includes all records from the right dataset and the matched records from the left dataset. If there is no match, the result is NA in the columns of the left dataset.
# Merge data
right_joined_data <-
right_join(daily_weather_data, forage_data, by = "month")2.4 full_join()
Includes all records when there is a match in the keys of the left or right datasets. If there is no match, the result is NA in the columns of the dataset that does not have a match.
# Merge data
full_joined_data <-
full_join(daily_weather_data, forage_data, by = "date")If a key appears more than once in either table, the join can produce more rows than expected.
Filtering joins are also very useful:
2.5 inner_join()
keep only keys present in both tables
# inner_join:
inner_join(monthly_weather_data, forage_data, by = "month")# A tibble: 7 × 5
month total_precipitation avg_temperature date forage_yield
<ord> <dbl> <dbl> <date> <dbl>
1 Apr 365 18.1 2024-04-01 500
2 May 455 18.0 2024-05-01 1200
3 Jun 410 20.8 2024-06-01 3000
4 Jul 423 18.7 2024-07-01 4000
5 Aug 372 18.1 2024-08-01 2800
6 Sep 364 20.4 2024-09-01 1500
7 Oct 22 16.7 2024-10-01 0
2.6 semi_join()
keep rows from left that have a match in right (does NOT bring columns from right)
semi_join(monthly_weather_data, forage_data, by = "month")# A tibble: 7 × 3
month total_precipitation avg_temperature
<ord> <dbl> <dbl>
1 Apr 365 18.1
2 May 455 18.0
3 Jun 410 20.8
4 Jul 423 18.7
5 Aug 372 18.1
6 Sep 364 20.4
7 Oct 22 16.7
2.7 anti_join()
find rows in left with NO match in right (great for debugging)
anti_join(monthly_weather_data, forage_data, by = "month")# A tibble: 0 × 3
# ℹ 3 variables: month <ord>, total_precipitation <dbl>, avg_temperature <dbl>
anti_join(daily_weather_data, forage_data, by = "date")# A tibble: 177 × 4
date month precipitation temperature
<date> <ord> <dbl> <dbl>
1 2024-04-02 Apr 5 14.0
2 2024-04-03 Apr 19 20.3
3 2024-04-04 Apr 15 23.8
4 2024-04-05 Apr 3 14.3
5 2024-04-06 Apr 23 19.8
6 2024-04-07 Apr 1 11.5
7 2024-04-08 Apr 4 20.2
8 2024-04-09 Apr 21 11.7
9 2024-04-10 Apr 15 19.3
10 2024-04-11 Apr 17 25.8
# ℹ 167 more rows
3 across()
After merging, it’s crucial to check for missing values and duplicates:
3.1 everything()
it allows you to apply functions to all columns in a data frame
data_quality_summary <- full_joined_data %>%
summarise(across(.cols = everything(),
.fns = ~ sum(is.na(.)),
.names = "missing_{.col}"))
data_quality_summary# A tibble: 1 × 6
missing_date missing_month.x missing_precipitation missing_temperature
<int> <int> <int> <int>
1 0 0 0 0
# ℹ 2 more variables: missing_month.y <int>, missing_forage_yield <int>
3.2 where()
When you have many columns and want to apply the same transformation/summarization to all of them (or a subset), across() is the go-to tool.
# Example: summarize ALL numeric columns by month
daily_weather_data %>%
mutate(month = lubridate::month(date, label = TRUE, abbr = TRUE)) %>%
group_by(month) %>%
summarise(
across(where(is.numeric), list(mean = mean, sd = sd), .names = "{.col}_{.fn}"),
.groups = "drop"
)# A tibble: 7 × 5
month precipitation_mean precipitation_sd temperature_mean temperature_sd
<ord> <dbl> <dbl> <dbl> <dbl>
1 Apr 12.2 7.44 18.1 5.05
2 May 14.7 7.78 18.0 5.31
3 Jun 13.7 6.94 20.8 6.21
4 Jul 13.6 6.26 18.7 6.57
5 Aug 12 8.19 18.1 6.03
6 Sep 12.1 7.71 20.4 5.48
7 Oct 22 NA 16.7 NA
4 Missing data with tidyr
4.1 replace_na()
df <- tibble(
site = c("Elora", "Elora", "Ridgetown", "Ridgetown"),
year = c(2024, 2025, 2024, 2025),
yield = c(NA, 8.2, 7.9, NA)
)
df %>%
tidyr::replace_na(list(yield = 0))# A tibble: 4 × 3
site year yield
<chr> <dbl> <dbl>
1 Elora 2024 0
2 Elora 2025 8.2
3 Ridgetown 2024 7.9
4 Ridgetown 2025 0
4.2 drop_na()
df %>% tidyr::drop_na(yield)# A tibble: 2 × 3
site year yield
<chr> <dbl> <dbl>
1 Elora 2025 8.2
2 Ridgetown 2024 7.9
4.3 fill()
common with messy datasets
df2 <- tibble(
site = c("Elora", NA, NA, "Ridgetown", NA),
plot = 1:5,
value = rnorm(5)
)
df2 %>%
tidyr::fill(site, .direction = "down")# A tibble: 5 × 3
site plot value
<chr> <int> <dbl>
1 Elora 1 -1.38
2 Elora 2 0.854
3 Elora 3 -2.90
4 Ridgetown 4 -0.637
5 Ridgetown 5 0.734
5 Date Handling with lubridate

Using lubridate, we can extract various date components for analysis.
weather_data_dates <- daily_weather_data %>%
mutate(
year_month_day = format(date, "%Y_%m_%d"),
day_of_year = yday(date),
day_of_month = mday(date),
week_of_year = week(date),
month_name = month(date, label = TRUE, abbr = FALSE)
)
# Take a look
glimpse(weather_data_dates)Rows: 184
Columns: 9
$ date <date> 2024-04-01, 2024-04-02, 2024-04-03, 2024-04-04, 2024-0…
$ month <ord> Apr, Apr, Apr, Apr, Apr, Apr, Apr, Apr, Apr, Apr, Apr, …
$ precipitation <dbl> 12, 5, 19, 15, 3, 23, 1, 4, 21, 15, 17, 6, 3, 16, 12, 2…
$ temperature <dbl> 15.18239, 13.95953, 20.30157, 23.81710, 14.26687, 19.84…
$ year_month_day <chr> "2024_04_01", "2024_04_02", "2024_04_03", "2024_04_04",…
$ day_of_year <dbl> 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104…
$ day_of_month <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
$ week_of_year <dbl> 14, 14, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 15, 15,…
$ month_name <ord> April, April, April, April, April, April, April, April,…
6 Data Cleaning with janitor

6.1 clean_names()
Sometimes, datasets come with inconsistent column names, which can cause issues in analysis.
messy_weather_data <- tibble(
`Date Recorded` = seq(as.Date("2024-04-01"), as.Date("2024-10-01"), by = "month"),
`Precipitation (mm)` = c(20, 40, 60, 80, 50, 30, 2),
`Temperature..(C,)` = c(15, 18, 25, 30, 22, 16, 12)
)
clean_weather_data <- messy_weather_data %>% clean_names()
clean_weather_data# A tibble: 7 × 3
date_recorded precipitation_mm temperature_c
<date> <dbl> <dbl>
1 2024-04-01 20 15
2 2024-05-01 40 18
3 2024-06-01 60 25
4 2024-07-01 80 30
5 2024-08-01 50 22
6 2024-09-01 30 16
7 2024-10-01 2 12
6.2 get_dupes()
it helps identify duplicate rows in a dataset
duplicate_rows <- full_joined_data %>% get_dupes()No variable names specified - using all columns.
No duplicate combinations found of: date, month.x, precipitation, temperature, month.y, forage_yield
6.3 compare_df_cols() 🧩
it helps to check if datasets align
janitor::compare_df_cols(daily_weather_data, monthly_weather_data) column_name daily_weather_data monthly_weather_data
1 avg_temperature <NA> numeric
2 date Date <NA>
3 month factor factor
4 precipitation numeric <NA>
5 temperature numeric <NA>
6 total_precipitation <NA> numeric
# this one compares by position too
janitor::compare_df_cols_same(daily_weather_data, monthly_weather_data) [1] TRUE
6.4 make_clean_names() 🏷
it’s the same idea than clean_names(), but returns a vector, which is useful when you’re renaming in a custom way.
names(df) <- janitor::make_clean_names(names(messy_weather_data))6.5 tabyl()
Simple frequency table: how many daily records per month?
daily_weather_data %>%
mutate(month = month(date, label = TRUE, abbr = TRUE)) %>%
tabyl(month) month n percent
Jan 0 0.000000000
Feb 0 0.000000000
Mar 0 0.000000000
Apr 30 0.163043478
May 31 0.168478261
Jun 30 0.163043478
Jul 31 0.168478261
Aug 31 0.168478261
Sep 30 0.163043478
Oct 1 0.005434783
Nov 0 0.000000000
Dec 0 0.000000000
6.6 adorn_*() functions
Enhance tabyl outputs with totals, percentages, and formatting.
# 2) Two-way tabyl: month x "rainy day?" (precipitation > 0)
daily_weather_data %>%
mutate(
month = month(date, label = TRUE, abbr = TRUE),
rainy_day = if_else(precipitation > 0, "Yes", "No")
) %>%
tabyl(month, rainy_day) %>%
adorn_totals(where = c("row", "col")) %>%
adorn_percentages("row") %>%
adorn_pct_formatting(digits = 1) %>%
adorn_ns() month No Yes Total
Jan - (0) - (0) 100.0% (0)
Feb - (0) - (0) 100.0% (0)
Mar - (0) - (0) 100.0% (0)
Apr 0.0% (0) 100.0% (30) 100.0% (30)
May 3.2% (1) 96.8% (30) 100.0% (31)
Jun 0.0% (0) 100.0% (30) 100.0% (30)
Jul 0.0% (0) 100.0% (31) 100.0% (31)
Aug 0.0% (0) 100.0% (31) 100.0% (31)
Sep 0.0% (0) 100.0% (30) 100.0% (30)
Oct 0.0% (0) 100.0% (1) 100.0% (1)
Nov - (0) - (0) 100.0% (0)
Dec - (0) - (0) 100.0% (0)
Total 0.5% (1) 99.5% (183) 100.0% (184)
6.7 tabyl() + case_when()
Create temperature bins (cold/mild/warm) by month
# Create temperature bins (cold/mild/warm) by month
daily_weather_data %>%
mutate(
month = month(date, label = TRUE, abbr = TRUE),
temp_bin = case_when(
temperature < 0 ~ "Below 0°C",
temperature < 10 ~ "0–10°C",
temperature < 20 ~ "10–20°C",
TRUE ~ "20°C+"
)
) %>%
tabyl(month, temp_bin) %>%
adorn_totals(where = c("row", "col")) %>%
adorn_percentages("row") %>%
adorn_pct_formatting(digits = 1) %>%
adorn_ns() month 10–20°C 20°C+ Total
Jan - (0) - (0) 100.0% (0)
Feb - (0) - (0) 100.0% (0)
Mar - (0) - (0) 100.0% (0)
Apr 66.7% (20) 33.3% (10) 100.0% (30)
May 71.0% (22) 29.0% (9) 100.0% (31)
Jun 36.7% (11) 63.3% (19) 100.0% (30)
Jul 58.1% (18) 41.9% (13) 100.0% (31)
Aug 67.7% (21) 32.3% (10) 100.0% (31)
Sep 50.0% (15) 50.0% (15) 100.0% (30)
Oct 100.0% (1) 0.0% (0) 100.0% (1)
Nov - (0) - (0) 100.0% (0)
Dec - (0) - (0) 100.0% (0)
Total 58.7% (108) 41.3% (76) 100.0% (184)
7 Final Thoughts and Resources
Data wrangling is a crucial step in data analysis, ensuring datasets are clean, structured, and ready for further exploration. By leveraging dplyr, tidyr, janitor, and lubridate, we can efficiently manage and transform our data to extract meaningful insights.
For further reading and practice, consider the following resources:
- R for Data Science by Hadley Wickham & Garrett Grolemund
- tidyverse documentation
- lubridate cheatsheet
- Data Wrangling with R - A comprehensive tutorial
Keep practicing and experimenting with different datasets to solidify your understanding. Happy coding!

