Advanced Data Wrangling in R

data wrangling
dplyr
tidyr
case study
Author

Dr. Adrian Correndo

Published

January 28, 2026

Description

This class dives deeper into the world of R data wrangling, covering advanced techniques and their applications. We’ll explore complex functions and join operations essential for real-world data analysis.

Required packages for today

library(pacman)
p_load(dplyr, tidyr, stringr, lubridate, janitor)

1 Review of Previous Lessons

Quick recap of essential data wrangling functions:

  • mutate()

  • filter()

  • select()

  • slice(), slice_head(), slice_tail(), slice_sample()

  • rename()

  • arrange()

  • distinct()

  • count()

  • glimpse()

  • summarise()

  • group_by

  • unite()
  • separate()
  • pivot_longer()
  • pivot_wider()
  • case_when()
  • nest()
  • unnest()
  • skim()

2 Complex Joins

In data analysis, combining data from different sources is often necessary. Here, we’ll use dplyr’s join functions to merge datasets based on common keys.

2.1 Example: Merging Weather and Crop Yield Data

2.2 left_join()

Includes all records from the left dataset and the matched records from the right dataset. If there is no match, the result is NA in the columns of the right dataset.

# Daily data
daily_weather_data <- tibble(
  date = seq(as.Date("2024-04-01"), as.Date("2024-10-01"), by = "day"),
  # create month column with name
  month = lubridate::month(date, label = TRUE, abbr = TRUE),
  precipitation = round(runif(n=184, min = 0, max = 25), digits = 0),
  temperature = runif(n=184, min = 10, max = 30)
)

# Summarize daily weather by month
monthly_weather_data <- daily_weather_data %>%
  group_by(month) %>%
  summarise(total_precipitation = sum(precipitation),
            avg_temperature = mean(temperature)) %>% 
  ungroup()

# Forage data
forage_data <- tibble(
  date = seq(as.Date("2024-04-01"), as.Date("2024-10-01"), by = "month"),
  # extract month from date
  month = lubridate::month(date, label = TRUE, abbr = TRUE),
  forage_yield = c(500, 1200, 3000, 4000, 2800, 1500, 0) )

# Merge data
left_joined_data <- left_join(monthly_weather_data, forage_data, by = "month")

2.3 right_join()

Includes all records from the right dataset and the matched records from the left dataset. If there is no match, the result is NA in the columns of the left dataset.

# Merge data
right_joined_data <- 
  right_join(daily_weather_data, forage_data, by = "month")

2.4 full_join()

Includes all records when there is a match in the keys of the left or right datasets. If there is no match, the result is NA in the columns of the dataset that does not have a match.

# Merge data
full_joined_data <- 
  full_join(daily_weather_data, forage_data, by = "date")

If a key appears more than once in either table, the join can produce more rows than expected.

Filtering joins are also very useful:

2.5 inner_join()

keep only keys present in both tables

# inner_join: 
inner_join(monthly_weather_data, forage_data, by = "month")
# A tibble: 7 × 5
  month total_precipitation avg_temperature date       forage_yield
  <ord>               <dbl>           <dbl> <date>            <dbl>
1 Apr                   365            18.1 2024-04-01          500
2 May                   455            18.0 2024-05-01         1200
3 Jun                   410            20.8 2024-06-01         3000
4 Jul                   423            18.7 2024-07-01         4000
5 Aug                   372            18.1 2024-08-01         2800
6 Sep                   364            20.4 2024-09-01         1500
7 Oct                    22            16.7 2024-10-01            0

2.6 semi_join()

keep rows from left that have a match in right (does NOT bring columns from right)

semi_join(monthly_weather_data, forage_data, by = "month")
# A tibble: 7 × 3
  month total_precipitation avg_temperature
  <ord>               <dbl>           <dbl>
1 Apr                   365            18.1
2 May                   455            18.0
3 Jun                   410            20.8
4 Jul                   423            18.7
5 Aug                   372            18.1
6 Sep                   364            20.4
7 Oct                    22            16.7

2.7 anti_join()

find rows in left with NO match in right (great for debugging)

anti_join(monthly_weather_data, forage_data, by = "month")
# A tibble: 0 × 3
# ℹ 3 variables: month <ord>, total_precipitation <dbl>, avg_temperature <dbl>
anti_join(daily_weather_data, forage_data, by = "date")
# A tibble: 177 × 4
   date       month precipitation temperature
   <date>     <ord>         <dbl>       <dbl>
 1 2024-04-02 Apr               5        14.0
 2 2024-04-03 Apr              19        20.3
 3 2024-04-04 Apr              15        23.8
 4 2024-04-05 Apr               3        14.3
 5 2024-04-06 Apr              23        19.8
 6 2024-04-07 Apr               1        11.5
 7 2024-04-08 Apr               4        20.2
 8 2024-04-09 Apr              21        11.7
 9 2024-04-10 Apr              15        19.3
10 2024-04-11 Apr              17        25.8
# ℹ 167 more rows

3 across()

After merging, it’s crucial to check for missing values and duplicates:

3.1 everything()

it allows you to apply functions to all columns in a data frame

data_quality_summary <- full_joined_data %>%
  summarise(across(.cols = everything(), 
                   .fns = ~ sum(is.na(.)), 
                   .names = "missing_{.col}"))

data_quality_summary
# A tibble: 1 × 6
  missing_date missing_month.x missing_precipitation missing_temperature
         <int>           <int>                 <int>               <int>
1            0               0                     0                   0
# ℹ 2 more variables: missing_month.y <int>, missing_forage_yield <int>

3.2 where()

When you have many columns and want to apply the same transformation/summarization to all of them (or a subset), across() is the go-to tool.

# Example: summarize ALL numeric columns by month
daily_weather_data %>%
  mutate(month = lubridate::month(date, label = TRUE, abbr = TRUE)) %>%
  group_by(month) %>%
  summarise(
    across(where(is.numeric), list(mean = mean, sd = sd), .names = "{.col}_{.fn}"),
    .groups = "drop"
  )
# A tibble: 7 × 5
  month precipitation_mean precipitation_sd temperature_mean temperature_sd
  <ord>              <dbl>            <dbl>            <dbl>          <dbl>
1 Apr                 12.2             7.44             18.1           5.05
2 May                 14.7             7.78             18.0           5.31
3 Jun                 13.7             6.94             20.8           6.21
4 Jul                 13.6             6.26             18.7           6.57
5 Aug                 12               8.19             18.1           6.03
6 Sep                 12.1             7.71             20.4           5.48
7 Oct                 22              NA                16.7          NA   

4 Missing data with tidyr

4.1 replace_na()

df <- tibble(
  site = c("Elora", "Elora", "Ridgetown", "Ridgetown"),
  year = c(2024, 2025, 2024, 2025),
  yield = c(NA, 8.2, 7.9, NA)
)

df %>%
  tidyr::replace_na(list(yield = 0))
# A tibble: 4 × 3
  site       year yield
  <chr>     <dbl> <dbl>
1 Elora      2024   0  
2 Elora      2025   8.2
3 Ridgetown  2024   7.9
4 Ridgetown  2025   0  

4.2 drop_na()

df %>% tidyr::drop_na(yield)
# A tibble: 2 × 3
  site       year yield
  <chr>     <dbl> <dbl>
1 Elora      2025   8.2
2 Ridgetown  2024   7.9

4.3 fill()

common with messy datasets

df2 <- tibble(
  site = c("Elora", NA, NA, "Ridgetown", NA),
  plot = 1:5,
  value = rnorm(5)
)

df2 %>%
  tidyr::fill(site, .direction = "down")
# A tibble: 5 × 3
  site       plot  value
  <chr>     <int>  <dbl>
1 Elora         1 -1.38 
2 Elora         2  0.854
3 Elora         3 -2.90 
4 Ridgetown     4 -0.637
5 Ridgetown     5  0.734

5 Date Handling with lubridate

Using lubridate, we can extract various date components for analysis.

weather_data_dates <- daily_weather_data %>%
  mutate(
    year_month_day = format(date, "%Y_%m_%d"),
    day_of_year = yday(date),
    day_of_month = mday(date),
    week_of_year = week(date),
    month_name = month(date, label = TRUE, abbr = FALSE)
  )

# Take a look 
glimpse(weather_data_dates)
Rows: 184
Columns: 9
$ date           <date> 2024-04-01, 2024-04-02, 2024-04-03, 2024-04-04, 2024-0…
$ month          <ord> Apr, Apr, Apr, Apr, Apr, Apr, Apr, Apr, Apr, Apr, Apr, …
$ precipitation  <dbl> 12, 5, 19, 15, 3, 23, 1, 4, 21, 15, 17, 6, 3, 16, 12, 2…
$ temperature    <dbl> 15.18239, 13.95953, 20.30157, 23.81710, 14.26687, 19.84…
$ year_month_day <chr> "2024_04_01", "2024_04_02", "2024_04_03", "2024_04_04",…
$ day_of_year    <dbl> 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104…
$ day_of_month   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
$ week_of_year   <dbl> 14, 14, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 15, 15,…
$ month_name     <ord> April, April, April, April, April, April, April, April,…

6 Data Cleaning with janitor

6.1 clean_names()

Sometimes, datasets come with inconsistent column names, which can cause issues in analysis.

messy_weather_data <- tibble(
  `Date Recorded` = seq(as.Date("2024-04-01"), as.Date("2024-10-01"), by = "month"),
  `Precipitation (mm)` = c(20, 40, 60, 80, 50, 30, 2),
  `Temperature..(C,)` = c(15, 18, 25, 30, 22, 16, 12)
)

clean_weather_data <- messy_weather_data %>% clean_names()
clean_weather_data
# A tibble: 7 × 3
  date_recorded precipitation_mm temperature_c
  <date>                   <dbl>         <dbl>
1 2024-04-01                  20            15
2 2024-05-01                  40            18
3 2024-06-01                  60            25
4 2024-07-01                  80            30
5 2024-08-01                  50            22
6 2024-09-01                  30            16
7 2024-10-01                   2            12

6.2 get_dupes()

it helps identify duplicate rows in a dataset

duplicate_rows <- full_joined_data %>% get_dupes()
No variable names specified - using all columns.
No duplicate combinations found of: date, month.x, precipitation, temperature, month.y, forage_yield

6.3 compare_df_cols() 🧩

it helps to check if datasets align

janitor::compare_df_cols(daily_weather_data, monthly_weather_data)
          column_name daily_weather_data monthly_weather_data
1     avg_temperature               <NA>              numeric
2                date               Date                 <NA>
3               month             factor               factor
4       precipitation            numeric                 <NA>
5         temperature            numeric                 <NA>
6 total_precipitation               <NA>              numeric
# this one compares by position too
janitor::compare_df_cols_same(daily_weather_data, monthly_weather_data) 
[1] TRUE

6.4 make_clean_names() 🏷

it’s the same idea than clean_names(), but returns a vector, which is useful when you’re renaming in a custom way.

names(df) <- janitor::make_clean_names(names(messy_weather_data))

6.5 tabyl()

Simple frequency table: how many daily records per month?

daily_weather_data %>%
  mutate(month = month(date, label = TRUE, abbr = TRUE)) %>%
  tabyl(month)
 month  n     percent
   Jan  0 0.000000000
   Feb  0 0.000000000
   Mar  0 0.000000000
   Apr 30 0.163043478
   May 31 0.168478261
   Jun 30 0.163043478
   Jul 31 0.168478261
   Aug 31 0.168478261
   Sep 30 0.163043478
   Oct  1 0.005434783
   Nov  0 0.000000000
   Dec  0 0.000000000

6.6 adorn_*() functions

Enhance tabyl outputs with totals, percentages, and formatting.

# 2) Two-way tabyl: month x "rainy day?" (precipitation > 0)
daily_weather_data %>%
  mutate(
    month = month(date, label = TRUE, abbr = TRUE),
    rainy_day = if_else(precipitation > 0, "Yes", "No")
  ) %>%
  tabyl(month, rainy_day) %>%
  adorn_totals(where = c("row", "col")) %>%
  adorn_percentages("row") %>%
  adorn_pct_formatting(digits = 1) %>%
  adorn_ns()
 month       No          Yes        Total
   Jan    - (0)      -   (0) 100.0%   (0)
   Feb    - (0)      -   (0) 100.0%   (0)
   Mar    - (0)      -   (0) 100.0%   (0)
   Apr 0.0% (0) 100.0%  (30) 100.0%  (30)
   May 3.2% (1)  96.8%  (30) 100.0%  (31)
   Jun 0.0% (0) 100.0%  (30) 100.0%  (30)
   Jul 0.0% (0) 100.0%  (31) 100.0%  (31)
   Aug 0.0% (0) 100.0%  (31) 100.0%  (31)
   Sep 0.0% (0) 100.0%  (30) 100.0%  (30)
   Oct 0.0% (0) 100.0%   (1) 100.0%   (1)
   Nov    - (0)      -   (0) 100.0%   (0)
   Dec    - (0)      -   (0) 100.0%   (0)
 Total 0.5% (1)  99.5% (183) 100.0% (184)

6.7 tabyl() + case_when()

Create temperature bins (cold/mild/warm) by month

# Create temperature bins (cold/mild/warm) by month
daily_weather_data %>%
  mutate(
    month = month(date, label = TRUE, abbr = TRUE),
    temp_bin = case_when(
      temperature < 0 ~ "Below 0°C",
      temperature < 10 ~ "0–10°C",
      temperature < 20 ~ "10–20°C",
      TRUE ~ "20°C+"
    )
  ) %>%
  tabyl(month, temp_bin) %>%
  adorn_totals(where = c("row", "col")) %>%
  adorn_percentages("row") %>%
  adorn_pct_formatting(digits = 1) %>%
  adorn_ns()
 month      10–20°C      20°C+        Total
   Jan      -   (0)     -  (0) 100.0%   (0)
   Feb      -   (0)     -  (0) 100.0%   (0)
   Mar      -   (0)     -  (0) 100.0%   (0)
   Apr  66.7%  (20) 33.3% (10) 100.0%  (30)
   May  71.0%  (22) 29.0%  (9) 100.0%  (31)
   Jun  36.7%  (11) 63.3% (19) 100.0%  (30)
   Jul  58.1%  (18) 41.9% (13) 100.0%  (31)
   Aug  67.7%  (21) 32.3% (10) 100.0%  (31)
   Sep  50.0%  (15) 50.0% (15) 100.0%  (30)
   Oct 100.0%   (1)  0.0%  (0) 100.0%   (1)
   Nov      -   (0)     -  (0) 100.0%   (0)
   Dec      -   (0)     -  (0) 100.0%   (0)
 Total  58.7% (108) 41.3% (76) 100.0% (184)

7 Final Thoughts and Resources

Data wrangling is a crucial step in data analysis, ensuring datasets are clean, structured, and ready for further exploration. By leveraging dplyr, tidyr, janitor, and lubridate, we can efficiently manage and transform our data to extract meaningful insights.

For further reading and practice, consider the following resources:

Keep practicing and experimenting with different datasets to solidify your understanding. Happy coding!