Reproducible Data Science with R III

R Packages Essentials

Adrian A. Correndo

2025-11-28

What is an R Package? 📦

What’s a Package? 📦

An R package is a collection of R functions, data, and documentation organized in a standardized format.

Key Components of an R Package

Functions: Reusable R code that performs specific tasks.
Data: Example datasets to illustrate package functions or concepts.
Documentation: Help files and vignettes explaining how to use the package.
Namespace: Defines which functions are available to the user.

Tip

Pro Tip: Packages help simplify code reuse and make complex tasks more accessible.

Why packages? 🚀

Benefits of Using R Packages 💡

Efficiency: Avoid rewriting code for common tasks.
Consistency: Standardized code structure and naming conventions.
Reproducibility: Ensures your work is easier to share and reproduce.

Example 1: mutate()

Packages like dplyr simplify tasks by providing clean, concise code for data manipulation.

1. Create a new column: total, which is the sum of two existing columns (var1 and var2).

Example 1: mutate()

Base R version

# Sample data
df <- data.frame(var1 = c(1, 2, 3), var2 = c(4, 5, 6))

# Adding a new column using base R
df$total <- df$var1 + df$var2

dplyr package (Tidyverse)

library(dplyr)

# Using mutate to add a new column
df <- df %>%
  mutate(total = var1 + var2)

Example 2: filter()

2. Filtering: get values of var1 greater than 2.

Base R version

# Filter rows using base R
filtered_df <- df[df$var1 > 2, ]

dplyr package (Tidyverse)

# Filter rows using dplyr
filtered_df <- filter(data = df, var1 > 2)

Example 3: select()

3. Select specific variables: get var1 and var3.

Base R version

# Select columns using base R
selected_df <- df[ , c("var1","var3")]

dplyr package (Tidyverse)

# Filter rows using dplyr
filtered_df <- select(data = df, var1, var3)

Example 4: iteration

Scenario: You have a data frame with a column study and a column data, where data contains nested data frames for each study.
We want to fit a linear model to each nested data frame (predicting y by x) and store the models in a new column called model.

Dataset

# Sample data frame with nested data
df <- tibble(
study = c("Study 1", "Study 2"),
data = list(data.frame(x = 1:10, y = rnorm(10)), 
            data.frame(x = 1:10, y = rnorm(10))  ) )

Example 4: iteration

Base R version (for loop)

# Initialize an empty list to store models
models <- vector("list", length = nrow(df))

# Using a for loop to fit the model and store results
for (i in seq_len(nrow(df))) {
  models[[i]] <- lm(y ~ x, data = df$data[[i]])
}

# Add the list of models as a new column in the data frame
df$model <- models

purrr package (map() function)

library(purrr)
# Using map to fit the model for each nested data frame
df <- df %>%
  mutate(model = map(data, ~ lm(y ~ x, data = .x)))

Where do packages come from ❓

Why is the source important? 📌

Understand the benefits and drawbacks of stable vs. development versions.
Understand the security standards and risks associated.
Learn when to choose each source for your projects.

CRAN: Comprehensive R Archive Network 🗂️

The primary repository for R packages, stability and rigor.

Pros 🌟

High stability and compatibility across systems
Strict quality standards and documentation requirements
Easy installation and dependency management

Cons ⚠️

Slower update cycles due to strict review processes
Limited flexibility for experimental or niche packages

CRAN: Comprehensive R Archive Network 🗂️

There are currently +23,000 of packages (on CRAN only).

CRAN Taskviews

Definition: A CRAN Task View is a curated collection of R packages focused on a specific topic or area of research.
Purpose: Organizes packages into categories, making it easier for users to find the right tools for their work.
Topics: Ranges from “Machine Learning” to “Ecology” to “Time Series Analysis” and beyond.
Benefits:
- Saves time for users by grouping relevant packages.
- Regularly updated by experts in each field.
- Offers a starting point for exploring specialized R tools.

Agriculture Task View

It compiles R packages useful for agricultural research and data analysis.

Content Includes:
- Packages for analyzing crop and livestock data.
- Tools for spatial analysis of agricultural data.
- Resources for agricultural economics and statistical models.
Examples of Included Packages:
- agricolae: Provides tools for analysis of experimental data in agriculture.
- agridat: A collection of agricultural datasets for research and teaching.
Why Use It?
- Simplifies the search for agricultural tools.
- Helps researchers quickly identify relevant resources for their work.

Package from GitHub

A platform that allows developers to share packages with fewer restrictions, encouraging innovation and collaboration.

Pros 🚀

Cutting-edge features and fast updates
Flexible release of experimental tools or features
Community-driven contributions and improvements

Cons ⚠️

Less stability
Potential for bugs and security risks
Requires manual dependency management
Variable documentation quality

Bioconductor

A platform designed for bioinformatics and computational biology, providing tools for genomic and biomedical data.

Pros 🚀

Tailored for Bioinformatics: Ideal for handling specialized data

THANK YOU!

acorrend@uoguelph.ca Adrian A. Correndo
Assistant Professor
Sustainable Cropping Systems
Department of Plant Agriculture
University of Guelph
Rm 226, Crop Science Bldg | Department of Plant Agriculture Ontario Agricultural College | University of Guelph | 50 Stone Rd E, Guelph, ON-N1G 2W1, Canada.

Contact me