Tidy Mixed Models in R

A simple guide to succeed on the analysis of common mixed models in agriculture

Author

Adrian Correndo

Published

October 19, 2022

Summary
This workshop provides a workflow to analyze common types of mixed models data in agriculture: (i) Split-Plots, and (ii) Repeated measures. From exploring the data to create a summary report with figures, we will cover how to write, test, and select from multiple candidate models at once using tidy principles, packages from the tidyverse|tidymodels framework, and all using the new Quarto features!

What are Mixed Models?

In simple words, the rationale behind mixed models is the simultaneous presence of components that model the EXPECTED VALUE (fixed) as well as the VARIANCE (random).

Specifically, today we are going to cover LINEAR MIXED MODELS, which make the following assumptions (Bolker et al. 2022):

  • The expected values of the responses are linear combinations of the fixed predictor variables and the random effects.

  • The conditional distribution of the variable response is Gaussian (equivalently, the errors are “Normal”).

  • The random effects are normally distributed. Normally, we assume that the expected value of a random effect is equal to zero, but with a positive variance.

Note

We are going to employ the most used packages and/or functions for frequentist LMMs:

  • nlme: nlme::lme() provides REML or ML estimation. Allows multiple nested random effects, and provides structures for modeling heteroscedastic and/or correlated errors (spoiler for repeated measures analysis). This is already part of base R.
  • lme4: lmer4::lmer() provides REML or ML estimation. Allows multiple nested or crossed random effects, can compute profile confidence intervals and conduct parametric bootstrapping.

For more information about mixed models in R, please, check out the new CRAN Task View: Mixed, Multilevel, and Hierarchical Models in R.

Why TIDY?

Well, from the hand of Tidyverse, the “tidy data” framework changed the way we code and work in R for data science. Tidy datasets are easy to manipulate, model and visualize, and have a specific structure (Wickham 2014):

  • Each variable is a column,

  • Each observation is a row, and

  • Each value have its own cell.

Tidy-data structure. Following three rules makes a dataset tidy: variables are in columns, observations are in rows, and values are in cells. Source: (Wickham and Grolemund 2017).

Free HTML books

Outline

After planning the experimental design, identifying dependent variable, independent variable(s), conducting the experiment, and collecting the data…the expected path would be as follows:

Show code

flowchart LR
  A[Dig the data] --> B{Model Selection}
  B --> C[Significant\nEffects?]
  subgraph Inference
  C -->|Yes| D[Comparisons]
  end
  

flowchart LR
  A[Dig the data] --> B{Model Selection}
  B --> C[Significant\nEffects?]
  subgraph Inference
  C -->|Yes| D[Comparisons]
  end
  

This workflow has been created using mermaid

THE CODE

What do you need?

For a complete experience, I recommend you to download and install the:

What is Quarto?

We are going to explore the new features offered by Quarto documents (*.qmd).

Quarto is a refined version (and successor) of R-markdown. It is an open-source scientific and technical publishing system built on Pandoc. It allows to combine R, Python, Julia, and Java (Observable JS) for the design of reports, applications, websites, blogs, scientific publications, and more…

00. PACKAGES

Show code
#| echo: true
#| collapse: true

#install.packages("easypackages") # To load and/or install the packs
library(easypackages)
packages("readxl") # To open/save excel files
packages('dplyr', "tidyr","purrr", "forcats", "stringr") # Data wrangling
packages("nlme", "lme4") # Mixed models libraries
packages("broom", "broom.mixed") # Managing models results
packages("performance") # Check assumptions and performance
packages("emmeans","multcomp","multcompView",
         "car", "multilevelmod") # Aov and mult comp
packages("ggplot2") # Figures
packages("agricolae") # Miscellaneous functions

Among many packages we are going to use today, there is one from the tidymodels family that was specially designed to convert statistical objects into tidy format: the broom package (Robinson, Hayes, and Couch 2022). Particularly for mixed models, we will also need its spinoff: the broom.mixed package (Bolker and Robinson 2022) .

The broom package offer three key functions to manipulate models’ outcomes:

  • glance() to report information about the entire model

  • tidy() to summarize information about model components, and

  • augment() to add information about observations to a dataset

01. SPLIT-PLOT

Split-plot arrangement is frequently used in experiments including several factors (factorial). The main characteristic is that the arrangement follows a hierarchy: there are main plots (a.k.a. whole plots) covering one of the factors, and then there are “subplots” within each level of the main that include levels of a second factor, and so on. Therefore, the main plot serves as a “block” for the subplots, and thus, we are setting boundaries to (“restricting”) the randomization of the subplot factor.

Important
  • DESIGN: We intentionally call split-plot “arrangement” because they can fit into multiple “designs” such as completely randomized (CRD), randomized complete blocks design (RCBD), or latin-squares design (LSD).

  • INFERENCE POWER: What happens with split-plot design is that the main plot has less degrees of freedom than the factor at the subplot, and thus, we have more inference power at the factor in the subplot hierarchy (and the interaction). So consider this before it’s too late!

Create a split-plot design

The agricolae package (Mendiburu 2021) brings a set of very useful functions to generate different experimental designs, among them, the split-plot. Let’s see an example:

Show code
# Example with agricolae

# Define plots
mainplot <- c(0,50,80,110,140)
subplot <- c("Var_1", "Var_2", "Var_3")

# Produce
sp_design <- agricolae::design.split(
                        trt1 = mainplot, trt2 = subplot, 
                        design = "rcbd", r = 3, 
                        randomization = TRUE, seed = 4561)

i. Data

The following is just a fake data set where we have a split-splot arrangement within an RCBD (BLOCK), where at the main plot corresponds to nitrogen rate (NRATE, with 5 levels), and the subplot to wheat variety (VARIETY, with 3 levels).

Show code
# Read csv
#split_plot_data_0 <- read.csv(file = "../data/01_split_plot_data.csv", header = TRUE)

# File online? Try this...(remove "#")
url_split <- "https://raw.githubusercontent.com/adriancorrendo/tidymixedmodelsweb/master/data/01_split_plot_data.csv"

split_plot_data_0 <- read.csv(url_split)

# Data hierarchy
split_plot_data <- 
  split_plot_data_0 %>% 
  mutate(NRATE = factor(NRATE)) %>% 
  # Identify Main Plot
  #mutate(main = factor(BLOCK:NRATE)) %>% 
  dplyr::select(BLOCK, NRATE, VARIETY, YIELD)

ii. Dig the data

Now, let’s use several functions to explore the data.

a. glimpse()

For example, the glimpse() function from the dplyr package (Wickham et al. 2022) allows to take a quick look to the columns in our data frame (it’s like a transposed version of print())

Show code
# Glimpse from dplyr
dplyr::glimpse(split_plot_data)
Rows: 45
Columns: 4
$ BLOCK   <chr> "B1", "B1", "B1", "B1", "B1", "B1", "B1", "B1", "B1", "B1", "B…
$ NRATE   <fct> 0, 0, 0, 50, 50, 50, 80, 80, 80, 110, 110, 110, 140, 140, 140,…
$ VARIETY <chr> "Var_1", "Var_2", "Var_3", "Var_1", "Var_2", "Var_3", "Var_1",…
$ YIELD   <int> 1938, 3946, 4628, 2038, 4346, 5949, 3837, 4287, 6765, 3466, 49…

b. skim()

Then, the skim() function from the skimr package (Waring et al. 2022) allows to take a deeper look to all the variables (columns), creating a quick summary that reports the presence of missing values, etc., etc.

Show code
# Skim from skimr
skimr::skim(split_plot_data)
Data summary
Name split_plot_data
Number of rows 45
Number of columns 4
_______________________
Column type frequency:
character 2
factor 1
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
BLOCK 0 1 2 2 0 3 0
VARIETY 0 1 5 5 0 3 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
NRATE 0 1 FALSE 5 0: 9, 50: 9, 80: 9, 110: 9

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
YIELD 0 1 4444.02 1347.72 1938 3389 4458 5440 6950 ▅▆▇▆▅

c. ggplot()

Of course, we shouldn’t miss to use ggplot2 for a better look

Show code
# Boxplot
split_plot_data %>% 
  # Plot
ggplot() + 
  # Boxplots
  geom_boxplot(aes(x = NRATE, y = YIELD, fill = VARIETY))+
  geom_jitter(aes(x = NRATE, y = YIELD, fill = VARIETY))+
  # Plot by site
  facet_wrap(~VARIETY)+
  scale_y_continuous(limits = c(0,10000), breaks = seq(0,10000, by=2000))+
  # Change theme
  theme_bw()

iii. Models

For the analysis of split-plot designs we basically need to specify an error term that otherwise the model will not see: the MAIN PLOT ERROR TERM (see Venables and Ripley (2002), pg. 283). By default, the random term that the computer will identify is the one happening at the lowest level in the hierarchy (replication). However, we need to specify that the main plot serves as a kind of block to the design.

a. nlme::lme

Show code
# Model without split component
no_split <- nlme::lme(# Response variable
                 YIELD ~
                   # Fixed
                   0 + NRATE*VARIETY,
                   # Random error of MAINPLOT (NRATE nested in BLOCK)
                   random = ~1|BLOCK, 
                   # Data
                   data = split_plot_data,
                   # Method
                   method = "REML")

# Model with split component
split_nlme <- nlme::lme(# Response variable
                 YIELD ~
                   # Fixed (Removing intercept? Why?)
                   0 + NRATE*VARIETY,
                   # Random error of MAINPLOT (NRATE nested in BLOCK)
                   random = ~1|BLOCK/NRATE, 
                   # Data
                   data = split_plot_data,
                   # Method
                   method = "REML")

# Type 3 (when interaction is present)
car::Anova(split_nlme, type = 3)
Analysis of Deviance Table (Type III tests)

Response: YIELD
                Chisq Df Pr(>Chisq)    
NRATE         559.646  5  < 2.2e-16 ***
VARIETY        22.128  2  1.567e-05 ***
NRATE:VARIETY  18.117  8    0.02036 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Show code
# Let's see the difference between models in terms of DFs
#summary(no_split)
#summary(split_nlme)

b. lmer code (lme4)

Show code
split_lme4 <- lme4::lmer(# Response variable
                   YIELD ~ 
                   # Fixed (Removing intercept? Why?)
                   0+NRATE*VARIETY +
                   # Random
                   (1|BLOCK/NRATE), 
                   # Data
                   data=split_plot_data)

# Type 3
Anova(split_lme4, type = 3)
Analysis of Deviance Table (Type III Wald chisquare tests)

Response: YIELD
                Chisq Df Pr(>Chisq)    
NRATE         559.646  5  < 2.2e-16 ***
VARIETY        22.128  2  1.567e-05 ***
NRATE:VARIETY  18.117  8    0.02036 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

iv. Check assumptions

Show code
# Single tests
performance::check_normality(split_lme4)
OK: residuals appear as normally distributed (p = 0.910).
Show code
performance::check_homogeneity(split_lme4)
OK: There is not clear evidence for different variances across groups (Bartlett Test, p = 0.396).
Show code
performance::check_autocorrelation(split_lme4)
OK: Residuals appear to be independent and not autocorrelated (p = 0.410).
Show code
performance::check_outliers(split_lme4)
OK: No outliers detected.
Show code
performance::check_collinearity(split_lme4)
# Check for Multicollinearity

High Correlation

          Term     VIF         VIF 95% CI Increased SE Tolerance
         NRATE  243.00 [ 168.92,  349.77]        15.59  4.12e-03
       VARIETY   75.00 [  52.26,  107.83]         8.66      0.01
 NRATE:VARIETY 2025.00 [1406.33, 2916.02]        45.00  4.94e-04
 Tolerance 95% CI
     [0.00, 0.01]
     [0.01, 0.02]
     [0.00, 0.00]
Show code
# Check Plots
performance::check_model(split_lme4)

v. Comparisons

Show code
# Estimate the comparisons (pairwise)
split_lm4_comparisons <-  
  split_lme4 %>% 
  # ~ specifies the level of comparison (marginal or interaction)
  # Since interaction was significant we specify ~ Interaction (Factor1*Factor2)
  emmeans(., ~ NRATE*VARIETY)

# Add letters
split_lm4_comparisons %>% 
  # Compact Letters Display (cld)
  cld(., 
      # Specify grouped comparisons by...
      #by = "NRATE", 
      # Order
      decreasing = TRUE, details=FALSE, reversed=TRUE, 
      # Specs
      alpha=0.05,  adjust = "tukey", Letters=LETTERS)
 NRATE VARIETY emmean  SE df lower.CL upper.CL .group  
 80    Var_3     6587 301 30     5630     7544  A      
 110   Var_3     6466 301 30     5509     7423  AB     
 50    Var_3     5904 301 30     4947     6861  ABC    
 140   Var_3     5359 301 30     4402     6316  ABCD   
 140   Var_2     5311 301 30     4354     6268  ABCD   
 110   Var_2     4948 301 30     3991     5905   BCDE  
 80    Var_2     4611 301 30     3654     5568    CDEF 
 0     Var_3     4501 301 30     3544     5458    CDEF 
 50    Var_2     4039 301 30     3082     4996     DEFG
 80    Var_1     3858 301 30     2901     4815     DEFG
 110   Var_1     3467 301 30     2510     4424      EFG
 0     Var_2     3186 301 30     2229     4143       FG
 140   Var_1     3101 301 30     2144     4058       FG
 50    Var_1     2787 301 30     1830     3744        G
 0     Var_1     2536 301 30     1579     3493        G

Degrees-of-freedom method: kenward-roger 
Confidence level used: 0.95 
Conf-level adjustment: sidak method for 15 estimates 
P value adjustment: tukey method for comparing a family of 15 estimates 
significance level used: alpha = 0.05 
NOTE: Compact letter displays can be misleading
      because they show NON-findings rather than findings.
      Consider using 'pairs()', 'pwpp()', or 'pwpm()' instead. 

02. REPEATED MEASURES

Now, we are going to reproduce the analysis I’ve done for one of my papers (Correndo et al. 2021). Particularly, we are going to reproduce Figure 2

For this paper, we have data from 4 different locations. We tested the levels of soil potassium fertility, hereinafter as soil test K (STK), in long-term experiments (2000-2009) where the treatments of interest were: (i) Control (unfertilized), (ii) NPS (fertilized with NPS), and (iii) Pristine conditions (No Ag-history).

At each plot/sample, the STK was measured at five-consecutive soil depths (0-20, 20-40, 40-60, 60-80, and 80-100 cm). Thus, they we took “repeated measurements” over the space.

We were NOT interested in comparing locations since they had very different previous history, and crop rotation, so confounding effects may have obscured the inference. Therefore, site was not a factor under consideration, and all the analysis were fitted independently by site.

i. Data

Show code
# Read file
rm_data <- 
  read_excel("../data/02_repeated_measures_data.xlsx", col_names = TRUE) %>% 
  # Create PLOT column to identify subject (Exp. Unit for Rep. Measures)
  unite(PLOT, BLOCK,TREAT, sep = "_", remove=FALSE) %>%
  # OR
  # Identify Subplot
  ungroup() %>% 
  group_by(BLOCK, TREAT) %>% 
  # Create plot ID # Needed for Repeated Measures
  mutate(plot = cur_group_id(), .after = PLOT) %>% 
  ungroup() %>% 
  mutate(DEPTH = as.factor(DEPTH),
         depth = as.integer(DEPTH), # Needed for CorAR1
         BLOCK = factor(BLOCK),
         SITE = factor(SITE),
         TREAT = factor(TREAT),
         # Create a grouping variable (WHY?) # Needed for HetVar
         GROUP = case_when(TREAT == "Pristine" ~ "Pristine",
                           TRUE ~ "Agriculture")
         )

# File online? Try this...(remove "#")
# url_rm <- "https://raw.githubusercontent.com/adriancorrendo/tidymixedmodelsweb/master/data/02_split_plot_data.csv"

#rm_data <- read_excel(url_rm, col_names = TRUE)

ii. Dig the data

Now, let’s use several functions to explore the data.

a. glimpse()

First, the glimpse() function from dplyr

Show code
# Glimpse from dplyr
dplyr::glimpse(rm_data)
Rows: 180
Columns: 9
$ SITE  <fct> site_1, site_1, site_1, site_1, site_1, site_1, site_1, site_1, …
$ PLOT  <chr> "I_Control", "I_Control", "I_Control", "I_Control", "I_Control",…
$ plot  <int> 1, 1, 1, 1, 1, 4, 4, 4, 4, 4, 7, 7, 7, 7, 7, 2, 2, 2, 2, 2, 5, 5…
$ TREAT <fct> Control, Control, Control, Control, Control, Control, Control, C…
$ BLOCK <fct> I, I, I, I, I, II, II, II, II, II, III, III, III, III, III, I, I…
$ DEPTH <fct> 10, 30, 50, 70, 90, 10, 30, 50, 70, 90, 10, 30, 50, 70, 90, 10, …
$ STK   <dbl> 105, 83, 103, 110, 127, 119, 98, 106, 107, 109, 132, 100, 96, 10…
$ depth <int> 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2…
$ GROUP <chr> "Agriculture", "Agriculture", "Agriculture", "Agriculture", "Agr…

b. skim()

Then, the skim() function from skmir

Show code
# Skim from skimr
skimr::skim(rm_data)
Data summary
Name rm_data
Number of rows 180
Number of columns 9
_______________________
Column type frequency:
character 2
factor 4
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
PLOT 0 1 5 12 0 9 0
GROUP 0 1 8 11 0 2 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
SITE 0 1 FALSE 4 sit: 45, sit: 45, sit: 45, sit: 45
TREAT 0 1 FALSE 3 Con: 60, NPS: 60, Pri: 60
BLOCK 0 1 FALSE 3 I: 60, II: 60, III: 60
DEPTH 0 1 FALSE 5 10: 36, 30: 36, 50: 36, 70: 36

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
plot 0 1 5.00 2.59 1 3.00 5 7.00 9 ▇▇▃▇▇
STK 0 1 181.84 63.09 74 138.75 174 221.25 406 ▅▇▅▂▁
depth 0 1 3.00 1.42 1 2.00 3 4.00 5 ▇▇▇▇▇

c. ggplot()

And let’s use ggplot2 for a better look

Show code
# Boxplot
rm_data %>% 
  dplyr::select(-depth) %>% 
  # Plot
ggplot() + 
  # Boxplots
  geom_boxplot(aes(x = reorder(DEPTH, desc(DEPTH)), y = STK, fill = TREAT))+
  # Axis labels
  labs(x = "Soil depth (cm)", y = "STK (g/m2)")+
  # Plot by site
  facet_wrap(~SITE)+
  # Flip axes
  coord_flip()+
  # Set scale type
  scale_x_discrete()+
  # Change theme
  tidybayes::theme_tidybayes()

iii. Candidate Models

I’m sorry for this, but the most important step is ALWAYS to write down the model.

a. Formulae

m0. Block Fixed

In a traditional approach blocks are defined as fixed, affecting the mean of the expected value. Yet there is no consensus about treating blocks as fixed or as random. For more information, read Dixon (2016).

Let’s define the model. For simplification (and avoid writing interaction terms), here we are going to consider that \(\tau_i\) is the “treatment”.

\[ y_{ij} = \mu + \tau_i + \beta_j + \epsilon_{ij} \]

\[ \epsilon_{ij} \sim N(0, \sigma^2_{e} )\] where \(\mu\) represents the overall mean (if intercept is used), \(\tau_i\) is the effect of treatment-j over \(\mu\), \(\beta_j\) is the effect of block-j over \(\mu\), and \(\epsilon_{ij}\) is the random effect of each experimental unit.

Show code
# SIMPLEST MODEL
fit_block_fixed <- function(x){
  lm(# Response variable
     STK ~ 
       # Fixed
       TREAT + DEPTH + TREAT:DEPTH + BLOCK,
     # Data
     data = x)
  }

m1. Block Random

An alternative approach is considering a MIXED MODEL, where blocks are considered “random”. Basically, we add a term to the model that it is expected to show a “null” overall effect over the mean of the variable of interest but introduces “noise”. By convention, a random effect is expected to have an expected value equal to zero but a positive variance as follows: \[ y_{ij} = \mu + \tau_i + \beta_j + \epsilon_{ij} \] \[ \beta_j \sim N(0, \sigma^2_{b} )\] \[ \epsilon_{ij} \sim N(0, \sigma^2_{e} )\] Similar than before, \(\mu\) represents the overall mean (if intercept is used), \(\tau_i\) is the effect of treatment-j over \(\mu\), \(\beta_j\) is the “random” effect of block-j over \(\mu\), and \(\epsilon_{ij}\) is the random effect of each experimental unit.

So what’s the difference? Simply specifying this component: \[ \beta_j \sim N(0, \sigma^2_b) \], which serves to model the variance.

How do we write that?

Show code
# RANDOM BLOCK
fit_block_random <- function(x){
  nlme::lme(
    # Fixed
    STK ~ TREAT + DEPTH + TREAT:DEPTH,
    # Random
    random = ~1|BLOCK,
    # Data
    data = x)
  }

Models w/ correlated ERRORS

Until here all sounds easy. However, we are (potentially) missing a key component. All measures involving DEPTH have been taken from the same “subjects” (experimental units/plots). So we do have “repeated measures” over space. Thus, it is highly likely that using depth implies the need to adjust the error correlation and covariance structure. Let’s explore some options…

m2. m1 + CompSymm

Compound symmetry is the simplest covariance structure, where we include a within-subject correlated errors. It is basically the same we do with including BLOCKS as random. We are telling the model that the observations within a given “depth” “share” something, they have something in common (the error).

Show code
# RANDOM BLOCK w/compound symmetry error correlation structure
fit_corsymm <- function(x){
  lme(# Response Variable
      STK ~
        # Fixed
        TREAT + DEPTH + TREAT:DEPTH,
        # Random
        random = ~1|BLOCK,
        # Identify subject where repeated measure happens
        # Plots nested within blocks.
        correlation = corCompSymm(form = ~ DEPTH |BLOCK/PLOT), 
     # Data   
     data=x) }

m3. m1 + CorAR1

The autoregressive of first order structure (CorAR1) considers correlations dependent of the “distance”. Thus, correlation of error is expected to be the highest between adjacent depths (e.g. 0-20 and 20-40 cm), and a systematically decrease with the distance. For example, the correlation between depth 1 and depth 2 would be \(\rho^{depth_2-depth_1}\), and then less and less, ending with the correlation between depth 5 and depth 1 equal to \(\rho^{depth_5-depth_1}\).

Danger

An important detail here is that CorAR1 structure is only applicable for evenly spaced intervals!

Show code
# RANDOM BLOCK w/ auto-regressive of 1st order as error correlation structure
fit_ar1 <- function(x){lme(STK ~ TREAT + DEPTH + TREAT:DEPTH,
                       random = ~1|BLOCK,
                       correlation=corAR1(form=~depth|BLOCK/PLOT),
                       data=x)}

m4. m2 + HetVar

Did you know that we can “relax” the assumption about homogeneity of variance? Oftentimes we have data that shows different variability depending on the level of a given factor or variable.

In the STK dataset, we observed that the “Pristine” treatment (or agriculture condition) present a higher variability compared to Control and NPS treatments, probably linked to higher values of STK. Variance is modeled by adding a “weight”. This “weight” could be a function of a continuous variable (e.g. fertilizer rate?) or, like in our case, based on a “categorical” variable.

Show code
# RANDOM BLOCK w/compound symmetry error correlation structure + Heterogeneous Variance
fit_corsymm_hetvar <- function(x){
  lme(# Response variable
      STK ~ 
        # Fixed
        TREAT + DEPTH + TREAT:DEPTH,
        # Random  
        random = ~1|BLOCK,
        # Correlation
        correlation = corCompSymm(form = ~ depth |BLOCK/PLOT),
        # Variance
        weights = varComb(varIdent(form= ~1|GROUP)),
      # Data
      data=x) }

b. Fit

Run the candidate models

Show code
STK_models <- 
  rm_data %>% 
  # Let's group data to run multiple locations|datasets at once
  group_by(SITE) %>% 
  # Store the data per location using nested arrangement
  nest() %>% 
  # BLOCK as FIXED 
  mutate(model_0 = map(data, fit_block_fixed)) %>% 
  # BLOCK as RANDOM
  mutate(model_1 = map(data, fit_block_random)) %>% 
  # COMPOUND SYMMETRY
  mutate(model_2 = map(data, fit_corsymm)) %>% 
  # AUTO-REGRESSIVE ORDER 1
  mutate(model_3 = map(data, fit_ar1)) %>% 
  # COMPOUND SYMMETRY + HETEROSKEDASTIC
  mutate(model_4 = map(data,  fit_corsymm_hetvar) ) %>%
    
  # Data wrangling
  pivot_longer(cols = c(model_0:model_4), # show alternative 'contains' model
               names_to = "model_id",
               values_to = "model") %>% 
  # Map over model column
  mutate(results = map(model, broom.mixed::augment )) %>% 
  # Performance
  mutate(performance = map(model, broom.mixed::glance )) %>% 
  # Extract AIC
  mutate(AIC = map(performance, ~.x$AIC)) %>% 
  # Extract coefficients
  mutate(coef = map(model, ~coef(.x))) %>% 
  # Visual-check plots
  mutate(checks = map(model, ~performance::check_model(.))) %>% 
  ungroup()
Could not compute standard errors from random effects for diagnostic plot.
Homogeneity of variance could not be computed. Cannot extract residual variance from objects of class 'lme'.
Could not compute standard errors from random effects for diagnostic plot.
Homogeneity of variance could not be computed. Cannot extract residual variance from objects of class 'lme'.
Could not compute standard errors from random effects for diagnostic plot.
Homogeneity of variance could not be computed. Cannot extract residual variance from objects of class 'lme'.
Could not compute standard errors from random effects for diagnostic plot.
Homogeneity of variance could not be computed. Cannot extract residual variance from objects of class 'lme'.
Could not compute standard errors from random effects for diagnostic plot.
Homogeneity of variance could not be computed. Cannot extract residual variance from objects of class 'lme'.
Could not compute standard errors from random effects for diagnostic plot.
Homogeneity of variance could not be computed. Cannot extract residual variance from objects of class 'lme'.
Could not compute standard errors from random effects for diagnostic plot.
Homogeneity of variance could not be computed. Cannot extract residual variance from objects of class 'lme'.
Could not compute standard errors from random effects for diagnostic plot.
Homogeneity of variance could not be computed. Cannot extract residual variance from objects of class 'lme'.
Could not compute standard errors from random effects for diagnostic plot.
Homogeneity of variance could not be computed. Cannot extract residual variance from objects of class 'lme'.
Could not compute standard errors from random effects for diagnostic plot.
Homogeneity of variance could not be computed. Cannot extract residual variance from objects of class 'lme'.
Could not compute standard errors from random effects for diagnostic plot.
Homogeneity of variance could not be computed. Cannot extract residual variance from objects of class 'lme'.
Could not compute standard errors from random effects for diagnostic plot.
Homogeneity of variance could not be computed. Cannot extract residual variance from objects of class 'lme'.
Could not compute standard errors from random effects for diagnostic plot.
Homogeneity of variance could not be computed. Cannot extract residual variance from objects of class 'lme'.
Could not compute standard errors from random effects for diagnostic plot.
Homogeneity of variance could not be computed. Cannot extract residual variance from objects of class 'lme'.
Could not compute standard errors from random effects for diagnostic plot.
Homogeneity of variance could not be computed. Cannot extract residual variance from objects of class 'lme'.
Could not compute standard errors from random effects for diagnostic plot.
Homogeneity of variance could not be computed. Cannot extract residual variance from objects of class 'lme'.

c. Check

Checking assumptions is always important. To learn more about data exploration, tools to detect outliers, heterogeneity of variance, collinearity, dependence of observations, problems with interactions, among others, I highly recommend reading (Zuur, Ieno, and Elphick 2010).

Show code
# Extracting by site
site_1_models <- STK_models %>% dplyr::filter(SITE == "site_1")
site_2_models <- STK_models %>% dplyr::filter(SITE == "site_3")
site_3_models <- STK_models %>% dplyr::filter(SITE == "site_4")
site_4_models <- STK_models %>% dplyr::filter(SITE == "site_2")
Show code
(site_1_models %>% dplyr::filter(model_id == "model_0"))$checks[[1]]

Show code
(site_2_models %>% dplyr::filter(model_id == "model_0"))$checks[[1]]

Show code
(site_3_models %>% dplyr::filter(model_id == "model_0"))$checks[[1]]

Show code
(site_4_models %>% dplyr::filter(model_id == "model_0"))$checks[[1]]

d. Selection

Compare models performance

Show code
# Visual model selection
best_STK_models <- 
  STK_models %>% 
  group_by(SITE) %>% 
  # Use case_when to identify the best model
  mutate(best_model = 
           case_when(AIC == min(as.numeric(AIC)) ~ "Yes",
                     TRUE ~ "No")) %>% 
  ungroup()

# Plot
best_STK_models %>% 
  ggplot()+
  geom_point(aes(x = model_id, y = as.numeric(AIC), 
                 color = best_model, shape = best_model), 
             size = 3)+
  facet_wrap(~SITE)

Show code
# Final models
selected_models <- best_STK_models %>% dplyr::filter(best_model == "Yes")

e. ANOVA

Estimate the effects of factors under study (and their interaction)

Show code
models_effects <- 
  selected_models %>%
  # Type 3 Sum of Squares (Partial SS, when interactions are present)
  mutate(ANOVA = map(model, ~Anova(., type = 3)) )

# Extract ANOVAS
models_effects$ANOVA[[1]]
Analysis of Deviance Table (Type III tests)

Response: STK
              Chisq Df Pr(>Chisq)    
(Intercept) 326.261  1  < 2.2e-16 ***
TREAT        39.994  2  2.067e-09 ***
DEPTH        12.637  4   0.013191 *  
TREAT:DEPTH  21.827  8   0.005247 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Show code
models_effects$ANOVA[[2]]
Analysis of Deviance Table (Type III tests)

Response: STK
              Chisq Df Pr(>Chisq)    
(Intercept) 794.721  1  < 2.2e-16 ***
TREAT        21.394  2  2.261e-05 ***
DEPTH        67.224  4  8.743e-14 ***
TREAT:DEPTH  52.957  8  1.099e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

iv. Means comparison

Show code
# MULTCOMPARISON
# emmeans and cld multcomp
# We need to specify ourselves the most important interaction to perform the comparisons
mult_comp <- 
  models_effects %>% 
  # Comparisons estimates (emmeans)
  mutate(mc_estimates = map(model, ~emmeans(., ~ TREAT*DEPTH))) %>% 
  # Assign letters and p-value adjustment (multcomp)
  mutate(mc_letters = 
           map(mc_estimates, 
               ~as.data.frame( 
                 # By specifies a strata or level to assign the letters
                 cld(., by = "DEPTH", decreasing = TRUE, details=FALSE,
                     reversed=TRUE, alpha=0.05,  adjust = "tukey", Letters=LETTERS))))

v. Plot

Now, we are going to reproduce Figure 2

Show code
# Create data frame for plot
plot_df <- mult_comp %>% 
  dplyr::select(SITE, mc_letters) %>% 
  unnest(mc_letters)

# Define your own colors
my_colors <- c("#ffaa00", "#7E5AA0", "#5c9c8c")

# Create the plot
STK_plot <-
  plot_df %>% 
  # We need to re-express DEPTH from factor to character, and then to numeric
  mutate(DEPTH = as.numeric(as.character(DEPTH)))  %>% 
  # Re-order levels of the factors
  mutate(TREAT = fct_relevel(TREAT,"Control", "NPS", "Pristine")) %>% 
  mutate(SITE = fct_relevel(SITE,"site_1", "site_2", "site_3", "site_4")) %>% 
  # Create plot
  ggplot()+
  # 01. LAYOUT
  ## Subplots
  facet_wrap(~SITE, nrow = 2)+
  ## Axis titles
  labs(x = "Soil depth (cm)", y = bquote(~NH[4]*'OAc-K (g' ~m^-2*')'))+
  # 02. GEOMETRIES
  ## i. Points
  geom_point(aes(x = DEPTH, y = emmean,
                 fill= TREAT,
                 shape = TREAT),
             size = 3, col = "black")+
  ## Adjust shape aesthetics
  scale_shape_manual(name="Fertilizer Scenario", values=c(24,23,21),
                     guide="legend")+
  scale_colour_manual(name="Fertilizer Scenario",
                    values = my_colors,
                    guide='legend')+
  scale_fill_manual(name="Fertilizer Scenario",
                    values = my_colors,
                    guide='legend')+
  ## ii. Add error bar
  geom_errorbar(width = 0.25, aes(x = DEPTH, color = TREAT, 
                                 ymin = emmean-2*SE, ymax = emmean+2*SE))+
  ## iii. Add line
  geom_line(size = 0.5,aes(x = DEPTH, y = emmean, color = TREAT))+
  # 03. ADJUST XY AXIS
  ## Reverse the scale
  scale_x_reverse(breaks=seq(0, 100, 20), limits = c(100,0))+
  coord_flip()+
  # 04. THEME
  theme_bw()+
  theme(strip.text = element_text(size = rel(1.25)),
        strip.background = element_blank(),
        # Grid
        panel.grid = element_blank(),
        # Axis
        axis.title = element_text(size = rel(1.5)),
        axis.text = element_text(size = rel(1.25), color = "black"),
        # Legend
        legend.position = "top", legend.title = element_blank(),
        legend.text = element_text(size = rel(1.25))        )

i. Figure with caption

Show code
STK_plot

Figure 2. Soil profiles of STK (\(g~m^{-2}\)) under three different conditions: pristine soils (green circles), under grain cropping from 2000 to 2009 with no fertilizers added (Control, orange triangles), and under grain cropping from 2000 to 2009 with N, P, plus S fertilization (NPS, purple diamonds). Overlapping error bars indicate absence of significant differences between scenarios by soil depths combinations (Tukey’s HSD, p < 0.05).

03. ADDING REFERENCES

i. Citations

Adding references with Quarto has become quite easy. Let’s see…

This is a citation of the Tidyverse (Wickham et al. 2019) package.

  • Using a “*.bib” file: Figures were produced using the ggplot2 package (Wickham 2016). Models check were tested with the performance package (Lüdecke et al. 2021).

  • Using Footnotes: For example, this is some text with a Footnote1, this is a second Footnote2

  • Using visual editor: this option introduced by Quarto is simply awesome! 🤩. Let’s see. Insert -> Citation or “Crtl + Shift + F8”. With this option we can look for citations online via DOI, Crossref, etc… and insert them into our document (and to our *.bib file).

BIBLIOGRAPHY

Bolker, Ben, Julia Piaskowski, Emi Tanaka, Phillip Alday, and Wolfgang Viechtbauer. 2022. CRAN Task View: Mixed, Multilevel, and Hierarchical Models in r. Version 2022-10-18. https://cran.r-project.org/view=MixedModels.
Bolker, Ben, and David Robinson. 2022. “Broom.mixed: Tidying Methods for Mixed Models.” https://CRAN.R-project.org/package=broom.mixed.
Correndo, Adrian A., Gerardo Rubio, Fernando O. García, and Ignacio A. Ciampitti. 2021. “Subsoil-Potassium Depletion Accounts for the Nutrient Budget in High-Potassium Agricultural Soils.” Scientific Reports 11 (1). https://doi.org/10.1038/s41598-021-90297-1.
Dixon, Philip. 2016. “Should Blocks Be Fixed or Random?” Conference on Applied Statistics in Agriculture, May 1-3, Kansas State University, 23–39. https://doi.org/10.4148/2475-7772.1474.
Lüdecke, Daniel, Mattan S. Ben-Shachar, Indrajeet Patil, Philip Waggoner, and Dominique Makowski. 2021. performance: An R Package for Assessment, Comparison and Testing of Statistical Models.” Journal of Open Source Software 6 (60): 3139. https://doi.org/10.21105/joss.03139.
Mendiburu, Felipe de. 2021. “Agricolae: Statistical Procedures for Agricultural Research.” https://CRAN.R-project.org/package=agricolae.
Robinson, David, Alex Hayes, and Simon Couch. 2022. “Broom: Convert Statistical Objects into Tidy Tibbles.” https://CRAN.R-project.org/package=broom.
Waring, Elin, Michael Quinn, Amelia McNamara, Eduardo Arino de la Rubia, Hao Zhu, and Shannon Ellis. 2022. “Skimr: Compact and Flexible Summaries of Data.” https://CRAN.R-project.org/package=skimr.
Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (10). https://doi.org/10.18637/jss.v059.i10.
———. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the Tidyverse 4: 1686. https://doi.org/10.21105/joss.01686.
Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2022. “Dplyr: A Grammar of Data Manipulation.” https://CRAN.R-project.org/package=dplyr.
Wickham, Hadley, and Garrett Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 1st ed. Paperback; O’Reilly Media. http://r4ds.had.co.nz/.
Zuur, Alain F., Elena N. Ieno, and Chris S. Elphick. 2010. “A Protocol for Data Exploration to Avoid Common Statistical Problems.” Methods in Ecology and Evolution 1 (1): 3–14. https://doi.org/https://doi.org/10.1111/j.2041-210X.2009.00001.x.

Footnotes

  1. Citation for Footnote 1↩︎

  2. Citation for Footnote 2↩︎