Regression Trees in R

Author

Dr. Adrian Correndo

Published

March 25, 2026

Abstract
In this lesson, we introduce regression trees for predicting two agronomic responses: EONR and yield at EONR. We use two common R implementations: rpart() and ctree(). Both methods build a tree by recursively splitting the predictor space into more homogeneous groups, but they do not choose splits in exactly the same way, so their final trees can differ.

1 🌳 Overview

In this lesson, we introduce regression trees for predicting two agronomic responses:

  • yield_eonr_Mg_ha
  • eonr_kgha

We will use two common R implementations:

  • rpart() from rpart
  • ctree() from partykit

Both methods build a tree by recursively splitting the predictor space into more homogeneous groups. However, they do not choose splits in exactly the same way, so their final trees can differ.

🌽 What is the goal here?

A regression tree predicts a continuous response by recursively splitting the data into smaller and more homogeneous groups.

In agronomy, this is useful when responses are driven by:

  • thresholds
  • nonlinear responses
  • interactions among soil, weather, and management

In this lesson, we use trees to predict EONR and yield at EONR.

2 βœ… Learning goals

3 References used in this lesson

For prediction-performance metrics, this lesson uses the metrica package and its regression-metrics vignette.

  • Correndo, A. A., Moro Rosso, L. H., Ciampitti, I. A., Agostinho de Souza, M. L., & Pearson, J. W. (2022). metrica: an R package to evaluate prediction performance of regression and classification point-forecast models. Journal of Open Source Software, 7(79), 4655. https://doi.org/10.21105/joss.04655
  • Metrica vignette on available regression metrics: https://adriancorrendo.github.io/metrica/articles/available_metrics_regression.html

4 Learning goals

By the end of this lesson, you should be able to:

  1. Explain what a regression tree is.
  2. Fit a regression tree in R with rpart().
  3. Fit a regression tree in R with ctree().
  4. Interpret the tree structure agronomically.
  5. Compare predictions from tree models using a test set.
  6. Recognize the main strengths and limitations of a single regression tree.

5 πŸ“¦ Load packages

Show code
library(pacman)
p_load(
  dplyr, # data manipulation
  ggplot2, # data visualization
  tidyr, # data tidying
  forcats, # categorical variable manipulation
  metrica, # prediction performance metrics
  rpart, # regression trees
  rpart.plot, # tree plotting for rpart
  partykit # conditional regression trees
)

6 🌽 Load or generate the dataset

This lesson assumes you already created the synthetic dataset corn_eonr.

If it is not in your environment, source the script that generates it.

Show code
source("synthetic_corn_eonr_dataset.R")
# A tibble: 1 Γ— 7
      n yield_min yield_mean yield_max eonr_min eonr_mean eonr_max
  <int>     <dbl>      <dbl>     <dbl>    <dbl>     <dbl>    <dbl>
1  3000       5.5       7.03      12.7        0      29.6     142.
# A tibble: 3 Γ— 7
  previous_crop     n mean_yield mean_eonr p10_eonr p50_eonr p90_eonr
  <fct>         <int>      <dbl>     <dbl>    <dbl>    <dbl>    <dbl>
1 alfalfa         374       7.22   0.00445      0        0        0  
2 soybean        1565       7.01  18.1          0       12.8     46.1
3 corn           1061       7.01  57.1         27.2     55.7     89.8
Show code
glimpse(corn_eonr)
Rows: 3,000
Columns: 24
$ crop                   <fct> corn, corn, corn, corn, corn, corn, corn, corn,…
$ site                   <fct> S01, S01, S01, S01, S01, S01, S01, S01, S01, S0…
$ year                   <int> 2001, 2001, 2001, 2002, 2002, 2002, 2003, 2003,…
$ topo_pos               <fct> high, medium, low, high, medium, low, high, med…
$ previous_crop          <fct> soybean, soybean, soybean, corn, alfalfa, corn,…
$ soil_texture           <fct> medium, medium, medium, medium, medium, medium,…
$ soil_depth_cm          <dbl> 73.95485, 76.54538, 94.83163, 74.77698, 70.0491…
$ whc_mm                 <dbl> 114.4738, 123.4785, 142.7356, 100.2769, 127.160…
$ om_pct                 <dbl> 3.991988, 3.967986, 4.199946, 3.229937, 3.76092…
$ ph                     <dbl> 6.440524, 6.150376, 6.210251, 6.187937, 6.38307…
$ drainage_class         <fct> moderate, moderate, moderate, moderate, moderat…
$ slope_pct              <dbl> 9.358450, 6.496899, 6.761809, 6.584873, 6.46126…
$ planting_doy           <dbl> 116, 108, 115, 126, 119, 127, 123, 124, 123, 12…
$ spring_rain_mm         <dbl> 154.5662, 106.9352, 160.4594, 259.3582, 282.665…
$ summer_rain_mm         <dbl> 420.3239, 338.8147, 384.4102, 242.3768, 230.724…
$ gdd                    <dbl> 1392.796, 1397.189, 1333.349, 1316.239, 1327.53…
$ heat_stress_days       <dbl> 11, 12, 13, 7, 6, 4, 9, 8, 9, 11, 9, 11, 8, 9, …
$ water_deficit_mm       <dbl> 44.49274329, 39.85067403, 16.53869214, 46.77271…
$ excess_water_index     <dbl> 3.2, 3.2, 3.2, 3.2, 3.2, 3.2, 3.2, 3.2, 3.2, 3.…
$ spring_nitrate_ppm     <dbl> 8.782210, 7.871453, 8.997586, 4.746425, 13.0066…
$ n_mineralization_index <dbl> 5.218859, 4.358232, 6.086249, 3.983716, 5.86645…
$ n_loss_risk_index      <dbl> 1.976456, 1.741376, 2.254018, 2.991177, 2.60394…
$ yield_eonr_Mg_ha       <dbl> 6.205054, 6.772010, 5.500000, 5.500000, 5.50000…
$ eonr_kgha              <dbl> 0.000000, 7.205527, 0.000000, 46.381194, 0.0000…

7 πŸ—‚οΈ The data

Each row represents one combination of:

  • site
  • year
  • topographic position

The current crop is always corn, but previous crop varies among:

  • alfalfa
  • soybean
  • corn

We will use the following predictors:

Show code
predictor_names <- c(
  "topo_pos", "previous_crop", "soil_texture", "soil_depth_cm", "whc_mm",
  "om_pct", "ph", "drainage_class", "slope_pct", "planting_doy",
  "spring_rain_mm", "summer_rain_mm", "gdd", "heat_stress_days",
  "water_deficit_mm", "excess_water_index", "spring_nitrate_ppm",
  "n_mineralization_index", "n_loss_risk_index"
)

corn_eonr %>%
  select(all_of(c(predictor_names, "yield_eonr_Mg_ha", "eonr_kgha"))) %>%
  glimpse()
Rows: 3,000
Columns: 21
$ topo_pos               <fct> high, medium, low, high, medium, low, high, med…
$ previous_crop          <fct> soybean, soybean, soybean, corn, alfalfa, corn,…
$ soil_texture           <fct> medium, medium, medium, medium, medium, medium,…
$ soil_depth_cm          <dbl> 73.95485, 76.54538, 94.83163, 74.77698, 70.0491…
$ whc_mm                 <dbl> 114.4738, 123.4785, 142.7356, 100.2769, 127.160…
$ om_pct                 <dbl> 3.991988, 3.967986, 4.199946, 3.229937, 3.76092…
$ ph                     <dbl> 6.440524, 6.150376, 6.210251, 6.187937, 6.38307…
$ drainage_class         <fct> moderate, moderate, moderate, moderate, moderat…
$ slope_pct              <dbl> 9.358450, 6.496899, 6.761809, 6.584873, 6.46126…
$ planting_doy           <dbl> 116, 108, 115, 126, 119, 127, 123, 124, 123, 12…
$ spring_rain_mm         <dbl> 154.5662, 106.9352, 160.4594, 259.3582, 282.665…
$ summer_rain_mm         <dbl> 420.3239, 338.8147, 384.4102, 242.3768, 230.724…
$ gdd                    <dbl> 1392.796, 1397.189, 1333.349, 1316.239, 1327.53…
$ heat_stress_days       <dbl> 11, 12, 13, 7, 6, 4, 9, 8, 9, 11, 9, 11, 8, 9, …
$ water_deficit_mm       <dbl> 44.49274329, 39.85067403, 16.53869214, 46.77271…
$ excess_water_index     <dbl> 3.2, 3.2, 3.2, 3.2, 3.2, 3.2, 3.2, 3.2, 3.2, 3.…
$ spring_nitrate_ppm     <dbl> 8.782210, 7.871453, 8.997586, 4.746425, 13.0066…
$ n_mineralization_index <dbl> 5.218859, 4.358232, 6.086249, 3.983716, 5.86645…
$ n_loss_risk_index      <dbl> 1.976456, 1.741376, 2.254018, 2.991177, 2.60394…
$ yield_eonr_Mg_ha       <dbl> 6.205054, 6.772010, 5.500000, 5.500000, 5.50000…
$ eonr_kgha              <dbl> 0.000000, 7.205527, 0.000000, 46.381194, 0.0000…

8 πŸ”Ž Quick exploratory plots

8.1 EONR by previous crop

Show code
corn_eonr %>%
  ggplot(aes(x = previous_crop, y = eonr_kgha)) +
  geom_boxplot() +
  labs(
    title = "EONR by previous crop",
    x = "Previous crop",
    y = "EONR (kg N/ha)"
  )

8.2 Yield at EONR by topographic position

Show code
corn_eonr %>%
  ggplot(aes(x = topo_pos, y = yield_eonr_Mg_ha)) +
  geom_boxplot() +
  labs(
    title = "Yield at EONR by topographic position",
    x = "Topographic position",
    y = "Yield at EONR (Mg/ha)"
  )

8.3 Relationship between the two responses

Show code
corn_eonr %>%
  ggplot(aes(x = yield_eonr_Mg_ha, y = eonr_kgha)) +
  geom_point(alpha = 0.5, 
             #aes(color = site)
             ) +
  labs(
    title = "Yield at EONR and EONR are related, but not identical",
    x = "Yield at EONR (Mg/ha)",
    y = "EONR (kg N/ha)"
  )

⚠️ A note on the train/test split

Here we use a simple random split because it is easy to understand for a first lesson.

However, this dataset has repeated structure across sites, years, and topographic positions. In real predictive work, grouped resampling by site, year, or site-year may provide a more honest evaluation than a random row-wise split.

9 βœ‚οΈ Train/test split

As in most supervised learning workflows, we fit the model on a training set and evaluate its performance on a test set. We will use a 80:20 split, which means using 80% of the data for training and 20% for testing. Remember we have more CV options that may be more appropriate for this dataset (maybe leave one group out or similar), but we will keep it simple for this being first lesson on regression trees.

Show code
set.seed(6530)

n <- nrow(corn_eonr)
train_id <- sample(seq_len(n), size = round(0.8 * n))

# Here we take the 80% of the data for training 
train_df <- corn_eonr[train_id, ] 
# And here we take the remaining 20% for testing
test_df  <- corn_eonr[-train_id, ]

dim(train_df)
[1] 2400   24
Show code
dim(test_df)
[1] 600  24
🌳 How does a regression tree work?

A regression tree starts with all observations in a single node.

Then it asks a sequence of binary questions such as:

  • Is previous_crop alfalfa?
  • Is spring_nitrate_ppm less than a threshold?
  • Is water_deficit_mm above a threshold?

Each split tries to create subgroups with more similar response values than the parent group.

10 🌳 Regression trees with rpart()

11 πŸ§ͺ What does rpart() do?

rpart() grows a regression tree by searching for splits that reduce within-node variability in the response.

For a continuous response, it tries to create child nodes where the response is more homogeneous than in the parent node.

11.1 🌽 Model for EONR

Show code
rpart_eonr <- 
  rpart(
  formula = eonr_kgha ~ 
    topo_pos + previous_crop + soil_texture + soil_depth_cm + whc_mm +
    om_pct + ph + drainage_class + slope_pct + planting_doy +
    spring_rain_mm + summer_rain_mm + gdd + heat_stress_days +
    water_deficit_mm + excess_water_index + spring_nitrate_ppm +
    n_mineralization_index + n_loss_risk_index,
  data = train_df,
  method = "anova",
  control = rpart.control(
    cp = 0.001,
    minsplit = 30,
    maxdepth = 5
  )
)

rpart_eonr
n= 2400 

node), split, n, deviance, yval
      * denotes terminal node

 1) root 2400 2058038.000  29.733080  
   2) previous_crop=alfalfa,soybean 1548  563219.800  14.621180  
     4) spring_nitrate_ppm>=7.598789 891   97720.590   5.220692  
       8) spring_nitrate_ppm>=10.33134 451   10792.880   1.256361 *
       9) spring_nitrate_ppm< 10.33134 440   72574.760   9.284131  
        18) water_deficit_mm>=152.4532 228   10010.240   3.687688 *
        19) water_deficit_mm< 152.4532 212   47743.570  15.302950  
          38) gdd< 1406.37 61    4717.100   5.653791 *
          39) gdd>=1406.37 151   35052.630  19.200950 *
     5) spring_nitrate_ppm< 7.598789 657  279982.200  27.369780  
      10) n_loss_risk_index< 3.812091 570  185857.600  23.772950  
        20) water_deficit_mm>=131.5821 316   73278.240  18.219890  
          40) spring_nitrate_ppm>=5.671195 182   29088.720  12.687690 *
          41) spring_nitrate_ppm< 5.671195 134   31053.940  25.733770 *
        21) water_deficit_mm< 131.5821 254   90712.120  30.681490  
          42) gdd< 1409.037 80   19299.940  18.304540 *
          43) gdd>=1409.037 174   53522.540  36.372040 *
      11) n_loss_risk_index>=3.812091 87   38436.920  50.935180  
        22) spring_nitrate_ppm>=3.377385 69   21970.810  46.041170  
          44) water_deficit_mm>=52.41622 25    3418.610  35.270470 *
          45) water_deficit_mm< 52.41622 44   14004.150  52.160890 *
        23) spring_nitrate_ppm< 3.377385 18    8478.361  69.695530 *
   3) previous_crop=corn 852  498998.100  57.189920  
     6) n_loss_risk_index< 3.208351 602  284237.500  50.348060  
      12) spring_nitrate_ppm>=5.823104 141   57298.600  35.672710  
        24) water_deficit_mm>=257.2334 50   13017.640  24.515670  
          48) n_mineralization_index>=6.003867 15    1958.918  11.542450 *
          49) n_mineralization_index< 6.003867 35    7452.198  30.075620 *
        25) water_deficit_mm< 257.2334 91   34637.200  41.802960  
          50) gdd< 1446.907 25    4592.789  23.343420 *
          51) gdd>=1446.907 66   18298.700  48.795210 *
      13) spring_nitrate_ppm< 5.823104 461  187284.500  54.836610  
        26) water_deficit_mm>=201.0774 215   64659.030  47.563630  
          52) spring_nitrate_ppm>=3.185544 107   23156.310  38.807740 *
          53) spring_nitrate_ppm< 3.185544 108   25172.210  56.238460 *
        27) water_deficit_mm< 201.0774 246  101313.200  61.193080  
          54) gdd< 1463.647 120   31478.090  50.259610 *
          55) gdd>=1463.647 126   41828.430  71.605910 *
     7) n_loss_risk_index>=3.208351 250  118722.200  73.665130  
      14) water_deficit_mm>=70.3831 140   42115.720  64.011320  
        28) spring_nitrate_ppm>=3.535302 50   14409.580  53.282400  
          56) ph>=6.62271 21    4113.880  43.087850 *
          57) ph< 6.62271 29    6532.758  60.664660 *
        29) spring_nitrate_ppm< 3.535302 90   18753.160  69.971840  
          58) n_loss_risk_index< 3.785434 72   13624.230  67.400480 *
          59) n_loss_risk_index>=3.785434 18    2748.664  80.257240 *
      15) water_deficit_mm< 70.3831 110   46953.200  85.951810  
        30) spring_nitrate_ppm>=3.167923 39   10990.730  73.333290 *
        31) spring_nitrate_ppm< 3.167923 71   26341.580  92.883100  
          62) gdd< 1428.991 33    9558.439  81.711000 *
          63) gdd>=1428.991 38    9087.264 102.585200 *

11.2 🌳 Plot the rpart() tree

Show code
rpart.plot(
  rpart_eonr,
  type = 2,
  extra = 101,
  fallen.leaves = TRUE,
  tweak = 1.0,
  box.palette = "RdYlGn",
)

11.3 πŸ” Variable importance from rpart()

Show code
rpart_eonr$variable.importance %>%
  sort(decreasing = TRUE) %>% 
  as.data.frame()
                                 .
previous_crop          1021558.977
spring_nitrate_ppm      793458.021
n_loss_risk_index       212476.606
gdd                     168571.957
spring_rain_mm          147527.672
n_mineralization_index  142437.193
water_deficit_mm        108960.397
om_pct                   38431.518
planting_doy             26936.049
summer_rain_mm           21520.542
heat_stress_days         20586.957
whc_mm                   11087.435
ph                       10886.443
slope_pct                 6354.733
soil_depth_cm             5492.259
drainage_class            4140.046
soil_texture              1990.203
topo_pos                  1399.251

11.4 πŸ“‰ Predictions and performance on the test set

Show code
test_df <- test_df %>%
  mutate(pred_rpart_eonr = predict(rpart_eonr, newdata = test_df))

# Using metrica for prediction metrics
metrica::metrics_summary(
  data = test_df,
  obs = eonr_kgha,
  pred = pred_rpart_eonr,
  type = "regression",
  metrics = c("RMSE", "MAE", "R2")
) %>%
  mutate(model = "rpart_eonr") %>%
  select(model, everything())
       model Metric      Score
1 rpart_eonr     R2  0.7362053
2 rpart_eonr    MAE 10.8485675
3 rpart_eonr   RMSE 15.2023901

11.5 πŸ“ˆ Observed vs predicted for rpart()

Show code
# Basic ggplot for observed vs predicted
test_df %>% 
  ggplot(aes(x = eonr_kgha, y = pred_rpart_eonr)) +
  geom_point(alpha = 0.5, shape = 21, size = 3,
             fill = "forestgreen", color = "grey15") +
  geom_abline(intercept = 0, slope = 1, linetype = 2) +
  scale_x_continuous(limits = c(0, 150), breaks = seq(0,200, by=50)) +
  scale_y_continuous(limits = c(0, 150), breaks = seq(0,200, by=50)) +
  labs(
    title = "Observed vs predicted EONR with rpart()",
    x = "Observed EONR (kg N/ha)",
    y = "Predicted EONR (kg N/ha)"
  )+
  theme_bw()

Show code
# Use metrica's scatter_plot for a more standardized visualization
# You can print metrics on the plot and draw a regression trendline
metrica::scatter_plot(data = test_df,
                      obs = eonr_kgha,
                      pred = pred_rpart_eonr,
                      print_metrics = TRUE,
                      metrics_list = c("RMSE", "MAE", "R2")
                      )+
  scale_x_continuous(limits = c(0, 200), breaks = seq(0,200, by=50)) +
  scale_y_continuous(limits = c(0, 200), breaks = seq(0,200, by=50)) +
  labs(title = "Observed vs predicted EONR with rpart()",
       x = "Observed EONR (kg N/ha)",
       y = "Predicted EONR (kg N/ha)"  )

12 βœ‚οΈ Tree complexity and pruning in rpart()

A fully grown tree often overfits the training data. One strength of rpart() is that it provides a complexity parameter table that helps us inspect pruning candidates.

Show code
printcp(rpart_eonr)

Regression tree:
rpart(formula = eonr_kgha ~ topo_pos + previous_crop + soil_texture + 
    soil_depth_cm + whc_mm + om_pct + ph + drainage_class + slope_pct + 
    planting_doy + spring_rain_mm + summer_rain_mm + gdd + heat_stress_days + 
    water_deficit_mm + excess_water_index + spring_nitrate_ppm + 
    n_mineralization_index + n_loss_risk_index, data = train_df, 
    method = "anova", control = rpart.control(cp = 0.001, minsplit = 30, 
        maxdepth = 5))

Variables actually used in tree construction:
[1] gdd                    n_loss_risk_index      n_mineralization_index
[4] ph                     previous_crop          spring_nitrate_ppm    
[7] water_deficit_mm      

Root node error: 2058038/2400 = 857.52

n= 2400 

          CP nsplit rel error  xerror      xstd
1  0.4838688      0   1.00000 1.00092 0.0267026
2  0.0901426      1   0.51613 0.51742 0.0180543
3  0.0466650      2   0.42599 0.43129 0.0159070
4  0.0270586      3   0.37932 0.38879 0.0142195
5  0.0192681      4   0.35226 0.37012 0.0134314
6  0.0144085      5   0.33300 0.35255 0.0126548
7  0.0119820      6   0.31859 0.34497 0.0125944
8  0.0106253      8   0.29462 0.33139 0.0120931
9  0.0086926      9   0.28400 0.32184 0.0117016
10 0.0079350     10   0.27531 0.31258 0.0115212
11 0.0070878     11   0.26737 0.30059 0.0111161
12 0.0063826     13   0.25320 0.28839 0.0108511
13 0.0051966     14   0.24681 0.28211 0.0106986
14 0.0046748     16   0.23642 0.27523 0.0103915
15 0.0043503     17   0.23175 0.27450 0.0103303
16 0.0038812     18   0.22740 0.26999 0.0101847
17 0.0038745     19   0.22351 0.26730 0.0101284
18 0.0037394     20   0.21964 0.26599 0.0101066
19 0.0022099     21   0.21590 0.25727 0.0097475
20 0.0018284     22   0.21369 0.25804 0.0098590
21 0.0017524     23   0.21186 0.25693 0.0098234
22 0.0011566     24   0.21011 0.25750 0.0098522
23 0.0010000     25   0.20895 0.25594 0.0097979
Show code
plotcp(rpart_eonr)

A common strategy is to prune the tree using a larger cp value chosen from the complexity table.

Show code
cp_best <- rpart_eonr$cptable[which.min(rpart_eonr$cptable[, "xerror"]), "CP"]

rpart_eonr_pruned <- prune(rpart_eonr, cp = cp_best)

rpart.plot(
  rpart_eonr_pruned,
  type = 2,
  extra = 101,
  fallen.leaves = TRUE,
  tweak = 1.0,
  box.palette = "RdYlGn",
)

Show code
test_df <- test_df %>%
  mutate(pred_rpart_eonr_pruned = predict(rpart_eonr_pruned, newdata = test_df))

bind_rows(
  metrica::metrics_summary(
    data = test_df,
    obs = eonr_kgha,
    pred = pred_rpart_eonr,
    type = "regression",
    metrics = c("RMSE", "MAE", "R2")
  ) %>% mutate(model = "rpart_eonr"),
  metrica::metrics_summary(
    data = test_df,
    obs = eonr_kgha,
    pred = pred_rpart_eonr_pruned,
    type = "regression",
    metrics = c("RMSE", "MAE", "R2")
  ) %>% mutate(model = "rpart_eonr_pruned")
) %>%
  select(model, everything())
              model Metric      Score
1        rpart_eonr     R2  0.7362053
2        rpart_eonr    MAE 10.8485675
3        rpart_eonr   RMSE 15.2023901
4 rpart_eonr_pruned     R2  0.7362053
5 rpart_eonr_pruned    MAE 10.8485675
6 rpart_eonr_pruned   RMSE 15.2023901

13 🌿 Regression trees with ctree()

14 πŸ§ͺ What does ctree() do?

ctree() fits a conditional inference tree. Instead of choosing splits through the same impurity-reduction strategy used in rpart(), it relies on a statistical testing framework (typically a p-value) to decide whether a split should be made and which predictor should be selected.

This often leads to trees that look different, more conservative trees compared to rpart(), even when both are fitted to the same response and predictors.

14.1 Model for EONR

Show code
ctree_eonr <- ctree(
  eonr_kgha ~
    topo_pos + previous_crop + soil_texture + soil_depth_cm + whc_mm +
    om_pct + ph + drainage_class + slope_pct + planting_doy +
    spring_rain_mm + summer_rain_mm + gdd + heat_stress_days +
    water_deficit_mm + excess_water_index + spring_nitrate_ppm +
    n_mineralization_index + n_loss_risk_index,
  data = train_df,
  control = ctree_control(
    minsplit = 30,
    maxdepth = 3
  )
)

ctree_eonr

Model formula:
eonr_kgha ~ topo_pos + previous_crop + soil_texture + soil_depth_cm + 
    whc_mm + om_pct + ph + drainage_class + slope_pct + planting_doy + 
    spring_rain_mm + summer_rain_mm + gdd + heat_stress_days + 
    water_deficit_mm + excess_water_index + spring_nitrate_ppm + 
    n_mineralization_index + n_loss_risk_index

Fitted party:
[1] root
|   [2] previous_crop in alfalfa, soybean
|   |   [3] spring_nitrate_ppm <= 7.5976
|   |   |   [4] water_deficit_mm <= 94.99674: 37.081 (n = 278, err = 138454.2)
|   |   |   [5] water_deficit_mm > 94.99674: 20.246 (n = 379, err = 96077.3)
|   |   [6] spring_nitrate_ppm > 7.5976
|   |   |   [7] spring_nitrate_ppm <= 10.33118: 9.284 (n = 440, err = 72574.8)
|   |   |   [8] spring_nitrate_ppm > 10.33118: 1.256 (n = 451, err = 10792.9)
|   [9] previous_crop in corn
|   |   [10] n_loss_risk_index <= 3.20745
|   |   |   [11] spring_nitrate_ppm <= 5.81194: 54.837 (n = 461, err = 187284.5)
|   |   |   [12] spring_nitrate_ppm > 5.81194: 35.673 (n = 141, err = 57298.6)
|   |   [13] n_loss_risk_index > 3.20745
|   |   |   [14] water_deficit_mm <= 69.99636: 85.952 (n = 110, err = 46953.2)
|   |   |   [15] water_deficit_mm > 69.99636: 64.011 (n = 140, err = 42115.7)

Number of inner nodes:    7
Number of terminal nodes: 8

14.2 🌿 Plot the ctree() tree

Show code
plot(ctree_eonr)

14.3 Predictions and performance on the test set

Show code
test_df <- test_df %>%
  mutate(pred_ctree_eonr = predict(ctree_eonr, newdata = test_df))

bind_rows(
  metrica::metrics_summary(
    data = test_df,
    obs = eonr_kgha,
    pred = pred_rpart_eonr,
    type = "regression",
    metrics = c("RMSE", "MAE", "R2")
  ) %>% mutate(model = "rpart_eonr"),
  metrica::metrics_summary(
    data = test_df,
    obs = eonr_kgha,
    pred = pred_rpart_eonr_pruned,
    type = "regression",
    metrics = c("RMSE", "MAE", "R2")
  ) %>% mutate(model = "rpart_eonr_pruned"),
  metrica::metrics_summary(
    data = test_df,
    obs = eonr_kgha,
    pred = pred_ctree_eonr,
    type = "regression",
    metrics = c("RMSE", "MAE", "R2")
  ) %>% mutate(model = "ctree_eonr")
) %>%
  select(model, everything())
              model Metric      Score
1        rpart_eonr     R2  0.7362053
2        rpart_eonr    MAE 10.8485675
3        rpart_eonr   RMSE 15.2023901
4 rpart_eonr_pruned     R2  0.7362053
5 rpart_eonr_pruned    MAE 10.8485675
6 rpart_eonr_pruned   RMSE 15.2023901
7        ctree_eonr     R2  0.6385247
8        ctree_eonr    MAE 13.0487819
9        ctree_eonr   RMSE 17.7822996

14.4 Observed vs predicted for ctree()

Show code
test_df %>%
  ggplot(aes(x = eonr_kgha, y = pred_ctree_eonr)) +
  geom_point(alpha = 0.5, shape = 21, size = 3,
             fill = "orange", color = "grey15") +
  geom_abline(intercept = 0, slope = 1, linetype = 2) +
  scale_x_continuous(limits = c(0, 150), breaks = seq(0,200, by=50)) +
  scale_y_continuous(limits = c(0, 150), breaks = seq(0,200, by=50)) +
  labs(
    title = "Observed vs predicted EONR with ctree()",
    x = "Observed EONR (kg N/ha)",
    y = "Predicted EONR (kg N/ha)"
  )+
  theme_bw()

Show code
# Use metrica's scatter_plot for a more standardized visualization
metrica::scatter_plot(data = test_df,
                      obs = eonr_kgha,
                      pred = pred_ctree_eonr,
                      print_metrics = TRUE,
                      metrics_list = c("RMSE", "MAE", "R2")
                      )+
    labs(title = "Observed vs predicted EONR with ctree()",
       x = "Observed EONR (kg N/ha)",
       y = "Predicted EONR (kg N/ha)"  )

15 🌽 Fit a second tree for yield at EONR

Now we repeat the process for a second continuous response.

15.1 rpart() for yield at EONR

Show code
rpart_yield <- rpart(
  yield_eonr_Mg_ha ~ 
    topo_pos + previous_crop + soil_texture + soil_depth_cm + whc_mm +
    om_pct + ph + drainage_class + slope_pct + planting_doy +
    spring_rain_mm + summer_rain_mm + gdd + heat_stress_days +
    water_deficit_mm + excess_water_index + spring_nitrate_ppm +
    n_mineralization_index + n_loss_risk_index,
  data = train_df,
  method = "anova",
  control = rpart.control(
    cp = 0.002,
    minsplit = 30,
    maxdepth = 3
  )
)

# Plot rpart for EONR
rpart.plot(
  rpart_yield,
  type = 2,
  extra = 101,
  fallen.leaves = TRUE,
  tweak = 1.0,
  box.palette = "RdYlGn",
)

15.2 ctree() for yield at EONR

Show code
ctree_yield <- ctree(
  yield_eonr_Mg_ha ~ 
    topo_pos + previous_crop + soil_texture + soil_depth_cm + whc_mm +
    om_pct + ph + drainage_class + slope_pct + planting_doy +
    spring_rain_mm + summer_rain_mm + gdd + heat_stress_days +
    water_deficit_mm + excess_water_index + spring_nitrate_ppm +
    n_mineralization_index + n_loss_risk_index,
  data = train_df,
  control = ctree_control(
    minsplit = 30,
    maxdepth = 3
  )
)

plot(ctree_yield)

15.3 Test-set comparison for yield at EONR

Show code
test_df <- test_df %>%
  mutate(
    pred_rpart_yield = predict(rpart_yield, newdata = test_df),
    pred_ctree_yield = predict(ctree_yield, newdata = test_df)
  )

# Merge performance metrics for both models into a single table for comparison
bind_rows(
  metrica::metrics_summary(
    data = test_df,
    obs = yield_eonr_Mg_ha,
    pred = pred_rpart_yield,
    type = "regression",
    metrics = c("RMSE", "MAE", "R2")
  ) %>% mutate(model = "rpart_yield"),
  metrica::metrics_summary(
    data = test_df,
    obs = yield_eonr_Mg_ha,
    pred = pred_ctree_yield,
    type = "regression",
    metrics = c("RMSE", "MAE", "R2")
  ) %>% mutate(model = "ctree_yield")
) %>%
  select(model, everything())
        model Metric     Score
1 rpart_yield     R2 0.7574397
2 rpart_yield    MAE 0.6242395
3 rpart_yield   RMSE 0.9642375
4 ctree_yield     R2 0.7574633
5 ctree_yield    MAE 0.6243926
6 ctree_yield   RMSE 0.9641893

16 πŸ“ About the performance metrics

In this lesson, prediction performance is summarized with RMSE, MAE, and R2 using the metrica package.

These metrics are described in the package vignette on available regression metrics. That reference is useful because it explains not only the formulas, but also the practical interpretation of each metric when comparing predictive models.

17 πŸ” Interpreting regression trees

A regression tree is not just a predictive tool. It is also a way to summarize how the algorithm partitions the data.

When reading a tree, ask:

  1. Which predictor appears first?
  2. Are the first splits agronomically sensible?
  3. Do some splits reflect thresholds?
  4. Are the terminal nodes easy to explain?
  5. Does the tree reveal interactions that a linear model would need to specify explicitly?

For example, a split on previous_crop followed by spring_nitrate_ppm would make agronomic sense for EONR because those variables are directly related to N supply.

Likewise, a split on water_deficit_mm or whc_mm would make sense for yield because those variables affect crop water availability.

18 βš–οΈ Strengths and limitations of a single tree

18.1 Strengths

  • Easy to visualize
  • Handles nonlinear relationships naturally
  • Captures interactions automatically
  • Works with mixed predictor types
  • Produces rules that are often intuitive

18.2 Limitations

  • A single tree is usually unstable
  • Small changes in the data can change the tree structure
  • Prediction accuracy is often lower than ensemble methods
  • Trees can overfit if not constrained or pruned
  • Variable importance from one tree should not be overinterpreted as causality

19 βœ… Final remarks

Single regression trees are a good starting point for machine learning because they are easy to interpret. They help us see how recursive partitioning works before moving to more powerful ensemble methods such as random forests or boosting.

In practice, a single tree is often more useful for interpretation than for maximum predictive performance.

20 πŸ“ Exercises

  1. Fit a shallower rpart() tree by reducing maxdepth to 3. What changes?
  2. Increase minsplit from 30 to 80. Does the tree become easier to interpret?
  3. Compare trees fitted with and without previous_crop.
  4. Compare trees fitted with and without spring_nitrate_ppm.
  5. Which response seems easier to predict: yield_eonr_Mg_ha or eonr_kgha?
  6. Which method gave the better test-set performance in your run: rpart() or ctree()?