Reproducible Data Science with R

Adrian A. Correndo

2025-11-24

Welcome ๐Ÿ‘‹

  • Goal: Gain foundational knowledge and understand how data science can improve agricultural practices.
  • Letโ€™s dive into it with an emphasis on reproducibility and data literacy.

Tip

  • Remember: Questions & discussions are encouraged! ๐Ÿ’ฌ

Objectives for Today ๐Ÿ“Œ

  • Define core concepts:
    • Data Science,
    • Data Literacy,
    • Reproducibility.
  • Understand the role of reproducible data science in agriculture.
  • Explore challenges and opportunities.

Core concepts

What is Data Science in Agriculture? ๐ŸŒฑ

  • Applying data engineering, analysis, statistics, and machine learning to solve agricultural problems.
  • Examples: Precision agriculture, yield forecasting, environmental monitoring.

Key Definitions ๐Ÿ“–

  • Data Science: Extracting insights from data using algorithms and statistical methods.

  • Data Literacy: Skills to read, interpret, and analyze data.

  • Reproducibility: Ensuring analyses can be recreated by others.

Note

Why does reproducibility matter?

  • Trustworthy results,

  • transparency, &

  • collaboration in research.

Challenges in Data Literacy ๐ŸŒ

  • Diverse data sources (weather, soil, crop data)
  • Standardization issues across datasets
  • Data skills gap among ag professionals

Why does it matter?

  • It is the #1 skill-gap in the job market:

    • Academia,
    • Industry,
    • Government, NGOs, etc.

  • Is there a REPRODUCIBILITY CRISIS in science?

  • A Nature survey with ~1,600 researchers found that

    • +70% failure rate to reproduce another scientistโ€™s experiments

    • +50% have failed to reproduce their own experiments

    • Main causes: selective reporting, weak stats, code/data unavailability, etc.

GOOD NEWS ISโ€ฆ

Why Reproducibility in Agriculture?

  • Agriculture research relies heavily on environmental data, often variable and complex.

  • We have complex challenges ๐Ÿ—’๏ธ

    • Variability due to environmental factors, soil types, and weather patterns.
    • Complex datasets involving long-term studies, geographical variability.
  • Opportunities โœ…

    • Reproducibility helps stakeholders make reliable, data-driven decisions.
    • Ensures scientific findings are reliable and valid.
    • Facilitates collaboration, accountability, and efficiency among researchers and practitioners.

Challenges in Ag-research

REPRODUCIBILITY ๐Ÿ’ป

  • Limited capability to reproduce analyses & results

  • DATA are rarely shared, CODES even less

ACCESSIBILITY ๐Ÿ“ฒ

  • Yet we are not translating enough science into flexible, and transparent decision tools.

โ€œBut it all starts with โ€ฆโ€

EDUCATION ๐ŸŽ“

  • Limited curriculum in applied data science

Discussion Prompt ๐Ÿ’ฌ


i. Where do you think improved data literacy & reproducibility could impact agriculture the most?


Tip

  • Consider areas like resource management, market predictions, and farm management.


ii. What practical challenges do you face (or may) in implementing them?


What is R?

What is R? ๐Ÿงฎ

  • R is a programming language and environment primarily for statistical analysis, data visualization, and data science.
  • Known for its extensive statistical libraries, data manipulation capabilities, and graphics.
  • Widely used in fields like data science, bioinformatics, agriculture, and social sciences.


Brief History of R ๐Ÿ“œ

  • Origin: Developed in the early 1990s by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand.
  • Inspiration: R is an implementation of the S language, designed at Bell Laboratories for data analysis.
  • Open Source: Released as free, open-source software, leading to a large community of users and contributors.
  • Popularity: Today, R is one of the top programming languages for statistical analysis and data science.

Alternatives to R

R vs. Excel for Data Wrangling ๐Ÿ“Š

  • Excel: Known for ease of use, popular among business and finance professionals.
    • Pros: Intuitive, good for small datasets and quick analysis.
    • Cons: Limited in handling large datasets, lacks reproducibility.
  • R: Provides powerful data manipulation packages (e.g., dplyr, tidyr).
    • Pros: Handles large datasets efficiently, supports complex transformations, fully reproducible.
    • Cons: Requires programming knowledge, steeper learning curve than Excel.

Tip

  • Tip: R is highly scalable and is ideal for projects requiring automation, reproducibility, and handling large datasets.

R vs. SAS for Statistical Analysis ๐Ÿ“‰

  • SAS: A powerful statistical software suite used widely in industries such as healthcare and finance.
    • Pros: Robust for regulatory environments, highly standardized.
    • Cons: Proprietary and costly, limited community contributions.
  • R: Offers a vast array of statistical packages and flexibility in method implementation.
    • Pros: Free and open-source, customizable, strong community support.
    • Cons: Requires more coding and configuration for regulatory standards.

Note

  • Comparison: R is often chosen for research and academia due to its flexibility and customization, while SAS remains strong in industries needing strict compliance and control.

R vs Python vs Julia ๐Ÿ”

  • R, Python, and Julia are popular languages in data science and research.
  • Each language has unique strengths, ideal use cases, and licensing considerations.

R: Strengths and Use Cases ๐Ÿงฎ

  • Designed for Statistics: R is optimized for statistical analysis, making it ideal for research and academia.
  • Visualization: Excellent data visualization libraries like ggplot2.
  • Licensing: Licensed under GPL; many packages are also GPL, with some using MIT or BSD.

Ideal Use Cases:

  • Data analysis, visualization, and complex statistical modeling.
  • Research and academia where open-source, reproducible code is needed.
  • Licensing in Production: GPL may restrict proprietary use; check package licenses carefully.

Python: Strengths and Use Cases ๐Ÿ

  • General-Purpose Language: Python is popular for both data science and software development.
  • Machine Learning & AI: Extensive libraries for ML and AI, such as scikit-learn, TensorFlow.
  • Licensing: PSFL (Python Software Foundation License), highly permissive for proprietary use.

Ideal Use Cases:

  • End-to-end development, from data wrangling to ML and web development.
  • Production-ready ML and AI applications.
  • Licensing in Production: Permissive licenses allow closed-source use, making Python production-friendly.

Julia: Strengths and Use Cases ๐Ÿš€

  • High Performance: Designed for scientific computing, close to the speed of C/C++.
  • Ease of Use: Combines high-level syntax with low-level performance.
  • Licensing: MIT license, very permissive for commercial and open-source use.

Ideal Use Cases:

  • Large-scale simulations, optimization problems, high-performance computing.
  • Licensing in Production: MIT license allows proprietary use without restrictions.

Comparison Summary I ๐Ÿ“Š

Note

  • R: Open-source, powerful for data science, statistical analysis, and visualizations.
  • Excel: User-friendly, ideal for simple tasks, but limited for complex data wrangling.
  • SAS: Industry-standard for statistical analysis with regulatory requirements, but costly and less flexible than R.


Feature R Python Julia
Primary Strength Statistics & Visualization General-purpose, ML, AI High-performance computing
Performance Moderate Moderate High
Licensing GPL (core), MIT, BSD (some) PSFL, highly permissive MIT, highly permissive
Production Use Limited by GPL Very friendly for proprietary Very friendly for proprietary


Comparison Summary II ๐Ÿ“Š

Choosing the right tool depends on:

  • your projectโ€™s requirements,
  • team structure & skills, and
  • licensing needs for research vs. production.

Tip

  • R: Best for statistical analysis and visualization, but GPL license may restrict use in proprietary products.
  • Python: Strong in ML and AI with highly permissive licensing, making it ideal for production.
  • Julia: Excellent for high-performance computing, with permissive licensing suitable for proprietary use.

THANK YOU!

acorrend@uoguelph.ca

Adrian A. Correndo
Assistant Professor
Sustainable Cropping Systems
Department of Plant Agriculture
University of Guelph

Rm 226, Crop Science Bldg | Department of Plant Agriculture

Ontario Agricultural College | University of Guelph | 50 Stone Rd E, Guelph, ON-N1G 2W1, Canada.


Contact me