library(pacman)
p_load(dplyr, purrr, stringr, tibble) # data wrangling
p_load(readr, readxl, openxlsx) # to read filesReading tabular data in R
1 Introduction π
In real projects, a big part of βdata scienceβ is simply getting data into R reliably and reproducibly. CSV (comma separated values) and Excel are the most common formats youβll see in agriculture, industry, and research workflows.
This short note covers:
- π how to read CSV (readr).
- π how to read Excel (readxl, openxlsx).
- ποΈ how to read many files in a folder.
- π§Ύ how to read many sheets in an excel file.
- π§Ή common cleaning options like skipping rows, header rows, and missing values
2 Libraries π¦
2.1 Quick comparison of main packages
π CSV / delimited text
- Package: readr
- β
Strengths: fast, consistent parsing, clear messages, good type control (col_types)
- β οΈ Notes: great default for tidyverse-style workflows
π Excel (.xlsx)
- Package: readxl
- β
Strengths: simple reading of Excel, minimal dependencies
- β οΈ Notes: read-only (does not write Excel)
π Excel (.xlsx)
- Package: openxlsx
- β
Strengths: read + write, more control over excels/sheets
- β οΈ Notes: great when you must write .xlsx outputs
Rule of thumb: - π CSV: start with readr::read_csv() - π Excel reading: start with readxl::read_excel() - π Excel reading + writing or more control: use openxlsx
3 Reading CSV files π
3.1 Basic read
df <- readr::read_csv("data/my_file.csv")Rows: 289 Columns: 6
ββ Column specification ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Delimiter: ","
chr (4): gen, trial, env, loc
dbl (2): yield, year
βΉ Use `spec()` to retrieve the full column specification for this data.
βΉ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df# A tibble: 289 Γ 6
gen trial env yield year loc
<chr> <chr> <chr> <dbl> <dbl> <chr>
1 G16 T2 E01 590 2001 Mieso
2 G17 T2 E01 554 2001 Mieso
3 G18 T2 E01 586 2001 Mieso
4 G19 T2 E01 738 2001 Mieso
5 G20 T2 E01 489 2001 Mieso
6 G21 T2 E01 684 2001 Mieso
7 G22 T2 E01 555 2001 Mieso
8 G23 T2 E01 1025 2001 Mieso
9 G24 T2 E01 493 2001 Mieso
10 G25 T2 E01 691 2001 Mieso
# βΉ 279 more rows
3.2 Common options
missing values, quieter output
df <- readr::read_csv(
file = "data/my_file.csv",
na = c("", "NA", "N/A", "na", ".")
)Rows: 289 Columns: 6
ββ Column specification ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Delimiter: ","
chr (4): gen, trial, env, loc
dbl (2): yield, year
βΉ Use `spec()` to retrieve the full column specification for this data.
βΉ Specify the column types or set `show_col_types = FALSE` to quiet this message.
3.3 Control column types π―
Recommended for reproducibility.
df <- readr::read_csv(file = "data/my_file.csv") |>
mutate(loc = as.character(loc),
year = as.integer(year),
yield = as.numeric(yield))Rows: 289 Columns: 6
ββ Column specification ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Delimiter: ","
chr (4): gen, trial, env, loc
dbl (2): yield, year
βΉ Use `spec()` to retrieve the full column specification for this data.
βΉ Specify the column types or set `show_col_types = FALSE` to quiet this message.
3.4 Skip lines at the top βοΈ
For example for metadata, notes, extra header rowsβ¦
df <- readr::read_csv(
file = "data/my_file.csv",
skip = 1
)Rows: 288 Columns: 6
ββ Column specification ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Delimiter: ","
chr (4): G16, T2, E01, Mieso
dbl (2): 590, 2001
βΉ Use `spec()` to retrieve the full column specification for this data.
βΉ Specify the column types or set `show_col_types = FALSE` to quiet this message.
3.5 Skip blank lines π³οΈ
Often helpful with messy CSV exports
df <- readr::read_csv(
file = "data/my_file.csv",
skip_empty_rows = TRUE
)Rows: 289 Columns: 6
ββ Column specification ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Delimiter: ","
chr (4): gen, trial, env, loc
dbl (2): yield, year
βΉ Use `spec()` to retrieve the full column specification for this data.
βΉ Specify the column types or set `show_col_types = FALSE` to quiet this message.
4 Reading Excel files (.xlsx) π
4.1 Basic read (first sheet by default)
df <- readxl::read_excel(path="data/my_excel.xlsx")
head(df)# A tibble: 6 Γ 6
gen trial env yield year loc
<chr> <chr> <chr> <dbl> <dbl> <chr>
1 G16 T2 E01 590 2001 Mieso
2 G17 T2 E01 554 2001 Mieso
3 G18 T2 E01 586 2001 Mieso
4 G19 T2 E01 738 2001 Mieso
5 G20 T2 E01 489 2001 Mieso
6 G21 T2 E01 684 2001 Mieso
4.2 Read a specific sheet
by name or position
df1 <- readxl::read_excel(path="data/my_excel.xlsx", sheet = "Sheet1")
df2 <- readxl::read_excel(path="data/my_excel.xlsx", sheet = 2)4.3 Skip rows βοΈ
For example, title row + empty row
df <- readxl::read_excel(
path = "data/my_excel.xlsx",
sheet = 1,
skip = 2
)4.4 Diagnose π
For example, βempty first rowβ or βheadersβ not where expectedβ
df_rows <- readxl::read_excel(
path = "data/my_excel.xlsx",
sheet = 1,
n_max = 6, # number of rows of data
col_names = FALSE # remove headers row
)New names:
β’ `` -> `...1`
β’ `` -> `...2`
β’ `` -> `...3`
β’ `` -> `...4`
β’ `` -> `...5`
β’ `` -> `...6`
df_rows# A tibble: 6 Γ 6
...1 ...2 ...3 ...4 ...5 ...6
<chr> <chr> <chr> <chr> <chr> <chr>
1 gen trial env yield year loc
2 G16 T2 E01 590 2001 Mieso
3 G17 T2 E01 554 2001 Mieso
4 G18 T2 E01 586 2001 Mieso
5 G19 T2 E01 738 2001 Mieso
6 G20 T2 E01 489 2001 Mieso
Then set skip based on what you see (commonly skip = 1 or skip = 2) and re-read.
4.5 Reading Excel with openxlsx π
Nice when you also write excel files later
df_sr1 <- openxlsx::read.xlsx(
xlsxFile = "data/my_excel.xlsx",
sheet = 1,
startRow = 1 # speficies where headers are located
)
head(df_sr1) gen trial env yield year loc
1 G16 T2 E01 590 2001 Mieso
2 G17 T2 E01 554 2001 Mieso
3 G18 T2 E01 586 2001 Mieso
4 G19 T2 E01 738 2001 Mieso
5 G20 T2 E01 489 2001 Mieso
6 G21 T2 E01 684 2001 Mieso
4.5.1 Skip rows by moving startRow βοΈ:
df_sr3 <- openxlsx::read.xlsx(
xlsxFile = "data/my_excel.xlsx",
sheet = 1,
startRow = 3
)
slice_head(df_sr3) G17 T2 E01 554 2001 Mieso
1 G18 T2 E01 586 2001 Mieso
5 Reading multiple CSV files ποΈ
This is extremely common when we have one file per site, year, trial, etc.
5.1 Read + stack π§±
This is call row-bind
folder <- "data"
files <- list.files(path = folder, pattern = "\\.csv$", full.names = TRUE)
all_data <- files |>
setNames(basename(files)) |> # here we name by file basename
# Here we read and stack using map_dfr
purrr::map_dfr(
\(f) readr::read_csv(f), # here we read each file
.id = "source_file" # this creates a new column with the file name
)Rows: 289 Columns: 6
ββ Column specification ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Delimiter: ","
chr (4): gen, trial, env, loc
dbl (2): yield, year
βΉ Use `spec()` to retrieve the full column specification for this data.
βΉ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 289 Columns: 6
ββ Column specification ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Delimiter: ","
chr (4): gen, trial, env, loc
dbl (2): yield, year
βΉ Use `spec()` to retrieve the full column specification for this data.
βΉ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(all_data)# A tibble: 6 Γ 7
source_file gen trial env yield year loc
<chr> <chr> <chr> <chr> <dbl> <dbl> <chr>
1 my_file_02.csv G16 T2 E01 590 2001 Mieso
2 my_file_02.csv G17 T2 E01 554 2001 Mieso
3 my_file_02.csv G18 T2 E01 586 2001 Mieso
4 my_file_02.csv G19 T2 E01 738 2001 Mieso
5 my_file_02.csv G20 T2 E01 489 2001 Mieso
6 my_file_02.csv G21 T2 E01 684 2001 Mieso
5.2 Add variables from filename π·οΈ
For example year, or site
all_data <- files |>
purrr::map_dfr(
# here we read each file
\(f) readr::read_csv(f) |>
dplyr::mutate(source_file = basename(f)) ) |> # add filename as column
dplyr::mutate(year = stringr::str_extract(source_file, "\\d{4}")) Rows: 289 Columns: 6
ββ Column specification ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Delimiter: ","
chr (4): gen, trial, env, loc
dbl (2): yield, year
βΉ Use `spec()` to retrieve the full column specification for this data.
βΉ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 289 Columns: 6
ββ Column specification ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Delimiter: ","
chr (4): gen, trial, env, loc
dbl (2): yield, year
βΉ Use `spec()` to retrieve the full column specification for this data.
βΉ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# here we extract a 4-digit year from the filename
all_data %>%
group_by(source_file) %>%
slice_head()# A tibble: 2 Γ 7
# Groups: source_file [2]
gen trial env yield year loc source_file
<chr> <chr> <chr> <dbl> <chr> <chr> <chr>
1 G16 T2 E01 590 <NA> Mieso my_file.csv
2 G16 T2 E01 590 <NA> Mieso my_file_02.csv
5.3 Keep them as a named list π
csv_list <- files |>
setNames(basename(files)) |>
purrr::map(\(f) readr::read_csv(f))Rows: 289 Columns: 6
ββ Column specification ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Delimiter: ","
chr (4): gen, trial, env, loc
dbl (2): yield, year
βΉ Use `spec()` to retrieve the full column specification for this data.
βΉ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 289 Columns: 6
ββ Column specification ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Delimiter: ","
chr (4): gen, trial, env, loc
dbl (2): yield, year
βΉ Use `spec()` to retrieve the full column specification for this data.
βΉ Specify the column types or set `show_col_types = FALSE` to quiet this message.
names(csv_list)[1] "my_file_02.csv" "my_file.csv"
6 Reading multiple Excel files ποΈπ
folder <- "data"
files <- list.files(path = folder, pattern = "\\.xlsx$", full.names = TRUE)
all_xlsx <- files |>
purrr::map_dfr(
\(f) readxl::read_excel(f, sheet = 1) |> # read first sheet from each
dplyr::mutate(source_file = basename(f)) # add filename as column
)
all_xlsx %>%
group_by(source_file) %>%
slice_head()# A tibble: 2 Γ 7
# Groups: source_file [2]
gen trial env yield year loc source_file
<chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
1 G16 T2 E01 590 2001 Mieso my_excel.xlsx
2 G16 T2 E01 590 2001 Mieso my_file_02.xlsx
7 Reading multiple sheets π§Ύπ§Ύπ§Ύ
7.1 With readxl
path <- "data/my_excel.xlsx"
sheets <- readxl::excel_sheets(path)
all_sheets <- sheets |>
setNames(sheets) |>
purrr::map_dfr(
\(sh) readxl::read_excel(path, sheet = sh) |>
dplyr::mutate(sheet = sh)
)
all_sheets %>%
group_by(sheet) %>%
slice_head()# A tibble: 2 Γ 7
# Groups: sheet [2]
gen trial env yield year loc sheet
<chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
1 G16 T2 E01 590 2001 Mieso Sheet1
2 G16 T2 E01 590 2001 Mieso Sheet2
7.2 With openxlsx π οΈ
path <- "data/my_excel.xlsx"
sheets <- openxlsx::getSheetNames(path)
all_sheets <- sheets |>
setNames(sheets) |>
purrr::map_dfr(
\(sh) openxlsx::read.xlsx(path, sheet = sh) |>
dplyr::mutate(sheet = sh)
)
all_sheets %>%
group_by(sheet) %>%
slice_head()# A tibble: 2 Γ 7
# Groups: sheet [2]
gen trial env yield year loc sheet
<chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
1 G16 T2 E01 590 2001 Mieso Sheet1
2 G16 T2 E01 590 2001 Mieso Sheet2
7.3 Footnotes in Excel π
If a sheet has footnotes at the bottom, you can often remove them after import by filtering rows:
df <- readxl::read_excel(path="data/my_excel.xlsx") |>
dplyr::filter(!is.na(yield)) # example: keep rows that have yield8 Summary
- Use readr::read_csv() for most CSV workflows (fast + consistent).
- Use readxl::read_excel() for quick Excel imports.
- Use openxlsx when you need more Excel control or want to write .xlsx outputs.
- For multiple files or sheets, combine list.files()/excel_sheets() with purrr::map_dfr() to create one clean dataset.