BeeBDC data prep.

0.0 Script preparation

0.1 Working directory

To start off with, have a think about where you want to work from. BeeBDC and bdc can create quite a few files and so setting this up well from the start is a good idea. If you are afraid that you might run out of storage, this could also be on a hard drive; but you can always change that later. Defining your RootPath at the top of your script, only once, shoudl make your life easier.

Choose the path to the root folder in which all other folders can be found.

RootPath <- paste0("/your/path/here")

# Create the working directory in the RootPath if it doesn't exist already
if (!dir.exists(paste0(RootPath, "/Data_acquisition_workflow"))) {
    dir.create(paste0(RootPath, "/Data_acquisition_workflow"), recursive = TRUE)
}
# Set the working directory
setwd(paste0(RootPath, "/Data_acquisition_workflow"))

0.2 Install packages (if needed)

Is this your first time using the sf or terra packages?

The first time that you use terra or sf on a new computer you may need to install some dependencies. Try to install the terra and sf packages first but then come back here if that doesn’t work.

Windows:

On Windows, you need to first install Rtools to get a C++ compiler that R can use. You need a recent version of Rtools42 (rtools42-5355-5357).

MacOS:

On macOS, you can use MacPorts or Homebrew.

With MacPorts you can do

sudo port install R-terra

With Homebrew, you need to first install GDAL:

brew install pkg-config

brew install gdal

Followed by (note the additional configuration argument needed for Homebrew)
# Install terra
install.packages("terra", type = "source", configure.args = "--with-proj-lib=$(brew --prefix)/lib/")
# install sf
install.packages("sf", type = "source", configure.args = "--with-proj-lib=$(brew --prefix)/lib/")

library(terra)
library(sf)

If you have sf and terra isntalled, you can now install BeeBDC.

install.packages("BeeBDC")
library(BeeBDC)

BeeBDC also has a few optional packages that are required for a subset of the functions. You don’t need to isntall these now, if you don’t want to, but you can do so later!

You can optionally install BiocManager, devtools, ComplexHeatmap, rnaturalearthhires, and taxadb or do it later as you wish.

if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager", repos = "http://cran.us.r-project.org")

BiocManager::install("ComplexHeatmap")

# Install remotes if needed
if (!require("remotes", quietly = TRUE)) install.packages("remotes", repos = "http://cran.us.r-project.org")
# Download and then load rnaturalearthhires
remotes::install_github("ropensci/rnaturalearthhires")
install.packages("rnaturalearthhires", repos = "https://ropensci.r-universe.dev",
    type = "source")
library(rnaturalearthhires)

install.packages("taxadb")

Set up the directories used by BeeBDC. These directories include where the data, figures, reports, etc. will be saved. The RDoc needs to be a path RELATIVE to the RootPath; i.e., the file path from which the two diverge.

BeeBDC::dirMaker(RootPath = RootPath, RDoc = "vignettes/BeeBDC_main.Rmd") %>%
    # Add paths created by this function to the environment()
list2env(envir = parent.env(environment()))

0.3 Load packages

Let’s go ahead and load our packages before we start!

lapply(c("ComplexHeatmap", "magrittr"), library, character.only = TRUE)
## Loading required package: grid
## 
## Attaching package: 'grid'
## The following object is masked from 'package:terra':
## 
##     depth
## ========================================
## ComplexHeatmap version 2.24.1
## Bioconductor page: http://bioconductor.org/packages/ComplexHeatmap/
## Github page: https://github.com/jokergoo/ComplexHeatmap
## Documentation: http://jokergoo.github.io/ComplexHeatmap-reference
## 
## If you use it in published research, please cite either one:
## - Gu, Z. Complex Heatmap Visualization. iMeta 2022.
## - Gu, Z. Complex heatmaps reveal patterns and correlations in multidimensional 
##     genomic data. Bioinformatics 2016.
## 
## 
## The new InteractiveComplexHeatmap package can directly export static 
## complex heatmaps into an interactive Shiny app with zero effort. Have a try!
## 
## This message can be suppressed by:
##   suppressPackageStartupMessages(library(ComplexHeatmap))
## ========================================
## 
## Attaching package: 'ComplexHeatmap'
## The following object is masked from 'package:terra':
## 
##     draw
## The following object is masked from 'package:R.utils':
## 
##     draw

1.0 Data merge

Attention:
Although each line of code has been validated, in order to save time knitting the R markdown document the next section is display only. If you are not data merging (section 1.0) or preparing the data (section 2.0), feel free to skip to Section 3.0 Initial flags.

1.1 Download ALA data

If you’re interested in using data from the Atlas of Living Australia (ALA) or one of the other Atlas repositories, you can use BeeBDC::atlasDownloader() below to access and download those files and their metadata easily enough. You may, however, need to ensure that you have an account that you can link to the download. Especially fo the sake of a doi.

To make an account with ALA in order to download your data visit this link — https://auth.ala.org.au/userdetails/registration/createAccount

  BeeBDC::atlasDownloader(path = DataPath,
           userEmail = "your@email.edu.au",
           atlas = "ALA",
           ALA_taxon = "Apiformes")

1.2 Import and merge ALA, SCAN, iDigBio, and GBIF data

If you are planning on combining data from ALA, SCAN, iDigBio, and/or GBIF, BeeBDC::repoMerge() is a handy function that should help you with this and help you to extract all of the metadata and citations that you need. Remember, that the main workflow only needs a nice Darwin Core-formatted dataset with which to work!

Supply the path to where the data are, the save_type is either “csv_files” or “R_file”.

  DataImp <- BeeBDC::repoMerge(path = DataPath, 
                  occ_paths = BeeBDC::repoFinder(path = DataPath),
                  save_type = "R_file")

If there is an error in finding a file, run repoFinder() by itself to troubleshoot. For example:

            #BeeBDC::repoFinder(path = DataPath)
            #OUTPUT:
            #$ALA_data
            #[1] "F:/BeeDataCleaning2022/BeeDataCleaning/BeeDataCleaning/BeeData/ALA_galah_path/galah_download_2022-09-15/data.csv"
  
            #$GBIF_data
            #[1] "F:/BeeDataCleaning2022/BeeDataCleaning/BeeDataCleaning/BeeData/GBIF_webDL_30Aug2022/0000165-220831081235567/occurrence.txt"
            #[2] "F:/BeeDataCleaning2022/BeeDataCleaning/BeeDataCleaning/BeeData/GBIF_webDL_30Aug2022/0436695-210914110416597/occurrence.txt"
            #[3] "F:/BeeDataCleaning2022/BeeDataCleaning/BeeDataCleaning/BeeData/GBIF_webDL_30Aug2022/0436697-210914110416597/occurrence.txt"
            #[4] "F:/BeeDataCleaning2022/BeeDataCleaning/BeeDataCleaning/BeeData/GBIF_webDL_30Aug2022/0436704-210914110416597/occurrence.txt"
            #[5] "F:/BeeDataCleaning2022/BeeDataCleaning/BeeDataCleaning/BeeData/GBIF_webDL_30Aug2022/0436732-210914110416597/occurrence.txt"
            #[6] "F:/BeeDataCleaning2022/BeeDataCleaning/BeeDataCleaning/BeeData/GBIF_webDL_30Aug2022/0436733-210914110416597/occurrence.txt"
            #[7] "F:/BeeDataCleaning2022/BeeDataCleaning/BeeDataCleaning/BeeData/GBIF_webDL_30Aug2022/0436734-210914110416597/occurrence.txt"
                    
            #$iDigBio_data
            #[1] "F:/BeeDataCleaning2022/BeeDataCleaning/BeeDataCleaning/BeeData/iDigBio_webDL_30Aug2022/5aa5abe1-62e0-4d8c-bebf-4ac13bd9e56f/occurrence_raw.csv"
  
            #$SCAN_data
            #character(0)
            #Failing because SCAN_data seems to be missing. Downloaded separatly from the one drive

Load in the most-recent version of these data if needed. This will return a list with:

The occurrence dataset with attributes (.$Data_WebDL)

The appended eml file (.$eml_files)

DataImp <- BeeBDC::importOccurrences(path = DataPath,
                       fileName = "BeeData_")

1.3 Import USGS Data

The USGS Bee Lab makes a large and excellent bee dataset publicly available. You can download, integrate, and use their data from our 2023 paper using the BeeBDC::USGS_formatter() function.

The BeeBDC::USGS_formatter() will find, import, format, and create metadata for the USGS dataset. The pubDate must be in day-month-year format.

  USGS_data <- BeeBDC::USGS_formatter(path = DataPath, pubDate = "19-11-2022")

1.4 Formatted Source Importer

Use this importer to find files that have been formatted and need to be added to the larger data file.

The attributes file must contain “attribute” in its name, and the occurrence file must not.

  Complete_data <- BeeBDC::formattedCombiner(path = DataPath, 
                                strings = c("USGS_[a-zA-Z_]+[0-9]{4}-[0-9]{2}-[0-9]{2}"), 
                                  # This should be the list-format with eml attached
                                existingOccurrences = DataImp$Data_WebDL,
                                existingEMLs = DataImp$eml_files)

In the column catalogNumber, remove “.*specimennumber:” as what comes after should be the USGS number to match for duplicates.

  Complete_data$Data_WebDL <- Complete_data$Data_WebDL %>%
    dplyr::mutate(catalogNumber = stringr::str_replace(catalogNumber,
                                                       pattern = ".*\\| specimennumber:",
                                                       replacement = ""))

1.5 Save data

Choose the type of data format you want to use in saving your work in 1.x.

  BeeBDC::dataSaver(path = DataPath,# The main path to look for data in
       save_type = "CSV_file", # "R_file" OR "CSV_file"
       occurrences = Complete_data$Data_WebDL, # The existing datasheet
       eml_files = Complete_data$eml_files, # The existing EML files
       file_prefix = "Fin_") # The prefix for the fileNames
rm(Complete_data, DataImp)

2.0 Data preparation

The data preparation section of the script relates mostly to integrating bee occurrence datasets and corrections and so may be skipped by many general taxon users.

2.1 Standardise datasets

You may either use:

1. the bdc import method (works well with general datasets) or
1. the jbd import method (works well with above data merge)

a. bdc import

The bdc import is NOT truly supported here, but provided as an example. Please go to section 2.1b below. Read in the bdc metadata and standardise the dataset to bdc.

        bdc_metadata <- readr::read_csv(paste(DataPath, "out_file", "bdc_integration.csv", sep = "/"))
        # ?issue — datasetName is a darwinCore field already!
        # Standardise the dataset to bdc
        db_standardized <- bdc::bdc_standardize_datasets(
          metadata = bdc_metadata,
          format = "csv",
          overwrite = TRUE,
          save_database = TRUE)
        # read in configuration description file of the column header info
        config_description <- readr::read_csv(paste(DataPath, "Output", "bdc_configDesc.csv",
                                                    sep = "/"), 
                                              show_col_types = FALSE, trim_ws = TRUE)

b. jbd import

Find the path, read in the file, and add the database_id column.

  occPath <- BeeBDC::fileFinder(path = DataPath, fileName = "Fin_BeeData_combined_")


  db_standardized <- readr::read_csv(occPath, 
                                       # Use the basic ColTypeR function to determine types
                                     col_types = BeeBDC::ColTypeR(), trim_ws = TRUE) %>%
                                     dplyr::mutate(database_id = paste("Dorey_data_", 
                                     1:nrow(.), sep = ""),
                                     .before = family)

2.2 Paige dataset

Paige Chesshire’s cleaned American dataset — https://doi.org/10.1111/ecog.06584

Import data

If you haven’t figured it out by now, don’t worry about the column name warning — not all columns occur here.

  PaigeNAm <- readr::read_csv(paste(DataPath, "Paige_data", "NorAmer_highQual_only_ALLfamilies.csv",
                                    sep = "/"), col_types = BeeBDC::ColTypeR()) %>%
     # Change the column name from Source to dataSource to match the rest of the data.
    dplyr::rename(dataSource = Source) %>%
     # EXTRACT WAS HERE
      # add a NEW database_id column
    dplyr::mutate(
      database_id = paste0("Paige_data_", 1:nrow(.)),
      .before = scientificName)

Attention:
It is recommended to run the below code on the full bee dataset with more than 16GB RAM. Robert ran this on a laptop with 16GB RAM and an Intel(R) Core(TM) i7-8550U processor (4 cores and 8 threads) — it struggled.

Merge Paige’s data with downloaded data

  db_standardized <- BeeBDC::PaigeIntegrater(
      db_standardized = db_standardized,
      PaigeNAm = PaigeNAm,
        # This is a list of columns by which to match Paige's data to the most-recent download with. 
        # Each vector will be matched individually
      columnStrings = list(
        c("decimalLatitude", "decimalLongitude", 
          "recordNumber", "recordedBy", "individualCount", "samplingProtocol",
          "associatedTaxa", "sex", "catalogNumber", "institutionCode", "otherCatalogNumbers",
          "recordId", "occurrenceID", "collectionID"),         # Iteration 1
        c("catalogNumber", "institutionCode", "otherCatalogNumbers",
          "recordId", "occurrenceID", "collectionID"), # Iteration 2
        c("decimalLatitude", "decimalLongitude", 
          "recordedBy", "genus", "specificEpithet"),# Iteration 3
        c("id", "decimalLatitude", "decimalLongitude"),# Iteration 4
        c("recordedBy", "genus", "specificEpithet", "locality"), # Iteration 5
        c("recordedBy", "institutionCode", "genus", 
          "specificEpithet","locality"),# Iteration 6
        c("occurrenceID","decimalLatitude", "decimalLongitude"),# Iteration 7
        c("catalogNumber","decimalLatitude", "decimalLongitude"),# Iteration 8
        c("catalogNumber", "locality") # Iteration 9
      ) )

Remove spent data.

  rm(PaigeNAm)

2.3 USGS

The USGS dataset also partially occurs on GBIF from BISON. However, the occurrence codes are in a silly place… We will correct these here to help identify duplicates later.

    db_standardized <- db_standardized %>%
          # Remove the discoverlife html if it is from USGS
      dplyr::mutate(occurrenceID = dplyr::if_else(
        stringr::str_detect(occurrenceID, "USGS_DRO"),
        stringr::str_remove(occurrenceID, "http://www\\.discoverlife\\.org/mp/20l\\?id="),
        occurrenceID)) %>%
          # Use otherCatalogNumbers when occurrenceID is empty AND when USGS_DRO is detected there
      dplyr::mutate(
        occurrenceID = dplyr::if_else(
          stringr::str_detect(otherCatalogNumbers, "USGS_DRO") & is.na(occurrenceID),
          otherCatalogNumbers, occurrenceID)) %>%
           # Make sure that no eventIDs have snuck into the occurrenceID columns 
           # For USGS_DRO, codes with <6 digits are event ids
      dplyr::mutate(
        occurrenceID = dplyr::if_else(stringr::str_detect(occurrenceID, "USGS_DRO", negate = TRUE),
             # Keep occurrenceID if it's NOT USGS_DRO
           occurrenceID, 
             # If it IS USGS_DRO and it has => 6 numbers, keep it, else, NA
          dplyr::if_else(stringr::str_detect(occurrenceID, "USGS_DRO[0-9]{6,10}"),
                         occurrenceID, NA_character_)),
        catalogNumber = dplyr::if_else(stringr::str_detect(catalogNumber, "USGS_DRO", negate = TRUE),
             # Keep catalogNumber if it's NOT USGS_DRO
          catalogNumber, 
             # If it IS USGS_DRO and it has => 6 numbers, keep it, else, NA
          dplyr::if_else(stringr::str_detect(catalogNumber, "USGS_DRO[0-9]{6,10}"),
                         catalogNumber, NA_character_)))

2.4 Additional datasets

Import additional and potentially private datasets.

Note: Private dataset functions are provided but the data itself is not integrated here until those datasets become freely available.

There will be some warnings were a few rows may not be formatted correctly or where dates fail to parse. This is normal.

a. EPEL

Guzman, L. M., Kelly, T. & Elle, E. A data set for pollinator diversity and their interactions with plants in the Pacific Northwest. Ecology, e3927 (2022). https://doi.org/10.1002/ecy.3927

EPEL_Data <- BeeBDC::readr_BeeBDC(dataset = "EPEL",
                                path = paste0(DataPath, "/Additional_Datasets"),
                      inFile = "/InputDatasets/bee_data_canada.csv",
                      outFile = "jbd_EPEL_data.csv",
                      dataLicense = "https://creativecommons.org/licenses/by-nc-sa/4.0/")

b. Allan Smith-Pardo

Data from Allan Smith-Pardo

ASP_Data <- BeeBDC::readr_BeeBDC(dataset = "ASP",
                               path = paste0(DataPath, "/Additional_Datasets"),
                      inFile = "/InputDatasets/Allan_Smith-Pardo_Dorey_ready2.csv",
                      outFile = "jbd_ASP_data.csv",
                      dataLicense = "https://creativecommons.org/licenses/by-nc-sa/4.0/")

c. Minckley

Data from Robert Minckley

BMin_Data <- BeeBDC::readr_BeeBDC(dataset = "BMin",
                                path = paste0(DataPath, "/Additional_Datasets"),
                        inFile = "/InputDatasets/Bob_Minckley_6_1_22_ScanRecent-mod_Dorey.csv",
                        outFile = "jbd_BMin_data.csv",
                        dataLicense = "https://creativecommons.org/licenses/by-nc-sa/4.0/")

d. BMont

Delphia, C. M. Bumble bees of Montana. https://www.mtent.org/projects/Bumble_Bees/bombus_species.html. (2022)

BMont_Data <- BeeBDC::readr_BeeBDC(dataset = "BMont",
                                 path = paste0(DataPath, "/Additional_Datasets"),
                          inFile = "/InputDatasets/Bombus_Montana_dorey.csv",
                          outFile = "jbd_BMont_data.csv",
                          dataLicense = "https://creativecommons.org/licenses/by-sa/4.0/")

e. Ecd

Ecdysis. Ecdysis: a portal for live-data arthropod collections, https://ecdysis.org/index.php (2022).

Ecd_Data <- BeeBDC::readr_BeeBDC(dataset = "Ecd",
                               path = paste0(DataPath, "/Additional_Datasets"),
                      inFile = "/InputDatasets/Ecdysis_occs.csv",
                      outFile = "jbd_Ecd_data.csv",
                      dataLicense = "https://creativecommons.org/licenses/by-nc-sa/4.0/")

f. Gai

Gaiarsa, M. P., Kremen, C. & Ponisio, L. C. Pollinator interaction flexibility across scales affects patch colonization and occupancy. Nature Ecology & Evolution 5, 787-793 (2021). https://doi.org/10.1038/s41559-021-01434-y

Gai_Data <- BeeBDC::readr_BeeBDC(dataset = "Gai",
                               path = paste0(DataPath, "/Additional_Datasets"),
                      inFile = "/InputDatasets/upload_to_scan_Gaiarsa et al_Dorey.csv",
                      outFile = "jbd_Gai_data.csv",
                      dataLicense = "https://creativecommons.org/licenses/by-nc-sa/4.0/")

g. CAES

From the Connecticut Agricultural Experiment Station.

Zarrillo, T. A., Stoner, K. A. & Ascher, J. S. Biodiversity of bees (Hymenoptera: Apoidea: Anthophila) in Connecticut (USA). Zootaxa (Accepted).

Ecdysis. Occurrence dataset (ID: 16fca9c2-f622-4cb1-aef0-3635a7be5aeb). https://ecdysis.org/content/dwca/CAES-CAES_DwC-A.zip. (2023)

CAES_Data <- BeeBDC::readr_BeeBDC(dataset = "CAES",
                                path = paste0(DataPath, "/Additional_Datasets"),
                        inFile = "/InputDatasets/CT_BEE_DATA_FROM_PBI.xlsx",
                        outFile = "jbd_CT_Data.csv",
                        sheet = "Sheet1",
                        dataLicense = "https://creativecommons.org/licenses/by-nc-sa/4.0/")

h. GeoL

GeoL_Data <- BeeBDC::readr_BeeBDC(dataset = "GeoL",
                                path = paste0(DataPath, "/Additional_Datasets"),
                        inFile = "/InputDatasets/Geolocate and BELS_certain and accurate.xlsx",
                        outFile = "jbd_GeoL_Data.csv",
                        dataLicense = "https://creativecommons.org/licenses/by-nc-sa/4.0/")

i. EaCO

EaCO_Data <- BeeBDC::readr_BeeBDC(dataset = "EaCO",
                                path = paste0(DataPath, "/Additional_Datasets"),
                        inFile = "/InputDatasets/Eastern Colorado bee 2017 sampling.xlsx",
                        outFile = "jbd_EaCo_Data.csv",
                        dataLicense = "https://creativecommons.org/licenses/by-nc-sa/4.0/")

j. FSCA

Florida State Collection of Arthropods

FSCA_Data <- BeeBDC::readr_BeeBDC(dataset = "FSCA",
                                path = paste0(DataPath, "/Additional_Datasets"),
                        inFile = "InputDatasets/fsca_9_15_22_occurrences.csv",
                        outFile = "jbd_FSCA_Data.csv",
                        dataLicense = "https://creativecommons.org/licenses/by-nc-sa/4.0/")

k. Texas SMC

Published or unpublished data from Texas literature not in an online database, usually copied into spreadsheet from document format, or otherwise copied from a very differently-formatted spreadsheet. Unpublished or partially published data were obtained with express permission from the lead author.

SMC_Data <- BeeBDC::readr_BeeBDC(dataset = "SMC",
                               path = paste0(DataPath, "/Additional_Datasets"),
                      inFile = "/InputDatasets/TXbeeLitOccs_31Oct22.csv", 
                      outFile = "jbd_SMC_Data.csv",
                      dataLicense = "https://creativecommons.org/licenses/by-nc-sa/4.0/")

l. Texas Bal

Data with GPS coordinates (missing accidentally from records on Dryad) from Ballare, K. M., Neff, J. L., Ruppel, R. & Jha, S. Multi-scalar drivers of biodiversity: local management mediates wild bee community response to regional urbanization. Ecological Applications 29, e01869 (2019), https://doi.org/10.1002/eap.1869. The version on Dryad is missing site GPS coordinates (by accident). Kim is okay with these data being made public as long as her paper is referenced. - Elinor Lichtenberg

Bal_Data <- BeeBDC::readr_BeeBDC(dataset = "Bal",
                               path = paste0(DataPath, "/Additional_Datasets"),
                      inFile = "/InputDatasets/Beedata_ballare.xlsx", 
                      outFile = "jbd_Bal_Data.csv",
                      sheet = "animal_data",
                      dataLicense = "https://creativecommons.org/licenses/by-nc-sa/4.0/")

m. Palouse Lic

Elinor Lichtenberg’s canola data: Lichtenberg, E. M., Milosavljević, I., Campbell, A. J. & Crowder, D. W. Differential effects of soil conservation practices on arthropods and crop yields. Journal of Applied Entomology, (2023) https://doi.org/10.1111/jen.13188. These are the data I will be putting on SCAN. - Elinor Lichtenberg

Lic_Data <- BeeBDC::readr_BeeBDC(dataset = "Lic",
                               path = paste0(DataPath, "/Additional_Datasets"),
                      inFile = "/InputDatasets/Lichtenberg_canola_records.csv", 
                      outFile = "jbd_Lic_Data.csv",
                      dataLicense = "https://creativecommons.org/licenses/by-nc-sa/4.0/")

n. Arm

Data from Armando Falcon-Brindis from the University of Kentucky.

Arm_Data <- BeeBDC::readr_BeeBDC(dataset = "Arm",
                               path = paste0(DataPath, "/Additional_Datasets"),
                      inFile = "/InputDatasets/Bee database Armando_Final.xlsx",
                      outFile = "jbd_Arm_Data.csv",
                      sheet = "Sheet1",
                      dataLicense = "https://creativecommons.org/licenses/by-nc-sa/4.0/")

o. Dor

From several papers:

Dorey, J. B., Fagan-Jeffries, E. P., Stevens, M. I., & Schwarz, M. P. (2020). Morphometric comparisons and novel observations of diurnal and low-light-foraging bees. Journal of Hymenoptera Research, 79, 117–144. doi:https://doi.org/10.3897/jhr.79.57308
Dorey, J. B. (2021). Missing for almost 100 years: the rare and potentially threatened bee Pharohylaeus lactiferus (Hymenoptera, Colltidae). Journal of Hymenoptera Research, 81, 165-180. doi: https://doi.org/10.3897/jhr.81.59365
Dorey, J. B., Schwarz, M. P., & Stevens, M. I. (2019). Review of the bee genus Homalictus Cockerell (Hymenoptera: Halictidae) from Fiji with description of nine new species. Zootaxa, 4674(1), 1–46. doi:https://doi.org/10.11646/zootaxa.4674.1.1

  Dor_Data <- BeeBDC::readr_BeeBDC(dataset = "Dor",
                    path = paste0(DataPath, "/Additional_Datasets"),
                    inFile = "/InputDatasets/DoreyData.csv",
                    outFile = "jbd_Dor_Data.csv",
                    dataLicense = "https://creativecommons.org/licenses/by-nc-sa/4.0/")

p. VicWam

These data are originally from the Victorian Museum and Western Australian Museum in Australia. However, in their current form they are from Dorey et al. 2021.

PADIL. (2020). PaDIL. https://www.PADIL.gov.au/
Houston, T. F. (2000). Native bees on wildflowers in Western Australia. Western Australian Insect Study Society.
Dorey, J. B., Rebola, C. M., Davies, O. K., Prendergast, K. S., Parslow, B. A., Hogendoorn, K., . . . Caddy-Retalic, S. (2021). Continental risk assessment for understudied taxa post catastrophic wildfire indicates severe impacts on the Australian bee fauna. Global Change Biology, 27(24), 6551-6567. doi:https://doi.org/10.1111/gcb.15879

 VicWam_Data <- BeeBDC::readr_BeeBDC(dataset = "VicWam",
                    path = paste0(DataPath, "/Additional_Datasets"),
                    inFile = "/InputDatasets/Combined_Vic_WAM_databases.xlsx",
                    outFile = "jbd_VicWam_Data.csv",
                    dataLicense = "https://creativecommons.org/licenses/by-nc-sa/4.0/",
                    sheet = "Combined")

2.5 Merge all

Remove these spent datasets.

  rm(EPEL_Data, ASP_Data, BMin_Data, BMont_Data, Ecd_Data, Gai_Data, CAES_Data, 
  GeoL_Data, EaCO_Data, FSCA_Data, SMC_Data, Bal_Data, Lic_Data, Arm_Data, Dor_Data,
  VicWam_Data)

Read in and merge all. There are more readr_BeeBDC() supported than currently implemented and these represent datasets that will be publicly released in the future. See ‘?readr_BeeBDC()’ for details.

db_standardized <- db_standardized %>%
  dplyr::bind_rows(
    readr::read_csv(paste0(DataPath, "/Additional_Datasets", 
                           "/jbd_ASP_data.csv"), col_types = BeeBDC::ColTypeR()),
    readr::read_csv(paste0(DataPath, "/Additional_Datasets", 
                           "/jbd_EPEL_data.csv"), col_types = BeeBDC::ColTypeR()),
    readr::read_csv(paste0(DataPath, "/Additional_Datasets", 
                           "/jbd_BMin_data.csv"), col_types = BeeBDC::ColTypeR()),
    readr::read_csv(paste0(DataPath, "/Additional_Datasets", 
                           "/jbd_BMont_data.csv"), col_types = BeeBDC::ColTypeR()),
    readr::read_csv(paste0(DataPath, "/Additional_Datasets", 
                           "/jbd_Ecd_data.csv"), col_types = BeeBDC::ColTypeR()),
    readr::read_csv(paste0(DataPath, "/Additional_Datasets", 
                           "/jbd_Gai_data.csv"), col_types = BeeBDC::ColTypeR()),
    readr::read_csv(paste0(DataPath, "/Additional_Datasets", 
                           "/jbd_CT_Data.csv"), col_types = BeeBDC::ColTypeR()),
    readr::read_csv(paste0(DataPath, "/Additional_Datasets", 
                           "/jbd_GeoL_Data.csv"), col_types = BeeBDC::ColTypeR()),
    readr::read_csv(paste0(DataPath, "/Additional_Datasets", 
                           "/jbd_EaCo_Data.csv"), col_types = BeeBDC::ColTypeR()), 
    readr::read_csv(paste0(DataPath, "/Additional_Datasets", 
                           "/jbd_SMC_Data.csv"), col_types = BeeBDC::ColTypeR()),
    readr::read_csv(paste0(DataPath, "/Additional_Datasets", 
                           "/jbd_Bal_Data.csv"), col_types = BeeBDC::ColTypeR()),
    readr::read_csv(paste0(DataPath, "/Additional_Datasets", 
                           "/jbd_Lic_Data.csv"), col_types = BeeBDC::ColTypeR()),
    readr::read_csv(paste0(DataPath, "/Additional_Datasets", 
                           "/jbd_Arm_Data.csv"), col_types = BeeBDC::ColTypeR()),
    readr::read_csv(paste0(DataPath, "/Additional_Datasets", 
                           "/jbd_Dor_Data.csv"), col_types = BeeBDC::ColTypeR()),
readr::read_csv(paste0(DataPath, "/Additional_Datasets", 
                       "/jbd_VicWam_Data.csv"), col_types = BeeBDC::ColTypeR())) %>% 
    # END bind_rows
  suppressWarnings(classes = "warning") # End suppressWarnings — due to col_types

2.6 Match database_id

If you have prior runs from which you’d like to match database_ids with from the current run, you may use the below script to try to match database_ids with prior runs.

Read in a prior run of choice.

  priorRun <- BeeBDC::fileFinder(path = DataPath,
                          file = "01_prefilter_database_9Aug22.csv") %>%
    readr::read_csv(file = ., col_types = BeeBDC::ColTypeR())

This function will attempt to find the database_ids from prior runs.

  db_standardized <- BeeBDC::idMatchR(
  currentData = db_standardized,
  priorData = priorRun,
    # First matches will be given preference over later ones
  matchBy = tibble::lst(c("gbifID", "dataSource"),
                        c("catalogNumber", "institutionCode", "dataSource", "decimalLatitude",
                          "decimalLongitude"),
                        c("occurrenceID", "dataSource","decimalLatitude","decimalLongitude"),
                        c("recordId", "dataSource","decimalLatitude","decimalLongitude"),
                        c("id", "dataSource","decimalLatitude","decimalLongitude"),
                        # Because INHS was entered as it's own dataset but is now included in the GBIF    download...
                        c("catalogNumber", "institutionCode", "dataSource",
                          "decimalLatitude","decimalLongitude")),
    # You can exclude datasets from prior by matching their prefixs — before first underscore:
  excludeDataset = c("ASP", "BMin", "BMont", "CAES", "EaCO", "Ecd", "EcoS",
                     "Gai", "KP", "EPEL", "CAES", "EaCO", "FSCA", "SMC", "Lic", "Arm",
                     "VicWam"))

 # Remove redundant files
rm(priorRun)

Save the dataset.

  db_standardized %>%
    readr::write_excel_csv(.,
                     paste(OutPath_Intermediate, "00_prefilter_database.csv",
                           sep = "/"))