Harmonise taxonomy of occurrence data

Uses BeeBDC-formatted taxonomy data to harmonise occurrence records and flag those that do not match the taxonomy harmoniseR() prefers to use the names_clean columns that is generated by bdc::bdc_clean_names(). While this is not required, you may find better results by running that function on your dataset first. It is possible to download taxonomy file for other taxa using taxadbToBeeBDC() which can download taxonomies from ITIS, GBIF, and more. You could also match the format of the beesTaxonomy() file.

Usage

harmoniseR(
  data = NULL,
  path = NULL,
  taxonomy = BeeBDC::beesTaxonomy(),
  speciesColumn = "scientificName",
  rm_names_clean = TRUE,
  checkVerbatim = FALSE,
  stepSize = 1e+06,
  mc.cores = 1
)

Arguments

data: A data frame or tibble. Occurrence records as input.
path: A directory as character. The path to a folder that the output can be saved.
taxonomy: A data frame or tibble. The taxonomy file to use. Default = beesTaxonomy(); for other taxa see first taxadbToBeeBDC().
speciesColumn: Character. The name of the column containing species names. Default = "scientificName".
rm_names_clean: Logical. If TRUE then the names_clean column will be removed at the end of this function to help reduce confusion about this column later. Default = TRUE
checkVerbatim: Logical. If TRUE then the verbatimScientificName will be checked as well for species matches. This matching will ONLY be done after harmoniseR has failed for the other name columns. NOTE: this column is not first run through bdc::bdc_clean_names. Default = FALSE
stepSize: Numeric. The number of occurrences to process in each chunk. Default = 1000000.
mc.cores: Numeric. If > 1, the function will run in parallel using mclapply using the number of cores specified. If = 1 then it will be run using a serial loop. NOTE: Windows machines must use a value of 1 (see ?parallel::mclapply). Additionally, be aware that each thread can use large chunks of memory. Default = 1.

Value

The occurrences are returned with update taxonomy columns, including: scientificName, species, family, subfamily, genus, subgenus, specificEpithet, infraspecificEpithet, and scientificNameAuthorship. A new column, .invalidName, is also added and is FALSE when the occurrence's name did not match the supplied taxonomy.

Examples

# load in the test dataset
system.file("extdata", "testTaxonomy.rda", package="BeeBDC") |> load()

# See also
?BeeBDC::taxadbToBeeBDC()

beesRaw_out <- BeeBDC::harmoniseR(
  #The path to a folder that the output can be saved
path = tempdir(),
# The formatted taxonomy file
taxonomy = testTaxonomy, 
data = BeeBDC::beesFlagged,
speciesColumn = "scientificName")
#>  - Formatting taxonomy for matching...
#> 
#>  - Harmonise the occurrence data with unambiguous names...
#> 
#>  - Attempting to harmonise the occurrence data with ambiguous names...
#>  - Formatting merged datasets...
#> Removing the names_clean column...
#>  - We matched valid names to 96 of 100 occurrence records. This leaves a total of 4 unmatched occurrence records.
#> 
#> harmoniseR:
#> 4
#> records were flagged.
#> The column, '.invalidName' was added to the database.
#> 
#>  - We updated the following columns: scientificName, species, family, subfamily, genus, subgenus, specificEpithet, infraspecificEpithet, and scientificNameAuthorship. The previous scientificName column was converted to verbatimScientificName
#>  - Completed in 0.21 secs
table(beesRaw_out$.invalidName, useNA = "always")
#> 
#> FALSE  TRUE  <NA> 
#>     4    96     0