--- title: "Standardise an Occurrence dataset" author: "Dax Kellie & Martin Westgate" date: '2025-06-06' output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Standardise an Occurrence dataset} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` In Living Atlases like the Atlas of Living Australia (ALA), the default type of data is *occurrence* data, where a record refers to the presence/absence of an organism or taxon in a particular place at a specific time. This is a relatively simple data structure, where it is assumed that each observation or record is independent of each other. This simplicity also allows occurrence-based data to be easily aggregated. Here, we'll go through the steps to standardise and build an occurrence dataset using galaxias. # The dataset The data we'll use are of bird observations from 4 different sites. As these are *occurrence* data, this dataset contains evidence of the presence of certain bird species (`species`) at particular locations (`lat`, `lon`) at specific times (`date`). It also contains additional information about the landscape type, and sex and age class of birds. ```{r} #| warning: false #| message: false library(galaxias) library(dplyr) library(readr) obs <- read_csv("dummy-dataset-sb.csv", show_col_types = FALSE) |> janitor::clean_names() obs |> gt::gt() |> gt::opt_interactive(page_size_default = 5) ``` # Standardise to Darwin Core We can use `suggest_workflow()` to determine what we need to do to standardise this dataset. ```{r} obs |> suggest_workflow() ``` Calling `suggest_workflow()` tells us that one column in the dataset matches Darwin Core terms (`sex`), and we are missing all the minimum required Darwin Core terms. We're also given a suggested workflow consisting of a series of piped `set_` functions for renaming, modifying, or adding missing columns (`set_` functions are specialised wrappers around `dplyr::mutate()`, with additional functionality to support using Darwin Core Standard). Let's start by renaming existing columns to align with Darwin Core terms. `set_` functions will automatically check to make sure each column is correctly formatted. ```{r} obs_dwc <- obs |> set_scientific_name(scientificName = species) |> set_coordinates(decimalLatitude = lat, decimalLongitude = lon) |> set_datetime(eventDate = lubridate::ymd(date)) # specify year-month-day format ``` One thing that is still missing are the required taxonomic terms `kingdom` and `family` (noting that you could add other taxonomic terms as well if you wish). These aren't present in our dataset, so we'll have to add them. This is a fairly trivial exercise for most biologists, and we'll add them in text here; but it would be possible to look this up with `galah::search_taxa()` as well. ```{r} obs_dwc <- obs_dwc |> set_taxonomy(kingdom = "Animalia", phylum = "Chordata", class = "Aves", family = case_when(stringr::str_detect(scientificName, "^Acanthiza") ~ "Acanthizidae", stringr::str_detect(scientificName, "^Artamus") ~ "Artamidae", stringr::str_detect(scientificName, "^Climacteris") ~ "Climacteridae", stringr::str_detect(scientificName, "^Malurus") ~ "Maluridae", stringr::str_detect(scientificName, "^Ptilotula|^Melithreptus") ~ "Meliphagidae", stringr::str_detect(scientificName, "^Pardalotus") ~ "Pardalotidae")) ``` Calling `suggest_workflow()` again accounts for our progress and shows us what still needs to be done. Here, we can see that we're still missing a couple of minimum required terms. ```{r} obs_dwc |> suggest_workflow() ``` Here's a rundown of the columns we need to add: * `occurrenceID`: Unique identifier for each record, which ensures that we can identify specific records for future updates or corrections. We can use `composite_id()`, `sequential_id()`, or `random_id()` to add a unique ID to each row. * `basisOfRecord`: The type of record (e.g. human observation, specimen from a museum collection, machine observation). See a list of acceptable values with `corella::basisOfRecord_values()`. * `geodeticDatum`: The geographic coordinate reference system (CRS), which is a framework for representing spatial data (for example, the CRS of Google Maps is "WGS84"). * `coordinateUncertaintyInMeters`: The area of uncertainty around your observation, which you may be able to infer based on your data collection method. As suggested, let's add these columns using `set_occurrences()` and `set_coordinates()`. We can also add the suggested function `set_individual_traits()` which will automatically identify the matched column name `sex` and check the column's format. ```{r} obs_dwc <- obs_dwc |> set_occurrences( occurrenceID = composite_id(sequential_id(), site, landscape), basisOfRecord = "humanObservation" ) |> set_coordinates( geodeticDatum = "WGS84", coordinateUncertaintyInMeters = 30 # coordinateUncertaintyInMeters = with_uncertainty(method = "phone") ) |> set_individual_traits() ``` Running `suggest_workflow()` once more confirms that our dataset is ready to be used in a Darwin Core Archive! ```{r} obs_dwc |> suggest_workflow() ``` To submit our dataset, we'll select only the columns that match Darwin Core terms ... ```{r} obs_dwc <- obs_dwc |> select(any_of(occurrence_terms())) # select any matching terms obs_dwc |> gt::gt() |> gt::opt_interactive(page_size_default = 5) ``` ... and save this as a file named `occurrences.csv` in a folder named `data-publish`. It's important to follow this naming convention because galaxias automatically looks for particular directories in some steps. ```{r} #| eval: false # Save in ./data-publish use_data_occurrences(obs_dwc) ``` All done! See the [Quick start guide](quick_start_guide.html) for instructions on building a Darwin Core Archive.