---
title: "Automatically Cleaning Laboratory Results in R using the 'lab2clean' package"
author: "Ahmed Zayed, Ilias Sarikakis, Arne Janssens, Pavlos Mamouris"
output:
rmarkdown::html_document:
theme: default
toc: true
toc_depth: 2
toc_float: true
vignette: >
%\VignetteIndexEntry{Automatically Cleaning Laboratory Results in R using the 'lab2clean' package}
%\VignetteEncoding{UTF-8}
%\VignetteEngine{knitr::rmarkdown}
---
```{r setup, include=FALSE}
library(knitr)
library(fansi) # safer ANSI -> HTML
knit_hooks$set(output = function(x, options) {
# Only touch true console text in HTML docs
if (!knitr::is_html_output()) return(x)
# Skip anything that is already HTML or is emitted as 'asis'
if (isTRUE(options$results == "asis") ||
grepl("^\\s*<", x) || grepl("
\n", x) # line breaks for console output only
x
})
# keep your table setup
knitr::opts_chunk$set(collapse = TRUE, comment = "#>", message = FALSE, warning = FALSE)
library(kableExtra)
knit_print.data.frame <- function(x, ...) {
knitr::asis_output(
kableExtra::kbl(x, ...) |>
kableExtra::kable_styling(
bootstrap_options = c("striped", "hover", "condensed", "responsive")
)
)
}
library(printr)
```
## 1. Introduction
Navigating the shift of clinical laboratory data from primary everyday clinical use to secondary research purposes presents a significant challenge. Given the substantial time and expertise required to preprocess and clean this data and the lack of all-in-one tools tailored for this need, we developed our algorithm `lab2clean` as an open-source R-package. `lab2clean` package is set to automate and standardize the intricate process of cleaning clinical laboratory results. With a keen focus on improving the data quality of laboratory result values and units, our goal is to equip researchers with a straightforward, plug-and-play tool, making it smoother for them to unlock the true potential of clinical laboratory data in clinical research and clinical machine learning (ML) model development.
The `lab2clean` package contains four key functions: Two functions to clean & validate result values (Version 1.0) are described in detail in Zayed et al. (2024) [https://doi.org/10.1186/s12911-024-02652-7]. The `clean_lab_result()` function cleans and standardizes the laboratory results, and the `validate_lab_result()` function performs validation to ensure the plausibility of these results. The other two functions to standardize & harmonize result units (added in Version 2.0) are described in detail in Zayed et al. (2025) [https://doi.org/10.1016/j.ijmedinf.2025.106131]. The `standardize_lab_unit()` function cleans and standardize formats of laboratory units of measurement according to the Unified Code for Units of Measure (UCUM), and the `harmonize_lab_unit()` function harmonizes the units found in a laboratory data set to reference units following either SI or Conventional units, converting the numeric result values.
This vignette aims to explain the theoretical background, usage, and customization of these functions.
## 2. Setup
### Installing and loading the `lab2clean` package
You can install and load the `lab2clean` package directly in R.
```{r install package}
#install.packages("lab2clean")
```
After installation, load the package:
```{r read library,}
library(lab2clean)
```
## 3. Function 1: Clean and Standardize results
The `clean_lab_result()` has five arguments:
* `lab_data` : A dataset containing laboratory data
* `raw_result` : The column in `lab_data` that contains raw result values to be cleaned
* `locale` : A string representing the locale for the laboratory data. Defaults to "NO"
* `report` : A report is written in the console. Defaults to "TRUE".
* `n_records` : In case you are loading a grouped list of distinct results, then you can assign the n_records to the column that contains the frequency of each distinct result. Defaults to NA
Let us demonstrate the `clean_lab_result()` function using `Function_1_dummy` and inspect the first six rows:
```{r Function_1_dummy, }
data("Function_1_dummy", package = "lab2clean")
head(Function_1_dummy,6)
```
This dataset -for demonstration purposes- contains two columns: `raw_result` and the `frequency`. The `raw_result` column holds raw laboratory results, and `frequency` indicates how often each result appeared. Let’s explore the `report` and `n_records` arguments:
```{r function with report, results='markup'}
cleaned_results <- clean_lab_result(Function_1_dummy, raw_result = "raw_result", report = TRUE, n_records = "frequency")
```
The `report` provides a detailed report on how the whole process of cleaning the data is done, and offers some descriptive insights of the process. The `n_records` argument adds percentages to each of the aforementioned steps to enhance the reporting. For simplicity, we will use `report = FALSE` in the rest of this tutorial:
```{r function with report 1, }
cleaned_results <- clean_lab_result(Function_1_dummy, raw_result = "raw_result", report = FALSE)
cleaned_results
```
This function creates three different columns:
1- `clean_result`: The cleaned version of the `raw_result` column. For example, "?" is converted to , "3.14159 * 10^30" to "3.142", and "+++" to "3+".
2- `scale_type` : Categorizes the cleaned results into specific types like Quantitative (Qn), Ordinal (Ord), or Nominal (Nom), with further subcategories for nuanced differences, such as differentiating simple numeric results (Qn.1) from inequalities (Qn.2), range results (Qn.3), or titer results (Qn.4) within the Quantitative scale.
3- `cleaning_comments`: Provides insights on how the results were cleaned.
The process above provided a generic description on how the `clean_lab_result()` function operates. It would be useful to delve into more details on the exact way that some of the specific raw results are cleaned:
* `Locale` variable:
In the `clean_lab_result()` function, we have an argument named locale. It addresses the variations in number formats with different decimal and thousand separators that arise due to locale-specific settings used internationally. We chose to standardize these varying languages and locale-specific settings to have the cleaned results in English, US. If the user did not identify the locale of the dataset, the default is `NO`, which means not specified. For example for rows 71 and 72, there is a locale_check in the `cleaning_comments`, and the results are 1.015 and 1,060 respectively. That means that either "US" or "DE" locale should be specified to identify this result value. If we specified the locale as `US` or `DE`, we can see different values as follows:
```{r locale, warning=FALSE, message=FALSE}
Function_1_dummy_subset <- Function_1_dummy[c(71,72),, drop = FALSE]
cleaned_results <- clean_lab_result(Function_1_dummy_subset, raw_result = "raw_result", report = FALSE, locale = "US")
cleaned_results
cleaned_results <- clean_lab_result(Function_1_dummy_subset, raw_result = "raw_result", report = FALSE, locale = "DE")
cleaned_results
```
* `Language` in `common words`:
In the `clean_lab_result()` function, we support 19 distinct languages in representing frequently used terms such as "high," "low," "positive," and "negative. For example, the word `Pøsitivo` is included in the common words and will be cleaned as `Pos`.
Let us see how this data table works in our function:
```{r common words, warning=FALSE, message=FALSE}
data("common_words", package = "lab2clean")
common_words
```
As seen in this data, there are 19 languages for 8 common words. If the words are positive or negative, then the result will either be cleaned to `Pos` or `Neg` unless if it is proceeded by a number, therefore the word is removed and a flag is added to the `cleaning_comments`. For example, the word `Négatif 0.3` is cleaned as `0.3` and the word `33 Normal` is cleaned as `33`. Finally, if the result has one of those words "Sample" or "Specimen", then a comment will pop-up mentioning that `test was not performed`.
* `Flag` creation:
In addition to the common words, when there is a space between a numeric value and a minus character, this also creates a flag. For example, result `- 5` is cleaned as `5` with a flag, but the result `-5` is cleaned as `-5`, and no flag is created because we can assume it was a negative value.
## 4. Function 2: Validate results
The `validate_lab_result()` has seven arguments:
* `lab_data` : A data frame containing laboratory data
* `result_value` : The column in lab_data with quantitative result values for validation
* `result_unit` : The column in lab_data with result units in a UCUM-valid format
* `loinc_code` : The column in lab_data indicating the LOINC code of the laboratory test
* `patient_id` : The column in lab_data indicating the identifier of the tested patient.
* `lab_datetime` : The column in lab_data with the date or datetime of the laboratory test.
* `report` : A report is written in the console. Defaults to "TRUE".
Let us check how our package validates the results using the `validate_lab_result()` function. Let us consider the `Function_2_dummy` data that contains 86,863 rows and inspect its first 6 rows;
```{r Function_2_dummy dataset, warning=FALSE, message=FALSE}
data("Function_2_dummy", package = "lab2clean")
head(Function_2_dummy, 6)
```
Let us apply the `validate_lab_result()` and see its functionality:
```{r apply validate_lab_result, warning=FALSE, message=FALSE}
validate_results <- validate_lab_result(Function_2_dummy,
result_value="result_value",
result_unit="result_unit",
loinc_code="loinc_code",
patient_id = "patient_id" ,
lab_datetime="lab_datetime1")
```
The `validate_lab_result()` function generates a `flag` column, with different checks:
```{r flag column creation, warning=FALSE, message=FALSE}
head(validate_results, 6)
levels(factor(validate_results$flag))
```
We can now subset specific patients to explain the flags:
```{r flag explain by subseting patients, warning=FALSE, message=FALSE}
subset_patients <- validate_results[validate_results$patient_id %in% c("14236258", "10000003", "14499007"), ]
subset_patients
```
* Patient 14236258 has both `delta_flag_8_90d` and `delta_flag_7d` that is calculated by lower and upper percentiles set to 0.0005 and 0.9995 respectively. While the delta check is effective in identifying potentially erroneous result values, we acknowledge that it may also flag clinically relevant changes. Therefore, it is crucial that users interpret these flagged results in conjunction with the patient's clinical context.
Let us also explain two tables that we used for the validation function. Let us begin with the reportable interval table.
```{r reportable_interval, warning=FALSE, message=FALSE}
data("reportable_interval", package = "lab2clean")
reportable_interval_subset <- reportable_interval[reportable_interval$interval_loinc_code == "2160-0", ]
reportable_interval_subset
```
* Patient 14499007 has a flag named `low_unreportable`. As we can see, for the "2160-0" loinc_code, his result was 0.0 which was not in the reportable range (0.0001, 120). In a similar note, patient 17726236 has a `high_unreportable`.
Logic rules ensure that related test results are consistent:
```{r logic_rules, warning=FALSE, message=FALSE}
data("logic_rules", package = "lab2clean")
logic_rules <- logic_rules[logic_rules$rule_id == 3, ]
logic_rules
```
* Patient 10000003 has both `logic_flag` and `duplicate`. The `duplicate` means that this patient has a duplicate row, whereas the `logic_flag` should be interpreted as follows. For the loinc_code "2093-3", which is cholesterol, we need that the "2093-3" > "2085-9" + "13457-7", or equivalently cholesterol > hdl cholesterol + ldl cholesterol (from the logic rules table). Therefore for patient 10000003, we have a logic flag because LDL ("13457-7") equals 100.0 and HDL ("2085-9") equals 130.0. Total cholesterol ("2093-3") equals 230. Therefore we see that the rule "2093-3" > "2085-9" + "13457-7" is not satisfied because we have 230 > 100+130, i.e. 230>230, which is clearly false, and thus a logic flag is created.
## 4. Function 3: Clean and Standardize units of measurement:
The `standardize_lab_unit()` has four arguments:
* `lab_data` : A dataset containing laboratory data
* `raw_unit` : The column in `lab_data` that contains raw units to be cleaned.
* `report` : A report is written in the console. Defaults to "TRUE".
* `n_records` : In case you are loading a grouped list of distinct results, then you can assign the n_records to the column that contains the frequency of each distinct result. Defaults to NA
Let us check how our package standardizes the units of measurement using the `standardize_lab_unit()` function. Let us consider the `Function_3_dummy` data that contains 32 rows and inspect its first 6 rows;
```{r Function_3_dummy dataset, warning=FALSE, message=FALSE}
data("Function_3_dummy", package = "lab2clean")
head(Function_3_dummy, 6)
```
This dataset -for demonstration purposes- contains three columns: `unit_raw`, `n_records`, and `note`. The `unit_raw` column holds raw laboratory units as reported in the database, and `frequency` indicates how often each unit appeared, while the `note` details the different cases handled by our function.
```{r apply standardize_lab_unit, warning=FALSE, message=FALSE}
standardized_units <- standardize_lab_unit(Function_3_dummy, raw_unit = "unit_raw", n_records = "n_records")
```
This function creates two new columns:
```{r standardized_units_head, warning=FALSE, message=FALSE}
head(standardized_units, 10)
```
1- `ucum_code`: Cleaned and standardized units according to UCUM syntax.
2- `cleaning_comments`: Comments about the cleaning process for each unit.
## 5. Function 4: Harmonize results to reference units
The `harmonize_lab_unit()` has six arguments:
* `lab_data` : A data frame containing laboratory data
* `result_value` : The column in lab_data with quantitative result values for validation
* `result_unit` : The column in lab_data with result units in a UCUM-valid format
* `loinc_code` : The column in lab_data indicating the LOINC code of the laboratory test
* `preferred_unit_system` : A string representing the preference of the user for the unit system used for standardization. Defaults to "SI", the other option is "Conventional".
* `report` : A report is written in the console. Defaults to "TRUE".
Let us demonstrate the `harmonize_lab_unit()` function using `Function_4_dummy` and inspect the first six rows:
```{r Function_4_dummy, }
data("Function_4_dummy", package = "lab2clean")
head(Function_4_dummy,6)
```
This dataset -for demonstration purposes- contains three columns: `loinc_code`, `result_value` and the `result_unit`.
```{r apply harmonize_lab_unit, warning=FALSE, message=FALSE}
harmonized_units <- harmonize_lab_unit(Function_4_dummy,
loinc_code="loinc_code",
result_value="result_value",
result_unit="result_unit")
```
This function creates six different columns:
```{r harmonized_units_head, warning=FALSE, message=FALSE}
head(harmonized_units, 6)
```
1- `harmonized_unit`: Harmonized units according to the preferred unit system.
2- `OMOP_concept_id`: The concept id of the harmonized unit, necessary for databases standardized to the OMOP Common Data Model.
3- `new_value`: The result value after the conversion.
4- `new_loinc_code`: The unit conversion can lead to a new loinc code than the reported one in two cases:
* If the reported unit did not match the property of the given loinc code. For example "mmol/L" with a LOINC code of mass concentration property --> "loinc_unitsystem_mismatch" is added in the cleaning comments.
* If a mass<>molar conversion was executed.
```{r harmonized_units new_loinc_code, warning=FALSE, message=FALSE}
harmonized_units[which(harmonized_units$loinc_code != harmonized_units$new_loinc_code), ]
```
5- `property_group_id`: the code of the LOINC group (parent group ID / Group ID).
6- `cleaning_comments`: Comments about the harmonization and conversion process for each lab result with two main cases:
* Success: harmonized with same value or with converted new value:
- No conversion in case of similar or equivalent source and reference units.
- Conversion with method clarified whether regular or mass<>molar conversion.
* Failure: not harmonized with detailed reason for each failure case.
```{r harmonized_units cleaning_comments, warning=FALSE, message=FALSE}
levels(factor(harmonized_units$cleaning_comments))
```
In the `harmonize_lab_unit()` function, we have an argument named `preferred_unit_system`.
* `preferred_unit_system`:
According to the user preference, the reference units may change from SI units (usually molar concentration) to conventional units commonly used in practice (usually mass concentration) through mass<>molar conversions. For LOINC codes which don't have mass<>molar equivalent, the conventional and SI units were considered the same. For some LOINC codes, the molar concentration is the one used conventionally. Examples of differences in using different `preferred_unit_system` is detailed as follows:
```{r preferred_unit_system, warning=FALSE, message=FALSE}
Function_4_dummy_subset <- Function_4_dummy[c(27, 15, 38, 45),, drop = FALSE]
harmonized_units <- harmonize_lab_unit(Function_4_dummy_subset,
loinc_code="loinc_code",
result_value="result_value",
result_unit="result_unit",
report = FALSE,
preferred_unit_system = "SI")
harmonized_units
harmonized_units <- harmonize_lab_unit(Function_4_dummy_subset,
loinc_code="loinc_code",
result_value="result_value",
result_unit="result_unit",
report = FALSE,
preferred_unit_system = "conventional")
harmonized_units
```
## 6. Customization
We fully acknowledge the importance of customization to accommodate diverse user needs and tailor the functions to specific datasets. To this end, the data in `common_words`, `logic_rules`, `reportable_interval`, `RWD_units_to_UCUM_V2`, `annotable_strings`, and `loinc_reference_unit_v1` are not hard-coded within the function scripts but are instead provided as separate data files in the "data" folder of the package. This approach allows users to benefit from the default data we have included, which reflects our best knowledge, while also providing the flexibility to append or modify the data as needed.
For example, users can easily customize the `common_words` RData file by adding phrases that are used across different languages and laboratory settings. This allows the `clean_lab_result()` function to better accommodate the specific linguistic and contextual nuances of their datasets. Similarly, users can adjust the `logic_rules` and `reportable_interval` data files for `validate_lab_result()` function to reflect the unique requirements or standards of their research or clinical environment.
Additionally, users can extend the `RWD_units_to_UCUM_V2` data file by adding some locally used units or strings (especially which have non-English letters or abbreviations) with their ucum-valid equivalents customizing the output of `standardize_lab_unit()` function. Similarly, the `annotable_strings` data file can extended by adding non-English strings for analytes locally used in units.
Finally, the `harmonize_lab_unit()` function can be customized by adding reference units to LOINC codes that were not covered in the `loinc_reference_unit_v1` or even editing the reference units for some existing LOINC codes (though not recommended).
By providing these customizable data files, we aim to ensure that the `lab2clean` package is not only powerful but also adaptable to the varied needs of the research and clinical communities.