The R-package refineR implements the recently published, state-of-the-art indirect method, refineR (Ammer et al. 2021, description of refineR v1.0) which was developed for the estimation of reference intervals using Real-World Data (RWD) [https://doi.org/10.1038/s41598-021-95301-2]. It takes routine measurements of diagnostic tests, containing pathological and non-pathological samples as input and uses sophisticated statistical methods to derive a model describing the distribution of the non-pathological samples. This distribution can then be used to derive reference intervals. The R-package offers various convenience functions to estimate the model, compute reference intervals, as well as to print and plot the estimated results. Additional guidance on the usage of the refineR algorithm is given in Ammer et al. 2023 [https://doi.org/10.1093/jalm/jfac101].
The R-package refineR comes with five simulated datasets that mimick routine measurements of biomarkers observed in laboratory practice. These datasets describe a mixed distribution containing non-pathological and pathological values. The datasets differ in number of samples, pathological fraction and underlying non-pathological distribution. A detailed description of the different testcases can be found in the package documentation.
Prefiltering, cleaning, and possible partitioning of the extracted data from the laboratory information system is not included in the refineR package and has to be carried out in advance. The simulated datasets used here for showcasing the application reflect already preprocessed datasets.
For demonstrating the application of the functions provided by the refineR package, testcase 4 is used as an example dataset:
# load refineR package and load data
library(refineR)
head(testcase4)
## [1] 17.76419 20.34425 42.40541 23.01746 23.57971 45.21698
For evaluating your own dataset, you may import the data from a .CSV file:
# open help of read.csv function to get familiar with its parameters
#?read.csv
# set the file path and parameters according to your input file and import dataset
#mydata <- read.csv(file = "file path to mydata.csv", header = TRUE, sep = ",", dec = ".")
#head(mydata)
# extract the column containing the numeric test results
#mydata2 <- mydata[, "column with test results"]
# example how to run refineR estimation
#fit <- findRI(Data = mydata2)
The main function of refineR is findRI() that takes an input data set and estimates the model parameters lambda, mu and sigma of a Box-Cox transformed normal distribution that best explains the non-pathological distribution. The optimization is carried out via a multi-level grid search to minimize the cost function (negative log-likelihood with regularization). For a detailed description of the method (v 1.0), please refer to Ammer et al., 2021. The findRI() function takes the following main function arguments:
To run the model estimation using the default parameters just the (pre-processed) input data is required (Data):
# run refineR estimation and print resulting RWDRI object
fit <- findRI(Data = testcase4)
print(fit)
##
## Reference Intervals
## ------------------------------------------------
## lower limit [ 2.5% perc]: 9.62
## upper limit [97.5% perc]: 49.5
##
## Model Parameters
## ------------------------------------------------
## method: refineR (v1.6.2)
## model: BoxCox
## N data: 100000
## rounded: no
## point est: fullDataEst
## lambda: 0.0595
## mu: 3.41
## sigma: 0.502
## shift: 0
## cost: -36.6
## NP fraction: 0.909
The print() method comprises an overview of the model estimation results. First, it shows the estimated lower and upper reference limit, per default the 2.5% and 97.5% percentiles. Second, it depicts information about the used refineR version, the applied model, as well as the number of of data points (N data) and whether the input data was rounded or not (rounded). Further, it shows the estimated model parameters (lambda, mu, sigma, shift), the estimated costs and the estimated fraction of non-pathological values (NP fraction).
To calculate the estimated reference intervals, the function getRI() can be used. Per default, again the 2.5% and 97.5% percentiles are computed using the estimated model parameters.
# compute reference intervals using the estimated model parameters
getRI(fit)
## Percentile PointEst CILow CIHigh
## 1 0.025 9.615576 NA NA
## 2 0.975 49.464639 NA NA
We can now also take a look at the result by using the plot() function.
# plot the estimated model
plot(fit)
The plot shows the raw input data (gray histogram) as well as the estimated model for the non-pathological distribution (green curve) and the estimated reference intervals (per default again the 2.5% and 97.5% percentiles) as green dashed lines.
As mentioned, the findRI() function can take additional function arguments. To calculate confidence intervals for the estimation, bootstrapping is required. The number of bootstrap iterations can be set by the argument NBootstrap (default: 0). For demonstration purposes and due to the increased computation time, we used a small number of iterations (NBootstrap = 20). However, for use in real-world analysis, we recommend to use NBootstrap >= 200. When using bootstrapping, the print() function then also gives the estimated confidence intervals for the reference limits (with a default confidence level of 95%).
# run refineR estimation with 20 bootstrap iterations
fit.bs <- findRI(Data = testcase4, NBootstrap = 20)
print(fit.bs)
##
## Reference Intervals
## ------------------------------------------------
## lower limit [ 2.5% perc]: 9.62 (9.16; 9.86)
## upper limit [97.5% perc]: 49.5 (48.4; 50.6)
##
## Model Parameters
## ------------------------------------------------
## method: refineR (v1.6.2)
## model: BoxCox
## N data: 100000
## N bootstrap: 20
## rounded: no
## point est: fullDataEst
## lambda: 0.0595
## mu: 3.41
## sigma: 0.502
## shift: 0
## cost: -36.6
## NP fraction: 0.909
In addition to the one-parameter Box-Cox transformation, the refineR algorithm also offers the option to use the two-parameter (modified) Box-Cox transformation. This model takes an additional shift parameter to better model skewed distributions that are shifted away from zero.
# run refineR estimation with alternative model (two-parameter (modified) Box-Cox transformation)
fit.mbc <- findRI(Data = testcase4, model = "modBoxCox")
print(fit.mbc)
##
## Reference Intervals
## ------------------------------------------------
## lower limit [ 2.5% perc]: 10
## upper limit [97.5% perc]: 49.3
##
## Model Parameters
## ------------------------------------------------
## method: refineR (v1.6.2)
## model: modBoxCox
## N data: 100000
## rounded: no
## point est: fullDataEst
## lambda: 0.187
## mu: 3.89
## sigma: 0.858
## shift: 3.67
## cost: -36.9
## NP fraction: 0.901
The getRI() as well as the print() function can take additional function arguments as well:
To print and compute certain percentiles of the estimated model, set
the RIperc argument, e.g. to
c(0.025, 0.5, 0.975)
.
# compute 2.5%, 50% (median), 97.5% percentiles for the estimated model
getRI(fit, RIperc = c(0.025, 0.5, 0.975))
## Percentile PointEst CILow CIHigh
## 1 0.025 9.615576 NA NA
## 2 0.500 22.248646 NA NA
## 3 0.975 49.464639 NA NA
# print 2.5%, 50% (median), 97.5% percentiles and estimated model parameters
print(fit, RIperc = c(0.025, 0.5, 0.975))
##
## Reference Intervals
## ------------------------------------------------
## lower limit [ 2.5% perc]: 9.62
## median [50.0% perc]: 22.2
## upper limit [97.5% perc]: 49.5
##
## Model Parameters
## ------------------------------------------------
## method: refineR (v1.6.2)
## model: BoxCox
## N data: 100000
## rounded: no
## point est: fullDataEst
## lambda: 0.0595
## mu: 3.41
## sigma: 0.502
## shift: 0
## cost: -36.6
## NP fraction: 0.909
To also compute confidence intervals for the estimated reference limits after using bootstrapping, you can specify the confidence level by setting CIprop (default: 0.95). Further, you can specify if you want to use the full data estimate as point estimate or the median or mean from the bootstrap samples by setting pointEst to fullDataEst (Default), medianBS, or meanBS. We would recommend to use the median of all bootstrap samples (medianBS) here.
# compute percentiles for estimated model with bootstrapping using the median as point estimate
getRI(fit.bs, RIperc = c(0.025, 0.975), pointEst = "medianBS")
## Percentile PointEst CILow CIHigh
## 1 0.025 9.611764 9.158019 9.864153
## 2 0.975 49.156895 48.358840 50.550111
# print percentiles for estimated model with bootstrapping using the median as point estimate and estimated model parameters
print(fit.bs, RIperc = c(0.025, 0.975), pointEst = "medianBS")
##
## Reference Intervals
## ------------------------------------------------
## lower limit [ 2.5% perc]: 9.61 (9.16; 9.86)
## upper limit [97.5% perc]: 49.2 (48.4; 50.6)
##
## Model Parameters
## ------------------------------------------------
## method: refineR (v1.6.2)
## model: BoxCox
## N data: 100000
## N bootstrap: 20
## rounded: no
## point est: medianBS
## lambda: 0.07
## mu: 3.47
## sigma: 0.517
## shift: 0
## cost: -33.7
## NP fraction: 0.904
The plot() function can take additional function arguments as well:
When plotting a model estimated with bootstrapping, the confidence intervals are shown per default as light green region. Further, you can specify if you want to show the estimate for the full data set (fullDataEst), or the median (medianBS) or mean (meanBS) of all bootstrap results by setting the argument pointEst. Additionally, you can adjust for example the x-limit, the x-label and the title:
# plot estimated model with bootstrapping with adjusted function arguments
plot(fit.bs, RIperc = c(0.025, 0.5, 0.975), pointEst = "medianBS", xlim = c(0, 100), xlab = "Concentration [U/L]",
title = "Testcase 4")
If you also want to show the estimated “pathological distribution”, set the showPathol argument to TRUE. This then adds a red curve to the plot, representing the difference between the raw histogram and the estimated model of the non-pathological distribution, i.e. interpretation should be taken with care.
If you don’t want to show the estimated reference limits, you can disable this, by setting showValue to FALSE.
# plot estimated model with bootstrapping showing the difference between raw input data and estimated model
# (i.e. 'pathological distribution'), wihtout showing the estimated reference limits
plot(fit.bs, showPathol = TRUE, showValue = FALSE, pointEst = "medianBS",
title = "Testcase 4 with pathological distribution")
Ammer, T., Schuetzenmeister, A., Prokosch, HU., Rauh, M., Rank, C.M., Zierk, J. refineR: A Novel Algorithm for Reference Interval Estimation from Real-World Data. Scientific Reports 11, 16023 (2021). https://doi.org/10.1038/s41598-021-95301-2.
Ammer, T., Schuetzenmeister, A., Rank, C.M., Doyle, K. Estimation of Reference Intervals from Routine Data Using the refineR Algorithm — A Practical Guide. The Journal of Applied Laboratory Medicine, 8(1):84-91 (2023). https://doi.org/10.1093/jalm/jfac101.