--- title: "mixgb: Multiple Imputation Through XGBoost" author: "Yongshi Deng" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteEncoding{UTF-8} %\VignetteIndexEntry{mixgb: Multiple Imputation Through XGBoost} %\VignetteEngine{knitr::rmarkdown} --- ## Introduction The **mixgb** package provides a scalable approach to imputation for large data using XGBoost, subsampling, and predictive mean matching. It leverages XGBoost—an efficient implementation of gradient-boosted trees—to automatically capture complex interactions and non-linear relationships. Subsampling and predictive mean matching are incorporated to reduce bias and to preserve realistic imputation variability. The package accommodates a wide range of variable types and offers flexible control over subsampling and predictive matching settings. We also recommend our package **vismi** ([Visualisation Tools for Multiple Imputation][vismi-url]), which offers a comprehensive set of diagnostics for assessing the quality of multiply imputed data. [vismi-url]: https://agnesdeng.github.io/vismi/ ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Impute missing values with `mixgb` We first load the `mixgb` package and the `newborn` dataset, which contains 16 variables of various types (integer/numeric/factor/ordinal factor). There are 9 variables with missing values. ```{r} library(mixgb) str(newborn) colSums(is.na(newborn)) ``` To impute this dataset, we use the default settings. By default, the number of imputed datasets is set to `m = 5`. The data do not need to be converted to a `dgCMatrix` or one-hot encoded format, as these transformations are handled automatically by the package. Supported variable types include numeric, integer, factor, and ordinal factor. ```{r, eval = FALSE} # use mixgb with default settings imp_list <- mixgb(data = newborn, m = 5) ``` ### Customise imputation settings We can also customise imputation settings: - The number of imputed datasets `m` - The number of imputation iterations `maxit` - XGBoost hyperparameters and verbose settings. `xgb.params`, `nrounds`, `early_stopping_rounds`, `print_every_n` and `verbose`. - Subsampling ratio. By default, `subsample = 0.7`. Users can change this value under the `xgb.params` argument. - Predictive mean matching settings `pmm.type`, `pmm.k` and `pmm.link`. - Whether ordinal factors should be converted to integer (imputation process may be faster) `ordinalAsInteger` - Initial imputation methods for different types of variables `initial.num`, `initial.int` and `initial.fac`. - Whether to save models for imputing newdata `save.models` and `save.vars`. ```{r, eval = FALSE} set.seed(2026) # Use mixgb with chosen settings params <- list( max_depth = 5, subsample = 0.9, nthread = 2, tree_method = "hist" ) imp_list <- mixgb( data = newborn, m = 10, maxit = 2, ordinalAsInteger = FALSE, pmm.type = "auto", pmm.k = 5, pmm.link = "prob", initial.num = "normal", initial.int = "mode", initial.fac = "mode", save.models = FALSE, save.vars = NULL, xgb.params = params, nrounds = 200, early_stopping_rounds = 10, print_every_n = 10L, verbose = 0 ) ``` ### Tune hyperparameters Imputation performance can be influenced by the choice of hyperparameters. While tuning a large number of hyperparameters may seem daunting, the search space can often be substantially reduced because many of them are correlated. In mixgb, the function `mixgb_cv()` is provided to tune the number of boosting rounds (`nrounds`). As XGBoost does not define a default value for `nrounds`, users must specify this parameter explicitly. The default setting in mixgb() is `nrounds = 100`; however, we recommend using `mixgb_cv()` to get an appropriate value first. ```{r} params <- list(max_depth = 3, subsample = 0.7, nthread = 2) cv.results <- mixgb_cv(data = newborn, nrounds = 100, xgb.params = params, verbose = FALSE) cv.results$evaluation.log cv.results$response cv.results$best.nrounds ``` By default, `mixgb_cv()` randomly selects an incomplete variable as the response and fits an XGBoost model using the remaining variables as predictors, based on the complete cases of the dataset. As a result, repeated runs of `mixgb_cv()` may yield different results. Users may instead explicitly specify the response variable and the set of covariates via the `response` and `select_features` arguments, respectively. ```{r} cv.results <- mixgb_cv( data = newborn, nfold = 10, nrounds = 100, early_stopping_rounds = 1, response = "head_circumference_cm", select_features = c("age_months", "sex", "race_ethinicity", "recumbent_length_cm", "first_subscapular_skinfold_mm", "second_subscapular_skinfold_mm", "first_triceps_skinfold_mm", "second_triceps_skinfold_mm", "weight_kg"), xgb.params = params, verbose = FALSE ) cv.results$best.nrounds ``` We can then set `nrounds = cv.results$best.nrounds` in `mixgb()` to generate five imputed datasets. ```{r, eval = FALSE} imp_list <- mixgb(data = newborn, m = 5, nrounds = cv.results$best.nrounds) ``` ## Inspect multiply imputed values Older version of **mixgb** package included a few visual diagnostic functions. These have now been removed from **mixgb**. We recommend our standalone package **vismi** ([Visualisation Tools for Multiple Imputation][vismi-url]), which provides a comprehensive set of visual diagnostics for evaluating multiply imputed data. For more details, please visit: [https://agnesdeng.github.io/vismi/](https://agnesdeng.github.io/vismi/) [https://github.com/agnesdeng/vismi](https://github.com/agnesdeng/vismi).