---
title: "A sparse-group boosing Tutorial in R"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{A sparse-group boosing Tutorial in R}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8} 
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

# Introduction

`sgboost` is an implementation of the sparse-group boosting [^1] to be used in conjunction with the R-package `mboost`. In the fitting process, a formula object defining group base-learners and individual base-learners is used.
Regularization is based on the degrees of freedom of an individual base-learners
$df(\lambda)$ and group base-learners $df(\lambda^{(g)})$,
such that $df(\lambda) = \alpha$ and $df(\lambda^{(g)}) = 1- \alpha$. 
Sparse-group boosting serves as an alternative method to sparse-group lasso, employing boosted Ridge regression.


# Installation

You can install the released version of sgboost from [CRAN](https://CRAN.R-project.org) with:

```{r, echo = TRUE, eval = FALSE}
install.packages("sgboost")
```

You can install the development version of sgboost from [GitHub](https://github.com/) with:

``` r
# install.packages("devtools")
devtools::install_github("FabianObster/sgboost")
```

Vignettes are not included in the package by default. If you want to
include vignettes, then use this modified command:

```{r, eval = FALSE}
remotes::install_github(
  "FabianObster/sgboost",
  build_vignettes = TRUE, dependencies = TRUE
)
```

```{r, warning=FALSE, message=FALSE}
library(mboost)
library(sgboost)
library(dplyr)
library(ggplot2)
```
# Data setup
We use the same simulated data and model structure as in the vignette of `sparsegl` (McDonald, Liang , Heinsfeld, 2023)[^2].  Randomly generate a $n\times p$ design matrix `X`.Randomly generate an $n \times p$ design matrix X. For the real-valued vector y, the following two settings are being used:

* Linear Regression model: $y = X\beta^* + \epsilon$.
* Logistic regression model: $y = (y_1, y_2, \cdots, y_n)$, where $y_i \sim \text{Bernoulli}\left(\frac{1}{1 + \exp(-x_i^\top \beta^*)}\right)$, $i = 1, 2, \cdots, n,$

where the coefficient vector $\beta^*$ is specified as below, and the white noise
$\epsilon$ follows a standard normal distribution. 

```{r, warning=FALSE,message=FALSE, eval=TRUE}
set.seed(10)
n <- 100
p <- 200
X <- matrix(data = rnorm(n * p, mean = 0, sd = 1), nrow = n, ncol = p)
beta_star <- c(
  rep(5, 5), c(5, -5, 2, 0, 0), rep(-5, 5),
  c(2, -3, 8, 0, 0), rep(0, (p - 20))
)
groups <- rep(1:(p / 5), each = 5)

# Linear regression model
eps <- rnorm(n, mean = 0, sd = 1)
y <- X %*% beta_star + eps

# Logistic regression model
pr <- 1 / (1 + exp(-X %*% beta_star))
y_binary <- as.factor(rbinom(n, 1, pr))

# Input data.frames
df <- X %>%
  as.data.frame() %>%
  mutate_all(function(x) {
    as.numeric(scale(x))
  }) %>%
  mutate(y = as.numeric(y), y_binary = y_binary)
group_df <- data.frame(
  group_name = groups,
  variable_name = head(colnames(df), -2)
)
```
# Optimization Problem
The sparse-group-boosting problem is formulated as the sum of mean squared error (linear regression) or logistic loss (logistic regression) within each boosting iteration:

* Linear regression: 
    + Group base-learner: \newline $$\min_{g \leq G}\min_{\beta^{(g)} \in \mathbb{R}^p}\left(\frac{1}{2n} \rVert u - X^{(g)}\beta^{(g)}\rVert_2^2 + \lambda^{(g)} \rVert\beta^{(g)}\rVert_2^2 \right)$$
    + Individual base-learner: \newline  $$\min_{g \leq G, j \leq p_g}\min_{\beta^{(g)}_j\in\mathbb{R}}\left(\frac{1}{2n} \rVert u - X^{(g)}_j\beta^{(g)}_j\rVert_2^2 + \lambda^{(g)}_j \rVert\beta^{(g)}_j\rVert_2^2 \right)$$

* Logistic regression: 
    + Group base-learner \newline
$$\min_{g \in G}\min_{\beta^{(g)}\in\mathbb{R}}\left(\frac{1}{2n}\sum_{i=1}^n \log\left(1 + \exp\left(-y_i(x^{(g)})_i^\top\beta^{(g)}\right)\right) + \lambda^{(g)} \rVert\beta^{(g)}\rVert_2^2 \right) \qquad$$
    + Individual base-learners
$$\min_{g \in G, j \in p_g}\min_{\beta^{(g)}_j\in\mathbb{R}}\left(\frac{1}{2n}\sum_{i=1}^n \log\left(1 + \exp\left(-y_i(x^{(g)})_{ij}^\top\beta^{(g)}_j\right)\right) + \lambda^{(g)}_j \rVert\beta^{(g)}_j\rVert_2^2 \right) \qquad$$
where 
* $df(\lambda^{(g)}_j) = \alpha$
* $df(\lambda^{g}) = 1-\alpha$
* $df(\lambda) = tr(2H_\lambda-H_\lambda^TH_\lambda)$
* $H$ is the Ridge Hat matrix of a base-learner
* $X^{(g)}$ is the submatrix of $X$ with columns corresponding
to the features in group $g$.

* $\beta^{(g)}$ is the corresponding coefficients of
the features in group $g$.

* $p_g$ is the number of predictors in group $g$.

* $\alpha$ adjusts the weight between group and individual base-learners.

* $\lambda^{(g)}, \lambda^{(g)}_j$ Ridge penalty parameters.
* $u$ is the negative gradient vector from the previous boosting iteration.

# Workflow

To estimate, tune and interpret a sparse-group boosting model, the following 
workflow is advised:

## Define model

We start by creating the formula that describes the sparse-group boosting 
optimization problem, as stated above. We pass three central parameters to `create_formula`.

* `alpha`: Mixing parameter between zero and one used for the convex combination 
of the degrees of freedom.
* `group_df`: data.frame containing the group structure to be used.
* `group_name`: The name of the column in `group_df` indicates the group structure
* `var_name`: The name of the column in `group_df` indicating the name of 
the variables in the modelling data.frame to be used. Note that not all variables present
on the modelling data.frame have to be in group_df. These could be ID, timestamp variables or 
additional information one does not want to use in the model. 
* `outcome_name`: Name of the outcome variable in the modelling data.frame
* `intercept`: Should an intercept be used for all base-learners? If the data is
not centered, one should include use `intercept = TRUE`. Note that in this case,
individual base-learners will be groups of size two, and the group size of group
base-learners increases by one. 

```{r, eval=TRUE}
sgb_formula_linear <- create_formula(
  alpha = 0.4, group_df = group_df, outcome_name = "y", intercept = FALSE,
  group_name = "group_name", var_name = "variable_name",
)
sgb_formula_binary <- create_formula(
  alpha = 0.4, group_df = group_df, outcome_name = "y_binary", intercept = FALSE,
  group_name = "group_name", var_name = "variable_name",
)
```

## Model fitting and tuning

Pass the formula to `mboost` and use the arguments as seems appropriate. 
The main hyperparameters are `nu`and `mstop`. For model tuning, the `mboost` function
`cvrisk` can be used and plotted. `cvrisk` may be slow to run. One can run it in
parallel to speed up.

```{r, eval=TRUE}
sgb_model_linear <- mboost(
  formula = sgb_formula_linear, data = df,
  control = boost_control(nu = 1, mstop = 600)
)
# cv_sgb_model_linear <- cvrisk(sgb_model_linear,
#                               folds = cv(model.weights(sgb_model_linear),
#                                          type = 'kfold', B = 10))
sgb_model_binary <- mboost(
  formula = sgb_formula_binary, data = df, family = Binomial(),
  control = boost_control(nu = 1, mstop = 600)
)
# cv_sgb_model_binary <- cvrisk(sgb_model_binary,
#                               folds = cv(model.weights(sgb_model_linear),
#                                          type = 'kfold', B = 10))
# mstop(cv_sgb_model_linear)
# mstop(cv_sgb_model_binary)
# plot(cv_sgb_model_linear)
# plot(cv_sgb_model_binary)
sgb_model_linear <- sgb_model_linear[320]
sgb_model_binary <- sgb_model_binary[540]
```
In this example, the lowest out-of-sample risk is obtained at 320 (linear) and 
540 (logistic) boosting iterations.

## Interpreting and plotting a sparse-group boosting model

`sgboost` has useful functions to understand sparse-group boosting models and 
reflects that final model estimates of a specific variable in the dataset can be
attributed to group base-learners as well as individual baseleraners depending
on the boosting iteration

### Variable importance

A good starting point for understanding a sparse-group boosting model is the 
variable importance. In the context of boosting the variable importance can be 
defined as the relative contribution of each predictor to the overall reduction
of the loss function (negative log likelihood).
`get_varimp` returns the variable importance of each base-learner/predictor 
selected in throughout the boosting process. Note that in the case of the selection
of a group and an individual variable - call it $x_1$ - from the same group
-$x_1, x_2, ... x_p$ -, both base-learners (predictors) will have an associated
variable importance as defined before. This allows us to differentiate between 
the individual contribution of $x_1$ its own variable and the contribution of the 
group $x_1$ is part of as a group construct. It is not possible to compute the 
aggregated variable importance of $x_1$ as it is not clear how much 
$x_1$ contributes to the group. However, the aggregated coefficients can be 
computed using `get_coef`. Also, the aggregated importance of all groups vs. all
individual variables are returned in a separate data.frame. With `plot_varimp`
one can visualize the variable importance as a barplot. One can indicate the maximum 
number of predictors to be printed through `n_predictors` or the minimal 
variable importance that a predictor has to have though `prop`. Through both 
parameters the number of printed entries can be reduced. Note that
in this case, the relative importance of groups in the legend is based only on the
plotted variables and not the ones removed. To add information about the direction
of effect sizes, one could add arrows behind the bars[^3]. To do this for the
groups, one can use the aggregated coefficients from `get_coef`.

```{r, eval=TRUE, fig.align='center', fig.height=5, fig.width=7}
get_varimp(sgb_model = sgb_model_linear)$varimp %>% slice(1:10)
get_varimp(sgb_model = sgb_model_linear)$group_importance
plot_varimp(sgb_model = sgb_model_linear, n_predictors = 15)
```

### Resulting model coefficients

The resulting coefficients can be retrieved through `get_coef`. 
In a sparse-group boosting models, a variable in a dataset can be selected as an
individual variable or as a group. Therefore, there can be two associated effect sizes
for the same variable. This function aggregates both and returns them in a data.frame
sorted by the effect size `'effect'`.

```{r, eval=TRUE, fig.align='center', fig.height=5, fig.width=7}
get_coef(sgb_model = sgb_model_linear)$aggregate %>% slice(1:10)
get_coef(sgb_model = sgb_model_linear)$raw %>% slice(1:10)
```

### Plotting effect sizes and importance 

With `plot_effects` we can plot the effect sizes of the sparse-group boosting model 
in relation to the relative importance to get an overall picture of the model. 
Through the parameter `'plot_type'` one can choose the type of visualization.
`'radar'` refers to a radar plot using polar coordinates.
Here the angle is relative to the cumulative relative importance of predictors and
the radius is proportional to the effect size. `'clock'` does the same as `'radar'` 
but uses clock coordinates instead of polar coordinates. `'scatter'` uses the
effect size as y-coordinate and the cumulative relative importance as x-axis in 
a classical Scatter plot.
```{r, eval=TRUE, fig.align='center', fig.height=5, fig.width=7}
plot_effects(sgb_model = sgb_model_binary, n_predictors = 10, base_size = 10)
plot_effects(sgb_model = sgb_model_binary, n_predictors = 10, plot_type = "clock", base_size = 10)
plot_effects(sgb_model = sgb_model_binary, n_predictors = 10, plot_type = "scatter", base_size = 10)
```

### Coefficient path

`plot_path` calls `get_coef_path` to retrieve the aggregated coefficients from a
`mboost` object for each boosting iteration and plots it, while indicating if a
coefficient was updated by an individual variable or group.
```{r, eval=TRUE, fig.align='center', fig.height=5, fig.width=7}
plot_path(sgb_model = sgb_model_linear[100])
plot_path(sgb_model = sgb_model_binary[100])
```

[^1]: Obster F & Heumann C, (2024). _Sparse-group boosting -- Unbiased group and variable selection._
<https://doi.org/10.48550/arXiv.2206.06344>
[^2]: McDonald D, Liang X, Heinsfeld A, Cohen A, Yang Y, Zou H, Friedman J, Hastie T, Tibshirani R, Narasimhan B, Tay K, Simon N, Qian J  & Yang J. (2022).  _Getting started with sparsegl._ <https://CRAN.R-project.org/package=sparsegl>.
[^3]: Obster F., Bohle H. & Pechan P.M. (2024). _The financial well-being of fruit farmers in Chile and Tunisia depends more on social and geographical factors than on climate change._ Commun Earth Environ 5, 16. <https://doi.org/10.1038/s43247-023-01128-2>.