--- title: "An Introduction to `adelie`" author: - James Yang, Trevor Hastie and Balasubramanian Narasimhan date: "`r format(Sys.time(), '%B %d, %Y')`" bibliography: assets/grpnet_refs.bib output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{An Introduction to `adelie`} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "##", fig.width = 7, fig.height=6 ) ``` # __Introduction to Group Lasso and Elastic Net__ The `adelie` package provides extremely efficient procedures for fitting the entire group lasso and group elastic net regularization path for GLMs, multinomial, the Cox model and multi-task Gaussian models. `adelie` is similar to the R package `glmnet` in scope of models, and in computational speed. In fact, if there are no groups, `adelie` fits the same class of models as in `glmnet`, with similar computational speed. The R package `adelie` is built from the same C++ code as used in the corresponding [adelie Python package](https://github.com/JamesYang007/adelie). See @yang2024adelie for details. The theory and algorithms in this implementation are inspired by @glmnet, @coxnet, @strongrules and @block. In this notebook, we give a brief overview of the group elastic net problem that `adelie` solves, and describe some of the features of this implementation. Then we develop a detailed example for fitting pairwise interactions, mimicking the `glinternet` R package (@glinternet) ## __Single-Response Group Elastic Net__ The single-response group elastic net problem is given by $$ \begin{align*} \mathrm{minimize}_{\beta, \beta_0} \quad& \ell(\eta) + \lambda \sum\limits_{g=1}^G \omega_g \left( \alpha \|\beta_g\|_2 + \frac{1-\alpha}{2} \|\beta_g\|_2^2 \right) \\\text{subject to}\quad& \eta = \eta^0 + \beta_0 \mathbf{1} + X \beta \end{align*} $$ where * $\ell(\cdot)$ is the loss function defined by the GLM, * $\eta^0$ is a fixed offset vector, * $\beta_0$ is the intercept, * $\beta$ is the coefficient vector, * $X$ is the feature matrix, * $\lambda \geq 0$ is the regularization parameter, * $G$ is the number of groups, * $\beta_g$ are the subvector of coefficients for the $g$th group, * $\omega_g \geq 0$ is the penalty factor per group, and * $\alpha \in [0,1]$ is the elastic net parameter. As an example, the Gaussian GLM (`glm.gaussian()`) defines the loss function as $$ \begin{align*} \ell(\eta) &= \sum\limits_{i=1}^n w_i \left( -y_i \eta_i + \frac{\eta_i^2}{2} \right) \end{align*} $$ where $w \geq 0$ is the observation weight vector, $y$ is the response vector, and $\eta$ is the linear prediction vector as in the optimization problem above. This is equivalent to the weighted sum-of-squared errors $$ \begin{align*} \mbox{RSS} &= \frac12\sum\limits_{i=1}^n w_i \left( y_i -\eta_i \right)^2, \end{align*} $$ since the term in $\sum w_iy_i^2$ is a constant for the optimization. By default we set the weights $w_i=1/n$, and if supplied they are renormalized to sum to 1. Specifically for the Gaussian GLM, we employ a specialized optimizer based on coordinate descent to solve the group elastic net problem. The Gaussian GLM is written in "exponential family" form, with $\eta_i$ the natural parameter. Other GLMs such as binomial (`glm.binomial()`) and Poisson (`glm.poisson()`) have similar expressions for the negative log-likelihood. For other general GLMs as well as the Cox model (`glm.cox()`), we use a proximal Newton method, which leads to an _iteratively reweighted least squares_ (IRLS) algorithm, That is, we iteratively perform a quadratic approximation to $\ell(\cdot)$, which yields a sequence of Gaussian GLM group elastic net problems that we solve using our special solver based on coordinate descent. ## __Multi-Response Group Elastic Net__ The multi-response group elastic net problem is given by $$ \begin{align*} \mathrm{minimize}_{B,\; \beta_0} \quad& \ell(\eta) + \lambda \sum\limits_{g=1}^G \omega_g \left( \alpha \|B_g\|_F + \frac{1-\alpha}{2} \|B_g\|_F^2 \right) \\\text{subject to}\quad& \eta = \eta^0 + \mathbf{1}\beta_0^\top + X B \end{align*} $$ The model arises in "multitask" learning --- `glm.multigaussian()` --- as well as multinomial (multiclass) logistic regression --- `glm.multinomial()`. This way, we have possibly different linear predictions for each of $K$ responses, and hence $\eta$ is $n\times K$. The coefficient "matrices" for each group $B_g$ are penalized using a Frobenius norm. Note that if an intercept is included in the model, an intercept is added for each response. We use an alternative but equivalent representation in the `adelie` software, by "flattening" the coefficient matrices. Specifically we solve $$ \begin{align*} \mathrm{minimize}_{\beta,\; \beta_0} \quad& \ell(\eta) + \lambda \sum\limits_{g=1}^G \omega_g \left( \alpha \|\beta_g\|_2 + \frac{1-\alpha}{2} \|\beta_g\|_2^2 \right) \\\text{subject to}\quad& \mathrm{vec}(\eta^\top) = \mathrm{vec}(\eta^{0\top}) + (\mathbf{1} \otimes I_K) \beta_0 + (X \otimes I_K) \beta \end{align*} $$ where $\mathrm{vec}(\cdot)$ is the operator that flattens a column-major matrix into a vector, and $A \otimes B$ is the Kronecker product operator. Hence $\beta \equiv \mathrm{vec}(B^\top)$. As indicated above, the multi-response group elastic net problem is technically of the same form as the single-response group elastic net problem. In fact, `adelie` reuses the single-response solver for multi-response problems by modifying the inputs appropriately (e.g. using `matrix.kronecker_eye()`) to represent $X \otimes I_K$). For the MultiGaussian family, we wrap the specialized single-response Gaussian solver and otherwise for general multi-response GLMs, we wrap the single-response GLM solver. # __Quickstart__ In this section, we cover the basic usage of `adelie` using very simple examples. ```{r} library(adelie) ``` ## __Gaussian Group Elastic Net__ Before using `adelie`, we assume that the user has a feature matrix `X` and a response vector `y`. For simplicity, we assume for now that `X` is a dense matrix, although we will later see that `X` can be substituted with a custom matrix as well. For demonstration, we will randomly generate our data. ```{r} n = 100 # number of observations p = 1000 # number of features set.seed(5) # seed X = matrix(rnorm(n*p),n,p) y = X[,1:10] %*% rnorm(10) + rnorm(n) * sqrt(10) # makes SNR = 1 ``` ### __Lasso__ The most basic call to `grpnet` simply supplies a `X` matrix and a `glm` object. For example, `glm.gaussian(y=y)` returns a Gaussian `glm` object with response variable `y`, which corresponds to the usual least squares loss function. By default, `grpnet` will fit lasso by setting each feature as its own group. `adelie` is written to treat groups of size `1` in a more optimized manner, so it is a competitive lasso solver that has computation times similar to those of the `glmnet` package. Like `glmnet`, `grpnet` will also generate a sequence of 100 values for the regularization parameter $\lambda$ evenly spaced on the log scale, by default if the user does not provide the path. The largest value of $\lambda$ will be the smallest value such that all the terms with positive $omega_g$ above in the solution are zero. Since `grpnet` is a path-solver, it will warm-start at the next $\lambda$ using the current solution on the path. __For this reason, we recommend users to supply a sufficiently fine grid of__ $\lambda$ __or use the default path!__ ```{r} fit = grpnet(X=X, glm=glm.gaussian(y=y)) print(fit) ``` The `print()` methods gives a nice summary of the fit. Note that the solver finished early in this example. By default, the solver terminates if the latter reaches $90\%$ as a simple heuristic to avoid overfitting. This threshold can be controlled via the argument `adev_tol`. The output of `grpnet` is a an object of class "grpnet", which contains some simple descriptors of the model, as well as a _state_ object that represents the state of the optimizer. For most use-cases, the users do not need to inspect the internals of a state object, but rather can extract the useful information via the methods `print()`, `plot()`, `predict()` and `coef()`. ```{r} plot(fit) ``` Here we plot the coefficients profiles as a function of $-\log(\lambda)$. By default `grpnet` standardizes the features internally, and these coefficients are on the scale of the standardized features. One can avoid standardization via `standardize = FALSE` in the call to `grpnet()`. We can make predictions at new values for the features --- here we use a subset of the original: ```{r} pred = predict(fit, newx = X[1:5,],lambda = c(1.5, 1)) pred ``` If you used `standardize = TRUE`, the default, you do not have to standardize `newx`; `adelie` will do it for you automatically. Finally, we would like to choose a good value for $\lambda$, and for that we use cross-validation. We include a `progress_bar` argument, which can be helpful for big problems. The `plot` method for a `cv.grpnet` object displays the average cross-validated deviance of the model, with approximate standard error bars for the average. ```{r} fitcv = cv.grpnet(X,glm.gaussian(y),progress_bar = TRUE) plot(fitcv) ``` Two vertical dashed lines are included: one at the value of $\lambda$ corresponding to the minimum value, and another at a larger (more conservative) value of $\lambda$, such that the mean error is within one standard error of the minimum. One can print the fitted `cv.glmnet` object: ```{r} fitcv ``` One can also predict directly from it: ```{r} pred = predict(fitcv, newx = X) pred = predict(fitcv, newx = X, lambda = "lambda.min") ``` In the first case the prediction is at the default value of $\lambda$, which is `lambda = "lambda.1se"`. Alternatively, any numeric values of $\lambda$ can be supplied. ### __Group Lasso__ So far this has been a simple Gaussian lasso fit, and the results would be identical to those produced by `glmnet`. To fit group lasso, the user simply needs to supply a `groups` argument that defines the starting column index of $X$ for each group. For example, if there are `4` features with two (contiguous) groups of sizes `3` and `1`, respectively, then `groups = c(1, 4)`. For demonstration, we take the same data as before and group every `10` features. We then fit group lasso using the same function. ```{r} fitg = grpnet( X=X, glm=glm.gaussian(y=y), groups=seq(from = 1, to = p, by=10), ) print(fitg) plot(fitg) ``` Notice in the printout the active set (df) increases in steps of 10, as expected. Also the coefficient plot colors all coefficients for a group with the sample color. As before, we can use cross-validation to select the tuning parameter. ```{r} fitgcv = cv.grpnet( X, glm.gaussian(y), groups=seq(from = 1, to = p, by=10), progress_bar = TRUE) plot(fitgcv) ``` ## __GLM Group Elastic Net__ In the previous section, we covered how to solve penalized least squares regression. Nearly all of the content remains the same for fitting penalized GLM regression. The only difference is in the choice of the `glm` object. For brevity, we only discuss logistic regression as our non-trivial GLM example, however the following discussion applies for any GLM. Let's modify our example and generate a binomial response. ```{r} eta = X[,1:5] %*% rnorm(5) / sqrt(5) mu = 1 / (1 + exp(-eta)) y = rbinom(n, size = 1, prob = mu) ``` To solve the group elastic net problem using the logistic loss, we simply provide the binomial GLM object. For simplicity, we fit the lasso problem below. ```{r} fitb = grpnet(X, glm.binomial(y)) plot(fitb) ``` Cross-validation works as before: ```{r} fitbcv = cv.grpnet(X,glm.binomial(y)) plot(fitbcv) ``` We can make predictions from the fitted objects as before. ```{r} predict(fitb, newx = X[1:5,],lambda = c(0.13, 0.07)) ``` For GLMs, the predictions are by default on the link ($\eta$) scale. However, one can also make predictions on the _inverse link_ or _response_ scale ($\mu$): ```{r} predict(fitb, newx = X[1:5,],lambda = c(0.13, 0.07),type="response") ``` In this case we get the estimated probability of a 1-class. We can specify coefficient groups just as before. ## __Multi-Response GLM Elastic Net__ `grpnet` is also able to fit multi-response GLM elastic nets. Currently, we only support MultiGaussian (least squares) and Multinomial GLMs. The following code shows an example of fitting a multinomial regression problem. We use the `glm.multinomial()` family, which expects a matrix response $Y$ with $K$ columns (the number of classes). Each row of $Y$ is a proportion vector which sums to 1. In the example below, $Y$ is an indicator matrix, with a single 1 in each row, the rest being zeros. If instead the data were grouped, and each row originally represented the number out of $n_i$ in category $k$, we would first divide each of the counts by $n_i$ turning them into proportions. Then we would supply weights $w_i=n_i$, normalized to sum to 1.0 overall. ```{r} n = 300 # number of observations p = 100 # number of features K = 4 # number of classes set.seed(7) X = matrix(rnorm(n*p),n,p) eta = X[, 1:5] %*% matrix(rnorm(K*5),5,K)/sqrt(5) probs = exp(eta) probs = probs/rowSums(probs) Y = t(apply(probs,1,function(prob)rmultinom(1, 1, prob))) Y[1:5,] ``` The fitting proceeds as before. We will directly fit a group lasso, with groups of 5 chosen similarly to before. ```{r fitm} grps = seq(from=1, to=p, by = 5) fitm = grpnet(X, glm.multinomial(Y), groups=grps) ``` ```{r out.lines = 10} print(fitm) ``` ```{r plotfitm} plot(fitm) ``` The coefficient for each feature is a vector(one per class), so rather than show multiple coefficients, the plot shows the progress of the 2norm of this vector as a function of $\lambda$. Once again, because of the grouping (into groups of 5 here), the groups are colored the same. In this case, the first group has all the action, so appears much earlier than the rest. ```{r} fitmcv = cv.grpnet(X,glm.multinomial(Y),groups=grps) plot(fitmcv) ``` Another important difference from the single-response case is that the user must be aware of the shape of the returned coefficients, as returned by `coef(fitm)` or `coef(fitmcv)`. ```{r} names(coef(fitm)) ``` We see that `coef()` returns a list. For multi-response GLMs, the `intercepts` component is included by default for each response, and hence is a $L \times K$ matrix, where $L$ is the number of $\lambda$s in the path. The `betas` component will be a $L \times pK)$ sparse matrix where every successive $K$ columns correspond to the coefficients associated with a feature and across all responses. Fortunately the `predict()` method understands this structure, and behaves as intended. ```{r} predict(fitmcv,newx = X[1:3,]) ``` Here by default the prediction is made at `lambda = "lambda.1se"`. Typically for the multinomial we would be more interested in the predicted probabilities: ```{r} predict(fitmcv,newx = X[1:3,], lambda="lambda.min", type = "response") ``` Each of the rows sum to 1. ## __Cox Models__ Our final example will be for censored survival data, and we will fit Cox proportional hazards model. We create an example with a $500 \times 100$ _sparse_ $X$ matrix. In this example we have right censoring. So the "subject" exits at times in `y`, and the binary `status` is 1 of the exit is a "death", else 0 if censored. ```{r} set.seed(9) n <- 500 p <- 100 X <- matrix(rnorm(n*p), n, p) X[sample.int(n * p, size = 0.5 * n * p)] <- 0 X_sparse <- Matrix::Matrix(X, sparse = TRUE) nzc <- p / 4 beta <- rnorm(nzc) fx <- X[, seq(nzc)] %*% beta / 3 hx <- exp(fx) y <- rexp(n,hx) status <- rbinom(n = n, prob = 0.3, size = 1) ``` We now fit an elastic-net model using the `glm.cox()` family, with every set of 5 coefficients in a group. ```{r} groups = seq(from = 1, to = p, by = 5) fitcv <- cv.grpnet(X_sparse, glm.cox(stop = y, status = status), alpha = 0.5, groups = groups) par(mfrow = c(1,2)) plot(fitcv) plot(fitcv$grpnet.fit) ``` The family `glm.cox()` accommodates start-stop data and strata as well. It handles ties by default using the "Efron" method. See `?glm.cox` for further details. # __Matrix__ In group elastic net problems, the matrix object plays a crucial role in the performance of the solver. It becomes apparent in our optimization algorithm (and our benchmark analysis) that most of the runtime lies in interacting with the matrix object, e.g. computing inner-products. Hence, a highly performant matrix object implementing a select set of methods that the solver requires will yield tremendous speed gains overall. In addition, we have found numerous examples where a matrix class admits some special structure that can be exploited for further speed and memory gains. One simple example is a large sparse matrix, which cannot fit in memory as a dense matrix. Another example is genomics datasets which are not only sparse, but only take on 3 possible integer values (see [Examples], to come), and generally have over 160 billion entries with 30\% non-zero entries. For these reasons, we found it fruitful to abstract out the matrix class, and `adelie` provides a few examples. We discuss below some ways a user may interact with these classes. ## __Some Useful Matrix Methods__ We will go through a few matrix methods that will be useful in group lasso modeling. ### Dense Matrix The simplest example of such a matrix is a dense matrix. Let us first construct a dense matrix and wrap it using `matrix.dense`. ```{r dense} n = 4 p = 2 set.seed(0) X_dense = matrix(rnorm(n*p),n,p) print(X_dense) Xd = matrix.dense(X_dense) print(Xd) ``` If this is all we are going to pass to `grpnet()`, then we could have saved ourselves the trouble, because it will do so anyway in the call to `grpnet`. But we will see that it can still become useful when we combine matrices. ### Sparse Matrix Similarly, we can wrap a sparse matrix using `matrix.sparse`. We will make a simple sparse matrix using the function `Matrix::sparseMatrix()`. We will use the _market matrix_ format, where we supply ` i, j, x` triples. ```{r sparse} d = data.frame( i = c(2, 1, 4, 1, 2, 4), j= c(1, 2, 2, 3, 3, 3), x = c(0.184, 0.330, 0.738, 0.576, -0.305, 0.390) ) print(d) require(Matrix) X_sparse = with(d, sparseMatrix(i, j, x=x, dims=c(4,3))) print(X_sparse) Xs = matrix.sparse(X_sparse) print(Xs) ``` Likewise, if this was all we were to pass to `grpnet`, we again could have saved the trouble, because it would convert an S4 sparse matrix automatically. ### Concatenated Matrices Suppose we have both a dense matrix and a sparse matrix. We can concatenate matrix classes together in `adelie`. This enables the software to use the most efficient methods for each when performing matrix multiplies. ```{r concat} Xds = matrix.concatenate(list(Xd, Xs)) print(Xds) Xds$rows; Xds$cols print(Xds) ``` By default we concatenated by columns (as in `cbind()`). We could alternatively concatenate by rows, using the `axis=1` argument, but that would fail here since the dimensions are not compatible. We currently do not have code that allows one to view these matrix objects; that may come in future versions of the package. Each matrix object has `attributes`, which typically contain the original matrix, or matrices if it was concatenated. ### Mixed variables and one-hot encoding Sometimes we have a mixture of categorical and quantitative variables. In this case we typically expand the categorical variables using _one_hot_ encoding (a.k.a _dummy variables_). Consider the simple example where we have `X_dense` as above, along with a factor with 3 levels. ```{r mixed} X_mixed = cbind(X_dense, c(1,0,2,1)) print(X_mixed) levels = c(1,1,3) ``` Along with this matrix we have an integer vector of levels, which says how many levels in each column, with quantitative variables depicted as 1. Notice for factors we represent the values for the factor starting at `0` in `adelie`. ```{r onehot} Xoh = matrix.one_hot(X_mixed, levels = levels) Xoh$cols ``` We would like to see what is inside `X0h`. Since we dont have a print method yet, we will use a bit of trickery to do that, by multiplying it by the identity. Since only row-sparse matrices are allowed (they are meant to be adelie coefficient matrices), we will make one. ```{r inoh} eye= as(as(diag(5), "generalMatrix"), "RsparseMatrix") Xoh$sp_tmul(eye) ``` ### Standardization We typically wish to standardize the features before fitting an `adelie` model, with a few exceptions. If some columns represent the _one-hot_ vectors for a factor variable, we generally do not standardize them. Lets use that case as an example. Each matrix class has its own standardize method. ```{r std} Xohs = matrix.standardize(Xoh) Xohs$sp_tmul(eye) ``` The standardization information is stored in the attributes of the matrix object. ```{r attrs} print(attributes(Xohs)) ``` # Extended Examples We now go through some applications where we use `adelie` and its group lasso facilities as a building block. Currently just one example, but more to follow. ## Learning Interactions via Hierarchical Group-Lasso Regularization In regression settings, we may want to include pairwise interaction terms amongst a subset of features to capture some non-linearity. Moreover, we would like to perform feature selection on the interaction terms as well. However, to achieve an interpretable model, we would like to also impose a hierarchy such that interaction terms are only included in the model if the main effects are included. Michael Lim and Trevor Hastie provide a formulation of this problem using group lasso where the group structure imposes the hierarchy and the group lasso penalty allows for feature selection. For further details see their R package `glinternet` and the reference: [Learning interactions via hierarchical group-lasso regularization](https://hastie.su.domains/Papers/glinternet_jcgs.pdf) We implement their approach in `adelie`, and provide two advantages: 1. We can provide interaction models for _all_ the `adelie` response familes. The `glinternet` package handles two cases only - Gaussian and binomial. 2. A computation speedup by a factor of about 25. We first demonstrate the functions that are provided in the `adelie` package for fitting these models. Then we develop our approach from scratch using `grpnet` as the work engine, which helps to explain how the model works. ### Simulation Setup We will work under a simulation setting. We draw $n$ independent samples $Z_i \in \mathbb{R}^d$ where the continuous features are sampled from a standard normal and the discrete features are sampled uniformly. Remember that factors start at 0, and the maximum value is levels-1. ```{r datasetup} require(adelie) n=1000 d_cont = 10 # number of continuous features d_disc = 10 # number of discrete features set.seed(3) # random seed Z_cont = matrix(rnorm(n*d_cont),n,d_cont) levels = sample(2:10,d_disc,replace=TRUE) Z_disc = matrix(0,n,d_disc) for(i in seq(d_disc))Z_disc[,i] = sample(0:(levels[i]-1),n,replace=TRUE) ``` It is customary to first center and scale the continuous features so that they have mean 0 and standard deviation 1. ```{r standardize} sd0 = function(x)sd(x)*sqrt(1-1/length(x)) # SD formula with divition by n rather n-1 Z_cont_means = apply(Z_cont,2,mean) Z_cont_stds = apply(Z_cont,2,sd0) Z_cont = scale(Z_cont, Z_cont_means,Z_cont_stds) ``` We now combine these matrices to form one big matrix, and a vector of levels that describes them all. ```{r comb} Z = cbind(Z_cont,Z_disc) levels = c(rep(1,d_cont),levels) print(levels) ``` We make a response which has a main effect in each of a continuous feature and discrete one, and their interaction. ```{r makey} xmat = model.matrix(~Z_cont[,1]*factor(Z_disc[,2])) nc=ncol(xmat) print(nc) set.seed(4) beta = rnorm(nc) y = xmat%*%beta+rnorm(n)*2.5 ``` ### Tools in adelie for fitting a glinternet model We now demonstrate the tools we have constructed to fit interaction models. In the next section, we describe a manual construction, in order to get some insight of what is _under the hood_. We provide a function `glintnet`, deliberately omitting the "er" in the middle to avoid confusion. In the following we consider a model with main effects, and possible interactions between the first variable, and all the others. ```{r fig.width=7, fig.height=5} set.seed(2) fit = glintnet(Z,glm.gaussian(y),levels=levels,intr_keys = 1) cvfit = cv.glintnet(Z,glm.gaussian(y),levels=levels,intr_keys = 1) par(mfrow=c(1,2)) plot(fit) plot(cvfit) ``` The generating model has 12 non-zero coefficients (including intercept), and the more conservative `lambda.1se` appears to have found these. We have some tools for poking in and seeing what was fit. The true model has main effects and interactions between variables 1 (continuous) and 12 (6-level factor). ```{r} predict(cvfit,type="nonzero") ``` There is a main-effect for variable 12 (6 coefficients), and an interaction term between variable 12 and variable 1. This counts actually 12 coefficients - 6 for the interaction, and 6 again for the main effect. The algorithm uses the _overlap group lasso_ to ensure that when interactions are present, so are the main effects. Details can be found in the reference. Although the model has found the correct interactions, digging in to see the coefficents is less helpful. The following printout is truncated. ```{r} coef(cvfit) ``` The coefficient matrix is quite wide, and seeing what goes with what is a little cumbersome. We will provide helpful tools in future to make these more manageable. Perhaps we made it too easy, by specifying that the only interactions could be with variable 1 (`intr_keys=1`). Now we repeat the process giving no hints to `glintnet`. ```{r fig.width=7, fig.height=5} set.seed(2) fit2 = glintnet(Z,glm.gaussian(y),levels=levels) cvfit2 = cv.glintnet(Z,glm.gaussian(y),levels=levels) par(mfrow=c(1,2)) plot(fit2) plot(cvfit2) ``` And the nonzero variables: ```{r} predict(cvfit2,type="nonzero") ``` It still found them! We do not expect this to happen all the time, but is encouraging that it found the true signals here. The structure of `glintnet` and `cv.glintnet` mirrors that of `grpnet` and `cv.grpnet` - they have a few more arguments to specify the interactions, if needed. Importantly, they have all the methods such as `print()`, `coef()`, and `predict()`. ```{r} predict(cvfit, newx=Z[1:3,]) predict(cvfit, newx=Z[1:3,],lambda="lambda.min") print(cvfit) ``` The print method for a `glintnet` fit adds some detail: ```{r} print(fit) ``` ### Details: manual construction of interaction model Here we go through details that are encapsulated in the `glintnet` function. Reader not interested in the innards should skip this section. The model starts of with a _one-hot_ encoded version of `Z`, which it then augments with additional blocks of interaction terms. We are now in position to construct the full feature matrix to fit a group lasso model. As a demonstration, suppose we (correctly) believe that there is a true interaction term containing the first continuous feature, but we do not know the other feature. We therefore wish to construct an interaction between the first continuous feature against all other features. It is easy to specify this pairing, as shown below via `intr_keys` and `intr_values`. The following code constructs the interaction matrix `X_intr`. ```{r interact} X_int = matrix.interaction(Z,intr_keys = 1, intr_values=list(NULL),levels=levels) ``` This function creates for each element of `intr_keys`, _Hadamard_ products with each of the variables listed in `int_values`: - For two quantitative features, this just makes one new feature, but includes in addition both of the individual features --- so a 3-column matrix. - For a quantitative and a discrete variable, it does the same, but with the quantitative feature multiplied into each of the columns of the one-hot matrix for the factor. The one-hot matrix is also included, so if the factor feature has 5 levels, the result here is a 10-column matrix. - For two discrete features we get the Hadamard product of their two one-hot matrices. In linear model termonology, these matrices are exactly what is needed to fit a main effect + interaction model for the pair of variables. Read the reference for further understanding of this particular construction. To put all groups of features on the same relative “scale”, we must further center and scale all interaction terms between two continuous features. Then, it can be shown that interaction block between two discrete features induce a (Frobenius) norm of 1, a discrete and continuous feature induce a norm of $\sqrt{2}$, and two continuous features induce a norm of $\sqrt{3}$. These values will be used as penalty factors later when we call the group lasso solver. We first compute the necessary centers and scales. ```{r} pairs = attributes(X_int)[["_pairs"]] # base 0 pair_levels = apply(pairs,1,function(i)levels[i+1]) # it is transposed is_cont_cont = apply(pair_levels,2,prod) == 1 cont_cont_pairs = pairs[is_cont_cont,] cont_cont = Z[,cont_cont_pairs[,1]+1]*Z[,cont_cont_pairs[,2]+1] # base 0 centers = rep(0,X_int$cols) scales = rep(1,X_int$cols) cont_cont_indices = X_int$groups[is_cont_cont]+2 # base 0 centers[cont_cont_indices+1] = apply(cont_cont,2,mean) # base 0 index, so add 1 scales[cont_cont_indices+1] = apply(cont_cont,2,sd) # base 0 index, so add 1 ``` So far we have just focussed on the interaction terms. Glinternet requires additional main-effect terms in case the model chooses to ignore the interactions. Now, we construct the full feature matrix including the one-hot encoded main effects as well as the standardized version of the interaction terms using the centers and scales defined above. Note that we have already standardized the quantitative features in the original matrix `Z`. ```{r} X_one_hot = matrix.one_hot(Z, levels) X_int_std = matrix.standardize( X_int, centers=centers, scales=scales) X = matrix.concatenate( list( X_one_hot, X_int_std) ) X$cols ``` Before calling `grpnet` we must prepare the grouping and penalty factor information. Once again, we must remember the 0 base issue. ```{r} groups = c( as.vector(X_one_hot$groups), X_one_hot$cols + as.vector(X_int$groups)) +1 # the +1 is the base 0 issue is_cont_disc = apply(pair_levels-1,2,function(x)xor(x[1],x[2])) penalty_int = rep(1, length(X_int$groups)) penalty_int[is_cont_cont] = sqrt(3) penalty_int[is_cont_disc] = sqrt(2) penalty = c( rep(1, length(X_one_hot$groups)), penalty_int) ``` Finally we can call the group-lasso solver with our prepared inputs. ```{r fig.width=7, fig.height=5} set.seed(2) fit = grpnet(X,glm.gaussian(y),groups=groups,penalty=penalty, standardize=FALSE) cv.fit = cv.grpnet(X,glm.gaussian(y),groups=groups,penalty=penalty, standardize=FALSE) par(mfrow=c(1,2)) plot(fit) plot(cv.fit) ``` ## References