ktweedie
: Kernel-based Tweedie compound Poisson gamma
model using high-dimensional covariates for the analyses of
zero-inflated response variables. ================
ktweedie
is a package that fits nonparametric Tweedie
compound Poisson gamma models in the reproducing kernel Hilbert space.
The package is based on two algorithms, the ktweedie
for
kernel-based Tweedie model and the sktweedie
for sparse
kernel-based Tweedie model. The ktweedie
supports a wide
range of kernel functions implemented in the R
package
kernlab
and the optimization algorithm is a
Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm with bisection line
search. The package includes cross-validation functions for
one-dimensional tuning of the kernel regularization parameter and for two-dimensional joint tuning of one
kernel parameter and . The sktweedie
uses variable
weights to achieve variable selection. It is a meta-algorithm that
alternatively updates the kernel parameters and a set of variable
weights.
The ktweedie
solves the problem
where is the index parameter, is the dispersion parameter, is an kernel matrix computed according to the user-specified kernel function , whose entries are are calculated based on the -dimensional predictors from subjects . In the kernel-based Tweedie model, the mean parameter for the -th observation is modelled by
The sktweedie
solves
where involves variable weights .
install.packages("ktweedie")
::install_github("ly129/ktweedie") devtools
First we load the ktweedie
package:
library(ktweedie)
The package includes a toy data for demonstration purpose. The predictor matrix x
is
generated from standard normal distribution and y
is
generated according to
where . That said, only the first two predictors are associated with the response.
data(dat)
<- dat$x
x <- dat$y y
An input matrix x
and an output vector y
are now loaded. The ktd_estimate()
function can be used to
fit a basic ktweedie
model. The regularization parameter
lam1
can be a vector, which will be solved in a decreasing
order with warm start.
<- ktd_estimate(x = x,
fit.ktd y = y,
kern = rbfdot(sigma = 0.1),
lam1 = c(0.01, 0.1, 1))
str(fit.ktd$estimates)
#> List of 3
#> $ lambda 1 :List of 3
#> ..$ fn : num 110
#> ..$ coefficient: num [1:30, 1] 0.5558 -0.062 -0.0381 0.0523 -0.0251 ...
#> ..$ convergence: int 0
#> $ lambda 0.1 :List of 3
#> ..$ fn : num 51
#> ..$ coefficient: num [1:30, 1] 1.662 -0.235 -0.177 0.867 -0.143 ...
#> ..$ convergence: int 0
#> $ lambda 0.01:List of 3
#> ..$ fn : num 39.2
#> ..$ coefficient: num [1:30, 1] 7.692 -0.49 -0.841 4.624 -0.696 ...
#> ..$ convergence: int 0
fit.ktd$estimates
stores the estimated coefficients and
the final objective function value. The estimated kernel-based model
coefficients for the -th lam1
can be accessed by the index
l
: fit.ktd$estimates[[l]]$coefficient
.
The function can also be used to fit the sktweedie
model
by setting the argument sparsity
to TRUE
, and
specifying the regularization coefficient in the argument lam2
.
<- ktd_estimate(x = x,
fit.sktd y = y,
kern = rbfdot(sigma = 0.1),
lam1 = 5,
sparsity = TRUE,
lam2 = 1)
And we can access the fitted coefficients in a similar manner to the
fit.ktd
. Additionally, the fitted variable weights can be accessed by
$estimates[[1]]$weight
fit.sktd#> [,1]
#> [1,] 1.0000000
#> [2,] 0.4462078
#> [3,] 0.0000000
#> [4,] 0.0000000
#> [5,] 0.0000000
Variables with weights close to 0 can be viewed as noise variables.
The ktweedie
and sktweedie
algorithms
require careful tuning of one to multiple hyperparameters, depending on
the choice of kernel functions. For the ktweedie
, we
recommend either a one-dimensional tuning for lam1
() or a two-dimensional random search for
lam1
and the kernel parameter using cross-validation.
Tuning is achieved by optimizing a user-specified criterion, including
log likelihood loss = "LL"
, mean absolute difference
loss = "MAD"
and root mean squared error
loss = "RMSE"
. Using the Laplacian kernel as an
example.
laplacedot(sigma = 1)
#> Laplace kernel function.
#> Hyperparameter : sigma = 1
The one-dimensional search for the optimal lam1
, can be
achieved with the ktd_cv()
function from a user-specified
vector of candidate values:
<- ktd_cv(x = x,
ktd.cv1d y = y,
kern = laplacedot(sigma = 0.1),
lambda = c(0.0001, 0.001, 0.01, 0.1, 1),
nfolds = 5,
loss = "LL")
ktd.cv1d#> $LL
#> 1 0.1 0.01 0.001 1e-04
#> -82.30040 -60.33054 -55.68177 -55.68835 -65.38823
#>
#> $Best_lambda
#> [1] 0.01
The two-dimensional joint search for the optimal lam1
and sigma
requires ktd_cv2d()
. In the example
below, a total of ncoefs = 10
pairs of candidate
lam1
and sigma
values are randomly sampled
(uniformly on the log scale) within the ranges
lambda = c(1e-5, 1e0)
and
sigma = c(1e-5, 1e0)
, respectively. Then the
nfolds = 5
-fold cross-validation is performed to select the
pair that generates the optimal cross-validation
loss = "MAD"
.
<- ktd_cv2d(x = x,
ktd.cv2d y = y,
kernfunc = laplacedot,
lambda = c(1e-5, 1e0),
sigma = c(1e-5, 1e0),
nfolds = 5,
ncoefs = 10,
loss = "MAD")
ktd.cv2d#> $MAD
#> Lambda=0.000435692, Sigma=0.174196 Lambda=0.00855899, Sigma=0.00201436
#> 354.1993 431.4734
#> Lambda=0.00518177, Sigma=0.000749782 Lambda=7.25693e-05, Sigma=0.0620986
#> 469.7289 327.0395
#> Lambda=0.0513091, Sigma=0.000344321 Lambda=0.0108477, Sigma=0.000277883
#> 626.3884 589.4097
#> Lambda=9.72691e-05, Sigma=2.19179e-05 Lambda=0.0682224, Sigma=0.000455657
#> 433.5755 624.1514
#> Lambda=0.000228745, Sigma=0.0247239 Lambda=0.166265, Sigma=0.00695988
#> 332.0113 544.0900
#>
#> $Best_lambda
#> [1] 7.25693e-05
#>
#> $Best_sigma
#> [1] 0.0620986
Then the model is fitted using the hyperparameter(s) selected by the
ktd_cv()
or ktd_cv2d()
. In the example below,
the selected lam1
and sigma
values are
accessed by ktd.cv2d$Best_lambda
and
ktd.cv2d$Best_sigma
, which are then be fed into the
ktd_estimate()
to perform final model fitting.
<- ktd_estimate(x = x,
ktd.fit y = y,
kern = laplacedot(sigma = ktd.cv2d$Best_sigma),
lam1 = ktd.cv2d$Best_lambda)
str(ktd.fit$estimates)
#> List of 1
#> $ lambda 7.25693e-05:List of 3
#> ..$ fn : num 36.6
#> ..$ coefficient: num [1:30, 1] 24.82 -9.63 -17.4 44.79 3.7 ...
#> ..$ convergence: int 0
For the sktweedie
, only the Gaussian radial basis
function (RBF) kernel rbfdot()
is supported. We recommend
using the same set of tuned parameters as if a ktweedie
model is fitted and tuning lam2
manually:
<- ktd_cv2d(x = x,
sktd.cv2d y = y,
kernfunc = rbfdot,
lambda = c(1e-3, 1e0),
sigma = c(1e-3, 1e0),
nfolds = 5,
ncoefs = 10,
loss = "LL")
<- ktd_estimate(x = x,
sktd.fit y = y,
kern = rbfdot(sigma = sktd.cv2d$Best_sigma),
lam1 = sktd.cv2d$Best_lambda,
sparsity = TRUE,
lam2 = 1,
ftol = 1e-3,
partol = 1e-3,
innerpartol = 1e-5)
The function ktd_predict()
can identify necessary
information stored in ktd.fit$data
and
sktd.fit$data
to make predictions at the user-specified
newdata
. If the argument newdata
is
unspecified, the prediction will be made at the original x
used in model training and fitting.
<- ktd_predict(ktd.fit, type = "response")
ktd.pred head(ktd.pred$prediction)
#> [,1]
#> [1,] 6.448220e+02
#> [2,] 1.750695e-03
#> [3,] 9.215399e-02
#> [4,] 4.713962e+00
#> [5,] 1.678452e-01
#> [6,] 1.650646e+00
If newdata
with the same dimension as x
is
provided, the prediction will be made at the new data points.
# Use a subset of the original x as newdata.
<- x[1:6, ]
newdata <- ktd_predict(ktd.fit,
ktd.pred.new newdata = newdata,
type = "response")
<- ktd_predict(sktd.fit,
sktd.pred.new newdata = newdata,
type = "response")
data.frame(ktweedie = ktd.pred.new$prediction,
sktweedie = sktd.pred.new$prediction)
#> ktweedie sktweedie
#> 1 6.448220e+02 421.931421
#> 2 1.750695e-03 22.543092
#> 3 9.215399e-02 23.415272
#> 4 4.713962e+00 1.642355
#> 5 1.678452e-01 12.034229
#> 6 1.650646e+00 122.187222
In practice, the variable selection results of the
sktweedie
is more meaningful. An effective way to fit the
sktweedie
is to start with an arbitrarily big
lam2
that sets all weights to zero and gradually decrease
its value. Note that a warning message is generated for the first
lam2
, suggesting that all weights are set to zero.
<- 10
nlam2 <- 20 * 0.8^(1:nlam2 - 1)
lam2.seq <- matrix(NA, nrow = nlam2, ncol = ncol(x))
wts for (i in 1:nlam2) {
<- ktd_estimate(x = x,
sktd.tmp y = y,
kern = rbfdot(sigma = sktd.cv2d$Best_sigma),
lam1 = sktd.cv2d$Best_lambda,
sparsity = TRUE,
lam2 = lam2.seq[i],
ftol = 1e-3,
partol = 1e-3,
innerpartol = 1e-5)
<- sktd.tmp$estimates[[1]]$weight
wt.tmp if (is.null(wt.tmp)) wts[i, ] <- 0 else wts[i, ] <- wt.tmp
}#> WARNING: All weights are zero in weight update iteration:
#> [1] 2
# plot the solution path with graphics::matplot()
matplot(y = wts,
x = lam2.seq,
type = "l",
log = "x",
ylab = "Weights",
xlab = expression(paste(lambda)),
lwd = 2)
legend("topright",
title = "w index",
legend = 1:5,
lty = 1:5,
col = 1:6,
lwd = 2)