--- title: "BayesChange Tutorial" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{BayesChange Tutorial} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: references.bib --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` Here we provide a brief tutorial of the `BayesChange` package. The `BayesChange` package contains two main functions: one that performs change points detection on time series and epidemic diffusions and one that perform clustering of time series and epidemic diffusions with common change points. Here we briefly show how to implement these. ```{r setup} library(BayesChange) ``` ## Detecting change points The function `detect_cp` provide a method for detecting change points, it is based on the work @MM2014 and on @corradin2024 Depending on the structure of the data, `detect_cp` might perform change points detection on univariate time series or multivariate time series. We import dataset `eu_inflation` that contains the standardized monthly inflation rates from the Harmonized Index of Consumer Prices (HICP) for the 12 COICOP expenditure categories across European Union countries. The data span the period from February~1997 to December~2024 resulting in a matrix of $12$ rows and $355$ columns. ```{r} data("eu_inflation") ``` Now we can run the function `detect_cp`, as arguments of the function we need to specify the number of iterations, the number of burn-in steps and a list with the the autoregressive coefficient `phi` for the likelihood of the data, the parameters `a`, `b`, `c` for the priors and the probability `q` of performing a split at each step. Since we deal with time series we need also to specify `kernel = "ts"`. ```{r} out <- detect_cp(data = eu_inflation[1,], n_iterations = 2000, n_burnin = 500, q = 0.5, params = list(prior_var_phi = 0.1, prior_delta_c = 1, prior_delta_d = 1), kernel = "ts") ``` With the methods `print` and `summary` we can get information about the algorithm. ```{r} print(out) summary(out) ``` In order to get a point estimate of the change points we can use the method `posterior_estimate` that uses the method *salso* by @Dahl02102022 to get the final latent order and then detect the change points. ```{r} cp_est <- posterior_estimate(out, loss = "binder") cumsum(table(cp_est))[-length(table(cp_est))] + 1 ``` The package also provides a method for plotting the change points. ```{r} plot(out, loss = "binder") ``` We can assess convergence of the latent order posterior chain, for example, by inspecting the traceplot of its log-likelihood with `coda::traceplot`. ```{r} coda::traceplot(out$lkl_MCMC, ylab = "Log-Likelihood") ``` If we instead consider a matrix of data, `detect_cp` automatically performs a multivariate change points detection method. We define the parameters. ```{r} params_multi <- list(m_0 = rep(0,3), k_0 = 1, nu_0 = 10, S_0 = diag(0.1,3,3), prior_var_phi = 0.1, prior_delta_c = 1, prior_delta_d = 1) ``` Arguments `k_0`, `nu_0`, `phi_0`, `m_0`, `prior_delta_c`, `prior_delta_d` and `prior_var_phi` correspond to the parameters of the prior distributions for the multivariate likelihood. ```{r} out <- detect_cp(data = eu_inflation[1:3,], n_iterations = 2000, n_burnin = 500, q = 0.5, params = params_multi, kernel = "ts") table(posterior_estimate(out, loss = "binder")) ``` ```{r} plot(out, loss = "binder", plot_freq = TRUE) ``` Function `detect_cp` can also be used to detect change points on survival functions. We consider the synthetic dataset `epi_synthetic` ```{r} data("epi_synthetic") ``` To run `detect_cp` on epidemiological data we need to set `kernel = "epi"`. Moreover, besides the usual parameters, we need to set the number of Monte Carlo replications `M` for the approximation of the integrated likelihood and the recovery rate `xi`. `a0` and `b0` are optional and correspond to the parameters of the gamma distribution for the integration of the likelihood. ```{r} params_epi <- list(M = 250, xi = 1/8, a0 = 4, b0 = 10, I0_var = 0.1) out <- detect_cp(data = epi_synthetic, n_iterations = 2000, n_burnin = 500, q = 0.25, params = params_epi, kernel = "epi") print(out) ``` Also here, with function `plot` we can plot the survival function and the position of the change points. ```{r} plot(out) ``` ## Clustering time dependent data with common change points `BayesChange` contains another function, `clust_cp`, that cluster respectively univariate and multivariate time series and survival functions with common change points. Details about this methods can be found in @CORRADIN202226 In `clust_cp` the argument `kernel` must be specified, if data are time series then `kernel = "ts"` must be set. Then the algorithm automatically detects if data are univariate or multivariate. We consider for this example dataset `stock_uni` that contains the daily mean stock prices for the 50 largest companies (by market capitalization) in the Standard&Poor’s 500 Index from January 1, 2020 to January 1, 2022. ```{r} data("stock_uni") ``` Arguments that need to be specified in `clust_cp` are the number of iterations `n_iterations`, the number of elements in the normalisation constant `B`, the split-and-merge step `L` performed when a new partition is proposed and a list with the parameters of the algorithm, the likelihood and the priors.. ```{r} params_uni <- list(a = 1, b = 1, c = 1, phi = 0.1) out <- clust_cp(data = stock_uni[1:5,], n_iterations = 2000, n_burnin = 500, L = 1, q = 0.5, B = 1000, params = params_uni, kernel = "ts") posterior_estimate(out, loss = "binder") ``` Method `plot` for clustering univariate time series represents the data colored according to the assigned cluster. ```{r} plot(out, loss = "binder") ``` Method `plot_psm` shows the posterior similarity matrix of the clustering. Selecting `reorder = TRUE` we can choose to order the matrix depending on the clustering obtained. ```{r} plot_psm(out, reorder = TRUE) ``` If time series are multivariate, data must be an array, where each element is a multivariate time series represented by a matrix. Each row of the matrix is a component of the time series. Here we use dataset `stock_multi` that contains for each company the daily opening and closing stock prices. ```{r} data("stock_multi") ``` ```{r} params_multi <- list(m_0 = rep(0,2), k_0 = 1, nu_0 = 10, S_0 = diag(1,2,2), phi = 0.1) out <- clust_cp(data = stock_multi[,,1:5], n_iterations = 2500, n_burnin = 500, L = 1, B = 1000, params = params_multi, kernel = "ts") posterior_estimate(out, loss = "binder") ``` ```{r} plot(out, loss = "binder") ``` Finally, if we set `kernel = "epi"`, `clust_cp` cluster survival functions with common change points. Also here details can be found in @CORRADIN202226 Data are a matrix where each row is the number of infected at each time. Inside this package is included the dataset `epi_synthetic_multi` with multivariate synthetic epidemic diffusions. ```{r, eval = FALSE} data("epi_synthetic_multi") params_epi <- list(M = 100, xi = 1/8, alpha_SM = 1, a0 = 4, b0 = 10, I0_var = 0.1, avg_blk = 2) out <- clust_cp(epi_synthetic_multi[,10:150], n_iterations = 2000, n_burnin = 500, L = 1, B = 1000, params = params_epi, kernel = "epi") posterior_estimate(out, loss = "binder") plot(out, loss = "binder") ```