--- title: "Node support values using treestructure" author: "Vinicius Franceschi and Fabricia F. Nascimento" date: "`r Sys.Date()`" output: bookdown::html_vignette2: #rmarkdown::html_vignette #bookdown::pdf_book: toc: TRUE bibliography: bib/tutorials.bib pkgdown: as_is: true fontsize: 12pt vignette: > %\VignetteIndexEntry{Node support values using treestructure} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 11, warning = FALSE, message = FALSE ) ``` # Introductions This tutorial uses the public data available for Ebola [here](https://github.com/ebov/space-time) to demonstrate the use of node support values (e.g. bootstrap and posterior probability) to avoid designating population structure in badly supported clades. We will use their [time-scaled phylogenetic tree](https://github.com/ebov/space-time/blob/master/Data/Makona_1610_cds_ig.GLM.MCC.tree) estimated with [BEAST](https://beast.community). First, we need to load the R package we will use in this tutorial. ```{r message=FALSE} library(treeio) library(ggtree) library(treestructure) ``` Now, we will load the time-scaled phylogenetic tree with posterior probability support values: ```{r} #get the dated tree by first downloading it from the URL below tree_url <- "https://raw.githubusercontent.com/ebov/space-time/master/Data/Makona_1610_cds_ig.GLM.MCC.tree" tmp_file <- tempfile(fileext = ".tree") #Download BEAST tree to a tmp file download.file(tree_url, tmp_file, mode = "wb") #read the downloaded tree beast_tree <- read.beast(tmp_file) ``` Then we need to convert the tree to a `phylo` object and add the posterior probability information from the Bayesian tree to the `phylo` object tree. If we use the `ape:read.nexus` to read the tree that was estimated with BEAST, we won't get the posterior probability associated to the clades. We now convert the dated tree to `phylo` object. ```{r} dated_tre <- as.phylo(beast_tree) ``` Now we add the posterior probability from the BEAST tree to the dated_tree object. ```{r} # Get number of tips n_tips <- length(dated_tre$tip.label) # Get BEAST tree as tibble (it will include node numbers and posterior probabilities) tree_data <- as_tibble(beast_tree) #get posterior probability posterior <- as.numeric(tree_data$posterior[(n_tips + 1):nrow(tree_data)]) #add the posterior probability information to the `phylo` object dated_tre$node.label <- posterior ``` ## Assign clusters without using node support Firstly, we will assign clusters without using node support values. Note that the parameter `nodeSupportValues` is set to FALSE. ```{r eval=FALSE} trestruct_res_nobt <- trestruct(dated_tre, minCladeSize = 30, nodeSupportValues = FALSE, level = 0.01) ``` Because `treestructure` will take several minutes to run, we can load the results: ```{r} trestruct_res_nobt <- readRDS( system.file('trestruct_res_nobt.rds', package='treestructure') ) plot(trestruct_res_nobt, use_ggtree = T) + ggtree::geom_tippoint() ``` The `treestructure` analyses resulted in 23 clusters. ## Assign clusters using branch support Now, we will assign clusters using the information on branch support. As we are analyzing a dated tree estimated with BEAST, the branch support is the posterior probability. We designate clusters that have at least 0.95 posterior probability. This is achieved by setting to 95 the parameter _nodeSupportThreshold_ in the **trestruct** function. Note that now the parameter `nodeSupportValues` is set to TRUE, which tells the algorithm that the node support values is provided with the `phylo` object as `node.label`. You can also provide the `nodeSupportValues` as a vector with length equal to the number of internal nodes in the tree. ```{r eval=FALSE} trestruct_res <- trestruct(dated_tre, minCladeSize = 30, nodeSupportValues = TRUE, nodeSupportThreshold = 95, level = 0.01) ``` Because it will take a minute or so to run `treestructure`, we can load the result instead. ```{r} trestruct_res <- readRDS( system.file('trestruct_res.rds', package='treestructure') ) plot(trestruct_res, use_ggtree = T) + ggtree::geom_tippoint() ``` Now we have only 6 well-supported clusters with differences in coalescent patterns. Note that this might change if you use a higher or lower value for the *nodeSupportThreshold* in the **trestruct** function. ## Using the CH-index The CH-index will compare different values for the parameter `levels` in the `treestructure` algorithm. We will need to specify `level` = NULL and specify the lower and upper bound to optimize the parameter `level`. Note that we will also take into consideration the use of branch support. ```{r eval = FALSE} trestruct_chindex <- trestruct(dated_tre, minCladeSize = 30, nodeSupportValues = TRUE, nodeSupportThreshold = 95, level = NULL, levellb = 0.0001, levelub = 0.001) ``` Because it will also take a minute or so to run `treestructure`, we can load the result instead. ```{r} trestruct_chindex <- readRDS( system.file('trestruct_chindex.rds', package='treestructure') ) plot(trestruct_chindex, use_ggtree = T) + ggtree::geom_tippoint() ``` Now we have only 2 well-supported clusters with differences in coalescent patterns after optimizing the significance level. In this example, instead of using level = 0.01 (as in the default ), the optimization of level by using the CH-index identified level as 0.00073. ### Final considerations The function `trestruct` has a parameter named `nsim` that refers to the number of simulations for computing the null distribution of the test statistics [@Volz2020]. The `nsim` parameter in the `trestruct` function is is set to the default value of 10,000, which is a good value if the level parameter is set to 0.01 (which is the default value). If decreasing the value of the level parameter, the value of `nsim` should be increased. In the above example using CH-index, the algorithm identified a significance level of 0.00073. However, the analyses were carried out using the default `nsim` of 10,000 simulations. We have future plans to automatically detect the adequate number of simulations (`nsim` value) given the sizes of clades being compared and the threshold p-value used in the test. As treestructure is implemented in the moment, you should run a sensitive analyses on your own dataset, if you use a significance level smaller than 0.01. For example, if you use a significance `level` of 0.001 and a `nsim` > 10,000, then you can run your analysis twice and compare the results obtained in both runs, which should be very similar. If results are very different, you should increase the value of `nsim`. # References