This vignette aims to give an introduction on how to use the
iimi
package for plant virus diagnostics and how to
visualize the coverage profile for the sample mapping. We also included
a tutorial on creating unreliable regions.
First, let’s install necessary packages. You may skip this step if you have installed the packages before.
# install iimi
install.packages(c("iimi", "httr"))
# install Biostrings
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
::install("Biostrings") BiocManager
We will load necessary packages before we start any analysis.
library(iimi)
library(Biostrings)
library(httr)
To get started creating coverage profiles and feature-extracted data
frames, we need mapping results. We used Bowtie2 to map the samples
(paired-end or single-end) against the official Virtool
virus data base (ver. 1.4.0). We recommend Bowtie2 or minimap2 since
we have tried both and they yield similar result with minimap2 having a
slight decrease. We let both software to report all alignments
(-a
mode for Bowtie2, --secondary=yes
for
minimap2). You can also use other mapping tools. Make sure that you
convert your mapping results to indexed and sorted BAM files using
Samtools.
We provide three example BAM files to demonstrate how to use iimi. These files are sourced from the dataset used in the VirHunter paper (Sukhorukov et al. 2022), which we also utilized for external validation in our manuscript (Ning et al. 2024). You can download these files directly from the Recherche Data Gouv website (Candresse, Marais-Colombel, and Brault 2022). We recommend storing all BAM files in a single folder for ease of access. Here, we provide a short tutorial to guide you through using iimi functions to make predictions on the real data.
Let’s convert the indexed and sorted BAM file(s) into coverage profiles and feature-extracted data frame.
We will use the coverage profiles to visualize the mapping information. The feature-extracted data frame will be used in the model training and testing process.
Note that both training and testing data need to go through the conversion step. In our example, we stored the conversion for both the testing and training datasets in the same object. You can do the conversion separately for your data.
Important: the example code does not work unless the path to the folder that stores your BAM files is provided.
If you already have coverage profiles in run-length encoding (RLE) format, go to section 2.2.
<- list.files(
path_to_bamfiles path = "path/to/your/BAM/files/folder",
pattern = "bam$",
full.names = TRUE,
include.dirs = TRUE
)
You may skip this step if you already have converted them to RLE format.
<- convert_bam_to_rle(bam_file = "path_to_bamfiles") example_cov
In this section, you have the option to use the provided unreliable
regions. We recommend to enable the profiling and filtering step as it
eliminates false peaks. Examples will be provided in section 4. If you
wish to disable this mode, simply set
unreliable_region_enabled = FALSE
. If you wish to enable
this mode, you do not need extra codes. If you wish to use your own
unreliable regions, please refer to section 3 and input
unreliable_region_df
with the unreliable regions that you
customized. If you would like to use the provided mappability profile,
you do not need extra codes. Here, we enabled the mappability profile
and nucleotide filtering mode and uses the provided mappability
profile.
<-
df convert_rle_to_df(covs = example_cov, unreliable_region_df = unreliable_regions)
To make predictions, use the converted mapping result of the
sample(s) that you wish to detect as the input, newdata
.
Make sure you have converted the indexed and sorted BAM files into
feature-extracted data frame from the section above.
After preparing your test sample, you can choose to test the data
using our provided training model or the model you trained using
train_iimi()
. The tutorial of training your own model is
provided in the next section.
Note: if you wish to customize unreliable regions, please go to 3.3.
If you wish to use provided training model, only input your data to
newdata
and choose a method of your wish using
predict_iimi()
.
There are three methods that you may choose from: xgb
,
en
, and rf
, which stand for pre-trained
XGBoost, elastic net, and random forest models. The example below uses
the pre-trained XGBoost model.
<- predict_iimi(newdata = df, method = "xgb") prediction_default
The detection of your plant sample(s) is finished. The prediction is
TRUE
if virus infected the sample, FALSE
if
virus did not infect the sample.
If you would like to train your own model and use this model to test your data, you can use the codes below to train a new model with your own data.
Ideally, the number of the samples used to train the model should be
bigger than 100. However, we are only providing a tutorial on how to use
the train_iimi()
function, only two samples are used to
train the model since example_cov()
only contains three
in-house data’s coverage information.
First, we need to prepare our training data. We are using a 80/20 random split to split the three samples. This means that two samples are used as the training data, and one sample is used as the testing data. If you are training your own data, treat the training data as your data that you want to train the model on; treat the testing data as your data that you would like to have a prediction on.
If you have separate training and testing data, you may only need to convert your data from mapping results (BAM files) to fetaure-extracted data frame and name your data as the following variables.
Here are some definitions/explanation of the objects to input in
train_iimi()
:
train_x
: the feature-extracted data frame of plant
samples that you would like to train iimi model on. Make sure that you
have mapped the samples to the virus database and converted the mapping
result to sorted and indexes BAM files.
train_y
: the known truth or labels for your
train_x
data. Please label the data to make sure that it
has a detection label for virus segments as well.
test_x
: the feature-extracted data frame of plant
samples that you would like to predict using your trained iimi model.
Make sure that you have mapped the samples to the virus database and
converted the mapping result to sorted and indexes BAM files.
# set seed
set.seed(123)
# spliting into 80-20 train and test data set with the three plant samples
<- sample(levels(as.factor(df$sample_id)),
train_names length(unique(df$sample_id)) * 0.8)
# trian data
# train_x is the feature-extracted data frame of your train data
= df[df$sample_id %in% train_names,]
train_x
# train_y is the known truth or labels for your train_x data, indicating the presence of specific viruses in the samples
= c()
train_y
for (ii in 1:nrow(train_x)) {
= append(train_y, example_diag[train_x$seg_id[ii],
train_y $sample_id[ii]])
train_x
}
# test data
# test_x is the feature-extracted data frame of the data you would like to predict
# here we used the sample that is not in the training set for demonstration purpose
= df[df$sample_id %in% train_names == F,] test_x
Then, we plug in the variables into the train_iimi
function with the default XGBoost model.
<- train_iimi(train_x = train_x, train_y = train_y) fit
Now, we have a trained model using the toy data.
Then, the process to detect which viruses infect the plant sample(s) is the same as previously described, except we are using a customized trained model.
<-
prediction_customized predict_iimi(newdata = test_x,
trained_model = fit)
The detection of the plant sample(s) is finished. The interpretation is the same as above.
Note: if you would like to create your own unreliable regions, please customize them first, then extract features to build a data frame from section 2.2.2. using customized unreliable regions.
If you would like to create your own mappability profile and high
nucleotide content regions besides from using your own training model,
you may use create_mappability_profile()
and
high_nucleotide_regions()
. Both functions’ output is a data
frame with the start and end position of the unmappable region, the
virus that the region is on, and the category that it belongs to.
Mappability profile is a profile of areas on a virus genome that can be mapped to (1) other viruses or (2) host genome. We choose Arabidopsis Thaliana as our host genome.
High nucleotide content regions is a profile of areas on a virus genome that has (1) high GC content and/or (2) high A nucleotide percentage.
Including these two profiles into iimi ensures that there are no false peaks like the ones described in the previous section.
Here is a short tutorial to make mappability profile.
First, split each of the virus segment from the virus database into a sliding window series with window size of your choice and with step size 1. The default value for window size is 75. You may choose any window size you want.
Then, map one virus segment with each other, until you finish mapping it to all virus segments in the virus database. Also map the virus segment with a host genome of your choice. We chose to use Arabidopsis Thaliana.
After mapping, sort and index the resulted BAM files from the mapping step.
Next, it is time to assemble the mappability profile:
# if you would like to keep unmappable regions that can be mapped to other viruses or the host genome separate into two data frames, you may use the following code:
# input your own path that you would want to store regions on a virus that can be mapped to another virus
# you can customize the name of this type of mappability profile
<-
mappability_profile_virus create_mappability_profile("path/to/bam/files/folder/virus", category = "Unmappable region (virus)")
# input your own path that you would want to store regions on a virus that can be mapped to the host genome
# you can customize the name of this type of mappability profile
<-
mappability_profile_host create_mappability_profile("path/to/bam/files/folder/host", category = "Unmappable region (host)")
# if you would like to keep everything in the same data frame, you may use the following code:
<-
mappability_profile create_mappability_profile("path/to/bam/files/folder/of/both/types/", category = "Unmappable region")
Creating the high nucleotide content regions is much easier than the
mappability profile. We only need to use
create_high_nucleotide_content()
function.
Here is an example:
<-
high_nucleotide_regions create_high_nucleotide_content(gc = 0.6, a = 0.45)
The default threshold for GC content is 60% and is 45% for A%. The thresholds are customizable.
You have created the mappability profile and the high nucleotide regions and can use them to convert your training and testing data to feature-extracted data frames. Please refer to section 2.2.2. to see how to do so.
Next, we can visualize the coverage profile by using the
plot_cov()
function.
plot_cov()
: plots the coverage profile of the plant
sample and the percentage of A nucleotides and GC content for a sliding
window of k-mer with the step as
<- par(mfrow = c(1, 2))
oldpar
## if you wish to plot all segments of one sample, you can try:
# plot_cov(covs = example_cov["S1"])
## if you wish to plot all segments from all samples, you can try:
# plot_cov(covs = example_cov)
## if you wish to plot certain segments from one sample, you can try:
= c("42jtlrir", "m0kacxse")
segs = list()
covs_selected $`305S` <-
covs_selected$`305S`[segs]
example_cov
## if you have many segments that you would want to plot, you can try the following code with the numbers changed
## to find the index of your desired segments:
$S1 <-
covs_selected$S1[names(example_cov$S1)[c(1,72)]]
example_cov
par(mar = c(2, 4, 1, 1))
layout(matrix(c(1, 1, 2, 3, 3, 4), nrow = 3))
plot_cov(covs = covs_selected)
par(oldpar)
This gives us a general idea of what the potential viruses are.