---
title: "Using An Export Manager"
author: "James P. Gilbert"
date: "`r Sys.Date()`"
output:
  pdf_document:
    toc: yes
  html_document:
    number_sections: yes
    toc: yes
vignette: >
  %\VignetteIndexEntry{Using An Export Manager}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
---

# Introduction

OHDSI studies often have very specific requirements in terms of exposing patient details.

# Creating the export manager for a package

The table Specification must be definied for a table to be exported.
Crucially, the data types, column names, primary keys and valid settings are always validated at the time of export.
You cannot export data that does not conform to this model, so make sure that this model matches the result schema
that the data are being imported in to.
It is assumed that package developers will include this in the unit tests of their package.

```{r, eval=FALSE}
library(ResultModelManager)

tableSpecification <- dplyr::tibble(
  tableName = c(
    "my_table", "my_table", "my_table", "my_table", "my_table", "my_table", "my_table",
    "my_andromeda_table", "my_andromeda_table", "my_andromeda_table"
  ),
  columnName = c(
    "database_id", "target_cohort_id", "comparator_cohort_id", "target_count", "comparator_count", "rr", "p_value",
    "database_id", "covariate_id", "value"
  ),
  primaryKey = c(
    "yes", "yes", "no", "no", "no", "no", "no",
    "yes", "yes", "no"
  ),
  minCellCount = c(
    "no", "no", "no", "yes", "yes", "no", "no",
    "no", "no", "no"
  ),
  dataType = c(
    "varchar(255)", "int", "int", "int", "int", "float", "float",
    "varchar(255)", "bigint", "float"
  )
)

# Per database export folder is a good principle to follow
exportDir <- "output_folder/example_cdm"
exportManager <- createResultExportManager(
  tableSpecification = tableSpecification,
  exportDir = exportDir,
  databaseId = "example_cdm"
)
```


# Saving large results sets with a batch operation
As data sets can easily exceed system memory, any operations should be performed in batch (via the export manager's
exposed functions with a callback), or exporting from an Andromeda object.

## Setup

First we will connect to a test database and create some test data:

```{r, eval=FALSE}
connection <- DatabaseConnector::connect(server = ":memory:", dbms = "sqlite")
schema <- "main"

# Some made up counts
data <- data.frame(
  target_cohort_id = 1:100,
  comparator_cohort_id = 101:200,
  target_count = stats::rpois(100, lambda = 10),
  target_time = stats::rpois(100, 100000),
  comparator_count = stats::rpois(100, lambda = 5),
  comparator_time = stats::rpois(100, 100000)
)

DatabaseConnector::insertTable(connection, data = data, tableName = "result_table", databaseSchema = schema)
```

## Exporting a database query result

```{r, eval=FALSE}
sql <- "SELECT * FROM @schema.result_table"
exportManager$exportQuery(connection = connection, sql = sql, exportTableName = "my_table", schema = schema)
```

It is vital to ensure that the returned result set conforms to your data model, including the primary key columns specified.
Otherwise, export validation will fail to prevent errors in exported csv files.

If you look at the file `output_folder/example_cdm/my_table.csv` you will notice that the database_id field is populated,
you should not add this in SQL as it will be completed per database automatically.

Note that this result set is incomplete - we're not exporting fields that would be computed using an R function, just
the values that are exported from an sql query.

### Performing R operations
In order to perform R operations (for example, computing a rate ratio or p-value that would be difficult to compute in
SQL) it is recommended that is performed inside a callback function to the `exportQuery` method.
Modifying the above to include a rate ratio calculation using the `rateratio.test` package:

```{r, eval=FALSE}
library(rateratio.test)

transformation <- function(rows, pos) {
  rrResult <- rateratio.test(
    x = c(row$target_count, row$comparator_count),
    n = c(row$target_time, row$comparator_time),
    RR = 1,
    conf.level = 0.95
  )

  row$rr <- rrResult$estimate
  row$p_value <- rrResult$p.value

  return(row)
}

exportManager$exportQuery(connection,
  sql,
  "my_table",
  transformFunction = transformation,
  transformFunctionArgs = list(),
  append = FALSE,
  schema = schema
)
```

## Exporting an Andromeda result in batch
It is generally inadvisable to collect an entire andromeda table for export in to the R session before saving to disk.
Instead, it is best practice to use batch operations as follows

```{r eval=FALSE}
andr <- Andromeda::andromeda()
andr$my_andromeda_table <- data.frame(covariate_id = 1:1e4, value = stats::runif(1e4))

first <- TRUE
writeBatch <- function(batch) {
  exportManager$exportDataFrame(batch, "my_andromeda_table", append = first)
  first <<- FALSE
  # we don't want to return anything, just write the result to disk
  return(invisible(NULL))
}

Andromeda::batchApply(andr$my_andromeda_table, writeBatch)
```
# Creating a results manifest file
Export manifests contain an sha256 hash of all files exported.
This can be useful to see if a file was modified or corrupted before inclusion.
To export the manifest of files within an export directory:

```{r, eval=FALSE}
exportManger$writeManifest(packageName = "analytics_package", packageVersion = packageVersion("analytics_package"))
```