---
title: "undidR"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{undidR}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(undidR)
```

## Introduction

The **undidR** package implements difference-in-differences with unpoolable data (UNDID); a framework that enables the estimation of the average treatment effect on the treated (ATT) when the data from different silos is not poolable. UNDID allows for staggered or common adoption and the inclusion of covariates.

In addition, **undidR** also implements the randomization inference (RI) procedure for difference-in-differences described in [MacKinnon and Webb (2020)](https://doi.org/10.1016/j.jeconom.2020.04.024) to calculate RI p-values.

Below is an overview of the **undidR** framework:

```{r schematic, echo=FALSE, fig.align='center', out.width='90%', fig.cap="Schematic of the UNDID framework."}
knitr::include_graphics("figures/undidR_schematic.png")
```

The following sections detail some examples of implementing **undidR** at each of its three stages for both staggered and common adoption scenarios.

## 1. Stage One: Initialize

When calling `create_init_csv()` silo names must be specified along with their corresponding treatment times. Consequently, the `silos_names` vector must be the same length as the `treatment_times` vector.

All dates must be entered in the same date format. To see valid date formats within the **undidR** package call `undid_date_formats()`.

Covariates may be specified when calling either `create_init_csv()` or when calling `create_diff_df()` via the `covariates` parameter.

The choice of weights is also set during the initialization stage. The options of weights are one of: `"none"`, `"diff"`, `"att"`, or `"both"`. Each of these options describes levels at which the weights are applied.
The `"diff"` option uses weights based off of the number of observations (treated and untreated) associated with each contrast (difference) at each silo. The weights are used when computing the subaggregate ATTs from the differences.
The counts of these observations are recorded during stage two and kept in the `n` column of the `diff_df` CSV files. Likewise, the number of observations after treatment time are stored in the `n_t` column.
The "`att`" weighting option uses the number of post-treatment observations from treated silos associated with each subaggregate ATT as weights when computing the aggregate ATT from the subaggregate ATTs.

### 1.1 Common Treatment Time (Stage One)

```{r}
# First, an initializing CSV is created detailing the silos
# and their treatment times. Control silos (here, 73 and 46)
# should be labelled with "control".
init <- create_init_csv(silo_names = c("73", "46", "71", "58"),
                        start_times = "1989",
                        end_times = "2000",
                        treatment_times = c("control", "control",
                                            "1991", "1991"))
init

# After the initializing CSV file is created, `create_diff_df()`
# can be called. This creates the empty differences data frame which
# will then be filled out at each individual silo for its respective portion.
init_filepath <- normalizePath(file.path(tempdir(), "init.csv"),
                               winslash = "/", mustWork = FALSE)
empty_diff_df <- create_diff_df(init_filepath, date_format = "yyyy",
                                freq = "yearly", weights = "both")
empty_diff_df
```

### 1.2 Staggered Adoption (Stage One)

```{r}
# The initializing CSV for staggered adoption is created in the same way.
# When `create_diff_df()` is run, it will automatically detect whether or not
# the initial setup is for a common adoption or staggered adoption scenario.
init <- create_init_csv(silo_names = c("73", "46", "54", "23", "86", "32",
                                       "71", "58", "64", "59", "85", "57"),
                        start_times = "1989",
                        end_times = "2000",
                        treatment_times = c(rep("control", 6),
                                            "1991", "1993", "1996", "1997",
                                            "1997", "1998"))
init

# Creating the empty differences data frame and associated CSV file is
# the same for the case of staggered adoption as it is for common adoption.
init_filepath <- normalizePath(file.path(tempdir(), "init.csv"),
                               winslash = "/", mustWork = FALSE)
empty_diff_df <- create_diff_df(init_filepath, date_format = "yyyy",
                                freq = "yearly", weights = "both",
                                covariates = c("asian", "black", "male"))
head(empty_diff_df, 4)
```

## 2. Stage Two: Silos

The second stage function, `undid_stage_two()`, creates two CSV files. The first is the filled portion of the differences data frame for the respective silo. The second captures the mean (and the mean residualized by the specified covariates) of the outcome variable from the `start_time` to the `end_time` in intervals of `freq`.

These are returned from `undid_stage_two()` as a list of two data frames which can be accessed by the suffixes of `$diff_df` and `$trends_data`, respectively.

In order to accommodate silos that might have very stringent data sharing policies, there is an option of `anonymize_weights` (defaults to `FALSE`) during the second stage. If selected, it will round the counts
in the `n` column (in the trends data and diff matrix) as well as the `n_t` column to the closest value of `anonymize_size` (which defaults to 5).

The `undid_stage_two()` looks for covariates based on how they are spelled in the `empty_diff_df.csv` file. This means that silos may have to rename their covariate columns.

### 2.1 Common Treatment Time (Stage Two)

```{r}
# When calling `undid_stage_two()`, ensure that the `time_column` of
# the `silo_df` contains only character values, i.e. date strings.
silo_data <- silo71
silo_data$year <- as.character(silo_data$year)
empty_diff_filepath <- system.file("extdata/common", "empty_diff_df.csv",
                                   package = "undidR")
stage2 <- undid_stage_two(empty_diff_filepath, silo_name = "71",
                          silo_df = silo_data, time_column = "year",
                          outcome_column = "coll", silo_date_format = "yyyy")
head(stage2$diff_df, 4)
head(stage2$trends_data, 4)
```

### 2.2 Staggered Adoption (Stage Two)

```{r}
# Here we can see that calling `undid_stage_two()` for staggered adoption
# is no different than calling `undid_stage_two()` for common adoption.
silo_data <- silo71
silo_data$year <- as.character(silo_data$year)
empty_diff_filepath <- system.file("extdata/staggered", "empty_diff_df.csv",
                                    package = "undidR")
stage2 <- undid_stage_two(empty_diff_filepath, silo_name = "71",
                          silo_df = silo_data, time_column = "year",
                          outcome_column = "coll", silo_date_format = "yyyy")
head(stage2$diff_df, 4)
head(stage2$trends_data, 4)
```

## 3. Stage Three: Analysis

The third stage of **undidR** produces the aggregate ATT estimate, its standard errors, and its p-values, as well as group level ATT estimates for staggered adoption.

In the case of staggered adoption these group level ATTs can either be grouped by silo (`agg = "silo"`), by treatment time (`agg = "g"`), by treatment time for every time period after treatment has started (`agg = "gt"`),
or, the `"gt"` aggregation can further be separated by silo with `agg = "sgt"`. There is also an option to aggregate by time since treatment with `agg = "time"`. 

`undid_stage_three()` returns an object with the class `UnDiDObj` which has four S3 methods: `summary()`, `print()`, `coef()`, and `plot()`.
`summary()` and `plot()` are likely the most useful.

With the `plot()` method for `UnDiDObj`, you can specify the `event` parameter as `event = TRUE` in order to produce an event study plot. You can specify the confidence intervals on the event study plot with
`ci` (defaults to 0.95) and the window for which you want to observe the event study plot can be restricted by setting `event_window = c(start, end)` where `start` and `end` are numeric values describing the periods before and after treatment time.
The `plot()` method for also inherits standard parameters normally used in `plot()`.

Further, you can access the diff matrix itself that is used to compute subaggregate ATTs and the aggregate ATT with `UnDiDObj$diff`. Likewise, you can access the trends data with `UnDiDObj$trends`.

### 3.1 Common Treatment Time (Stage Three)

```{r, fig.width=6, fig.height=4, out.width = "70%", dpi=300, fig.align="center"}
# `undid_stage_three()`, given a `dir_path`, will search that folder
# for all CSV files that begin with "filled_diff_df_" and stitch
# them together in order to compute the group level ATTs, aggregate ATT
# and associated standard errors and p-values.
dir_path <- system.file("extdata/common", package = "undidR")
results <- undid_stage_three(dir_path, covariates = FALSE, nperm = 399)
summary(results)
plot(results)

```

### 3.2 Staggered Adoption (Stage Three)

```{r, fig.width=6, fig.height=4, out.width = "70%", dpi=300, fig.align="center"}
# When calling `undid_stage_three()` for staggered adoption it is
# important to specify the aggregation method, `agg`.
dir_path <- system.file("extdata/staggered", package = "undidR")
results <- undid_stage_three(dir_path, agg = "silo", covariates = TRUE,
                             nperm = 399)
head(results$diff, 4)

head(results$trends, 4)

summary(results)

plot(results)
```

```{r, fig.width=6, fig.height=4, out.width = "70%", dpi=300, fig.align="center"}
plot(results, event = TRUE)
```

## References

You can access citations by calling `citation("undidR")`.

```{r}
citation("undidR")
```

You can also call `print(citation("undidR"), bibtex = TRUE)`.

```{r}
print(citation("undidR"), bibtex = TRUE)
```

Karim, S., Webb, M., Austin, N., and Strumpf, E. 2024. Difference-in-Differences with Unpoolable Data. [https://arxiv.org/abs/2403.15910](https://arxiv.org/abs/2403.15910)

MacKinnon, J. and Webb, M. 2020. Randomization inference for difference-in-differences with few treated clusters. Journal of Econometrics. [https://doi.org/10.1016/j.jeconom.2020.04.024](https://doi.org/10.1016/j.jeconom.2020.04.024)