---
title: "cocoon"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{cocoon}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(rlang)
```

```{r setup-real, echo = FALSE, message = FALSE}
library(cocoon)
```

```{r setup-show, eval = FALSE}
library(cocoon)
```

One of the many useful features of the [`{papaja}`](https://github.com/crsh/papaja) package is the [`apa_print()`](https://frederikaust.com/papaja_man/reporting.html#statistical-models-and-tests)
function, which takes a statistical object and formats the output to print the statistical information inline in R Markdown documents. The `apa_print()` function is an easy way to extract and format this statistical information for documents following APA style. However, APA style has some rather strange quirks, and users may want some flexibility in how their statistics are formatted. Moreover, `apa_print()` uses LaTeX syntax, which works great for PDFs but generates images for mathematical symbols when outputting to Word documents. 

The `{cocoon}` package uses APA style as the default, but allows more flexible formatting such as including the leading 0 before numbers with maximum values of 1. All functions accept a `type` argument that specifies either `"md"` for Markdown (default) or `"latex"` for LaTeX. This package can format statistical objects, statistical values, and numbers more generally.


## Formatting statistical objects

Running a statistical test in R typically returns a list with lots of information about the test, often including things like statistical test values and p-values. The aim of the `format_stats()` function is to extract and format statistics for a suite of commonly used statistical objects (correlation and t-tests).

### Correlations

The `format_stats()` function can input objects returned by the `cor.test()` or `correlation::correlation()` function and detects whether the object is from a Pearson, Kendall, or Spearman correlation. It then reports and formats the appropriate correlation coefficient and p-value.

Let's start by creating a few different correlations.

```{r}
mpg_disp_corr_pearson <- cor.test(mtcars$mpg, mtcars$disp, method = "pearson")
mpg_disp_corr_spearman <- cor.test(mtcars$mpg, mtcars$disp, method = "spearman", exact = FALSE)
mpg_disp_corr_kendall <- cor.test(mtcars$mpg, mtcars$disp, method = "kendall", exact = FALSE)
```

For Pearson correlations, we get the correlation coefficient and the confidence intervals. Since Spearman and Kendall correlations are non-parametric, confidence intervals are not returned. Confidence intervals can be omitted from Pearson correlations by setting `full = FALSE`.

| Code | Output |
|-------|-----|
| `format_stats(mpg_disp_corr_pearson)` | `r format_stats(mpg_disp_corr_pearson)` |
| `format_stats(mpg_disp_corr_pearson, full = FALSE)` | `r format_stats(mpg_disp_corr_pearson, full = FALSE)` |
| `format_stats(mpg_disp_corr_spearman)` | `r format_stats(mpg_disp_corr_spearman)` |
| `format_stats(mpg_disp_corr_kendall)` | `r format_stats(mpg_disp_corr_kendall)` |

Format the number of digits of coefficients with `digits` and digits of p-values with `pdigits`. Include the leading zeros for coefficients and p-values with `pzero = TRUE`. Remove italics with  `italics = FALSE`.

| Code | Output |
|---------|------|
| `format_stats(mpg_disp_corr_pearson)` | `r format_stats(mpg_disp_corr_pearson)` |
| `format_stats(mpg_disp_corr_pearson, digits = 1, pdigits = 2)` | `r format_stats(mpg_disp_corr_pearson, digits = 1, pdigits = 2)` |
| `format_stats(mpg_disp_corr_pearson, pzero = TRUE)` | `r format_stats(mpg_disp_corr_pearson, pzero = TRUE)` |
| `format_stats(mpg_disp_corr_pearson, italics = FALSE)` | `r format_stats(mpg_disp_corr_pearson, italics = FALSE)` |
| `format_stats(mpg_disp_corr_spearman, italics = FALSE)` | `r format_stats(mpg_disp_corr_spearman, italics = FALSE)` |
| `format_stats(mpg_disp_corr_kendall, italics = FALSE)` | `r format_stats(mpg_disp_corr_kendall, italics = FALSE)` |


### T-tests

The `format_stats()` function can also input objects returned by the `t.test()` or `wilcox.test()` functions and detect whether the object is from a Student's or Wilcoxon t-test, including one-sample, independent-sample, and paired-sample versions. It then reports and formats the mean value (or mean difference), confidence intervals for mean value/difference, appropriate test statistic, degrees of freedom (for parametric tests), and p-value.

Let's start by creating a few different t-tests

```{r}
ttest_gear_carb <- t.test(mtcars$gear, mtcars$carb)
ttest_gear_carb_paired <- t.test(mtcars$gear, mtcars$carb, paired = TRUE)
ttest_gear_carb_onesample <- t.test(mtcars$gear, mu = 4)
wtest_gear_carb <- wilcox.test(mtcars$gear, mtcars$carb, exact = FALSE)
wtest_gear_carb_paired <- wilcox.test(mtcars$gear, mtcars$carb, paired = TRUE, exact = FALSE)
wtest_gear_carb_onesample <- wilcox.test(mtcars$gear, mu = 4, exact = FALSE)
```

For Student's t-tests, we get the mean value or difference and the confidence intervals. Means and confidence intervals can be omitted by setting `full = FALSE`.

| Code | Output |
|-------|-----|
| `format_stats(ttest_gear_carb)` | `r format_stats(ttest_gear_carb)` |
| `format_stats(ttest_gear_carb_paired)` | `r format_stats(ttest_gear_carb_paired)` |
| `format_stats(ttest_gear_carb_onesample)` | `r format_stats(ttest_gear_carb_onesample)` |
| `format_stats(ttest_gear_carb_onesample, full = FALSE)` | `r format_stats(ttest_gear_carb_onesample, full = FALSE)` |
| `format_stats(wtest_gear_carb)` | `r format_stats(wtest_gear_carb)` |
| `format_stats(wtest_gear_carb_paired)` | `r format_stats(wtest_gear_carb_paired)` |
| `format_stats(wtest_gear_carb_onesample)` | `r format_stats(wtest_gear_carb_onesample)` |

Format the number of digits of coefficients with `digits` and digits of p-values with `pdigits`. Include the leading zeros for coefficients and p-values with `pzero = TRUE`. Remove italics with  `italics = FALSE`.

| Code | Output |
|---------|------|
| `format_stats(ttest_gear_carb)` | `r format_stats(ttest_gear_carb)` |
| `format_stats(ttest_gear_carb, digits = 2, pdigits = 2)` | `r format_stats(ttest_gear_carb, digits = 2, pdigits = 2)` |
| `format_stats(ttest_gear_carb, pzero = TRUE)` | `r format_stats(ttest_gear_carb, pzero = TRUE)` |
| `format_stats(ttest_gear_carb, italics = FALSE)` | `r format_stats(ttest_gear_carb, italics = FALSE)` |
| `format_stats(wtest_gear_carb)` | `r format_stats(wtest_gear_carb)` |
| `format_stats(wtest_gear_carb, italics = FALSE)` | `r format_stats(wtest_gear_carb, italics = FALSE)` |

### ANOVAs

The `format_stats()` function can also input objects returned by the `aov()` function. It then reports and formats the F statistic, degrees of freedom, and p-value.

Let's start by creating an ANOVA

```{r}
aov_mpg_cyl_hp <- aov(mpg ~ cyl * hp, data = mtcars)
summary(aov_mpg_cyl_hp)
```

To use `format_stats()` on ANOVAs, you must pass the `aov` object and a character string describing which term to extract. Apply `summary()` to your `aov` object and copy the text of the term you want to extract. Then you can format the number of digits of coefficients with `digits` and digits of p-values with `pdigits`. Include the leading zeros for coefficients and p-values with `pzero = TRUE`. Remove italics with  `italics = FALSE`. With `dfs`, format degrees of freedom as parenthetical (`par`) or subscripts (`sub`) or remove them (`none`).

| Code | Output |
|-------------|------|
| `format_stats(aov_mpg_cyl_hp, term = "cyl")` | `r format_stats(aov_mpg_cyl_hp, term = "cyl")` |
| `format_stats(aov_mpg_cyl_hp, term = "cyl:hp")` | `r format_stats(aov_mpg_cyl_hp, term = "cyl:hp")` |
| `format_stats(aov_mpg_cyl_hp, term = "cyl", digits = 2, pdigits = 2)` | `r format_stats(aov_mpg_cyl_hp, term = "cyl", digits = 2, pdigits = 2)` |
| `format_stats(aov_mpg_cyl_hp, term = "cyl", pzero = TRUE)` | `r format_stats(aov_mpg_cyl_hp, term = "cyl", pzero = TRUE)` |
| `format_stats(aov_mpg_cyl_hp, term = "cyl", italics = FALSE)` | `r format_stats(aov_mpg_cyl_hp, term = "cyl", italics = FALSE)` |
| `format_stats(aov_mpg_cyl_hp, term = "cyl", dfs = "sub")` | `r format_stats(aov_mpg_cyl_hp, term = "cyl", dfs = "sub")` |


### Linear models

The `format_stats()` function can also input objects returned by the `lm()`, `glm()`, `lme4::lmer()`, `lmerTest::lmer()`, and `lme4::glmer()` functions. It can report overall model statistics (e.g., R-squared, AIC) for `lm()` and `glm()` and term-specific statistics (e.g., coefficients, p-values) for all models.

Let's start by creating some models:

```{r}
lm_mpg_cyl_hp <- lm(mpg ~ cyl * hp, data = mtcars)
summary(lm_mpg_cyl_hp)
glm_am_cyl_hp <- glm(am ~ cyl * hp, data = mtcars, family = binomial)
summary(glm_am_cyl_hp)
lmer_mpg_cyl_hp <- lme4::lmer(mpg ~ hp + (1 | cyl), data = mtcars)
summary(lmer_mpg_cyl_hp)
glmer_am_cyl_hp <- lme4::glmer(am ~ hp + (1 | cyl), data = mtcars, family = binomial)
summary(glmer_am_cyl_hp)
lmer_mpg_cyl_hp2 <- lmerTest::lmer(mpg ~ hp + (1 | cyl), data = mtcars)
summary(lmer_mpg_cyl_hp2)
```

To extract overall model statistics from `format_stats()`, pass the `lm` or `glm` object but omit any terms. 

| Code | Output |
|----------|--------|
| `format_stats(lm_mpg_cyl_hp)` | `r format_stats(lm_mpg_cyl_hp)` |
| `format_stats(lm_mpg_cyl_hp, full = FALSE)` | `r format_stats(lm_mpg_cyl_hp, full = FALSE)` |
| `format_stats(glm_am_cyl_hp)` | `r format_stats(glm_am_cyl_hp)` |
| `format_stats(glm_am_cyl_hp, full = FALSE)` | `r format_stats(glm_am_cyl_hp, full = FALSE)` |


To extract term-specific statistics, pass the object and a character string describing which term to extract. Apply `summary()` to your `lm`, `glm`, `lmer`, or `glmer` object and copy the text of the term you want to extract. Then you can format the number of digits of coefficients with `digits` and digits of p-values with `pdigits`. Include the leading zeros for coefficients and p-values with `pzero = TRUE`. Remove italics with  `italics = FALSE`. With `dfs`, format degrees of freedom as parenthetical (`par`) or subscripts (`sub`) or remove them (`none`).

| Code | Output |
|----------|-------|
| `format_stats(lm_mpg_cyl_hp, term = "cyl")` | `r format_stats(lm_mpg_cyl_hp, term = "cyl")` |
| `format_stats(lm_mpg_cyl_hp, term = "cyl:hp")` | `r format_stats(lm_mpg_cyl_hp, term = "cyl:hp")` |
| `format_stats(glm_am_cyl_hp, term = "cyl")` | `r format_stats(glm_am_cyl_hp, term = "cyl")` |
| `format_stats(lmer_mpg_cyl_hp, term = "hp")` | `r format_stats(lmer_mpg_cyl_hp, term = "hp")` |
| `format_stats(glmer_am_cyl_hp, term = "hp")` | `r format_stats(glmer_am_cyl_hp, term = "hp")` |
| `format_stats(lmer_mpg_cyl_hp2, term = "hp")` | `r format_stats(lmer_mpg_cyl_hp2, term = "hp")` |
| `format_stats(lm_mpg_cyl_hp, term = "cyl", digits = 2, pdigits = 2)` | `r format_stats(lm_mpg_cyl_hp, term = "cyl", digits = 2, pdigits = 2)` |
| `format_stats(lm_mpg_cyl_hp, term = "cyl", pzero = TRUE)` | `r format_stats(lm_mpg_cyl_hp, term = "cyl", pzero = TRUE)` |
| `format_stats(lm_mpg_cyl_hp, term = "cyl", italics = FALSE)` | `r format_stats(lm_mpg_cyl_hp, term = "cyl", italics = FALSE)` |
| `format_stats(lm_mpg_cyl_hp, term = "cyl", dfs = "sub")` | `r format_stats(lm_mpg_cyl_hp, term = "cyl", dfs = "sub")` |


### Bayes factors

The `format_stats()` function can also extract and format Bayes factors from a `BFBayesFactor` object from the [`{BayesFactor}`](https://cran.r-project.org/package=BayesFactor) package. Bayes factors are not as standardized in how they are formatted. One issue is that Bayes factors can be referenced from either the alternative hypothesis (H~1~) or the null hypothesis (H~0~). Also, as a ratio, digits after the decimal are more important below 1 than above 1.

To respond to the digits issue, the `format_stats()` function controls digits for Bayes factors less than 1 (`digits1`) separately from those greater than 1 (`digits2`). In fact, the defaults are different for these two arguments. Further, Bayes factors can be very large or very small when evidence strongly favors one hypothesis over another. Therefore, the `cutoff` argument set a threshold above which (or below 1/cutoff) the returned value is truncated (e.g., BF > 1000).

```{r}
bf_corr <- BayesFactor::correlationBF(mtcars$mpg, mtcars$disp)
bf_ttest <- BayesFactor::ttestBF(mtcars$vs, mtcars$am)
bf_lm <- BayesFactor::lmBF(mpg ~ am, data = mtcars)
```


| Code | Output |
|--------|------|
| `format_stats(bf_lm)`     |  `r format_stats(bf_lm)`      |
| `format_stats(bf_lm, digits1 = 2)`     | `r format_stats(bf_lm, digits1 = 2)`       |
| `format_stats(bf_corr)`     | `r format_stats(bf_corr)`       |
| `format_stats(bf_corr, cutoff = 1000)`     | `r format_stats(bf_corr, cutoff = 1000)`       |
| `format_stats(bf_ttest)`     | `r format_stats(bf_ttest)`       |
| `format_stats(bf_ttest, digits2 = 3)`     | `r format_stats(bf_ttest, digits2 = 3)`       |
| `format_stats(bf_ttest, cutoff = 3)`     | `r format_stats(bf_ttest, cutoff = 3)`       |

The default label for Bayes factors is _BF_~10~. The text of the label can be changed with the `label` argument, where setting `label = ""` omits the label. Italics can be removed with `italics = FALSE`, and the subscript can be set to 01  (`subscript = "01"`) or removed (`subscript = ""`).

| Code | Output |
|--------|-----|
| `format_stats(bf_lm)`     |  `r format_stats(bf_lm)`      |
| `format_stats(bf_lm, italics = FALSE)`     | `r format_stats(bf_lm, italics = FALSE)`       |
| `format_stats(bf_lm, subscript = "")`     | `r format_stats(bf_lm, subscript = "")`       |
| `format_stats(bf_lm, label = "Bayes factor", italics = FALSE, subscript = "")`     | `r format_stats(bf_lm, label = "Bayes factor", italics = FALSE, subscript = "")`       |
| `format_stats(bf_lm, label = "")`     | `r format_stats(bf_lm, label = "")`       |


## Formatting statistical values

### Central tendency and error

#### Data vectors
Often, we need to include simple descriptive statistics in our documents, such as measures of central tendency and error. `{cocoon}` includes a suite of functions that can calculate different summary measures of central tendency (mean or median) and error (confidence interval, standard error, standard deviation, interquartile range) from a numeric data vector. With the base function `format_summary()`, you can specify central tendency with the `summary` argument and error with the `error` argument. For instance, `format_summary(vec, summary = "mean", error = "se")` calculates mean and standard error.

`{cocoon}` includes a number of wrapper functions that cover common measures of central tendency and error including:

* `format_meanci()`
* `format_meanse()`
* `format_meansd()`
* `format_medianiqr()`

And if you don't want to include error, use `format_mean()` or `format_median()`.

| Code | Output |
|------|--------|
| `format_summary(mtcars$mpg, error = "ci")` | `r format_summary(mtcars$mpg, error = "ci")` |
| `format_meanci(mtcars$mpg)` | `r format_meanci(mtcars$mpg)` |
| `format_medianiqr(mtcars$mpg)` | `r format_medianiqr(mtcars$mpg)` |
| `format_mean(mtcars$mpg)` | `r format_mean(mtcars$mpg)` |

#### Pre-calculated summaries

In addition to calculating values directly from the vectors, these functions can format already-calculated measures.  So if you already have your mean and error calculated, just pass the vector of central tendency, lower error limit, and upper error limit to the `values` argument to format them. For instance, `format_meanci(values = c(12.5, 11.2, 13.7))` produces `r format_meanci(values = c(12.5, 11.2, 13.7))`. Make sure you pass the arguments in this order, as the function checks whether the send argument is less than or equal to the first and the third is greater than or equal to the first.

#### Formatting output
These functions can control many aspects of formatting for the values and labels of summary statistics. Digits after the decimal are controlled with `digits` (default is 1). The `tendlabel` argument defines whether the default abbreviation is used ("M" or "Mdn"), the full word ("Mean" or "Median"), or no label is provided.  Each of these can be italicized  or not with the `italics` argument, subscripts can be included with the `subscript` argument, and units added with the `units` argument. 

| Code | Output |
|------|--------|
| `format_mean(mtcars$mpg)` | `r format_mean(mtcars$mpg)` |
| `format_mean(mtcars$mpg, tendlabel = "word")` | `r format_mean(mtcars$mpg, tendlabel = "word")` |
| `format_mean(mtcars$mpg, tendlabel = "none")` | `r format_mean(mtcars$mpg, tendlabel = "none")` |
| `format_mean(mtcars$mpg, italics = FALSE)` | `r format_mean(mtcars$mpg, italics = FALSE)` |
| `format_mean(mtcars$mpg, subscript = "A")` | `r format_mean(mtcars$mpg, subscript = "A")` |
| `format_mean(mtcars$mpg, units = "m")` | `r format_mean(mtcars$mpg, units = "m")` |

Error can be displayed in a number of different ways. Setting the `display` argument to `"limits"` (default) includes upper and lower limits in brackets. If intervals rather than limits are preferred, they can be appended after the mean/median with ± using `"pm"` or in parentheses with `"par"`. Error is not displayed if `display = "none"`. The presence of the error label is controlled by the logical argument `errorlabel`. When set to `FALSE`, no error label is included. For confidence intervals, the `cilevel` argument takes a numeric scalar from 0-1 to define the confidence level.

| Code | Output |
|--------|------|
| `format_meanci(mtcars$mpg)` | `r format_meanci(mtcars$mpg)` |
| `format_meanci(mtcars$mpg, display = "pm")` | `r format_meanci(mtcars$mpg, display = "pm")` |
| `format_meanci(mtcars$mpg, display = "par")` | `r format_meanci(mtcars$mpg, display = "par")` |
| `format_meanci(mtcars$mpg, display = "none")` | `r format_meanci(mtcars$mpg, display = "none")` |
| `format_meanci(mtcars$mpg, errorlabel = FALSE)` | `r format_meanci(mtcars$mpg, errorlabel = FALSE)` |
| `format_meanci(mtcars$mpg, cilevel = 0.90)` | `r format_meanci(mtcars$mpg, cilevel = 0.90)` |


### P-values

P-values are pretty easy to format with `format_p()`. The `digits` argument controls the number of digits after the decimal, and if the value is lower, `p <` is used. Unfortunately, APA style involves lopping off the leading zero in p-values, but setting `pzero  = TRUE` turns off this silly setting. The p-value label is controlled by `label`, where the user can specify the exact label text. By default, this is a lower case, italicized _p_. Non-italicized can be defined with `italics = FALSE`. P-value labels can be omitted by setting `label = ""`.

| Code | Output |
|------|--------|
| `format_p(0.001)`     |  `r format_p(0.001)`      |
| `format_p(0.001, digits = 2)`     |  `r format_p(0.001, digits = 2)`      |
| `format_p(0.321, digits = 2)`     |  `r format_p(0.321, digits = 2)`      |
| `format_p(0.001, pzero = TRUE)`     |  `r format_p(0.001, pzero = TRUE)`      |
| `format_p(0.001, label = "P")`     |  `r format_p(0.001, label = "P")`      |
| `format_p(0.001, italics = FALSE)`     |  `r format_p(0.001, italics = FALSE)`      |
| `format_p(0.001, label = "")`     |  `r format_p(0.001, label = "")`      |


### Bayes factors

Though the `format_stats()` function extracts and formats Bayes factors from the [`{BayesFactor}`](https://cran.r-project.org/package=BayesFactor) package, sometimes you may have Bayes factors from other sources. The `format_bf()` function formats Bayes factors from numeric values (either single scalar elements or vectors).

| Code | Output |
|--------|------|
| `format_bf(4321)`     |  `r format_bf(4321)`      |
| `format_bf(4321, digits1 = 2)`     | `r format_bf(4321, digits1 = 2)`       |
| `format_bf(4321, cutoff = 1000)`     | `r format_bf(4321, cutoff = 1000)`       |
| `format_bf(0.04321)`     | `r format_bf(0.04321)`       |
| `format_bf(0.04321, digits2 = 3)`     | `r format_bf(0.04321, digits2 = 3)`       |
| `format_bf(0.04321, cutoff = 10)`     | `r format_bf(0.04321, cutoff = 10)`       |
| `format_bf(4321, italics = FALSE)`     | `r format_bf(4321, italics = FALSE)`       |
| `format_bf(4321, subscript = "")`     | `r format_bf(4321, subscript = "")`       |
| `format_bf(4321, label = "Bayes factor", italics = FALSE, subscript = "")`     | `r format_bf(4321, label = "Bayes factor", italics = FALSE, subscript = "")`       |
| `format_bf(4321, label = "")`     | `r format_bf(4321, label = "")`       |
| `format_bf(c(4321, 0.04321))`     |  `r format_bf(c(4321, 0.04321))`      |


## Formatting numbers

In addition to formatting specific statistics, this package can format numbers more generally. The `format_num()` function controls general formatting of numbers of digits with `digits` and the presence of the leading zero with `pzero`.

| Code | Output |
|---------|------|
| `format_num(0.1234)` | `r format_num(0.1234)` |
| `format_num(0.1234, digits = 2)` | `r format_num(0.1234, digits = 2)` |
| `format_num(0.1234, pzero = FALSE)` | `r format_num(0.1234, pzero = FALSE)` |

For large or small values, using scientific notation may be a more useful way to format the numbers. The `format_scientific()` function converts to scientific notation, again offering control of the number of `digits` as well as whether output `type` is Markdown or LaTeX.

| Code | Output |
|---------|------|
| `format_scientific(1234)` | `r format_scientific(1234)` |
| `format_scientific(0.0000001234)` | `r format_scientific(0.0000001234)` |
| `format_scientific(0.0000001234, digits = 2)` | `r format_scientific(0.0000001234, digits = 2)` |