This vignette shows you how to upload and prepare any dataset for use
with finalfit. The demonstration will use the
boot::melanoma. Use ?boot::melanoma to see the
help page with data description. I will use
library(tidyverse) methods. First I’ll
write_csv() the data just to demonstrate reading it.
Note the various options in read_csv(), including
providing column names, variable type, missing data identifier etc.
library(readr)
# Save example
write_csv(boot::melanoma, "boot.csv")
# Read data
melanoma = read_csv("boot.csv")
#> Rows: 205 Columns: 7
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (7): time, status, sex, age, year, thickness, ulcer
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.Note the output shows how the columns/variables have been parsed. For
full details see ?readr::cols().
col_integer()col_double()col_factor()col_character()col_logical()col_date()col_time()col_datetime()ff_glimpse() provides a convenient overview of all data
in a tibble or data frame. It is particularly important that factors are
correctly specified. Hence, ff_glimpse() separates
variables into continuous and categorcial. As expected, no factors are
yet specified in the melanoma dataset.
library(finalfit)
ff_glimpse(melanoma)
#> $Continuous
#>               label var_type   n missing_n missing_percent   mean     sd    min
#> time           time    <dbl> 205         0             0.0 2152.8 1122.1   10.0
#> status       status    <dbl> 205         0             0.0    1.8    0.6    1.0
#> sex             sex    <dbl> 205         0             0.0    0.4    0.5    0.0
#> age             age    <dbl> 205         0             0.0   52.5   16.7    4.0
#> year           year    <dbl> 205         0             0.0 1969.9    2.6 1962.0
#> thickness thickness    <dbl> 205         0             0.0    2.9    3.0    0.1
#> ulcer         ulcer    <dbl> 205         0             0.0    0.4    0.5    0.0
#>           quartile_25 median quartile_75    max
#> time           1525.0 2005.0      3042.0 5565.0
#> status            1.0    2.0         2.0    3.0
#> sex               0.0    0.0         1.0    1.0
#> age              42.0   54.0        65.0   95.0
#> year           1968.0 1970.0      1972.0 1977.0
#> thickness         1.0    1.9         3.6   17.4
#> ulcer             0.0    0.0         1.0    1.0
#> 
#> $Categorical
#> data frame with 0 columns and 205 rowsIf you wish to see the variables in the order in which they appear in
the data frame or tibble, missing_glimpse() or
tibble::glimpse() are useful.
Use an original description of the data (often called a data dictionary) to correctly assign and label any factor variables. This can be done in a single pipe.
library(dplyr)
melanoma %>% 
  mutate(
    status.factor = factor(status, levels = c(1, 2, 3), 
      labels = c("Died from melanoma", "Alive", "Died from other causes")) %>% 
    ff_label("Status"),
    sex.factor = factor(sex, levels = c(1, 0),
      labels = c("Male", "Female")) %>% 
    ff_label("Sex"),
    ulcer.factor = factor(ulcer, levels = c(1, 0),
      labels = c("Present", "Absent")) %>% 
    ff_label("Ulcer")
  ) -> melanoma
ff_glimpse(melanoma)
#> $Continuous
#>               label var_type   n missing_n missing_percent   mean     sd    min
#> time           time    <dbl> 205         0             0.0 2152.8 1122.1   10.0
#> status       status    <dbl> 205         0             0.0    1.8    0.6    1.0
#> sex             sex    <dbl> 205         0             0.0    0.4    0.5    0.0
#> age             age    <dbl> 205         0             0.0   52.5   16.7    4.0
#> year           year    <dbl> 205         0             0.0 1969.9    2.6 1962.0
#> thickness thickness    <dbl> 205         0             0.0    2.9    3.0    0.1
#> ulcer         ulcer    <dbl> 205         0             0.0    0.4    0.5    0.0
#>           quartile_25 median quartile_75    max
#> time           1525.0 2005.0      3042.0 5565.0
#> status            1.0    2.0         2.0    3.0
#> sex               0.0    0.0         1.0    1.0
#> age              42.0   54.0        65.0   95.0
#> year           1968.0 1970.0      1972.0 1977.0
#> thickness         1.0    1.9         3.6   17.4
#> ulcer             0.0    0.0         1.0    1.0
#> 
#> $Categorical
#>                label var_type   n missing_n missing_percent levels_n
#> status.factor Status    <fct> 205         0             0.0        3
#> sex.factor       Sex    <fct> 205         0             0.0        2
#> ulcer.factor   Ulcer    <fct> 205         0             0.0        2
#>                                                                             levels
#> status.factor "Died from melanoma", "Alive", "Died from other causes", "(Missing)"
#> sex.factor                                           "Male", "Female", "(Missing)"
#> ulcer.factor                                      "Present", "Absent", "(Missing)"
#>               levels_count   levels_percent
#> status.factor  57, 134, 14 27.8, 65.4,  6.8
#> sex.factor         79, 126           39, 61
#> ulcer.factor       90, 115           44, 56Everything looks good and you are ready to start analysis.