There are many ways to describe longitudinal data - from panel data, cross-sectional data, and time series. We define longitudinal data as:
Information from the same individuals, recorded at multiple points in time.
To explore and model longitudinal data, It is important to understand what variables represent the individual components, and the time components, and how these identify an individual moving through time. Identifying the individual and time components can sometimes be a challenge, so this vignette walks through how to do this.
tsibbleThe tools and workflows in brolgar are designed to work
with a special tidy time series data frame called a
tsibble. We can define our longitudinal data in terms of a
time series to gain access to some really useful tools. To do so, we
need to identify three components:
Together, time index and key uniquely identify an observation with repeated measurements
The term key is used a lot in brolgar, so it is an
important idea to internalise:
The key is the identifier of your individuals or series
Why care about defining longitudinal data as a time series? Once we account for this time series structure inherent in longitudinal data, we gain access to a suite of nice tools that simplify and accelerate how we work with time series data.
brolgar is built on top of the powerful tsibble package
by Earo Wang, if you would like to learn
more, see the official package
documentation or read the
paper.
To convert longitudinal data into a “time
series tibble”, a tsibble, we need
to consider which variables identify:
Together, time index and key uniquely identify an observation with repeated measurements
The vignette now walks through some examples of converting
longitudinal data into a tsibble.
Let’s look at the wages data analysed in Singer & Willett (2003). This data contains measurements on hourly wages by years in the workforce, with education and race as covariates. The population measured was male high-school dropouts, aged between 14 and 17 years when first measured. Below is the first 10 rows of the data.
library(brolgar)
suppressPackageStartupMessages(library(dplyr))
slice(wages, 1:10) %>% knitr::kable()| id | ln_wages | xp | ged | xp_since_ged | black | hispanic | high_grade | unemploy_rate | 
|---|---|---|---|---|---|---|---|---|
| 31 | 1.491 | 0.015 | 1 | 0.015 | 0 | 1 | 8 | 3.21 | 
| 31 | 1.433 | 0.715 | 1 | 0.715 | 0 | 1 | 8 | 3.21 | 
| 31 | 1.469 | 1.734 | 1 | 1.734 | 0 | 1 | 8 | 3.21 | 
| 31 | 1.749 | 2.773 | 1 | 2.773 | 0 | 1 | 8 | 3.30 | 
| 31 | 1.931 | 3.927 | 1 | 3.927 | 0 | 1 | 8 | 2.89 | 
| 31 | 1.709 | 4.946 | 1 | 4.946 | 0 | 1 | 8 | 2.49 | 
| 31 | 2.086 | 5.965 | 1 | 5.965 | 0 | 1 | 8 | 2.60 | 
| 31 | 2.129 | 6.984 | 1 | 6.984 | 0 | 1 | 8 | 4.80 | 
| 36 | 1.982 | 0.315 | 1 | 0.315 | 0 | 0 | 9 | 4.89 | 
| 36 | 1.798 | 0.983 | 1 | 0.983 | 0 | 0 | 9 | 7.40 | 
To create a tsibble of the data we ask, “which variables
identify…”:
Together, time index and key uniquely identify an observation with repeated measurements
From this, we can say that:
id - the
subject id, from 1-888.xp the
experience in years an individual has.We can use this information to create a tsibble of this
data using as_tsibble
#> # A tsibble: 6,402 x 9 [!]
#> # Key:       id [888]
#>       id ln_wages    xp   ged xp_since_ged black hispanic high_grade
#>    <int>    <dbl> <dbl> <int>        <dbl> <int>    <int>      <int>
#>  1    31     1.49 0.015     1        0.015     0        1          8
#>  2    31     1.43 0.715     1        0.715     0        1          8
#>  3    31     1.47 1.73      1        1.73      0        1          8
#>  4    31     1.75 2.77      1        2.77      0        1          8
#>  5    31     1.93 3.93      1        3.93      0        1          8
#>  6    31     1.71 4.95      1        4.95      0        1          8
#>  7    31     2.09 5.96      1        5.96      0        1          8
#>  8    31     2.13 6.98      1        6.98      0        1          8
#>  9    36     1.98 0.315     1        0.315     0        0          9
#> 10    36     1.80 0.983     1        0.983     0        0          9
#> # ℹ 6,392 more rows
#> # ℹ 1 more variable: unemploy_rate <dbl>Note that regular = FALSE, since we have an
irregular time series
Note the following information printed at the top of
wages
# A tsibble: 6,402 x 9 [!]
# Key:       id [888]
...This says:
The ! at the top means that there is no regular spacing
between series
The “key” variable is then listed - id, of which there
888.
The heights data is a little simpler than the wages data, and contains the average male heights in 144 countries from 1810-1989, with a smaller number of countries from 1500-1800.
It contains four variables:
To create a tsibble of the data we ask, “which variables
identify…”:
In this case:
This data is already a tsibble object, we can create a
tsibble with the following code:
as_tsibble(x = heights,
           key = country,
           index = year,
           regular = FALSE)
#> # A tsibble: 1,490 x 4 [!]
#> # Key:       country [144]
#>    country     continent  year height_cm
#>    <chr>       <chr>     <dbl>     <dbl>
#>  1 Afghanistan Asia       1870      168.
#>  2 Afghanistan Asia       1880      166.
#>  3 Afghanistan Asia       1930      167.
#>  4 Afghanistan Asia       1990      167.
#>  5 Afghanistan Asia       2000      161.
#>  6 Albania     Europe     1880      170.
#>  7 Albania     Europe     1890      170.
#>  8 Albania     Europe     1900      169.
#>  9 Albania     Europe     2000      168.
#> 10 Algeria     Africa     1910      169.
#> # ℹ 1,480 more rowsThe gapminder R package contains a dataset of a subset of the gapminder study (link). This contains data on life expectancy, GDP per capita, and population by country.
library(gapminder)
gapminder
#> # A tibble: 1,704 × 6
#>    country     continent  year lifeExp      pop gdpPercap
#>    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
#>  1 Afghanistan Asia       1952    28.8  8425333      779.
#>  2 Afghanistan Asia       1957    30.3  9240934      821.
#>  3 Afghanistan Asia       1962    32.0 10267083      853.
#>  4 Afghanistan Asia       1967    34.0 11537966      836.
#>  5 Afghanistan Asia       1972    36.1 13079460      740.
#>  6 Afghanistan Asia       1977    38.4 14880372      786.
#>  7 Afghanistan Asia       1982    39.9 12881816      978.
#>  8 Afghanistan Asia       1987    40.8 13867957      852.
#>  9 Afghanistan Asia       1992    41.7 16317921      649.
#> 10 Afghanistan Asia       1997    41.8 22227415      635.
#> # ℹ 1,694 more rowsLet’s identify
This is in fact very similar to the heights dataset:
To identify if the year is regular, we can do a bit of data
exploration using index_summary()
gapminder %>% 
  group_by(country) %>% 
  index_summary(year)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    1952    1966    1980    1980    1993    2007This shows us that the year is every five - so now we know that this is a regular longitudinal dataset, and can be encoded like so:
as_tsibble(gapminder,
           key = country,
           index = year,
           regular = TRUE)
#> # A tsibble: 1,704 x 6 [5Y]
#> # Key:       country [142]
#>    country     continent  year lifeExp      pop gdpPercap
#>    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
#>  1 Afghanistan Asia       1952    28.8  8425333      779.
#>  2 Afghanistan Asia       1957    30.3  9240934      821.
#>  3 Afghanistan Asia       1962    32.0 10267083      853.
#>  4 Afghanistan Asia       1967    34.0 11537966      836.
#>  5 Afghanistan Asia       1972    36.1 13079460      740.
#>  6 Afghanistan Asia       1977    38.4 14880372      786.
#>  7 Afghanistan Asia       1982    39.9 12881816      978.
#>  8 Afghanistan Asia       1987    40.8 13867957      852.
#>  9 Afghanistan Asia       1992    41.7 16317921      649.
#> 10 Afghanistan Asia       1997    41.8 22227415      635.
#> # ℹ 1,694 more rowsThe PISA study measures school students around the world on a series of math, reading, and science scores. A subset of the data looks like so:
pisa
#> # A tibble: 433 × 11
#>    country  year math_mean math_min math_max read_mean read_min read_max
#>    <fct>   <int>     <dbl>    <dbl>    <dbl>     <dbl>    <dbl>    <dbl>
#>  1 ALB      2000      395.     27.4     722.      354.  59.7        640.
#>  2 ALB      2009      377.     79.6     706.      385.  17.0        662.
#>  3 ALB      2012      395.     62.4     688.      394.   0.0834     742.
#>  4 ALB      2015      412.    122.      711.      405.  93.6        825.
#>  5 ALB      2018      437.     96.5     789.      405. 152.         693.
#>  6 ARE      2009      421.     57.8     768.      431.  48.1        772.
#>  7 ARE      2012      434.    138.      862.      442.  75.5        785.
#>  8 ARE      2015      427.     91.8     793.      432.  54.4        827.
#>  9 ARE      2018      437.     87.6     865.      431.  84.0        814.
#> 10 ARG      2000      385.     16.0     675.      417.  84.2        761.
#> # ℹ 423 more rows
#> # ℹ 3 more variables: science_mean <dbl>, science_min <dbl>, science_max <dbl>Let’s identify
Here it looks like the key is the student_id, which is nested within school_id and country,
And the index is year, so we would write the following
We can assess the regularity of the year like so:
index_regular(pisa, year)
#> [1] TRUE
index_summary(pisa, year)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    2000    2004    2009    2009    2014    2018We can now convert this into a tsibble:
pisa_ts <- as_tsibble(pisa,
           key = country,
           index = year,
           regular = TRUE)
pisa_ts
#> # A tsibble: 433 x 11 [3Y]
#> # Key:       country [100]
#>    country  year math_mean math_min math_max read_mean read_min read_max
#>    <fct>   <int>     <dbl>    <dbl>    <dbl>     <dbl>    <dbl>    <dbl>
#>  1 ALB      2000      395.     27.4     722.      354.  59.7        640.
#>  2 ALB      2009      377.     79.6     706.      385.  17.0        662.
#>  3 ALB      2012      395.     62.4     688.      394.   0.0834     742.
#>  4 ALB      2015      412.    122.      711.      405.  93.6        825.
#>  5 ALB      2018      437.     96.5     789.      405. 152.         693.
#>  6 ARE      2009      421.     57.8     768.      431.  48.1        772.
#>  7 ARE      2012      434.    138.      862.      442.  75.5        785.
#>  8 ARE      2015      427.     91.8     793.      432.  54.4        827.
#>  9 ARE      2018      437.     87.6     865.      431.  84.0        814.
#> 10 ARG      2000      385.     16.0     675.      417.  84.2        761.
#> # ℹ 423 more rows
#> # ℹ 3 more variables: science_mean <dbl>, science_min <dbl>, science_max <dbl>This idea of longitudinal data is core to brolgar. Understanding what
longitudinal data is, and how this can be linked to a time series
representation of data helps us understand our data structure, and gives
us access to more flexible tools. Other vignettes in the package will
further show why the time series tsibble is useful.