In this document, we provide a tutorial on conducting simple comparative research analyses with R This document accompanies the Discover Social Science and Health paper Investigating change across time in prevalence or association: the challenges of cross-study comparative research and possible solutions by Bann et al. (2022). A similar tutorial using Stata is provided on the Open Science Foundation website (https://osf.io/d569x/) and on GitHub (https://ljwright.github.io/cross-cohort-tutorial/stata_syntax.html). Both tutorials cover almost identical material, but the HTML files are produced using different software (RMarkdown for the R tutorial, Jupyter Lab for the Stata tutorial). This explains the different look of the two tutorial files. The current tutorial was produced with R version 4.1.1, but should work with (at least) R versions 3.6.0 and above.
To demonstrate the concepts discussed in the paper, we will use a simple simulated dataset, which contains data from 40,000 individuals across four cohorts (Boomers, Gen X, Millennials, and Gen Z) each surveyed at the same age. The dataset can be downloaded from the Open Science Foundation website (https://osf.io/d569x/). The code in this document is also be used for conducting analyses with repeat cross-sectional data, but we limit ourselves to this example dataset to make the code easier to follow along.
library(labelled)
load("cohort_data.Rdata")
look_for(df)
## pos variable label col_type values
## 1 id Person ID int
## 2 cohort Cohort fct Boomers
## Gen X
## Millennials
## Gen Z
## 3 cohort_id Person ID (Cohort) int
## 4 sex Sex fct Male
## Female
## 5 social_class Father's Occupational Class fct Manual
## Non-Manual
## 6 education Highest Qualification fct A-Level or Below
## Degree or Above
## 7 bmi Body Mass Index dbl
## 8 depression Depression Diagnosis fct No Depression
## Diagnosed
## 9 observed Survey Participation Indicator dbl
The dataset contains observations collected at age 25 across the four
cohorts. There are two variables that we will use as outcome variables
in the analyses: the continuous variable, bmi
, and the
binary variable, depression
. There are two variables that
we will analyse as exposure variables: the binary variable
sex
and the binary variable social_class
. We
will use these data to explore changes in social- and sex-inequalities
across cohorts. We will produce cohort specific descriptive statistics,
estimate regression models in both individual cohorts and in pooled
analyses, and perform meta-analyses of model results. The dataset also
contains a study participation indicator, observed
, that we
will use to show how missingness can bias results.
We’ll primarily use the tidyverse
package to prepare,
analyse, and present the data. We’ll load other packages when they are
required, in order to make clearer which packages functions belong to.
(Above we loaded the labelled
package to use the
look_for()
function.)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
Before starting the analysis, it is worth discussing the
tidyverse
in further detail.
tidyverse
The tidyverse
is an “opinionated” package of packages
that share a common design philosophy (https://www.tidyverse.org/). Popular
tidyverse
packages include ggplot2
,
dplyr
, tidyr
, stringr
and
purrr
. Above you’ll see that the command
library(tidyverse)
loaded several packages. These are known
as the “core” tidyverse packages. When installing the
tidyverse
, more packages are installed than the core
packages (for instance, broom
and glue
), but
these need to be loaded separately before use.
Because the tidyverse
packages share a design
philosophy, they work very well together (https://design.tidyverse.org/). Other developers can
also create packages using the same philosophy, which means they can fit
into the tidyverse
ecosystem. This solves some of the
coordination issues that can make using open-source software
difficult.
Two central elements of the tidyverse
are the use of
tidy data and the %>%
(“pipe”) operator. Tidy
data are data in rectangular (data frame) format where each row is an
observation and each column is a variable (Wickham 2014). What is tidy can depend on
context, but in general, *wide* data - where each row contains repeat
observations from different timepoints - are not tidy, whereas
long data - datasets with individual rows for each person time
combination - are.
The pipe operator allows R users to chain functions together, piping the object on the left hand side to the first argument of the function on the right hand side. This makes code much easier to write and to read. Compare:
paste(mean(1:10), "is the mean of 1 to 10.")
## [1] "5.5 is the mean of 1 to 10."
1:10 %>%
mean() %>%
paste("is the mean of 1 to 10")
## [1] "5.5 is the mean of 1 to 10"
The pipe can be read sequentially (i.e. do this, then this, then
this, …, etc). Users can also pipe objects into different arguments of a
function using the .
placeholder.1
paste("The mean of 1 to 10 is", mean(1:10))
## [1] "The mean of 1 to 10 is 5.5"
1:10 %>%
mean() %>%
paste("The mean of 1 to 10 is", .)
## [1] "The mean of 1 to 10 is 5.5"
There are a lot of excellent tutorials on the tidyverse
online. See, for instance, see the R
for Data Science book, the R
for Data Science Online Learning Community, and the #rstats and #tidytuesday Twitter
hashtags.
To produce descriptive statistics by cohort, we will use the
gtsummary
package. The gtsummary
package (http://www.danieldsjoberg.com/gtsummary/) uses
tidyverse
syntax but is not part of the
tidyverse
. To produce a table of descriptive statistics by
cohort
, we pass our data to the tbl_summary()
function with the option by = cohort
. We don’t need
descriptive statistics for either of the ID columns (id
,
cohort_id
), so we drop these using the
select()
function (from dplyr
) before piping
in to tbl_summary()
.
library(gtsummary)
df %>%
select(-id, -cohort_id) %>%
tbl_summary(by = cohort)
Characteristic | Boomers, N = 10,0001 | Gen X, N = 10,0001 | Millennials, N = 10,0001 | Gen Z, N = 10,0001 |
---|---|---|---|---|
Sex | ||||
Male | 5,047 (50%) | 4,994 (50%) | 5,080 (51%) | 4,965 (50%) |
Female | 4,953 (50%) | 5,006 (50%) | 4,920 (49%) | 5,035 (50%) |
Father's Occupational Class | ||||
Manual | 5,936 (59%) | 5,033 (50%) | 4,007 (40%) | 3,504 (35%) |
Non-Manual | 4,064 (41%) | 4,967 (50%) | 5,993 (60%) | 6,496 (65%) |
Highest Qualification | ||||
A-Level or Below | 9,285 (93%) | 9,001 (90%) | 6,226 (62%) | 4,042 (40%) |
Degree or Above | 715 (7.1%) | 999 (10.0%) | 3,774 (38%) | 5,958 (60%) |
Body Mass Index | 26.9 (24.8, 29.1) | 27.4 (25.1, 29.6) | 30.2 (27.2, 33.2) | 33.0 (29.8, 36.3) |
Depression Diagnosis | ||||
No Depression | 9,357 (94%) | 9,217 (92%) | 8,540 (85%) | 7,881 (79%) |
Diagnosed | 643 (6.4%) | 783 (7.8%) | 1,460 (15%) | 2,119 (21%) |
Survey Participation Indicator | 7,679 (77%) | 6,490 (65%) | 7,752 (78%) | 6,406 (64%) |
1 n (%); Median (IQR) |
Notice that the variable labels in the table appear already
formatted. gtsummary
uses the attribute labels
from the individual columns. (The labels are also displayed by the
look_for()
function above.) attributes
are
metadata attached to objects. For example, the names
attribute in a data.frame
defines the individual column
names.2
In the above table, we see that there has been an increase across cohorts in: the proportion of individuals from non-manual backgrounds, the average level of BMI, the prevalence of diagnosed depression, and the number of individuals with degree level education. There are also differences in the level of missingness by cohort. We will come back to this in a later section.
gtsummary
includes many options to format and change the
tbl_summary()
table, including using different summary
statistics and exporting to Microsoft Word. See the gtsummary
website or YouTube for more
detail. Several other packages also exist for creating descriptive
statistics in R (see this
blog for a summary).
Descriptive tables are helpful, but are limited compared with
visualisations in helping you and your readers to understand the data.
Well designed plots can convey a lot of information concisely, including
enabling readers to make quick comparisons across multiple data points,
such as the distribution of a variable or difference in an estimate by
cohort. Thus, in this guide, we use ggplot2
to produce
several plots to present results. Below we create kernel density plots
of BMI by cohort. This plot allows us to visually examine differences in
the distribution of BMI over time.
cbbPalette <- c("#000000", "#E69F00", "#56B4E9", "#009E73",
"#F0E442", "#0072B2", "#D55E00", "#CC79A7")
ggplot(df) +
aes(x = bmi) +
facet_wrap(~ cohort) +
geom_density(data = rename(df, cohort_f = cohort),
aes(group = cohort_f),
color = "grey70", fill = "grey70", alpha = 0.4) +
geom_density(color = cbbPalette[6], fill = cbbPalette[6], alpha = 0.7) +
theme_bw() +
labs(x = "BMI", y = "Density")
ggplot2
syntax is a little confusing at first, but worth
persevering with as it is extraordinarily powerful. ggplot2
uses a concept called the layered grammar of graphics, which is a
consistent way of describing plots. In the grammar of graphics,
different types of plots (scatter plots, line graphs, density plots,
etc.) are just specific instances of a general language. This means that
when you understand the language, you can create different plots very
simply. Compare this to creating plots in Stata, which requires
knowledge of the idiosyncratic features of a particular plot.
The idea behind ggplot2
is as follows:
ggplot2
takes data, which are mapped to aesthetics
(aes()
; x-location, y-location, color, shape, etc.). These
aesthetics are represented visually with geometries
(geom_*()
), that can be displayed across small multiples -
facets (facet_grid()
, facet_wrap()
). Some
geometries require statistical transformations to be produced - for
instance, bar plots require counts to be made from data. The geometries
are projected onto a set of coordinates (Cartesian, polar, etc.) and
finally non-data elements of the plot (title, axis colour) can be
adjusted with themes. The layered aspect of ggplot2
is that
plot elements (e.g., additional geometries) are added on top of the
existing plot.
In the above plot, we took our data frame, df
, and
mapped the variable bmi
to the x-axis. We then added
faceting so that individual cohort data would be mapped to separate
plots. Next we added density geometries (geom_density()
)
which transform the BMI data to produce kernel density estimates and
plot these to the y-axis. We added two layers of density curves. In the
first layer, we used the df
object again (specifying the
data
argument within geom_density()
ensures
the correct data is used 3), but renamed the column
cohort
as cohort_f
and mapped this to the
group
aesthetic so that we would see the density curve for
each cohort in every facet.4 In the second geom_density()
layer we used the initial df
object specified in
ggplot()
to add the cohort-specific densities (in blue) to
the top of the plot. The argument alpha
sets the
transparency of the density curves. Note, we used hex codes from a
color-blind friendly palette (cbbPalette
)
to shade the outline (color
) and interior
(fill
) of the upper density curves. We used
theme_bw()
to control the overall “look” of the plot
(e.g. produces grey panel title backgrounds) and labs
to
set the legend labels (in this case, the x-axis and y-axis labels).
A second way of presenting the BMI data is by using scatter plots,
boxplots, and violin plots. Below we use the layered grammar of
ggplot2
to add each of geometries to a single plot
simultaneously. Note, geom_jitter()
adds a small about of
random noise to the x and y positions of the points to ensure there is
less overlap 5 and coord_flip()
inverts the x
and y-axes. The plot concisely shows the increased average and
variability in BMI across cohorts.
ggplot(df) +
aes(x = cohort, y = bmi) +
geom_jitter(alpha = 0.05, color = "grey30") +
geom_boxplot(alpha = 0, color = cbbPalette[6]) +
geom_violin(alpha = 0, color = cbbPalette[6]) +
coord_flip() +
theme_minimal() +
labs(x = NULL, y = "BMI")
There is a lot of information on ggplot2
online. See,
for example, Hadley Wickham et al.’s ggplot2
book, Keiran
Healy’s Data Visualization: a Practical
Introduction, the #TidyTuesday
Twitter hashtag, and Gina Reynolds’ ggplot2
flipbooks. Gina Reynolds’ flipbooks are a particularly good place to
start as they elegantly convey the sequential, layered procedure for
constructing ggplot2
plots.
In this section we examine the association between depression, BMI and sex and socio-economic position using individual cohort data. For BMI, we estimate OLS models, and for depression, we estimate logistic regression models. In both cases, we present results in absolute and relative effect forms as when comparing across cohorts, the direction of these effects are not necessarily the same (see main paper). To estimate relative effects for BMI, we need to standardize (mean = 0, SD = 1) our data in each cohort, which we do in the next section. In that section, we also prepare our dataset so that it follows principles set out by Vable et al. (2021) - in particular, that binary variables are named such that the top category is clear from variable name.
Below we use several functions from dplyr
(part of the
tidyverse
) to prepare the dataset. We assign the prepared
dataset to the object df_reg
. mutate()
adds
new columns to the data, select()
adds or removes columns
(-var
removes the column var
).
sex
, social _class
and depression
were stored as factors in df
, so we first convert these to
numeric (first category = 1) when creating our binary numeric variables.
group_by()
groups the dataset according to a variable (in
this case cohort
) so that operations are carried out within
groups of observations rather than the full dataset. In this case, we
standardize BMI in each cohort separately.
set_variable_labels()
is from the labelled
package and adds the label attribute to specified columns. These
variable labels will be used when we produce tables of regression
results with gtsummary
.
df_reg <- df %>%
mutate(female = as.numeric(sex) - 1,
manual_sep = 2 - as.numeric(social_class),
depressed = as.numeric(depression) - 1) %>%
group_by(cohort) %>%
mutate(bmi_std = (bmi - mean(bmi)) / sd(bmi)) %>%
ungroup() %>%
select(-sex, -social_class, -education, - depression) %>%
set_variable_labels(female = "Female", manual_sep = "Manual SEP")
glimpse(df_reg)
## Rows: 40,000
## Columns: 9
## $ id <int> 12, 19, 22, 33, 34, 40, 47, 51, 54, 75, 81, 86, 92, 97, 113…
## $ cohort <fct> Boomers, Boomers, Boomers, Boomers, Boomers, Boomers, Boome…
## $ cohort_id <int> 12, 19, 22, 33, 34, 40, 47, 51, 54, 75, 81, 86, 92, 97, 113…
## $ bmi <dbl> 28.36981, 24.44526, 26.21268, 26.98448, 27.27541, 26.46252,…
## $ observed <dbl> 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1,…
## $ female <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ manual_sep <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ depressed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ bmi_std <dbl> 0.45480234, -0.78078408, -0.22434043, 0.01864942, 0.1102453…
In the main paper, we noted that care must be taken to harmonise data across cohorts. One option discussed was creating ridit scores to place ordered categorical variables onto 0-1 scales. Below we provide code for creating ridit scores, but do not use these further in the analyses.
library(ridittools)
df %>%
group_by(cohort) %>%
mutate(sep_ridit = toridit(table(social_class))[social_class]) %>%
distinct(cohort, social_class, sep_ridit)
## # A tibble: 8 × 3
## # Groups: cohort [4]
## cohort social_class sep_ridit
## <fct> <fct> <table[1d]>
## 1 Boomers Non-Manual 0.79680
## 2 Boomers Manual 0.29680
## 3 Gen X Non-Manual 0.75165
## 4 Gen X Manual 0.25165
## 5 Millennials Non-Manual 0.70035
## 6 Millennials Manual 0.20035
## 7 Gen Z Non-Manual 0.67520
## 8 Gen Z Manual 0.17520
toridit()
requires a vector of counts (i.e. as outputted
by table()
) and produces a vector of ridit scores. Where
names are provided to the input vector, these are also provided in the
output, thus above we use social_class
to subset the vector
produced by toridit()
. distinct()
(from
dplyr
) returns a data frame with the unique combinations of
variables specified in the function call.
To run a linear regression model, we use the function
lm()
. lm()
takes a formula as its first
argument and a data frame as its second argument. Here we regress
bmi
on female
and manual_sep
and
their interaction using the data from Boomers (filter()
from dplyr()
returns the observations which meet a
condition or set of conditions - in this case where
cohort == "Boomers"
). female*manual_sep
concisely specifies that we want the interaction between female and
manual SEP along with their main effects (for the interaction term only
use female:manual_sep
).6
lm(bmi ~ female*manual_sep,
filter(df_reg, cohort == "Boomers"))
##
## Call:
## lm(formula = bmi ~ female * manual_sep, data = filter(df_reg,
## cohort == "Boomers"))
##
## Coefficients:
## (Intercept) female manual_sep female:manual_sep
## 25.0280 0.8506 2.6853 -0.4037
We could write this command 7 more times so that we had a model for
each cohort for both bmi
(absolute effects) and
bmi_std
(relative effects). To make this more concise, we
can instead write a loop, so that we iterate over cohort and BMI
variable. Below we create a function get_lm()
which takes
two string inputs, cohort
and bmi_var
, and
uses these to run a linear regression model. To produce the correct
model formula, we use the glue()
function from
glue
(part of the tidyverse
) to create a
string which replaces bmi_var
with its value.
library(glue)
get_lm <- function(cohort, bmi_var){
df_mod <- filter(df_reg, cohort == !!cohort)
glue("{bmi_var} ~ female*manual_sep") %>%
as.formula() %>%
lm(df_mod)
}
get_lm("Boomers", "bmi")
##
## Call:
## lm(formula = ., data = df_mod)
##
## Coefficients:
## (Intercept) female manual_sep female:manual_sep
## 25.0280 0.8506 2.6853 -0.4037
To iterate over the cohort and BMI variables, we can first create a
data frame using expand_grid()
that contains the unique
combinations of cohort
and bmi_var
that we
need. Next we can add a new column to this dataset iterating
get_lm()
over these values using the map2()
function. map2()
is from the map*()
family of
functions (from purrr
, part of the tidyverse
).
map2()
takes two objects and a function as inputs and
repeats the function over the values of the two objects returning a list
with the objects passed to the first two arguments of the function.
res_lm <- expand_grid(cohort = unique(df$cohort),
bmi_var = c("bmi", "bmi_std")) %>%
mutate(mod = map2(cohort, bmi_var, get_lm))
res_lm
## # A tibble: 8 × 3
## cohort bmi_var mod
## <fct> <chr> <list>
## 1 Boomers bmi <lm>
## 2 Boomers bmi_std <lm>
## 3 Gen X bmi <lm>
## 4 Gen X bmi_std <lm>
## 5 Millennials bmi <lm>
## 6 Millennials bmi_std <lm>
## 7 Gen Z bmi <lm>
## 8 Gen Z bmi_std <lm>
There are different ways of specifying functions and inputs in the
map*()
functions and different variants of
map*()
that return other types of objects other than lists.
We return to this below. Notice that the mod
column in
res_lm
is a list (known as a list-column) - this means that
data frames are extremely flexible in what data they can store (e.g. a
column of ggplot
objects).7
Note, if we had decided to run robustness checks using different measures of SEP (e.g. ridit scores), changing the above code to loop over SEP variable definition too is a relatively trivial matter.8
get_sep <- function(cohort, bmi_var, sep_var){
df_mod <- filter(df_reg, cohort == !!cohort)
glue("{bmi_var} ~ female*{sep_var}") %>%
as.formula() %>%
lm(df_reg)
}
expand_grid(cohort = unique(df$cohort),
bmi_var = c("bmi", "bmi_std"),
sep_var = c("manual_sep", "sep_ridit")) %>%
mutate(mod = pmap(list(cohort, bmi_var, sep_var), get_sep))
See the purrr
website for more details on pmap()
.
We can produce regression tables using the
tbl_regression()
function from gtsummary
.
tbl_regression
takes a model object as its first argument.
Again, we can use the map*()
family of functions to create
a table for each model in res_lm
- here we use
map()
rather than map2()
because we only
iterate over one input, the column mod
. We can merge
regression tables produced by tbl_regression()
using the
tbl_merge()
function. The argument tab_spanner
specifies the spanning column headers post-merge (here we use the cohort
names with markdown syntax, **
, that specifies these
should be bold). To reduce the width of the table, here
we just create a table for absolute effects
(filter(bmi_var == "bmi")
). pull(tbl)
extracts
a column from a data frame.
res_lm %>%
mutate(tbl = map(mod, tbl_regression)) %>%
filter(bmi_var == "bmi") %>%
pull(tbl) %>%
tbl_merge(tab_spanner = glue("**{unique(res_lm$cohort)}**"))
Characteristic | Boomers | Gen X | Millennials | Gen Z | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Beta | 95% CI1 | p-value | Beta | 95% CI1 | p-value | Beta | 95% CI1 | p-value | Beta | 95% CI1 | p-value | |
Female | 0.85 | 0.67, 1.0 | <0.001 | 0.45 | 0.28, 0.63 | <0.001 | 1.1 | 0.91, 1.3 | <0.001 | 1.3 | 1.1, 1.5 | <0.001 |
Manual SEP | 2.7 | 2.5, 2.8 | <0.001 | 2.3 | 2.1, 2.5 | <0.001 | 4.7 | 4.5, 4.9 | <0.001 | 5.7 | 5.4, 5.9 | <0.001 |
Female * Manual SEP | -0.40 | -0.64, -0.17 | <0.001 | -0.03 | -0.27, 0.22 | 0.8 | 0.49 | 0.21, 0.77 | <0.001 | -0.17 | -0.49, 0.14 | 0.3 |
1 CI = Confidence Interval |
As with tbl_summary()
, there are a lot of options for
formatting the table, including dropping p-values and having beta values
and confidence intervals in a single column. gtsummary
also
contains a function (theme_gtsummary_journal()
) that allow
immediate formatting to a particular journal style (JAMA, etc.). See the
gtsummary
website for more information.
To plot the model results, we first need to extract the estimates
from the model objects. We can do this using the package
broom
(part of the tidyverse
).
broom
contains three main functions: tidy
,
glance
, and augment
. tidy
returns
a data frame which contains the main estimates (beta coefficients,
standard errors, p-values, etc.); glance
returns model
information and fit statistics (sample size, R2, etc.); and
augment
returns observation level information (fitted
values and residuals). Below we use map()
to
tidy
the model objects and then ggplot()
to
display the results. In map()
, we use lambda notation to
specify the function. .x
is a placeholder for the current
value of the object being iterated over. (In map2()
the
placeholders .x
and .y
can be used.)
map()
returns a list, so we use unnest()
(from
tidyr
) to expand the column, which gives us one row per
model-coefficient combination.
library(broom)
res_lm %>%
mutate(tidy = map(mod, ~ tidy(.x, conf.int = TRUE))) %>%
unnest(tidy) %>%
filter(term != "(Intercept)") %>%
ggplot() +
aes(x = term, y = estimate,
ymin = conf.low, ymax = conf.high,
color = cohort, shape = cohort) +
facet_grid(~ bmi_var, scales = "free_x") +
geom_hline(yintercept = 0, linetype = "dashed") +
geom_pointrange(position = position_dodge(0.6)) +
coord_flip() +
theme_bw() +
theme(legend.position = "bottom") +
labs(x = NULL, y = "Marginal Effect (+ 95% CI)",
color = NULL, shape = NULL)
In the above plot, we specify scales = "free_x"
in
facet_grid()
to have different x-axis scales in the
individual facets. We add the argument
position = position_dodge(0.6)
to
geom_pointrange()
to spread out the individual cohort
estimates for a given coefficient (otherwise they would overlap). Much
more could be done to “prettify” the plot - for instance, changing the
facet titles by adding mutate()
steps prior to
ggplot()
.
Observe in the above plot that there are some differences in the
comparative size of absolute and relative effects, with the
female
coefficient larger in absolute terms (left panel) in
Boomers than Millennials but smaller in relative terms (right panel)
across the same two cohorts.
Below we produce logistic regression models estimating the
association between depression, sex and manual SEP. To run logistic
regression models, we use the run glm()
(for generalized
linear model) with the option family = bionomial
. Again,
for concision, we use map()
to run the model in each cohort
separately. The function distinct()
creates a data frame
with the unique values of cohort
from df_lm
.
Note, the bang-bang, !!
, operator in the
filter()
function is used to ensure that
!!cohort
is replaced with the value of cohort
from the environment (in this case, the input to the
get_glm()
function call) and not the value of
cohort
from the df_reg
data frame itself.
get_glm <- function(cohort){
df_mod <- filter(df_reg, cohort == !!cohort)
glm(depressed ~ female*manual_sep,
family = binomial,
data = df_mod)
}
res_glm <- distinct(df_reg, cohort) %>%
mutate(mod = map(cohort, get_glm))
To produce, regression tables, we can again use the
tbl_regression()
function. By default the function presents
results in log odds form, so we include the argument
exponentiate = TRUE
to provide odds ratios. Note, we pass
this argument to map()
directy as map passes extra
arguments on to the function being iterate over.
res_glm %>%
mutate(tbl = map(mod, tbl_regression, exponentiate = TRUE)) %>%
pull(tbl) %>%
tbl_merge(tab_spanner = glue("**{res_glm$cohort}**"))
Characteristic | Boomers | Gen X | Millennials | Gen Z | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
OR1 | 95% CI1 | p-value | OR1 | 95% CI1 | p-value | OR1 | 95% CI1 | p-value | OR1 | 95% CI1 | p-value | |
Female | 2.65 | 1.91, 3.73 | <0.001 | 1.83 | 1.43, 2.34 | <0.001 | 1.70 | 1.46, 1.98 | <0.001 | 1.53 | 1.35, 1.74 | <0.001 |
Manual SEP | 2.59 | 1.90, 3.59 | <0.001 | 2.04 | 1.60, 2.60 | <0.001 | 1.55 | 1.30, 1.83 | <0.001 | 1.51 | 1.31, 1.75 | <0.001 |
Female * Manual SEP | 0.62 | 0.42, 0.90 | 0.014 | 0.76 | 0.56, 1.04 | 0.084 | 0.83 | 0.66, 1.04 | 0.11 | 0.85 | 0.70, 1.04 | 0.12 |
1 OR = Odds Ratio, CI = Confidence Interval |
To plot the results, we can again use the tidy()
function from broom
. tidy()
also contains an
exponentiate
argument to output odds ratios. In the
ggplot2
call, we also add the function
scale_y_log10()
which rescales the y axis to log form.
res_glm %>%
mutate(tidy = map(mod, ~ tidy(.x, conf.int = TRUE, exponentiate = TRUE))) %>%
select(-mod) %>%
unnest(tidy) %>%
filter(term != "(Intercept)") %>%
ggplot() +
aes(x = term, y = estimate, ymin = conf.low, ymax = conf.high,
color = cohort, shape = cohort) +
geom_hline(yintercept = 1, linetype = "dashed") +
geom_pointrange(position = position_dodge(0.6)) +
theme_bw() +
theme(legend.position = "bottom") +
scale_y_log10() +
labs(x = "Coefficient", y = "Marginal Effect (+ 95% CI)",
color = NULL, shape = NULL)
The above regression produces marginal effects on the relative scale.
To extract marginal effects on the absolute (i.e. difference in the
probability of depression) scale, we can use the margins()
function from the margins
package. margins
is
based on Stata’s margins command (package website
here). Providing margins()
with a model object and a
data frame of observations on which to calculate effects produces a set
of marginal effect estimates for each observation and model variable.
Using tidy()
on the resulting object produces a data frame
of average marginal effects with standard errors, p-values, etc. More
complex specifications are possible.
Below, we calculate marginal effects on the probability scale for the
variables female
and manual_sep
where the
comparator is a male of high SEP (female = 0
,
manual_sep = 0
). We then plot these results using
ggplot2
.
library(margins)
get_mrg <- function(mod){
new_data <- tibble(female = 0, manual_sep = 0)
margins(mod, new_data) %>%
tidy(conf.int = TRUE)
}
res_glm %>%
mutate(mrg = map(mod, get_mrg)) %>%
select(-mod) %>%
unnest(mrg) %>%
ggplot() +
aes(x = term, y = estimate, ymin = conf.low, ymax = conf.high,
color = cohort, shape = cohort) +
geom_hline(yintercept = 0, linetype = "dashed") +
geom_pointrange(position = position_dodge(0.6)) +
theme_bw() +
theme(legend.position = "bottom") +
labs(x = "Coefficient", y = "Marginal Effect (+ 95% CI)",
color = NULL, shape = NULL)
Comparing this plot with the previous plot, we see some differences across the two scales. First, on the absolute scale, there is little difference by sex, while on the relative scale, there is a clear increase over time. Second, on the absolute scale, there is a clear increase in the differences in depression by SEP over time, but on the relative scale the opposite is true.
To pool results across cohorts, we can use meta-analytic methods.
Below, we use the rma()
function from metafor
to produce fixed effects (method = "FE"
) and random effects
(method = "REML"
) estimates for the bivariate association
between manual SEP and (standardized) BMI. First, we create a data frame
of individual cohort results (res_sep
), then use
rma()
, tidy()
, and ggplot2
to
produce and display pooled estimates.
get_sep <- function(cohort){
lm(bmi_std ~ manual_sep,
filter(df_reg, cohort == !!cohort)) %>%
tidy() %>%
filter(term == "manual_sep")
}
res_sep <- distinct(df_reg, cohort) %>%
mutate(map_dfr(cohort, get_sep))
library(metafor)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loading required package: metadat
##
## Loading the 'metafor' package (version 3.8-1). For an
## introduction to the package please type: help(metafor)
get_meta <- function(method){
rma(estimate, sei = std.error, data = res_sep,
slab = cohort, method = method) %>%
tidy(conf.int = TRUE, include_studies = TRUE)
}
res_meta <- tibble(method = c("FE", "REML")) %>%
mutate(res = map(method, get_meta)) %>%
unnest(res) %>%
distinct(across(-method), .keep_all = TRUE) %>%
mutate(term = case_when(term == "overall" & method == "FE" ~ "FE Model",
term == "overall" & method == "REML" ~ "RE Model",
term != "overall" ~ term),
type = ifelse(type == "study", "Study", "Overall") %>%
fct_rev())
ggplot(res_meta) +
aes(x = term, y = estimate, ymin = conf.low, ymax = conf.high) +
facet_grid(type ~ ., scales = "free_y", switch = "y", space = "free_y") +
geom_hline(yintercept = 0, linetype = "dashed") +
geom_pointrange() +
scale_y_continuous(expand = c(.1, .1)) +
coord_flip(clip = "off") +
theme_minimal() +
theme(strip.placement = "outside",
strip.text.y.left = element_text(angle = 0),
plot.margin = margin(r = 25)) +
labs(x = NULL, y = "Effect Size (+ 95% CI)")
To carry out a pooled regression analysis to examine cross-cohort
differences in a single model, we can add a further interaction term
with cohort
to our model formula.
pool_mod <- lm(bmi ~ cohort*female*manual_sep, df_reg)
pool_mod %>%
tidy(conf.int = TRUE)
## # A tibble: 16 × 7
## term estim…¹ std.e…² stati…³ p.value conf.low conf.…⁴
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 25.0 0.0746 336. 0 2.49e+1 25.2
## 2 cohortGen X 0.937 0.101 9.30 1.49e- 20 7.39e-1 1.13
## 3 cohortMillennials 2.69 0.0963 27.9 3.45e-170 2.50e+0 2.88
## 4 cohortGen Z 5.46 0.0954 57.2 0 5.27e+0 5.65
## 5 female 0.851 0.106 8.06 7.91e- 16 6.44e-1 1.06
## 6 manual_sep 2.69 0.0965 27.8 1.10e-168 2.50e+0 2.87
## 7 cohortGen X:female -0.396 0.142 -2.78 5.38e- 3 -6.75e-1 -0.117
## 8 cohortMillennials:female 0.242 0.137 1.77 7.68e- 2 -2.60e-2 0.510
## 9 cohortGen Z:female 0.454 0.135 3.38 7.34e- 4 1.91e-1 0.718
## 10 cohortGen X:manual_sep -0.375 0.136 -2.76 5.74e- 3 -6.40e-1 -0.109
## 11 cohortMillennials:manual_… 2.02 0.136 14.8 1.56e- 49 1.75e+0 2.29
## 12 cohortGen Z:manual_sep 2.97 0.139 21.4 4.13e-101 2.70e+0 3.24
## 13 female:manual_sep -0.404 0.137 -2.95 3.22e- 3 -6.72e-1 -0.135
## 14 cohortGen X:female:manual… 0.377 0.192 1.96 4.97e- 2 5.49e-4 0.753
## 15 cohortMillennials:female:… 0.891 0.194 4.59 4.40e- 6 5.11e-1 1.27
## 16 cohortGen Z:female:manual… 0.230 0.197 1.17 2.43e- 1 -1.56e-1 0.615
## # … with abbreviated variable names ¹estimate, ²std.error, ³statistic,
## # ⁴conf.high
Presenting these results in a table is simple enough:
tbl_regression(pool_mod)
Characteristic | Beta | 95% CI1 | p-value |
---|---|---|---|
Cohort | |||
Boomers | — | — | |
Gen X | 0.94 | 0.74, 1.1 | <0.001 |
Millennials | 2.7 | 2.5, 2.9 | <0.001 |
Gen Z | 5.5 | 5.3, 5.6 | <0.001 |
Female | 0.85 | 0.64, 1.1 | <0.001 |
Manual SEP | 2.7 | 2.5, 2.9 | <0.001 |
Cohort * Female | |||
Gen X * Female | -0.40 | -0.68, -0.12 | 0.005 |
Millennials * Female | 0.24 | -0.03, 0.51 | 0.077 |
Gen Z * Female | 0.45 | 0.19, 0.72 | <0.001 |
Cohort * Manual SEP | |||
Gen X * Manual SEP | -0.37 | -0.64, -0.11 | 0.006 |
Millennials * Manual SEP | 2.0 | 1.8, 2.3 | <0.001 |
Gen Z * Manual SEP | 3.0 | 2.7, 3.2 | <0.001 |
Female * Manual SEP | -0.40 | -0.67, -0.14 | 0.003 |
Cohort * Female * Manual SEP | |||
Gen X * Female * Manual SEP | 0.38 | 0.00, 0.75 | 0.050 |
Millennials * Female * Manual SEP | 0.89 | 0.51, 1.3 | <0.001 |
Gen Z * Female * Manual SEP | 0.23 | -0.16, 0.62 | 0.2 |
1 CI = Confidence Interval |
But presenting the results nicely in a plot requires a bit of data
cleaning. Below, we use several functions from the
tidyverse
to do this. First, to focus on cohort interaction
terms, we filter()
the observations where term
begins with cohort. Next, we use separate()
to split the
term
column into three new columns containing the specific
variables in term
. Next, we create a new column,
type
, which contains the cleaned variable name based on the
contents of two_way
and three_way
.
(case_when()
is like an sequentialifelse()
function where the value is set to that after “~
” if the
statement is true. We use uncount()
to create a set of rows
for Boomers (the reference category). uncount()
creates a
new data frame where rows are repeated n times, where
n can differ across row (in this case, we double the number of
rows where cohort
contains "cohortGen X"
. The
across()
function in mutate allows you to repeat a function
across multiple columns (in this case the columns from
estimate
to conf.high
). As with
map()
, it is possible to use lambda notation within
across()
.
res_pool <- pool_mod %>%
tidy(conf.int = TRUE) %>%
filter(str_detect(term, "^cohort")) %>%
separate(term, c("cohort", "two_way", "three_way"),
sep = "\\:", fill = "right") %>%
mutate(type = case_when(is.na(two_way) ~ "Cohort Fixed Effect",
!is.na(three_way) ~ "Female x Manual SEP",
two_way == "female" ~ "Female",
two_way == "manual_sep" ~ "Manual SEP")) %>%
uncount(ifelse(str_detect(cohort, "cohortGen X"), 2, 1), .id = "row") %>%
mutate(across(estimate:conf.high,
~ ifelse(row == 2, 0, .x)),
cohort = str_replace(cohort, "^cohort", ""),
cohort = ifelse(cohort == "Gen X" & row == 2, "Boomers", cohort) %>%
factor(levels(df$cohort)))
ggplot(res_pool) +
aes(x = type, y = estimate, ymin = conf.low, ymax = conf.high,
color = cohort, shape = cohort) +
geom_hline(yintercept = 0, linetype = "dashed") +
geom_pointrange(position = position_dodge(0.6)) +
theme_bw() +
theme(legend.position = "bottom") +
labs(x = "Coefficient (ref. category: Boomers)",
y = "Marginal Effect (+ 95% CI)",
color = NULL, shape = NULL)
In the simulated data, we included an indicator of missingness
(observed
). The degree of missingness differs by sex and
SEP to varying degrees across cohort. The observed data
(observed == 1
) are therefore non-representative of the
full population and the degree of this non-representativeness differs by
cohort. Below, we show how not accounting for missingness can lead to
biased, misleading results.
get_miss <- function(data){
lm(bmi ~ cohort, data) %>%
tidy(conf.int = TRUE) %>%
filter(term != "(Intercept)") %>%
mutate(term = str_replace(term, "cohort", "") %>%
factor(levels(df$cohort))) %>%
select(term, estimate, conf.low, conf.high)
}
bind_rows(All = get_miss(df_reg),
Observed = get_miss(filter(df_reg, observed == 1)),
.id = "Sample") %>%
ggplot() +
aes(x = term, y = estimate, ymin = conf.low,
ymax = conf.high, color = Sample) +
geom_hline(yintercept = 0, linetype = "dashed") +
geom_pointrange(position = position_dodge(0.4)) +
theme_minimal() +
theme(legend.position = c(.1, .9)) +
labs(x = NULL, y = "Difference in BMI (compared to Boomers)")
In the full sample, there are clear increases in BMI across successive cohorts, but in the observed sample, the difference between Boomers and Gen X is close to zero and statistically insignificant. Not accounting for missing data (e.g. by not using non-response weights) could lead to incorrect inferences.
Several methods can be used to account for missing data. The type of
missing data shown here is unit non-response (i.e. missing participants,
rather than missing items). This could be accounted for by deriving
non-response weights. Note, to do this would require some information
about those who are missing from the dataset - which could, for
instance, be inferred from comparing the observed data with that
expected from other contemporary representative datasets (e.g. the
Census). Once derived, weights can be added to the lm()
and
glm()
functions via the weights
argument.
To account for item missingness, imputation procedures such as
multiple imputation by chained equations may be used. The most popular
package for multiple imputation in R is mice
. Describing
this package is beyond the scope of this tutorial. For more information,
see Stef van Buuren’s free online companion textbook, Flexible Imputation of
Missing Data.
Another approach for accounting for missing data is estimation via
Full Information Maximum Likelihood (FIML). FIML can be used with the
package lavaan
. Again, describing this package is beyond
the scope of this paper. For more information, see the package’s companion website.
Relatively more complex cross-cohort analyses may incorporate repeat
measure, longitudinal data. There are several modelling options
available where data contains multiple observations per individual,
including mixed effects (alternatively known as multilevel modelling,
growth curve modelling, and random effects models) and the fixed effects
“within” panel estimator. The main package for mixed effects regression
in R is lme4
, the main functions of which are
lmer
for linear mixed effects models and glmer
for generalized linear mixed effects models. The main packages for panel
effects estimation is plm
. plm
can be used
with the broom
packages. lme4
requires the
(similar) broom.mixed
package. The syntax for specifying
lme4
and plm
models is similar to that using
lm()
and glm()
.9
When using lme4
to produce growth curve models, splines
can be used to flexibly model growth trajectories. The
splines
package includes several functions to create
splines (including the ns()
and bs()
functions
for natural cubic splines and B-splines, respectively). To extract model
marginal effects from lme4
when using splines, the
margins
or multcomp
packages can be used.
A further option for modelling longitudinal data is estimating latent trajectory models such as latent growth curve analysis (LGCA) models or growth mixture models (GMM). For further information on these, including exemplar R code, see Herle et al. (2020).
R v 4.1 and above contain a native pipe operator
(|>
). Here we use the the tidyverse
pipe
for compatibility with earlier versions of R, and further as the native
pipe has a less simple syntax for piping objects into different
arguments of a function. There is a small performance loss using the
tidyverse
pipe, which isn’t noticeable in most use cases.↩︎
To view the attributes attached to an object, use the
str()
(structure) function or attributes()
. To
access or overwrite a specific attribute, use
attr(obj_name, attribute_name)
. Some frequently used
attributes, such as names
and class
, also have
their own functions (names()
and class()
).↩︎
Note, ggplot2
still inherits the
aes(x = bmi)
mapping but does this mapping with the new
data.↩︎
In other words, this process masked two steps. Removing
the column cohort
from df
ensures that the
whole dataset is included in every facet as cohort
is not
available to assign data to different facets. Adding
group = cohort_f
specifies we want a density plot for each
cohort_f
.↩︎
geom_point()
can be used to render the
points faithfully.↩︎
Another alternative for including interaction terms is
to explicitly create new variables which multiply variables together.
However, this should generally be avoided because when using
postestimation commands, such as those calculating marginal effects, the
relevant function won’t “know” that the new variable is dependent on the
other variables, whereas this is explicit when var1*var2
or
var1:var2
are used.↩︎
A data frame is just a list with the class
attribute "data.frame"
and where each element of the list
is the same length.↩︎
Note, the following won’t run as we have not created the
sep_ridit
variable in df_reg
.↩︎
Specifying random effects is required when writing model
formula for lme4
. The syntax for this is very simple.
y ~ x + (1 + x | id)
would include random intercepts and
slopes (for x
) in a regression of y
on
x
.↩︎