Comparative Research Analysis Example

Introduction

In this document, we provide a tutorial on conducting simple comparative research analyses with R This document accompanies the Discover Social Science and Health paper Investigating change across time in prevalence or association: the challenges of cross-study comparative research and possible solutions by Bann et al. (2022). A similar tutorial using Stata is provided on the Open Science Foundation website (https://osf.io/d569x/) and on GitHub (https://ljwright.github.io/cross-cohort-tutorial/stata_syntax.html). Both tutorials cover almost identical material, but the HTML files are produced using different software (RMarkdown for the R tutorial, Jupyter Lab for the Stata tutorial). This explains the different look of the two tutorial files. The current tutorial was produced with R version 4.1.1, but should work with (at least) R versions 3.6.0 and above.

To demonstrate the concepts discussed in the paper, we will use a simple simulated dataset, which contains data from 40,000 individuals across four cohorts (Boomers, Gen X, Millennials, and Gen Z) each surveyed at the same age. The dataset can be downloaded from the Open Science Foundation website (https://osf.io/d569x/). The code in this document is also be used for conducting analyses with repeat cross-sectional data, but we limit ourselves to this example dataset to make the code easier to follow along.

library(labelled)
load("cohort_data.Rdata")

look_for(df)

##  pos variable     label                          col_type values          
##  1   id           Person ID                      int                      
##  2   cohort       Cohort                         fct      Boomers         
##                                                           Gen X           
##                                                           Millennials     
##                                                           Gen Z           
##  3   cohort_id    Person ID (Cohort)             int                      
##  4   sex          Sex                            fct      Male            
##                                                           Female          
##  5   social_class Father's Occupational Class    fct      Manual          
##                                                           Non-Manual      
##  6   education    Highest Qualification          fct      A-Level or Below
##                                                           Degree or Above 
##  7   bmi          Body Mass Index                dbl                      
##  8   depression   Depression Diagnosis           fct      No Depression   
##                                                           Diagnosed       
##  9   observed     Survey Participation Indicator dbl

The dataset contains observations collected at age 25 across the four cohorts. There are two variables that we will use as outcome variables in the analyses: the continuous variable, bmi, and the binary variable, depression. There are two variables that we will analyse as exposure variables: the binary variable sex and the binary variable social_class. We will use these data to explore changes in social- and sex-inequalities across cohorts. We will produce cohort specific descriptive statistics, estimate regression models in both individual cohorts and in pooled analyses, and perform meta-analyses of model results. The dataset also contains a study participation indicator, observed, that we will use to show how missingness can bias results.

We’ll primarily use the tidyverse package to prepare, analyse, and present the data. We’ll load other packages when they are required, in order to make clearer which packages functions belong to. (Above we loaded the labelled package to use the look_for() function.)

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✔ ggplot2 3.4.0     ✔ purrr   1.0.1
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.1
## ✔ readr   2.1.2     ✔ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Before starting the analysis, it is worth discussing the tidyverse in further detail.

The `tidyverse`

The tidyverse is an “opinionated” package of packages that share a common design philosophy (https://www.tidyverse.org/). Popular tidyverse packages include ggplot2, dplyr, tidyr, stringr and purrr. Above you’ll see that the command library(tidyverse) loaded several packages. These are known as the “core” tidyverse packages. When installing the tidyverse, more packages are installed than the core packages (for instance, broom and glue), but these need to be loaded separately before use.

Because the tidyverse packages share a design philosophy, they work very well together (https://design.tidyverse.org/). Other developers can also create packages using the same philosophy, which means they can fit into the tidyverse ecosystem. This solves some of the coordination issues that can make using open-source software difficult.

Two central elements of the tidyverse are the use of tidy data and the %>% (“pipe”) operator. Tidy data are data in rectangular (data frame) format where each row is an observation and each column is a variable (Wickham 2014). What is tidy can depend on context, but in general, *wide* data - where each row contains repeat observations from different timepoints - are not tidy, whereas long data - datasets with individual rows for each person time combination - are.

The pipe operator allows R users to chain functions together, piping the object on the left hand side to the first argument of the function on the right hand side. This makes code much easier to write and to read. Compare:

paste(mean(1:10), "is the mean of 1 to 10.")

## [1] "5.5 is the mean of 1 to 10."

1:10 %>%
  mean() %>%
  paste("is the mean of 1 to 10")

## [1] "5.5 is the mean of 1 to 10"

The pipe can be read sequentially (i.e. do this, then this, then this, …, etc). Users can also pipe objects into different arguments of a function using the . placeholder.¹

paste("The mean of 1 to 10 is", mean(1:10))

## [1] "The mean of 1 to 10 is 5.5"

1:10 %>%
  mean() %>%
  paste("The mean of 1 to 10 is", .)

## [1] "The mean of 1 to 10 is 5.5"

There are a lot of excellent tutorials on the tidyverse online. See, for instance, see the R for Data Science book, the R for Data Science Online Learning Community, and the #rstats and #tidytuesday Twitter hashtags.

Descriptive Statistics

To produce descriptive statistics by cohort, we will use the gtsummary package. The gtsummary package (http://www.danieldsjoberg.com/gtsummary/) uses tidyverse syntax but is not part of the tidyverse . To produce a table of descriptive statistics by cohort, we pass our data to the tbl_summary() function with the option by = cohort. We don’t need descriptive statistics for either of the ID columns (id, cohort_id), so we drop these using the select() function (from dplyr) before piping in to tbl_summary().

library(gtsummary)
df %>%
  select(-id, -cohort_id) %>%
  tbl_summary(by = cohort)

Characteristic	Boomers, N = 10,000¹	Gen X, N = 10,000¹	Millennials, N = 10,000¹	Gen Z, N = 10,000¹
Sex
Male	5,047 (50%)	4,994 (50%)	5,080 (51%)	4,965 (50%)
Female	4,953 (50%)	5,006 (50%)	4,920 (49%)	5,035 (50%)
Father's Occupational Class
Manual	5,936 (59%)	5,033 (50%)	4,007 (40%)	3,504 (35%)
Non-Manual	4,064 (41%)	4,967 (50%)	5,993 (60%)	6,496 (65%)
Highest Qualification
A-Level or Below	9,285 (93%)	9,001 (90%)	6,226 (62%)	4,042 (40%)
Degree or Above	715 (7.1%)	999 (10.0%)	3,774 (38%)	5,958 (60%)
Body Mass Index	26.9 (24.8, 29.1)	27.4 (25.1, 29.6)	30.2 (27.2, 33.2)	33.0 (29.8, 36.3)
Depression Diagnosis
No Depression	9,357 (94%)	9,217 (92%)	8,540 (85%)	7,881 (79%)
Diagnosed	643 (6.4%)	783 (7.8%)	1,460 (15%)	2,119 (21%)
Survey Participation Indicator	7,679 (77%)	6,490 (65%)	7,752 (78%)	6,406 (64%)
¹ n (%); Median (IQR)

Notice that the variable labels in the table appear already formatted. gtsummary uses the attribute labels from the individual columns. (The labels are also displayed by the look_for() function above.) attributes are metadata attached to objects. For example, the names attribute in a data.frame defines the individual column names.²

In the above table, we see that there has been an increase across cohorts in: the proportion of individuals from non-manual backgrounds, the average level of BMI, the prevalence of diagnosed depression, and the number of individuals with degree level education. There are also differences in the level of missingness by cohort. We will come back to this in a later section.

gtsummary includes many options to format and change the tbl_summary() table, including using different summary statistics and exporting to Microsoft Word. See the gtsummary website or YouTube for more detail. Several other packages also exist for creating descriptive statistics in R (see this blog for a summary).

Descriptive tables are helpful, but are limited compared with visualisations in helping you and your readers to understand the data. Well designed plots can convey a lot of information concisely, including enabling readers to make quick comparisons across multiple data points, such as the distribution of a variable or difference in an estimate by cohort. Thus, in this guide, we use ggplot2 to produce several plots to present results. Below we create kernel density plots of BMI by cohort. This plot allows us to visually examine differences in the distribution of BMI over time.

cbbPalette <- c("#000000", "#E69F00", "#56B4E9", "#009E73",
                "#F0E442", "#0072B2", "#D55E00", "#CC79A7")

ggplot(df) +
  aes(x = bmi) +
  facet_wrap(~ cohort) +
  geom_density(data = rename(df, cohort_f = cohort), 
               aes(group = cohort_f),
               color = "grey70", fill = "grey70", alpha = 0.4) +
  geom_density(color = cbbPalette[6], fill = cbbPalette[6], alpha = 0.7) +
  theme_bw() +
  labs(x = "BMI", y = "Density")

ggplot2 syntax is a little confusing at first, but worth persevering with as it is extraordinarily powerful. ggplot2 uses a concept called the layered grammar of graphics, which is a consistent way of describing plots. In the grammar of graphics, different types of plots (scatter plots, line graphs, density plots, etc.) are just specific instances of a general language. This means that when you understand the language, you can create different plots very simply. Compare this to creating plots in Stata, which requires knowledge of the idiosyncratic features of a particular plot.

The idea behind ggplot2 is as follows: ggplot2 takes data, which are mapped to aesthetics (aes(); x-location, y-location, color, shape, etc.). These aesthetics are represented visually with geometries (geom_*()), that can be displayed across small multiples - facets (facet_grid(), facet_wrap()). Some geometries require statistical transformations to be produced - for instance, bar plots require counts to be made from data. The geometries are projected onto a set of coordinates (Cartesian, polar, etc.) and finally non-data elements of the plot (title, axis colour) can be adjusted with themes. The layered aspect of ggplot2 is that plot elements (e.g., additional geometries) are added on top of the existing plot.

In the above plot, we took our data frame, df, and mapped the variable bmi to the x-axis. We then added faceting so that individual cohort data would be mapped to separate plots. Next we added density geometries (geom_density()) which transform the BMI data to produce kernel density estimates and plot these to the y-axis. We added two layers of density curves. In the first layer, we used the df object again (specifying the data argument within geom_density() ensures the correct data is used ³), but renamed the column cohort as cohort_f and mapped this to the group aesthetic so that we would see the density curve for each cohort in every facet.⁴ In the second geom_density() layer we used the initial df object specified in ggplot() to add the cohort-specific densities (in blue) to the top of the plot. The argument alpha sets the transparency of the density curves. Note, we used hex codes from a color-blind friendly palette (cbbPalette) to shade the outline (color) and interior (fill) of the upper density curves. We used theme_bw() to control the overall “look” of the plot (e.g. produces grey panel title backgrounds) and labs to set the legend labels (in this case, the x-axis and y-axis labels).

A second way of presenting the BMI data is by using scatter plots, boxplots, and violin plots. Below we use the layered grammar of ggplot2 to add each of geometries to a single plot simultaneously. Note, geom_jitter() adds a small about of random noise to the x and y positions of the points to ensure there is less overlap ⁵ and coord_flip() inverts the x and y-axes. The plot concisely shows the increased average and variability in BMI across cohorts.

ggplot(df) +
  aes(x = cohort, y = bmi) +
  geom_jitter(alpha = 0.05, color = "grey30") +
  geom_boxplot(alpha = 0, color = cbbPalette[6]) +
  geom_violin(alpha = 0, color = cbbPalette[6]) +
  coord_flip() +
  theme_minimal() +
  labs(x = NULL, y = "BMI")

There is a lot of information on ggplot2 online. See, for example, Hadley Wickham et al.’s ggplot2 book, Keiran Healy’s Data Visualization: a Practical Introduction, the #TidyTuesday Twitter hashtag, and Gina Reynolds’ ggplot2 flipbooks. Gina Reynolds’ flipbooks are a particularly good place to start as they elegantly convey the sequential, layered procedure for constructing ggplot2 plots.

Study-Specific Regressions

In this section we examine the association between depression, BMI and sex and socio-economic position using individual cohort data. For BMI, we estimate OLS models, and for depression, we estimate logistic regression models. In both cases, we present results in absolute and relative effect forms as when comparing across cohorts, the direction of these effects are not necessarily the same (see main paper). To estimate relative effects for BMI, we need to standardize (mean = 0, SD = 1) our data in each cohort, which we do in the next section. In that section, we also prepare our dataset so that it follows principles set out by Vable et al. (2021) - in particular, that binary variables are named such that the top category is clear from variable name.

Data Preparation

Below we use several functions from dplyr (part of the tidyverse) to prepare the dataset. We assign the prepared dataset to the object df_reg. mutate() adds new columns to the data, select() adds or removes columns (-var removes the column var). sex, social _class and depression were stored as factors in df, so we first convert these to numeric (first category = 1) when creating our binary numeric variables. group_by() groups the dataset according to a variable (in this case cohort) so that operations are carried out within groups of observations rather than the full dataset. In this case, we standardize BMI in each cohort separately. set_variable_labels() is from the labelled package and adds the label attribute to specified columns. These variable labels will be used when we produce tables of regression results with gtsummary.

df_reg <- df %>%
  mutate(female = as.numeric(sex) - 1,
         manual_sep = 2 - as.numeric(social_class),
         depressed = as.numeric(depression) - 1) %>%
  group_by(cohort) %>%
  mutate(bmi_std = (bmi - mean(bmi)) / sd(bmi)) %>%
  ungroup() %>%
  select(-sex, -social_class, -education, - depression) %>%
  set_variable_labels(female = "Female", manual_sep = "Manual SEP")

glimpse(df_reg)

## Rows: 40,000
## Columns: 9
## $ id         <int> 12, 19, 22, 33, 34, 40, 47, 51, 54, 75, 81, 86, 92, 97, 113…
## $ cohort     <fct> Boomers, Boomers, Boomers, Boomers, Boomers, Boomers, Boome…
## $ cohort_id  <int> 12, 19, 22, 33, 34, 40, 47, 51, 54, 75, 81, 86, 92, 97, 113…
## $ bmi        <dbl> 28.36981, 24.44526, 26.21268, 26.98448, 27.27541, 26.46252,…
## $ observed   <dbl> 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1,…
## $ female     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ manual_sep <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ depressed  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ bmi_std    <dbl> 0.45480234, -0.78078408, -0.22434043, 0.01864942, 0.1102453…

In the main paper, we noted that care must be taken to harmonise data across cohorts. One option discussed was creating ridit scores to place ordered categorical variables onto 0-1 scales. Below we provide code for creating ridit scores, but do not use these further in the analyses.

library(ridittools)

df %>%
  group_by(cohort) %>%
  mutate(sep_ridit = toridit(table(social_class))[social_class]) %>%
  distinct(cohort, social_class, sep_ridit)

## # A tibble: 8 × 3
## # Groups:   cohort [4]
##   cohort      social_class sep_ridit  
##   <fct>       <fct>        <table[1d]>
## 1 Boomers     Non-Manual   0.79680    
## 2 Boomers     Manual       0.29680    
## 3 Gen X       Non-Manual   0.75165    
## 4 Gen X       Manual       0.25165    
## 5 Millennials Non-Manual   0.70035    
## 6 Millennials Manual       0.20035    
## 7 Gen Z       Non-Manual   0.67520    
## 8 Gen Z       Manual       0.17520

toridit() requires a vector of counts (i.e. as outputted by table()) and produces a vector of ridit scores. Where names are provided to the input vector, these are also provided in the output, thus above we use social_class to subset the vector produced by toridit(). distinct() (from dplyr) returns a data frame with the unique combinations of variables specified in the function call.

BMI OLS Models

To run a linear regression model, we use the function lm(). lm() takes a formula as its first argument and a data frame as its second argument. Here we regress bmi on female and manual_sep and their interaction using the data from Boomers (filter() from dplyr() returns the observations which meet a condition or set of conditions - in this case where cohort == "Boomers"). female*manual_sep concisely specifies that we want the interaction between female and manual SEP along with their main effects (for the interaction term only use female:manual_sep).⁶

lm(bmi ~ female*manual_sep,
   filter(df_reg, cohort == "Boomers"))

## 
## Call:
## lm(formula = bmi ~ female * manual_sep, data = filter(df_reg, 
##     cohort == "Boomers"))
## 
## Coefficients:
##       (Intercept)             female         manual_sep  female:manual_sep  
##           25.0280             0.8506             2.6853            -0.4037

We could write this command 7 more times so that we had a model for each cohort for both bmi (absolute effects) and bmi_std (relative effects). To make this more concise, we can instead write a loop, so that we iterate over cohort and BMI variable. Below we create a function get_lm() which takes two string inputs, cohort and bmi_var, and uses these to run a linear regression model. To produce the correct model formula, we use the glue() function from glue (part of the tidyverse) to create a string which replaces bmi_var with its value.

library(glue)

get_lm <- function(cohort, bmi_var){
  df_mod <- filter(df_reg, cohort == !!cohort)
  
  glue("{bmi_var} ~ female*manual_sep") %>%
    as.formula() %>%
    lm(df_mod)
}

get_lm("Boomers", "bmi")

## 
## Call:
## lm(formula = ., data = df_mod)
## 
## Coefficients:
##       (Intercept)             female         manual_sep  female:manual_sep  
##           25.0280             0.8506             2.6853            -0.4037

To iterate over the cohort and BMI variables, we can first create a data frame using expand_grid() that contains the unique combinations of cohort and bmi_var that we need. Next we can add a new column to this dataset iterating get_lm() over these values using the map2() function. map2() is from the map*() family of functions (from purrr, part of the tidyverse). map2() takes two objects and a function as inputs and repeats the function over the values of the two objects returning a list with the objects passed to the first two arguments of the function.

res_lm <- expand_grid(cohort = unique(df$cohort),
                      bmi_var = c("bmi", "bmi_std")) %>%
  mutate(mod = map2(cohort, bmi_var, get_lm))

res_lm

## # A tibble: 8 × 3
##   cohort      bmi_var mod   
##   <fct>       <chr>   <list>
## 1 Boomers     bmi     <lm>  
## 2 Boomers     bmi_std <lm>  
## 3 Gen X       bmi     <lm>  
## 4 Gen X       bmi_std <lm>  
## 5 Millennials bmi     <lm>  
## 6 Millennials bmi_std <lm>  
## 7 Gen Z       bmi     <lm>  
## 8 Gen Z       bmi_std <lm>

There are different ways of specifying functions and inputs in the map*() functions and different variants of map*() that return other types of objects other than lists. We return to this below. Notice that the mod column in res_lm is a list (known as a list-column) - this means that data frames are extremely flexible in what data they can store (e.g. a column of ggplot objects).⁷

Note, if we had decided to run robustness checks using different measures of SEP (e.g. ridit scores), changing the above code to loop over SEP variable definition too is a relatively trivial matter.⁸

get_sep <- function(cohort, bmi_var, sep_var){
  df_mod <- filter(df_reg, cohort == !!cohort)
  
  glue("{bmi_var} ~ female*{sep_var}") %>%
    as.formula() %>%
    lm(df_reg)
}

expand_grid(cohort = unique(df$cohort),
            bmi_var = c("bmi", "bmi_std"),
            sep_var = c("manual_sep", "sep_ridit")) %>%
  mutate(mod = pmap(list(cohort, bmi_var, sep_var), get_sep))

See the purrr website for more details on pmap().

Regression Tables

We can produce regression tables using the tbl_regression() function from gtsummary. tbl_regression takes a model object as its first argument. Again, we can use the map*() family of functions to create a table for each model in res_lm - here we use map() rather than map2() because we only iterate over one input, the column mod. We can merge regression tables produced by tbl_regression() using the tbl_merge() function. The argument tab_spanner specifies the spanning column headers post-merge (here we use the cohort names with markdown syntax, ** , that specifies these should be bold). To reduce the width of the table, here we just create a table for absolute effects (filter(bmi_var == "bmi")). pull(tbl) extracts a column from a data frame.

res_lm %>%
  mutate(tbl = map(mod, tbl_regression)) %>%
  filter(bmi_var == "bmi") %>%
  pull(tbl) %>%
  tbl_merge(tab_spanner = glue("**{unique(res_lm$cohort)}**"))

Characteristic	Boomers			Gen X			Millennials			Gen Z
Characteristic	Beta	95% CI¹	p-value	Beta	95% CI¹	p-value	Beta	95% CI¹	p-value	Beta	95% CI¹	p-value
Female	0.85	0.67, 1.0	<0.001	0.45	0.28, 0.63	<0.001	1.1	0.91, 1.3	<0.001	1.3	1.1, 1.5	<0.001
Manual SEP	2.7	2.5, 2.8	<0.001	2.3	2.1, 2.5	<0.001	4.7	4.5, 4.9	<0.001	5.7	5.4, 5.9	<0.001
Female * Manual SEP	-0.40	-0.64, -0.17	<0.001	-0.03	-0.27, 0.22	0.8	0.49	0.21, 0.77	<0.001	-0.17	-0.49, 0.14	0.3
¹ CI = Confidence Interval

As with tbl_summary(), there are a lot of options for formatting the table, including dropping p-values and having beta values and confidence intervals in a single column. gtsummary also contains a function (theme_gtsummary_journal()) that allow immediate formatting to a particular journal style (JAMA, etc.). See the gtsummary website for more information.

Plotting Results

To plot the model results, we first need to extract the estimates from the model objects. We can do this using the package broom (part of the tidyverse). broom contains three main functions: tidy, glance, and augment. tidy returns a data frame which contains the main estimates (beta coefficients, standard errors, p-values, etc.); glance returns model information and fit statistics (sample size, R², etc.); and augment returns observation level information (fitted values and residuals). Below we use map() to tidy the model objects and then ggplot() to display the results. In map(), we use lambda notation to specify the function. .x is a placeholder for the current value of the object being iterated over. (In map2() the placeholders .x and .y can be used.) map() returns a list, so we use unnest() (from tidyr) to expand the column, which gives us one row per model-coefficient combination.

library(broom)

res_lm %>%
  mutate(tidy = map(mod, ~ tidy(.x, conf.int = TRUE))) %>%
  unnest(tidy) %>%
  filter(term != "(Intercept)") %>%
  ggplot() +
  aes(x = term, y = estimate, 
      ymin = conf.low, ymax = conf.high,
      color = cohort, shape = cohort) +
  facet_grid(~ bmi_var, scales = "free_x") +
  geom_hline(yintercept = 0, linetype = "dashed") +
  geom_pointrange(position = position_dodge(0.6)) +
  coord_flip() +
  theme_bw() +
  theme(legend.position = "bottom") +
  labs(x = NULL, y = "Marginal Effect (+ 95% CI)",
       color = NULL, shape = NULL)

In the above plot, we specify scales = "free_x" in facet_grid() to have different x-axis scales in the individual facets. We add the argument position = position_dodge(0.6) to geom_pointrange() to spread out the individual cohort estimates for a given coefficient (otherwise they would overlap). Much more could be done to “prettify” the plot - for instance, changing the facet titles by adding mutate() steps prior to ggplot().

Observe in the above plot that there are some differences in the comparative size of absolute and relative effects, with the female coefficient larger in absolute terms (left panel) in Boomers than Millennials but smaller in relative terms (right panel) across the same two cohorts.

Depression Logistic Regression Models

Below we produce logistic regression models estimating the association between depression, sex and manual SEP. To run logistic regression models, we use the run glm() (for generalized linear model) with the option family = bionomial. Again, for concision, we use map() to run the model in each cohort separately. The function distinct() creates a data frame with the unique values of cohort from df_lm. Note, the bang-bang, !!, operator in the filter() function is used to ensure that !!cohort is replaced with the value of cohort from the environment (in this case, the input to the get_glm() function call) and not the value of cohort from the df_reg data frame itself.

get_glm <- function(cohort){
  df_mod <- filter(df_reg, cohort == !!cohort)
  
  glm(depressed ~ female*manual_sep, 
      family = binomial, 
      data = df_mod)
}

res_glm <- distinct(df_reg, cohort) %>%
  mutate(mod = map(cohort, get_glm))

Regression Tables

To produce, regression tables, we can again use the tbl_regression() function. By default the function presents results in log odds form, so we include the argument exponentiate = TRUE to provide odds ratios. Note, we pass this argument to map() directy as map passes extra arguments on to the function being iterate over.

res_glm %>%
  mutate(tbl = map(mod, tbl_regression, exponentiate = TRUE)) %>%
  pull(tbl) %>%
  tbl_merge(tab_spanner = glue("**{res_glm$cohort}**"))

Characteristic	Boomers			Gen X			Millennials			Gen Z
Characteristic	OR¹	95% CI¹	p-value	OR¹	95% CI¹	p-value	OR¹	95% CI¹	p-value	OR¹	95% CI¹	p-value
Female	2.65	1.91, 3.73	<0.001	1.83	1.43, 2.34	<0.001	1.70	1.46, 1.98	<0.001	1.53	1.35, 1.74	<0.001
Manual SEP	2.59	1.90, 3.59	<0.001	2.04	1.60, 2.60	<0.001	1.55	1.30, 1.83	<0.001	1.51	1.31, 1.75	<0.001
Female * Manual SEP	0.62	0.42, 0.90	0.014	0.76	0.56, 1.04	0.084	0.83	0.66, 1.04	0.11	0.85	0.70, 1.04	0.12
¹ OR = Odds Ratio, CI = Confidence Interval

Plotting Results

To plot the results, we can again use the tidy() function from broom. tidy() also contains an exponentiate argument to output odds ratios. In the ggplot2 call, we also add the function scale_y_log10() which rescales the y axis to log form.

res_glm %>%
  mutate(tidy = map(mod, ~ tidy(.x, conf.int = TRUE, exponentiate = TRUE))) %>%
  select(-mod) %>%
  unnest(tidy) %>%
  filter(term != "(Intercept)") %>%
  ggplot() +
  aes(x = term, y = estimate, ymin = conf.low, ymax = conf.high,
      color = cohort, shape = cohort) +
  geom_hline(yintercept = 1, linetype = "dashed") +
  geom_pointrange(position = position_dodge(0.6)) +
  theme_bw() +
  theme(legend.position = "bottom") +
  scale_y_log10() +
  labs(x = "Coefficient", y = "Marginal Effect (+ 95% CI)",
       color = NULL, shape = NULL)

Marginal Effects on Absolute Scale

The above regression produces marginal effects on the relative scale. To extract marginal effects on the absolute (i.e. difference in the probability of depression) scale, we can use the margins() function from the margins package. margins is based on Stata’s margins command (package website here). Providing margins() with a model object and a data frame of observations on which to calculate effects produces a set of marginal effect estimates for each observation and model variable. Using tidy() on the resulting object produces a data frame of average marginal effects with standard errors, p-values, etc. More complex specifications are possible.

Below, we calculate marginal effects on the probability scale for the variables female and manual_sep where the comparator is a male of high SEP (female = 0, manual_sep = 0). We then plot these results using ggplot2.

library(margins)

get_mrg <- function(mod){
  new_data <- tibble(female = 0, manual_sep = 0)
  
  margins(mod, new_data) %>%
    tidy(conf.int = TRUE)
}

res_glm %>%
  mutate(mrg = map(mod, get_mrg)) %>%
  select(-mod) %>%
  unnest(mrg) %>%
  ggplot() +
  aes(x = term, y = estimate, ymin = conf.low, ymax = conf.high,
      color = cohort, shape = cohort) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  geom_pointrange(position = position_dodge(0.6)) +
  theme_bw() +
  theme(legend.position = "bottom") +
  labs(x = "Coefficient", y = "Marginal Effect (+ 95% CI)",
       color = NULL, shape = NULL)

Comparing this plot with the previous plot, we see some differences across the two scales. First, on the absolute scale, there is little difference by sex, while on the relative scale, there is a clear increase over time. Second, on the absolute scale, there is a clear increase in the differences in depression by SEP over time, but on the relative scale the opposite is true.

Meta-Analysis

To pool results across cohorts, we can use meta-analytic methods. Below, we use the rma() function from metafor to produce fixed effects (method = "FE") and random effects (method = "REML") estimates for the bivariate association between manual SEP and (standardized) BMI. First, we create a data frame of individual cohort results (res_sep), then use rma(), tidy(), and ggplot2 to produce and display pooled estimates.

get_sep <- function(cohort){
  lm(bmi_std ~ manual_sep, 
     filter(df_reg, cohort == !!cohort)) %>%
    tidy() %>%
    filter(term == "manual_sep")
}

res_sep <- distinct(df_reg, cohort) %>%
  mutate(map_dfr(cohort, get_sep))

library(metafor)

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

## Loading required package: metadat

## 
## Loading the 'metafor' package (version 3.8-1). For an
## introduction to the package please type: help(metafor)

get_meta <- function(method){
  rma(estimate, sei = std.error, data = res_sep,
      slab = cohort, method = method) %>%
    tidy(conf.int = TRUE, include_studies = TRUE)
}

res_meta <- tibble(method = c("FE", "REML")) %>%
  mutate(res = map(method, get_meta)) %>%
  unnest(res) %>%
  distinct(across(-method), .keep_all = TRUE) %>%
  mutate(term = case_when(term == "overall" & method == "FE" ~ "FE Model",
                          term == "overall" & method == "REML" ~ "RE Model",
                          term != "overall" ~ term),
         type = ifelse(type == "study", "Study", "Overall") %>%
           fct_rev())

ggplot(res_meta) +
  aes(x = term, y = estimate, ymin = conf.low, ymax = conf.high) +
  facet_grid(type ~ ., scales = "free_y", switch = "y", space = "free_y") +
  geom_hline(yintercept = 0, linetype = "dashed") +
  geom_pointrange() +
  scale_y_continuous(expand = c(.1, .1)) +
  coord_flip(clip = "off") +
  theme_minimal() +
  theme(strip.placement = "outside",
        strip.text.y.left = element_text(angle = 0),
        plot.margin = margin(r = 25)) +
  labs(x = NULL, y = "Effect Size (+ 95% CI)")

Pooled Cohort Regressions

To carry out a pooled regression analysis to examine cross-cohort differences in a single model, we can add a further interaction term with cohort to our model formula.

pool_mod <- lm(bmi ~ cohort*female*manual_sep, df_reg) 

pool_mod %>%
  tidy(conf.int = TRUE)

## # A tibble: 16 × 7
##    term                       estim…¹ std.e…² stati…³   p.value conf.low conf.…⁴
##    <chr>                        <dbl>   <dbl>   <dbl>     <dbl>    <dbl>   <dbl>
##  1 (Intercept)                 25.0    0.0746  336.   0          2.49e+1  25.2  
##  2 cohortGen X                  0.937  0.101     9.30 1.49e- 20  7.39e-1   1.13 
##  3 cohortMillennials            2.69   0.0963   27.9  3.45e-170  2.50e+0   2.88 
##  4 cohortGen Z                  5.46   0.0954   57.2  0          5.27e+0   5.65 
##  5 female                       0.851  0.106     8.06 7.91e- 16  6.44e-1   1.06 
##  6 manual_sep                   2.69   0.0965   27.8  1.10e-168  2.50e+0   2.87 
##  7 cohortGen X:female          -0.396  0.142    -2.78 5.38e-  3 -6.75e-1  -0.117
##  8 cohortMillennials:female     0.242  0.137     1.77 7.68e-  2 -2.60e-2   0.510
##  9 cohortGen Z:female           0.454  0.135     3.38 7.34e-  4  1.91e-1   0.718
## 10 cohortGen X:manual_sep      -0.375  0.136    -2.76 5.74e-  3 -6.40e-1  -0.109
## 11 cohortMillennials:manual_…   2.02   0.136    14.8  1.56e- 49  1.75e+0   2.29 
## 12 cohortGen Z:manual_sep       2.97   0.139    21.4  4.13e-101  2.70e+0   3.24 
## 13 female:manual_sep           -0.404  0.137    -2.95 3.22e-  3 -6.72e-1  -0.135
## 14 cohortGen X:female:manual…   0.377  0.192     1.96 4.97e-  2  5.49e-4   0.753
## 15 cohortMillennials:female:…   0.891  0.194     4.59 4.40e-  6  5.11e-1   1.27 
## 16 cohortGen Z:female:manual…   0.230  0.197     1.17 2.43e-  1 -1.56e-1   0.615
## # … with abbreviated variable names ¹estimate, ²std.error, ³statistic,
## #   ⁴conf.high

Presenting these results in a table is simple enough:

tbl_regression(pool_mod)

Characteristic	Beta	95% CI¹	p-value
Cohort
Boomers	—	—
Gen X	0.94	0.74, 1.1	<0.001
Millennials	2.7	2.5, 2.9	<0.001
Gen Z	5.5	5.3, 5.6	<0.001
Female	0.85	0.64, 1.1	<0.001
Manual SEP	2.7	2.5, 2.9	<0.001
Cohort * Female
Gen X * Female	-0.40	-0.68, -0.12	0.005
Millennials * Female	0.24	-0.03, 0.51	0.077
Gen Z * Female	0.45	0.19, 0.72	<0.001
Cohort * Manual SEP
Gen X * Manual SEP	-0.37	-0.64, -0.11	0.006
Millennials * Manual SEP	2.0	1.8, 2.3	<0.001
Gen Z * Manual SEP	3.0	2.7, 3.2	<0.001
Female * Manual SEP	-0.40	-0.67, -0.14	0.003
Cohort * Female * Manual SEP
Gen X * Female * Manual SEP	0.38	0.00, 0.75	0.050
Millennials * Female * Manual SEP	0.89	0.51, 1.3	<0.001
Gen Z * Female * Manual SEP	0.23	-0.16, 0.62	0.2
¹ CI = Confidence Interval

But presenting the results nicely in a plot requires a bit of data cleaning. Below, we use several functions from the tidyverseto do this. First, to focus on cohort interaction terms, we filter() the observations where term begins with cohort. Next, we use separate() to split the term column into three new columns containing the specific variables in term. Next, we create a new column, type, which contains the cleaned variable name based on the contents of two_way and three_way. (case_when() is like an sequentialifelse() function where the value is set to that after “~” if the statement is true. We use uncount() to create a set of rows for Boomers (the reference category). uncount() creates a new data frame where rows are repeated n times, where n can differ across row (in this case, we double the number of rows where cohort contains "cohortGen X" . The across() function in mutate allows you to repeat a function across multiple columns (in this case the columns from estimate to conf.high). As with map(), it is possible to use lambda notation within across().

res_pool <- pool_mod %>%
  tidy(conf.int = TRUE) %>%
  filter(str_detect(term, "^cohort")) %>%
  separate(term, c("cohort", "two_way", "three_way"), 
           sep = "\\:", fill = "right") %>%
  mutate(type = case_when(is.na(two_way) ~ "Cohort Fixed Effect",
                          !is.na(three_way) ~ "Female x Manual SEP",
                          two_way == "female" ~ "Female",
                          two_way == "manual_sep" ~ "Manual SEP")) %>%
  uncount(ifelse(str_detect(cohort, "cohortGen X"), 2, 1), .id = "row") %>%
  mutate(across(estimate:conf.high, 
                ~ ifelse(row == 2, 0, .x)),
         cohort = str_replace(cohort, "^cohort", ""),
         cohort = ifelse(cohort == "Gen X" & row == 2, "Boomers", cohort) %>%
           factor(levels(df$cohort)))

ggplot(res_pool) +
  aes(x = type, y = estimate, ymin = conf.low, ymax = conf.high,
      color = cohort, shape = cohort) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  geom_pointrange(position = position_dodge(0.6)) +
  theme_bw() +
  theme(legend.position = "bottom") +
  labs(x = "Coefficient (ref. category: Boomers)",
       y = "Marginal Effect (+ 95% CI)",
       color = NULL, shape = NULL)

Missing Data

In the simulated data, we included an indicator of missingness (observed). The degree of missingness differs by sex and SEP to varying degrees across cohort. The observed data (observed == 1) are therefore non-representative of the full population and the degree of this non-representativeness differs by cohort. Below, we show how not accounting for missingness can lead to biased, misleading results.

get_miss <- function(data){
  lm(bmi ~ cohort, data) %>%
    tidy(conf.int = TRUE) %>%
    filter(term != "(Intercept)") %>%
    mutate(term = str_replace(term, "cohort", "") %>%
             factor(levels(df$cohort))) %>%
    select(term, estimate, conf.low, conf.high)
}

bind_rows(All = get_miss(df_reg),
          Observed = get_miss(filter(df_reg, observed == 1)),
          .id = "Sample") %>%
  ggplot() +
  aes(x = term, y = estimate, ymin = conf.low, 
      ymax = conf.high, color = Sample) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  geom_pointrange(position = position_dodge(0.4)) +
  theme_minimal() +
  theme(legend.position = c(.1, .9)) +
  labs(x = NULL, y = "Difference in BMI (compared to Boomers)")

In the full sample, there are clear increases in BMI across successive cohorts, but in the observed sample, the difference between Boomers and Gen X is close to zero and statistically insignificant. Not accounting for missing data (e.g. by not using non-response weights) could lead to incorrect inferences.

Several methods can be used to account for missing data. The type of missing data shown here is unit non-response (i.e. missing participants, rather than missing items). This could be accounted for by deriving non-response weights. Note, to do this would require some information about those who are missing from the dataset - which could, for instance, be inferred from comparing the observed data with that expected from other contemporary representative datasets (e.g. the Census). Once derived, weights can be added to the lm() and glm() functions via the weights argument.

To account for item missingness, imputation procedures such as multiple imputation by chained equations may be used. The most popular package for multiple imputation in R is mice. Describing this package is beyond the scope of this tutorial. For more information, see Stef van Buuren’s free online companion textbook, Flexible Imputation of Missing Data.

Another approach for accounting for missing data is estimation via Full Information Maximum Likelihood (FIML). FIML can be used with the package lavaan. Again, describing this package is beyond the scope of this paper. For more information, see the package’s companion website.

Modelling Longitudinal Data

Relatively more complex cross-cohort analyses may incorporate repeat measure, longitudinal data. There are several modelling options available where data contains multiple observations per individual, including mixed effects (alternatively known as multilevel modelling, growth curve modelling, and random effects models) and the fixed effects “within” panel estimator. The main package for mixed effects regression in R is lme4, the main functions of which are lmer for linear mixed effects models and glmer for generalized linear mixed effects models. The main packages for panel effects estimation is plm. plm can be used with the broom packages. lme4 requires the (similar) broom.mixed package. The syntax for specifying lme4 and plm models is similar to that using lm() and glm().⁹

When using lme4 to produce growth curve models, splines can be used to flexibly model growth trajectories. The splines package includes several functions to create splines (including the ns() and bs() functions for natural cubic splines and B-splines, respectively). To extract model marginal effects from lme4 when using splines, the margins or multcomp packages can be used.

A further option for modelling longitudinal data is estimating latent trajectory models such as latent growth curve analysis (LGCA) models or growth mixture models (GMM). For further information on these, including exemplar R code, see Herle et al. (2020).

References

Herle, Moritz, Nadia Micali, Mohamed Abdulkadir, Ruth Loos, Rachel Bryant-Waugh, Christopher Hübel, Cynthia M. Bulik, and Bianca L. De Stavola. 2020. “Identifying Typical Trajectories in Longitudinal Data: Modelling Strategies and Interpretations.” European Journal of Epidemiology 35 (3): 205–22. https://doi.org/10.1007/s10654-020-00615-6.

Vable, Anusha M, Scott F Diehl, and M Maria Glymour. 2021. “Code Review as a Simple Trick to Enhance Reproducibility, Accelerate Learning, and Improve the Quality of Your Team’s Research.” American Journal of Epidemiology, April. https://doi.org/10.1093/aje/kwab092.

Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (10). https://doi.org/10.18637/jss.v059.i10.

R v 4.1 and above contain a native pipe operator (|>). Here we use the the tidyverse pipe for compatibility with earlier versions of R, and further as the native pipe has a less simple syntax for piping objects into different arguments of a function. There is a small performance loss using the tidyverse pipe, which isn’t noticeable in most use cases.↩︎
To view the attributes attached to an object, use the str() (structure) function or attributes(). To access or overwrite a specific attribute, use attr(obj_name, attribute_name). Some frequently used attributes, such as names and class, also have their own functions (names() and class()).↩︎
Note, ggplot2 still inherits the aes(x = bmi) mapping but does this mapping with the new data.↩︎
In other words, this process masked two steps. Removing the column cohort from df ensures that the whole dataset is included in every facet as cohort is not available to assign data to different facets. Adding group = cohort_f specifies we want a density plot for each cohort_f.↩︎
geom_point() can be used to render the points faithfully.↩︎
Another alternative for including interaction terms is to explicitly create new variables which multiply variables together. However, this should generally be avoided because when using postestimation commands, such as those calculating marginal effects, the relevant function won’t “know” that the new variable is dependent on the other variables, whereas this is explicit when var1*var2 or var1:var2 are used.↩︎
A data frame is just a list with the class attribute "data.frame" and where each element of the list is the same length.↩︎
Note, the following won’t run as we have not created the sep_ridit variable in df_reg.↩︎
Specifying random effects is required when writing model formula for lme4. The syntax for this is very simple. y ~ x + (1 + x | id) would include random intercepts and slopes (for x) in a regression of y on x.↩︎

Comparative Research Analysis Example

February 2023

Introduction

The tidyverse

Descriptive Statistics

Study-Specific Regressions

Data Preparation

BMI OLS Models

Regression Tables

Plotting Results

Depression Logistic Regression Models

Regression Tables

Plotting Results

Marginal Effects on Absolute Scale

Meta-Analysis

Pooled Cohort Regressions

Missing Data

Modelling Longitudinal Data

References

The `tidyverse`