Comparative Research Analysis Example Syntax¶

February 2023¶

Introduction¶

In this document, we provide a tutorial on conducting simple comparative research analyses with Stata. This document accompanies the Discover Social Science and Health paper Investigating change across time in prevalence or association: the challenges of cross-study comparative research and possible solutions by Bann et al. (2022). A similar tutorial using R is provided on the Open Science Foundation website (https://osf.io/d569x/) and on GitHub (https://ljwright.github.io/cross-cohort-tutorial/r_syntax.html). Both tutorials cover almost identical material, but the HTML files are produced using different software (RMarkdown for the R tutorial, Jupyter Lab for the Stata tutorial). This explains the different look of the two tutorial files. The current tutorial was produced with Stata version 16, but should work on Stata versions 15 and higher.

To demonstrate the concepts discussed in the paper, we will use a simple simulated dataset that contains data from 40,000 individuals across four cohorts (Boomers, Gen X, Millennials, and Gen Z) each surveyed at the same age. The dataset can be downloaded from the Open Science Foundation website (https://osf.io/d569x/). The code in this document can also be used for conducting analyses with repeat cross-sectional data, but we limit ourselves to this example dataset to make the code easier to follow along.

In [1]:
use "cohort_data.dta", clear
set linesize 130
describe



Contains data from cohort_data.dta
  obs:        40,000                          
 vars:             9                          11 Aug 2021 10:33
----------------------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
----------------------------------------------------------------------------------------------------------------------------------
id              long    %12.0g                Person ID
cohort          long    %12.0g     cohort     Cohort
cohort_id       long    %12.0g                Person ID (Cohort)
sex             long    %12.0g     sex        Sex
social_class    long    %12.0g     social_class
                                              Father's Occupational Class
education       long    %12.0g     education
                                              Highest Qualification
bmi             double  %10.0g                Body Mass Index
depression      long    %12.0g     depression
                                              Depression Diagnosis
observed        double  %10.0g                Survey Participation Indicator
----------------------------------------------------------------------------------------------------------------------------------
Sorted by: 

The dataset contains observations collected at age 25 across the four cohorts. There are two variables that we will use as outcome variables in the analyses: the continuous variable, bmi, and the binary variable, depression. There are two variables that we will use as exposures: the binary variable sex and the binary variable social_class. We will use these data to explore changes in social- and sex-inequalities across cohorts. We will produce cohort specific descriptive statistics, estimate regression models in individual cohorts and in pooled analyses, and we will perform meta-analyses of model results. The dataset also contains a study participation indicator, observed. We will use this to show how missingness can bias results.

Below are the labels for the categorical variables. Note the value labels have the same name as the variables they are attached to. This should be treated as coincidental. In Stata, variables (which just contain numbers) are separate objects to value labels (which tell Stata how to format a variable when it is printed to the console window).

In [2]:
labelbook
----------------------------------------------------------------------------------------------------------------------------------
value label cohort 
----------------------------------------------------------------------------------------------------------------------------------

      values                                    labels
       range:  [1,4]                     string length:  [5,11]
           N:  4                 unique at full length:  yes
        gaps:  no                  unique at length 12:  yes
  missing .*:  0                           null string:  no
                               leading/trailing blanks:  no
                                    numeric -> numeric:  no
  definition
           1   Boomers
           2   Gen X
           3   Millennials
           4   Gen Z

   variables:  cohort


----------------------------------------------------------------------------------------------------------------------------------
value label depression 
----------------------------------------------------------------------------------------------------------------------------------

      values                                    labels
       range:  [1,2]                     string length:  [9,13]
           N:  2                 unique at full length:  yes
        gaps:  no                  unique at length 12:  yes
  missing .*:  0                           null string:  no
                               leading/trailing blanks:  no
                                    numeric -> numeric:  no
  definition
           1   No Depression
           2   Diagnosed

   variables:  depression


----------------------------------------------------------------------------------------------------------------------------------
value label education 
----------------------------------------------------------------------------------------------------------------------------------

      values                                    labels
       range:  [1,2]                     string length:  [15,16]
           N:  2                 unique at full length:  yes
        gaps:  no                  unique at length 12:  yes
  missing .*:  0                           null string:  no
                               leading/trailing blanks:  no
                                    numeric -> numeric:  no
  definition
           1   A-Level or Below
           2   Degree or Above

   variables:  education


----------------------------------------------------------------------------------------------------------------------------------
value label sex 
----------------------------------------------------------------------------------------------------------------------------------

      values                                    labels
       range:  [1,2]                     string length:  [4,6]
           N:  2                 unique at full length:  yes
        gaps:  no                  unique at length 12:  yes
  missing .*:  0                           null string:  no
                               leading/trailing blanks:  no
                                    numeric -> numeric:  no
  definition
           1   Male
           2   Female

   variables:  sex


----------------------------------------------------------------------------------------------------------------------------------
value label social_class 
----------------------------------------------------------------------------------------------------------------------------------

      values                                    labels
       range:  [1,2]                     string length:  [6,10]
           N:  2                 unique at full length:  yes
        gaps:  no                  unique at length 12:  yes
  missing .*:  0                           null string:  no
                               leading/trailing blanks:  no
                                    numeric -> numeric:  no
  definition
           1   Manual
           2   Non-Manual

   variables:  social_class

Descriptive Statistics¶

To produce a table of descriptive statistics, we will use the user-written command table1_mc. If not previously installed, pleased do so using the command: ssc install table1_mc. table1_mc is capable of producing several descriptive statistics for binary, categorical and continuous variables. Below we create descriptives for each cohort (by(cohort)) with formatted digits (%4.0f). Variable and value labels displayed by default. The table can be outputted to Excel using the saving() option or to Word using the table1_mc_dta2docx command.

In [3]:
// ssc install table1_mc

table1_mc, by(cohort) ///
vars( ///
sex cat %4.0f \ ///  
social_class cat %4.0f \ ///
education cat %4.0f \ ///
bmi contn %4.0f \ ///
depression cat %4.0f \ ///
observed cat %4.0f \ ///
)
  +--------------------------------------------------------------------------------------------+
  | factor                              N_1      N_2      N_3      N_4   m_1   m_2   m_3   m_4 |
  |--------------------------------------------------------------------------------------------|
  | Sex                              10,000   10,000   10,000   10,000     0     0     0     0 |
  |--------------------------------------------------------------------------------------------|
  | Father's Occupational Class      10,000   10,000   10,000   10,000     0     0     0     0 |
  |--------------------------------------------------------------------------------------------|
  | Highest Qualification            10,000   10,000   10,000   10,000     0     0     0     0 |
  |--------------------------------------------------------------------------------------------|
  | Body Mass Index                  10,000   10,000   10,000   10,000     0     0     0     0 |
  |--------------------------------------------------------------------------------------------|
  | Depression Diagnosis             10,000   10,000   10,000   10,000     0     0     0     0 |
  |--------------------------------------------------------------------------------------------|
  | Survey Participation Indicator   10,000   10,000   10,000   10,000     0     0     0     0 |
  +--------------------------------------------------------------------------------------------+
   N_ ... #records used below,   m_ ... #records not used
 
  +---------------------------------------------------------------------------------------------------------------------+
  |                                                     Boomers       Gen X         Millennials   Gen Z         p-value |
  |---------------------------------------------------------------------------------------------------------------------|
  |                                                     N=10,000      N=10,000      N=10,000      N=10,000              |
  |---------------------------------------------------------------------------------------------------------------------|
  | Sex                              Male               5,047 (50%)   4,994 (50%)   5,080 (51%)   4,965 (50%)    0.36   |
  |                                  Female             4,953 (50%)   5,006 (50%)   4,920 (49%)   5,035 (50%)           |
  |---------------------------------------------------------------------------------------------------------------------|
  | Father's Occupational Class      Manual             5,936 (59%)   5,033 (50%)   4,007 (40%)   3,504 (35%)   <0.001  |
  |                                  Non-Manual         4,064 (41%)   4,967 (50%)   5,993 (60%)   6,496 (65%)           |
  |---------------------------------------------------------------------------------------------------------------------|
  | Highest Qualification            A-Level or Below   9,285 (93%)   9,001 (90%)   6,226 (62%)   4,042 (40%)   <0.001  |
  |                                  Degree or Above    715 ( 7%)     999 (10%)     3,774 (38%)   5,958 (60%)           |
  |---------------------------------------------------------------------------------------------------------------------|
  | Body Mass Index                                     27 (3)        27 (3)        30 (4)        33 (5)        <0.001  |
  |---------------------------------------------------------------------------------------------------------------------|
  | Depression Diagnosis             No Depression      9,357 (94%)   9,217 (92%)   8,540 (85%)   7,881 (79%)   <0.001  |
  |                                  Diagnosed          643 ( 6%)     783 ( 8%)     1,460 (15%)   2,119 (21%)           |
  |---------------------------------------------------------------------------------------------------------------------|
  | Survey Participation Indicator   0                  2,321 (23%)   3,510 (35%)   2,248 (22%)   3,594 (36%)   <0.001  |
  |                                  1                  7,679 (77%)   6,490 (65%)   7,752 (78%)   6,406 (64%)           |
  +---------------------------------------------------------------------------------------------------------------------+
Data are presented as mean (SD) for continuous measures, and n (%) for categorical measures.
 

In the above table, we see that there has been an increase across cohorts in: the proportion of individuals from non-manual backgrounds, the average level of BMI, the prevalence of diagnosed depression, and the number of individuals with degree level education. There are also differences in the level of missingness by cohort. We will come back to this in a later section.

table1_mc has many other options to format and change the table, including using different summary statistics. See the help file (help table1_mc) for more information.

Descriptive tables are helpful, but limited compared with visualisations in helping readers understand the dataset. Well designed plots can convey a lot of information concisely and can enable readers to make comparisons across multiple data in an efficient way, for instance examining the change in the distribution of a variable over time. Below we create kernel density plots of BMI by cohort.

In [4]:
kdensity bmi if cohort == 1, ///
    addplot(kdensity bmi if cohort == 2 || ///
            kdensity bmi if cohort == 3 || ///
            kdensity bmi if cohort == 4) ///
    legend( label(1 "Boomers") ///
            label(2 "Gen X") ///
            label(3 "Millennials") ///
            label(4 "Gen Z"))

The code above is straightforward, and while further formatting and plot-tweaking can be achieved, Stata's graphical capabilities are relatively limited. We believe the R package ggplot2 is far superior to Stata as it allows for the creation of more types of plot, is more flexible in the formatting that can be achieved, and uses a consistent syntax that (once learned) makes creating plots of different types a breeze. See the accompanying R file for more information on ggplot2 (https://osf.io/d569x/).

Study-Specific Regressions¶

In this section we examine the association between depression, BMI and sex and socio-economic position using individual cohort data. For BMI, we estimate OLS models, and for depression, we estimate logistic regression models. In both cases, we present results in absolute and relative effect forms as when comparing across cohorts, the direction of these effects are not necessarily the same (see main paper). To estimate relative effects for BMI, we need to standardize (mean = 0, SD = 1) our data in each cohort. We do this in the next section in which we also prepare our dataset so that it follows principles set out by Vable et al. (2021)1 - in particular, we create binary variables named such that the top category is clear from variable name (i.e. female rather than sex).

Data Preparation¶

The code to prepare the data is below. We create three binary variables - female, manual_sep, and depressed - that we will use in the regression models. To standardize BMI in each cohort, we use the by prefix with the egen command to create variables that contain the mean and SD of BMI in a given cohort.

In [5]:
gen female = sex - 1,
gen manual_sep = 2 - social_class,
gen depressed = depression - 1

by cohort, sort: egen mean_bmi = mean(bmi)
by cohort, sort: egen sd_bmi = sd(bmi)
gen bmi_std = (bmi - mean_bmi) / sd_bmi

label variable female "Female"
label variable manual_sep "Manual SEP"
label variable depressed "Depressed"
label variable bmi_std "Standardized BMI"

Next we run some unit tests to ensure we have prepared the variables correctly.

In [6]:
by cohort, sort: sum bmi bmi_std
----------------------------------------------------------------------------------------------------------------------------------
-> cohort = Boomers

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
         bmi |     10,000    26.92524    3.176265       13.8   40.15921
     bmi_std |     10,000    2.27e-07           1  -4.132287    4.16652

----------------------------------------------------------------------------------------------------------------------------------
-> cohort = Gen X

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
         bmi |     10,000    27.34874    3.355648   14.72619   39.79839
     bmi_std |     10,000   -1.71e-07           1  -3.761585    3.71006

----------------------------------------------------------------------------------------------------------------------------------
-> cohort = Millennials

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
         bmi |     10,000    30.23923    4.324106   15.08092   45.53218
     bmi_std |     10,000   -5.03e-08           1  -3.505536   3.536675

----------------------------------------------------------------------------------------------------------------------------------
-> cohort = Gen Z

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
         bmi |     10,000    33.09719    4.677421   17.53237       50.4
     bmi_std |     10,000   -2.95e-07           1  -3.327651    3.69922

In [7]:
tab sex female
           |        Female
       Sex |         0          1 |     Total
-----------+----------------------+----------
      Male |    20,086          0 |    20,086 
    Female |         0     19,914 |    19,914 
-----------+----------------------+----------
     Total |    20,086     19,914 |    40,000 
In [8]:
tab social_class manual_sep
  Father's |
Occupation |      Manual SEP
  al Class |         0          1 |     Total
-----------+----------------------+----------
    Manual |         0     18,480 |    18,480 
Non-Manual |    21,520          0 |    21,520 
-----------+----------------------+----------
     Total |    21,520     18,480 |    40,000 
In [9]:
tab depression depressed
   Depression |       Depressed
    Diagnosis |         0          1 |     Total
--------------+----------------------+----------
No Depression |    34,995          0 |    34,995 
    Diagnosed |         0      5,005 |     5,005 
--------------+----------------------+----------
        Total |    34,995      5,005 |    40,000 

All good!

In the main paper, we noted that attempts should be made to harmonise data across cohorts. One option discussed was creating ridit scores that place ordered categorical variables onto 0-1 scales. Below we create social class ridit scores using the user-written wridit command, but do not use this data further in the analyses. (wridit requires installation of the egenmore package.)

In [10]:
// ssc install egenmore
// ssc install wridit
by cohort, sort: egen sep_ridit = ridit(social_class)
label variable sep_ridit "Social Class Ridit"

BMI OLS Models¶

To run a linear regression model, we use the command regress. Here we regress bmi on female and manual_sep and their interaction using the data from Boomers (if cohort == 1).2 c.female##c.manual_sep concisely specifies that we want the interaction between female and manual SEP along with their main effects (c.female#c.manual_sep would specify the interaction term only).3

Note, the c. operator is used to specify that the variables are continuous. The i. operator is used to specify categorical variables. Either would have worked in this situation as female and manual_sep are binary and on a 0/1 scale. We use the c. operator here as it makes the results easier to read. (While in general, the c. operator is not required to specify continuous variables as these are assumed by default, it is required when the # interaction term operator is used because Stata assumes the use of categorical (i.) variables in this situation.)

In [11]:
regress bmi c.female##c.manual_sep if cohort == 1
      Source |       SS           df       MS      Number of obs   =    10,000
-------------+----------------------------------   F(3, 9996)      =    622.60
       Model |  15881.6907         3  5293.89691   Prob > F        =    0.0000
    Residual |  84994.8243     9,996  8.50288359   R-squared       =    0.1574
-------------+----------------------------------   Adj R-squared   =    0.1572
       Total |  100876.515     9,999  10.0886604   Root MSE        =     2.916

---------------------------------------------------------------------------------------
                  bmi |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
----------------------+----------------------------------------------------------------
               female |   .8506442   .0914823     9.30   0.000     .6713205    1.029968
           manual_sep |   2.685329   .0836673    32.10   0.000     2.521324    2.849333
                      |
c.female#c.manual_sep |  -.4036906   .1187431    -3.40   0.001    -.6364509   -.1709302
                      |
                _cons |   25.02798   .0646241   387.29   0.000     24.90131    25.15466
---------------------------------------------------------------------------------------

We need to repeat this model for each cohort and for standardized BMI. We could write this command 7 more times to do this but to make the code more concise, we can instead use a loop. We will use the loop to iterate over each combination of cohort and BMI variable. The syntax to do this is below. We prepend the command with quietly so that output is not printed. We save the individual regression results using estimates store command so that they can be called upon again at a later point.

In [12]:
quietly forval cohort_num = 1/4{
    foreach dep_var in bmi bmi_std{
        regress `dep_var' c.female##c.manual_sep if cohort == `cohort_num'
        local cohort_name: label cohort `cohort_num'
        estimates store `dep_var'_`cohort_num', title(`cohort_name')
    }
}
In [13]:
estimates dir
----------------------------------------------------------
        name | command      depvar       npar  title 
-------------+--------------------------------------------
       bmi_1 | regress      bmi             4  Boomers
   bmi_std_1 | regress      bmi_std         4  Boomers
       bmi_2 | regress      bmi             4  Gen X
   bmi_std_2 | regress      bmi_std         4  Gen X
       bmi_3 | regress      bmi             4  Millennials
   bmi_std_3 | regress      bmi_std         4  Millennials
       bmi_4 | regress      bmi             4  Gen Z
   bmi_std_4 | regress      bmi_std         4  Gen Z
----------------------------------------------------------

The code above contains two loopss, one nested in the the other. Each loop iterates over a set of values, passing these values to a local macro in turn and running the code within the curly brackets ({}) at each iteration. We gave the name cohort_num to the local macro for the outer loop, and we gave the name dep_var to the local macro for the inner loop. cohort_num takes the values 1, 2, 3, and 4, in turn, while dep_var takes the values bmi and bmi_std.

We use these local macros in the code that is run by the loops. We tell Stata that we want to use the macro using the back-tick/apostrophe (` ') symbols. When Stata sees these symbols, it replaces the name of the local macro with its contents and then it runs the command. A local macro is just some text associated with a name. There are many ways of creating local macros. local a: label b x is an extended macro function that creates a local macro a which contains the label attached to the number x in value label b. (We use this function to extract the cohort name.)

In the first iteration, cohort_num = 1 and dep_var = bmi so:

  • regress dep_var i.female##i.manual_sep if cohort == `cohort_num'` becomes regress bmi i.female##i.manual_sep if cohort == 1
  • local cohort_name: label cohort `cohort_num' becomes local cohort_name: label cohort 1 (= "Boomers")
  • estimates store `dep_var'_`cohort_num', title(`cohort_name') becomes estimates stores bmi_1, title("Boomers").

Including the title option in estimates store is not necessary, but is helpful for identifying particular estimates. We will use the titles when creating tables of the regression results.

Macros and loops are extremely useful components of the Stata programming language as they allow for the concise expression of repetitive code. They can save time, reduce the number of mistakes, and make spotting errors much easier. For more detailed guides on programming in Stata, see German Rodriguez's Stata Tutorial and Liam Wright's Advanced Stata workshop slides.

To print the individual results, we can use estimates replay. Below we show the results of the standardized BMI (bmi_std) regression for Gen X (cohort == 2).

In [14]:
estimates replay bmi_std_2
----------------------------------------------------------------------------------------------------------------------------------
Model bmi_std_2 (Gen X)
----------------------------------------------------------------------------------------------------------------------------------

      Source |       SS           df       MS      Number of obs   =    10,000
-------------+----------------------------------   F(3, 9996)      =    459.65
       Model |  1212.14414         3  404.048047   Prob > F        =    0.0000
    Residual |  8786.85575     9,996   .87903719   R-squared       =    0.1212
-------------+----------------------------------   Adj R-squared   =    0.1210
       Total |   9998.9999     9,999   .99999999   Root MSE        =    .93757

---------------------------------------------------------------------------------------
              bmi_std |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
----------------------+----------------------------------------------------------------
               female |   .1354372   .0266072     5.09   0.000     .0832818    .1875926
           manual_sep |   .6886068   .0265366    25.95   0.000     .6365897    .7406238
                      |
c.female#c.manual_sep |  -.0079618   .0375044    -0.21   0.832    -.0814779    .0655543
                      |
                _cons |  -.4123822   .0188841   -21.84   0.000    -.4493987   -.3753656
---------------------------------------------------------------------------------------

Regression Tables¶

We can produce regression tables using the estout command. estout allows for the display of multiple regressions models in a single table. By default, estout will print the most recently produced results (bmi_std_4). Passing the names of stored estimates will print the stored models instead.

estout includes lots of options for formatting the table. Below we show some of these options using the results from the standardized BMI models. For more on estout see the help files and the IDRE website.

In [15]:
estout bmi_std_1 bmi_std_2 bmi_std_3 bmi_std_4, ///
    cells(b(star fmt(3) label(B)) ///
          ci(par fmt(2) label(95% CI)))   /// To display coefficients and CIs
    label varlabels(_cons Constant)  /// To label the columns and the constant
    stats(r2 bic N, fmt(3 0 0) label(R-squared BIC N)) /// To add fit statistics
    noomitted nobaselevels legend // To add a legend and drop reference categories
------------------------------------------------------------------------------------
                          Boomers           Gen X     Millennials           Gen Z   
                         B/95% CI        B/95% CI        B/95% CI        B/95% CI   
------------------------------------------------------------------------------------
Female                      0.268***        0.135***        0.253***        0.279***
                      [0.21,0.32]     [0.08,0.19]     [0.21,0.29]     [0.24,0.32]   
Manual SEP                  0.845***        0.689***        1.088***        1.209***
                      [0.79,0.90]     [0.64,0.74]     [1.04,1.13]     [1.16,1.26]   
Female # Manual SEP        -0.127***       -0.008           0.113***       -0.037   
                     [-0.20,-0.05]    [-0.08,0.07]     [0.05,0.18]    [-0.10,0.03]   
Constant                   -0.597***       -0.412***       -0.583***       -0.558***
                     [-0.64,-0.56]    [-0.45,-0.38]    [-0.61,-0.55]    [-0.59,-0.53]   
------------------------------------------------------------------------------------
R-squared                   0.157           0.121           0.337           0.339   
BIC                         26702           27122           24309           24279   
N                           10000           10000           10000           10000   
------------------------------------------------------------------------------------
* p<0.05, ** p<0.01, *** p<0.001

estout also contains functionality to export tables to external files in a variety of formats (e.g. tab separated, LaTeX, etc.). outreg2 and esttab are similar packages for producing publication-ready tables and also allow export to different file types (such as Microsoft Word).

Several packages also exist in R for making publication-quality tables (such as the excellent gt, gtsummary and flextable packages). The Stata package regsave can be used to save regression results as .dta files. These can then be imported into R (for instance, the haven package). This process can also be used to plot results in R using ggplot2.

Note, the estout command works by using data stored in the ereturn list. An unfortunately not-universally-known feature of Stata language is results of commands are stored behind-the-scenes in return lists. Data can be extracted from the return lists and used in following commands.

For instance, the summarize command stores descriptive statistics in the return list.

In [16]:
summarize bmi
    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
         bmi |     40,000     29.4026    4.653675       13.8       50.4
In [17]:
return list
scalars:
                r(sum) =  1176103.920781312
                r(max) =  50.39999999999999
                r(min) =  13.8
                 r(sd) =  4.653674920684698
                r(Var) =  21.65669026740973
               r(mean) =  29.4025980195328
              r(sum_w) =  40000
                  r(N) =  40000
In [18]:
display r(max) - r(min) // Results can be used in commands
36.6

Stata has three useful return lists:

  • return list - for standard commands (e.g. describe, summarise)
  • ereturn list - for estimation results (e.g. regress)
  • creturn list - for system information and other useful data (e.g. the time, pi, the alphabet)

For more information, see German Rodriguez's Stata Tutorial and Liam Wright's Advanced Stata workshop slides.

Plotting Results¶

To plot model results, we can use the Benn Jann's coefplot command (ssc install coefplot).

In [19]:
// ssc install coefplot
coefplot (bmi_1, label("Boomers")) ///
         (bmi_2, label("Gen X")) ///
         (bmi_3, label("Millennials")) ///
         (bmi_4, label("Gen Z")), bylabel("Absolute") || ///
         (bmi_std_1, label("Boomers")) ///
         (bmi_std_2, label("Gen X")) ///
         (bmi_std_3, label("Millennials")) ///
         (bmi_std_4, label("Gen Z")), bylabel("Relative Effects") ||, ///
         drop(_cons) xline(0) byopts(xrescale) /// Drop constant and add line through origin.
         coeflabels(, interaction(" x ")) // To display x rather than # for interaction terms

Observe in the above plot that there are some differences in the comparative size of absolute and relative effects. For instance, the female coefficient is smaller in absolute terms (left panel) in Boomers than Millennials but larger in relative terms (right panel) across the same two cohorts.

The coefplot command includes several options to format the plot. See the help file (help coefplot) and the extensive coefplot website for more detail. While coefplot is an excellent command, we still recommend ggplot2 for greater control of plot presentation.

Depression Logistic Regression Models¶

Below we produce logistic regression models estimating the association between depression, sex and manual SEP. To run logistic regression models, we use the logit command. Again we create a loop to do this, but as we only have a single outcome variable (depressed), we do not need to use a nested loop.

In [20]:
quietly forval cohort_num = 1/4{
    logit depressed c.female##c.manual_sep if cohort == `cohort_num'
    local cohort_name: label cohort `cohort_num'
    estimates store dep_`cohort_num', title(`cohort_name')
    }

To print the results of a particular model, we can again use estimates replay. By default logit will print results on the log odds scales. Including the option or prints odds ratios instead.

In [21]:
estimates replay dep_3, or
----------------------------------------------------------------------------------------------------------------------------------
Model dep_3 (Millennials)
----------------------------------------------------------------------------------------------------------------------------------

Logistic regression                             Number of obs     =     10,000
                                                LR chi2(3)        =      96.04
                                                Prob > chi2       =     0.0000
Log likelihood =  -4109.056                     Pseudo R2         =     0.0116

---------------------------------------------------------------------------------------
            depressed | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
----------------------+----------------------------------------------------------------
               female |   1.697377   .1333788     6.73   0.000     1.455095        1.98
           manual_sep |    1.54607   .1343602     5.01   0.000     1.303935    1.833168
                      |
c.female#c.manual_sep |     .83158   .0959467    -1.60   0.110     .6632746    1.042593
                      |
                _cons |   .1115175   .0067321   -36.34   0.000     .0990735    .1255245
---------------------------------------------------------------------------------------
Note: _cons estimates baseline odds.

Regression Tables¶

We can again use estout to print the results in a single well-formatted table. The option eform can be used to display exponentiated results.

In [22]:
estout dep_1 dep_2 dep_3 dep_4, ///
    cells(b(star fmt(3) label(B)) ///
          ci(par fmt(2) label(95% CI)))   /// To display coefficients and CIs
    label varlabels(_cons Constant)  /// To label the columns and the constant
    stats(r2 bic N, fmt(3 0 0) label(R-squared BIC N)) /// To add fit statistics
    noomitted nobaselevels legend /// To add a legend and drop reference categories
    eform
------------------------------------------------------------------------------------
                          Boomers           Gen X     Millennials           Gen Z   
                         B/95% CI        B/95% CI        B/95% CI        B/95% CI   
------------------------------------------------------------------------------------
Depressed                                                                           
Female                      2.654***        1.826***        1.697***        1.532***
                      [1.90,3.70]     [1.43,2.33]     [1.46,1.98]     [1.35,1.74]   
Manual SEP                  2.585***        2.035***        1.546***        1.514***
                      [1.88,3.55]     [1.60,2.59]     [1.30,1.83]     [1.31,1.75]   
Female # Manual SEP         0.618*          0.761           0.832           0.855   
                      [0.42,0.91]     [0.56,1.04]     [0.66,1.04]     [0.70,1.04]   
Constant                    0.025***        0.044***        0.112***        0.189***
                      [0.02,0.03]     [0.04,0.05]     [0.10,0.13]     [0.17,0.21]   
------------------------------------------------------------------------------------
R-squared                                                                           
BIC                          4696            5441            8255           10268   
N                           10000           10000           10000           10000   
------------------------------------------------------------------------------------
* p<0.05, ** p<0.01, *** p<0.001

Plotting Results¶

As with estout, we add the option eform to display results on the odds-ratio scale.

In [23]:
coefplot (dep_1, label("Boomers")) ///
         (dep_2, label("Gen X")) ///
         (dep_3, label("Millennials")) ///
         (dep_4, label("Gen Z")), ///
         drop(_cons) xline(1) eform /// Drop constant and add line through origin
         coeflabels(, interaction(" x ")) // To display x rather than # for interaction terms

Marginal Effects on Absolute Scale¶

The above results show effects on the relative scale. To extract marginal effects on the absolute scale (i.e. difference in the probability of depression), we can use the margins command.

Below, we calculate marginal effects as probabilities for the variables female and manual_sep (dydx(female manual_sep)) where the comparator is a male of high SEP (at(female = 0 manual_sep = 0)). We post these results to the ereturn list so that we can use estimates store to store the results for plotting with coefplot.

In [24]:
quietly forval cohort_num = 1/4{
    logit depressed c.female##c.manual_sep if cohort == `cohort_num'
    margins, at(female = 0 manual_sep = 0) dydx(female manual_sep) post
    estimates store margins_`cohort_num'
}
In [25]:
coefplot (margins_1, label("Boomers")) ///
         (margins_2, label("Gen X")) ///
         (margins_3, label("Millennials")) ///
         (margins_4, label("Gen Z"))

Comparing this plot with the previous plot, we see some differences across the two scales. First, on the absolute scale, there is little difference by sex, while on the relative scale, there is a clear increase over time. Second, on the absolute scale, there is a clear increase in the differences in depression by SEP over time, but on the relative scale the opposite is true.

Meta-Analysis¶

To pool results across cohorts, we can use meta-analysis. Below, we use the meta command to pool estimates of the bivariate association between manual SEP and (standardized) BMI.

meta set, the meta command we will use, uses estimates and standard errors stored as variables (i.e. columns) in the dataset. Therefore to use this command, we first need to get effect estimate from each cohort (regress bmi_std manual_sep if cohort == `cohort_num') and store these in the dataset. We create three columns - cohort_lbl, beta and se - to store the effect estimates (+ cohort name). Then we run a regression using a loop which harnesses the local macro c as a counter to put estimates in each of the first four rows in turn (in selects specific rows). _b[] and _se[] are inbuilt variables that allow users to extract regression estimates directly (beta coefficients and standard errors, respectively).

We wrap the code in quietly so that the output is suppressed.

In [26]:
quietly{
    
    generate cohort_lbl = ""
    generate beta = .
    generate se = .

    forval cohort_num = 1/4{
        regress bmi_std manual_sep if cohort == `cohort_num'
        local cohort_name: label cohort `cohort_num'

        replace cohort_lbl = "`cohort_name'" in `cohort_num'
        replace beta = _b[manual_sep] in `cohort_num'
        replace se = _se[manual_sep] in `cohort_num'    
        }
     
     }
In [27]:
list cohort_lbl beta se in 1/4 // Check results are in correct rows and as expected.
     +-----------------------------------+
     |  cohort_lbl       beta         se |
     |-----------------------------------|
  1. |     Boomers   .7811322   .0188026 |
  2. |       Gen X   .6838048    .018796 |
  3. | Millennials   1.143039   .0169052 |
  4. |       Gen Z   1.187434   .0172727 |
     +-----------------------------------+

Next we perform a meta-analysis using the meta set command. Random-effect meta-analysis (REML estimator) is performed by default.

In [28]:
meta set beta se, studylabel(cohort_lbl)
(39,996 missing values generated)

Meta-analysis setting information

 Study information
    No. of studies:  4
       Study label:  cohort_lbl
        Study size:  N/A

       Effect size
              Type:  Generic
             Label:  Effect Size
          Variable:  beta

         Precision
         Std. Err.:  se
                CI:  [_meta_cil, _meta_ciu]
          CI level:  95%

  Model and method
             Model:  Random-effects
            Method:  REML

To display the results in a table, use meta summarize

In [29]:
meta summarize
  Effect-size label:  Effect Size
        Effect size:  beta
          Std. Err.:  se
        Study label:  cohort_lbl

Meta-analysis summary                     Number of studies =      4
Random-effects model                      Heterogeneity:
Method: REML                                          tau2 =  0.0640
                                                    I2 (%) =   99.50
                                                        H2 =  200.57

--------------------------------------------------------------------
            Study |    Effect Size    [95% Conf. Interval]  % Weight
------------------+-------------------------------------------------
          Boomers |          0.781       0.744       0.818     24.99
            Gen X |          0.684       0.647       0.721     24.99
      Millennials |          1.143       1.110       1.176     25.01
            Gen Z |          1.187       1.154       1.221     25.01
------------------+-------------------------------------------------
            theta |          0.949       0.700       1.198
--------------------------------------------------------------------
Test of theta = 0: z = 7.48                      Prob > |z| = 0.0000
Test of homogeneity: Q = chi2(3) = 595.97          Prob > Q = 0.0000

A forest plot can also be produced using the meta forestplot command.

In [30]:
meta forestplot
  Effect-size label:  Effect Size
        Effect size:  beta
          Std. Err.:  se
        Study label:  cohort_lbl

To perform a fixed effects meta-analysis, we can add the fixed option to meta set.

In [31]:
quietly meta set beta se, studylabel(cohort_lbl) fixed
meta summarize


  Effect-size label:  Effect Size
        Effect size:  beta
          Std. Err.:  se
        Study label:  cohort_lbl

Meta-analysis summary                     Number of studies =      4
Fixed-effects model                       Heterogeneity:
Method: Inverse-variance                            I2 (%) =   99.50
                                                        H2 =  198.66

--------------------------------------------------------------------
            Study |    Effect Size    [95% Conf. Interval]  % Weight
------------------+-------------------------------------------------
          Boomers |          0.781       0.744       0.818     22.61
            Gen X |          0.684       0.647       0.721     22.63
      Millennials |          1.143       1.110       1.176     27.97
            Gen Z |          1.187       1.154       1.221     26.79
------------------+-------------------------------------------------
            theta |          0.969       0.952       0.987
--------------------------------------------------------------------
Test of theta = 0: z = 108.40                    Prob > |z| = 0.0000
Test of homogeneity: Q = chi2(3) = 595.97          Prob > Q = 0.0000
In [32]:
meta forestplot
  Effect-size label:  Effect Size
        Effect size:  beta
          Std. Err.:  se
        Study label:  cohort_lbl

Comparing the random-effects and fixed-effects results, similar pooled effect estimates are obtained with each method (theta = 0.95 and 0.97, respectively), but with considerably larger confidence intervals when using the random-effects estimator, as is to be expected.

In [33]:
drop cohort_lbl beta se

Pooled Cohort Regressions¶

To carry out a pooled regression analysis that examines cross-cohort differences within a single model, we can add a further interaction term (i.cohort##...) to the regression command. Here we look at cross-cohort differences in association with (absolute) BMI.

In [34]:
regress bmi i.cohort##c.female##c.manual_sep
      Source |       SS           df       MS      Number of obs   =    40,000
-------------+----------------------------------   F(15, 39984)    =   2436.14
       Model |  413642.291        15  27576.1527   Prob > F        =    0.0000
    Residual |  452603.663    39,984  11.3196194   R-squared       =    0.4775
-------------+----------------------------------   Adj R-squared   =    0.4773
       Total |  866245.954    39,999  21.6566903   Root MSE        =    3.3645

----------------------------------------------------------------------------------------------
                         bmi |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-----------------------------+----------------------------------------------------------------
                      cohort |
                      Gen X  |   .9369439   .1007565     9.30   0.000     .7394588    1.134429
                Millennials  |   2.692338   .0963498    27.94   0.000      2.50349    2.881186
                      Gen Z  |   5.460237   .0953962    57.24   0.000     5.273259    5.647216
                             |
                      female |   .8506442   .1055528     8.06   0.000     .6437582     1.05753
                             |
             cohort#c.female |
                      Gen X  |  -.3961648   .1423298    -2.78   0.005    -.6751345   -.1171951
                Millennials  |   .2420068   .1367414     1.77   0.077    -.0260096    .5100232
                      Gen Z  |    .454465   .1345858     3.38   0.001     .1906738    .7182563
                             |
                  manual_sep |   2.685329   .0965358    27.82   0.000     2.496116    2.874541
                             |
         cohort#c.manual_sep |
                      Gen X  |  -.3746071   .1355995    -2.76   0.006    -.6403853   -.1088289
                Millennials  |   2.020185   .1363508    14.82   0.000     1.752934    2.287436
                      Gen Z  |   2.971245   .1387889    21.41   0.000     2.699215    3.243274
                             |
       c.female#c.manual_sep |  -.4036906   .1370065    -2.95   0.003    -.6722265   -.1351547
                             |
cohort#c.female#c.manual_sep |
                      Gen X  |   .3769735   .1920513     1.96   0.050     .0005486    .7533985
                Millennials  |   .8908207   .1939876     4.59   0.000     .5106004    1.271041
                      Gen Z  |   .2296222    .196636     1.17   0.243    -.1557889    .6150334
                             |
                       _cons |   25.02798   .0745636   335.66   0.000     24.88184    25.17413
----------------------------------------------------------------------------------------------

Displaying these results in a presentation-ready table is simple enough:

In [35]:
estout, ///
    cells(b(star fmt(3) label(B)) ///
          ci(par fmt(2) label(95% CI)))   /// To display coefficients and CIs
    varlabels(_cons Constant)  /// To label the constant
    stats(r2 bic N, fmt(3 0 0) label(R-squared BIC N)) /// To add fit statistics
    noomitted nobaselevels legend varwidth(30) // To add a legend and drop reference categories
----------------------------------------------
                                          .   
                                   B/95% CI   
----------------------------------------------
2.cohort                              0.937***
                                [0.74,1.13]   
3.cohort                              2.692***
                                [2.50,2.88]   
4.cohort                              5.460***
                                [5.27,5.65]   
female                                0.851***
                                [0.64,1.06]   
2.cohort#c.female                    -0.396** 
                               [-0.68,-0.12]   
3.cohort#c.female                     0.242   
                               [-0.03,0.51]   
4.cohort#c.female                     0.454***
                                [0.19,0.72]   
manual_sep                            2.685***
                                [2.50,2.87]   
2.cohort#c.manual_sep                -0.375** 
                               [-0.64,-0.11]   
3.cohort#c.manual_sep                 2.020***
                                [1.75,2.29]   
4.cohort#c.manual_sep                 2.971***
                                [2.70,3.24]   
c.female#c.manual_sep                -0.404** 
                               [-0.67,-0.14]   
2.cohort#c.female#c.manual_sep        0.377*  
                                [0.00,0.75]   
3.cohort#c.female#c.manual_sep        0.891***
                                [0.51,1.27]   
4.cohort#c.female#c.manual_sep        0.230   
                               [-0.16,0.62]   
Constant                             25.028***
                               [24.88,25.17]   
----------------------------------------------
R-squared                             0.478   
BIC                                  210730   
N                                     40000   
----------------------------------------------
* p<0.05, ** p<0.01, *** p<0.001

Presenting the results with coefplot requires the use of a few more options:

In [36]:
coefplot, ///
    xline(0) drop(_cons) omitted baselevels ///
    order(*cohort* . female sep) /// Order variables
    headings(1.cohort = "{bf:Cohort}" /// Add headings to variable groups
             1.cohort#c.female = "{bf:Cohort x Female}" ///
             1.cohort#c.manual_sep = "{bf:Cohort x Manual SEP}" ///
             1.cohort#c.female#c.manual_sep = "{bf:Cohort x Female x Manual SEP}" ///
             female = "{bf:Main Effects}") ///
    coeflabels(, interaction(" x "))

We can use the test or testparm commands to assess whether groups of the variables are jointly statistical significant. For instance, to test whether there are differences by cohort in the association between female and bmi, we use:

In [37]:
testparm i.cohort#c.female
 ( 1)  2.cohort#c.female = 0
 ( 2)  3.cohort#c.female = 0
 ( 3)  4.cohort#c.female = 0

       F(  3, 39984) =   16.17
            Prob > F =    0.0000

The null hypothesis is that all coefficients are equal to zero. We see that p-value for the test is very small (< 0.001), so we can reject the null of no difference by cohort.

Missing Data¶

In the simulated data, we included an indicator of missingness (observed). The degree of missingness differs by sex and SEP to varying degrees across cohort. The observed data (observed == 1) are therefore non-representative of the full population and the degree of this non-representativeness differs by cohort. Below, we show how not accounting for missingness can lead to biased, misleading results.

In [38]:
quietly{
    regress bmi i.cohort
    estimates store all
    regress bmi i.cohort if observed == 1
    estimates store obs
}
In [39]:
coefplot (all, label("Full Population")) ///
         (obs, label("Observed Individuals")), ///
         drop(_cons ) xline(0) baselevels /// Drop constant and add line through origin
         coeflabels(, interaction(" x "))

In the full sample, there are clear increases in BMI across successive cohorts, but in the observed sample, the difference between Boomers and Gen X is close to zero and statistically insignificant. Not accounting for missing data (e.g. by not using non-response weights) could lead to incorrect inferences.

Several methods can be used to account for missing data. The imagined type of missing data shown here is unit non-response (i.e. missing participants, rather than missing items). This could be accounted for by deriving non-response weights. Note, to do this would require some information about those who are missing from the dataset - this could, for instance, be inferred by comparing the observed data with that expected from other contemporary representative datasets (e.g. the Census). Once derived, weights can be added to the regress and logit commands with the [pweight = weight_name] argument.

To account for item missingness, imputation procedures such as multiple imputation by chained equations may be used. The main command for multiple imputation in Stata is mi impute. For more information, see the reference manual and the IDRE website.

Another approach for accounting for missing data is estimation via Full Information Maximum Likelihood (FIML). FIML can be used with the sem command. Describing this command is beyond the scope of this paper. For more information, see the reference manual.

Modelling Longitudinal Data¶

Relatively more complex cross-cohort analyses may incorporate repeat measure, longitudinal data. There are several modelling options available where data contains multiple observations per individual, including mixed effects (alternatively known as multilevel modelling, growth curve modelling, and random effects models) and the fixed effects “within” panel estimator. The main command for estimating mixed effects models in Stata is mixed. For fixed effects ("within") or random effects ("between") panel data models, the xtreg command can alternatively be used. Mixed effects models can also be specified in a structural equation modelling framework with the sem or gsem commands.

To our knowledge, Stata contains no or limited options for estimating latent class-type growth models, such as growth mixture models or latent growth curve analysis models. For further information on these, including exemplar R code, see Herle et al. (2020).4

Footnotes¶

1 Vable, Anusha M, Scott F Diehl, and M Maria Glymour. 2021. “Code Review as a Simple Trick to Enhance Reproducibility, Accelerate Learning, and Improve the Quality of Your Team’s Research.” American Journal of Epidemiology, April. https://doi.org/10.1093/aje/kwab092.

2 If you have forgotten the number a particular cohort corresponds to, it is also possible to specify an if condition using the value label directly (e.g. 'if "Boomer":cohort'). Alternatively numbers can be prepended to value labels using the 'numlabel, add' command or looked up using 'label list'

3 Another alternative for including interaction terms is to explicitly create new variables which multiply variables together. However, this should generally be avoided because when using postestimation commands, such as those calculating marginal effects, the relevant function won’t “know” that the new variable is dependent on the other variables, whereas this is explicit when 'var1##var2' or 'var1#var2' are used

4 Herle, Moritz, Nadia Micali, Mohamed Abdulkadir, Ruth Loos, Rachel Bryant-Waugh, Christopher Hübel, Cynthia M. Bulik, and Bianca L. De Stavola. 2020. “Identifying Typical Trajectories in Longitudinal Data: Modelling Strategies and Interpretations.” European Journal of Epidemiology 35 (3): 205–22. https://doi.org/10.1007/s10654-020-00615-6.