In this document, we provide a tutorial on conducting simple comparative research analyses with Stata. This document accompanies the Discover Social Science and Health paper Investigating change across time in prevalence or association: the challenges of cross-study comparative research and possible solutions by Bann et al. (2022). A similar tutorial using R is provided on the Open Science Foundation website (https://osf.io/d569x/) and on GitHub (https://ljwright.github.io/cross-cohort-tutorial/r_syntax.html). Both tutorials cover almost identical material, but the HTML files are produced using different software (RMarkdown for the R tutorial, Jupyter Lab for the Stata tutorial). This explains the different look of the two tutorial files. The current tutorial was produced with Stata version 16, but should work on Stata versions 15 and higher.
To demonstrate the concepts discussed in the paper, we will use a simple simulated dataset that contains data from 40,000 individuals across four cohorts (Boomers, Gen X, Millennials, and Gen Z) each surveyed at the same age. The dataset can be downloaded from the Open Science Foundation website (https://osf.io/d569x/). The code in this document can also be used for conducting analyses with repeat cross-sectional data, but we limit ourselves to this example dataset to make the code easier to follow along.
use "cohort_data.dta", clear
set linesize 130
describe
Contains data from cohort_data.dta obs: 40,000 vars: 9 11 Aug 2021 10:33 ---------------------------------------------------------------------------------------------------------------------------------- storage display value variable name type format label variable label ---------------------------------------------------------------------------------------------------------------------------------- id long %12.0g Person ID cohort long %12.0g cohort Cohort cohort_id long %12.0g Person ID (Cohort) sex long %12.0g sex Sex social_class long %12.0g social_class Father's Occupational Class education long %12.0g education Highest Qualification bmi double %10.0g Body Mass Index depression long %12.0g depression Depression Diagnosis observed double %10.0g Survey Participation Indicator ---------------------------------------------------------------------------------------------------------------------------------- Sorted by:
The dataset contains observations collected at age 25 across the four cohorts. There are two variables that we will use as outcome variables in the analyses: the continuous variable, bmi
, and the binary variable, depression
. There are two variables that we will use as exposures: the binary variable sex
and the binary variable social_class
. We will use these data to explore changes in social- and sex-inequalities across cohorts. We will produce cohort specific descriptive statistics, estimate regression models in individual cohorts and in pooled analyses, and we will perform meta-analyses of model results. The dataset also contains a study participation indicator, observed
. We will use this to show how missingness can bias results.
Below are the labels for the categorical variables. Note the value labels have the same name as the variables they are attached to. This should be treated as coincidental. In Stata, variables (which just contain numbers) are separate objects to value labels (which tell Stata how to format a variable when it is printed to the console window).
labelbook
---------------------------------------------------------------------------------------------------------------------------------- value label cohort ---------------------------------------------------------------------------------------------------------------------------------- values labels range: [1,4] string length: [5,11] N: 4 unique at full length: yes gaps: no unique at length 12: yes missing .*: 0 null string: no leading/trailing blanks: no numeric -> numeric: no definition 1 Boomers 2 Gen X 3 Millennials 4 Gen Z variables: cohort ---------------------------------------------------------------------------------------------------------------------------------- value label depression ---------------------------------------------------------------------------------------------------------------------------------- values labels range: [1,2] string length: [9,13] N: 2 unique at full length: yes gaps: no unique at length 12: yes missing .*: 0 null string: no leading/trailing blanks: no numeric -> numeric: no definition 1 No Depression 2 Diagnosed variables: depression ---------------------------------------------------------------------------------------------------------------------------------- value label education ---------------------------------------------------------------------------------------------------------------------------------- values labels range: [1,2] string length: [15,16] N: 2 unique at full length: yes gaps: no unique at length 12: yes missing .*: 0 null string: no leading/trailing blanks: no numeric -> numeric: no definition 1 A-Level or Below 2 Degree or Above variables: education ---------------------------------------------------------------------------------------------------------------------------------- value label sex ---------------------------------------------------------------------------------------------------------------------------------- values labels range: [1,2] string length: [4,6] N: 2 unique at full length: yes gaps: no unique at length 12: yes missing .*: 0 null string: no leading/trailing blanks: no numeric -> numeric: no definition 1 Male 2 Female variables: sex ---------------------------------------------------------------------------------------------------------------------------------- value label social_class ---------------------------------------------------------------------------------------------------------------------------------- values labels range: [1,2] string length: [6,10] N: 2 unique at full length: yes gaps: no unique at length 12: yes missing .*: 0 null string: no leading/trailing blanks: no numeric -> numeric: no definition 1 Manual 2 Non-Manual variables: social_class
To produce a table of descriptive statistics, we will use the user-written command table1_mc
. If not previously installed, pleased do so using the command: ssc install table1_mc
. table1_mc
is capable of producing several descriptive statistics for binary, categorical and continuous variables. Below we create descriptives for each cohort (by(cohort)
) with formatted digits (%4.0f
). Variable and value labels displayed by default. The table can be outputted to Excel using the saving()
option or to Word using the table1_mc_dta2docx
command.
// ssc install table1_mc
table1_mc, by(cohort) ///
vars( ///
sex cat %4.0f \ ///
social_class cat %4.0f \ ///
education cat %4.0f \ ///
bmi contn %4.0f \ ///
depression cat %4.0f \ ///
observed cat %4.0f \ ///
)
+--------------------------------------------------------------------------------------------+ | factor N_1 N_2 N_3 N_4 m_1 m_2 m_3 m_4 | |--------------------------------------------------------------------------------------------| | Sex 10,000 10,000 10,000 10,000 0 0 0 0 | |--------------------------------------------------------------------------------------------| | Father's Occupational Class 10,000 10,000 10,000 10,000 0 0 0 0 | |--------------------------------------------------------------------------------------------| | Highest Qualification 10,000 10,000 10,000 10,000 0 0 0 0 | |--------------------------------------------------------------------------------------------| | Body Mass Index 10,000 10,000 10,000 10,000 0 0 0 0 | |--------------------------------------------------------------------------------------------| | Depression Diagnosis 10,000 10,000 10,000 10,000 0 0 0 0 | |--------------------------------------------------------------------------------------------| | Survey Participation Indicator 10,000 10,000 10,000 10,000 0 0 0 0 | +--------------------------------------------------------------------------------------------+ N_ ... #records used below, m_ ... #records not used +---------------------------------------------------------------------------------------------------------------------+ | Boomers Gen X Millennials Gen Z p-value | |---------------------------------------------------------------------------------------------------------------------| | N=10,000 N=10,000 N=10,000 N=10,000 | |---------------------------------------------------------------------------------------------------------------------| | Sex Male 5,047 (50%) 4,994 (50%) 5,080 (51%) 4,965 (50%) 0.36 | | Female 4,953 (50%) 5,006 (50%) 4,920 (49%) 5,035 (50%) | |---------------------------------------------------------------------------------------------------------------------| | Father's Occupational Class Manual 5,936 (59%) 5,033 (50%) 4,007 (40%) 3,504 (35%) <0.001 | | Non-Manual 4,064 (41%) 4,967 (50%) 5,993 (60%) 6,496 (65%) | |---------------------------------------------------------------------------------------------------------------------| | Highest Qualification A-Level or Below 9,285 (93%) 9,001 (90%) 6,226 (62%) 4,042 (40%) <0.001 | | Degree or Above 715 ( 7%) 999 (10%) 3,774 (38%) 5,958 (60%) | |---------------------------------------------------------------------------------------------------------------------| | Body Mass Index 27 (3) 27 (3) 30 (4) 33 (5) <0.001 | |---------------------------------------------------------------------------------------------------------------------| | Depression Diagnosis No Depression 9,357 (94%) 9,217 (92%) 8,540 (85%) 7,881 (79%) <0.001 | | Diagnosed 643 ( 6%) 783 ( 8%) 1,460 (15%) 2,119 (21%) | |---------------------------------------------------------------------------------------------------------------------| | Survey Participation Indicator 0 2,321 (23%) 3,510 (35%) 2,248 (22%) 3,594 (36%) <0.001 | | 1 7,679 (77%) 6,490 (65%) 7,752 (78%) 6,406 (64%) | +---------------------------------------------------------------------------------------------------------------------+ Data are presented as mean (SD) for continuous measures, and n (%) for categorical measures.
In the above table, we see that there has been an increase across cohorts in: the proportion of individuals from non-manual backgrounds, the average level of BMI, the prevalence of diagnosed depression, and the number of individuals with degree level education. There are also differences in the level of missingness by cohort. We will come back to this in a later section.
table1_mc
has many other options to format and change the table, including using different summary statistics. See the help file (help table1_mc
) for more information.
Descriptive tables are helpful, but limited compared with visualisations in helping readers understand the dataset. Well designed plots can convey a lot of information concisely and can enable readers to make comparisons across multiple data in an efficient way, for instance examining the change in the distribution of a variable over time. Below we create kernel density plots of BMI by cohort.
kdensity bmi if cohort == 1, ///
addplot(kdensity bmi if cohort == 2 || ///
kdensity bmi if cohort == 3 || ///
kdensity bmi if cohort == 4) ///
legend( label(1 "Boomers") ///
label(2 "Gen X") ///
label(3 "Millennials") ///
label(4 "Gen Z"))
The code above is straightforward, and while further formatting and plot-tweaking can be achieved, Stata's graphical capabilities are relatively limited. We believe the R package ggplot2
is far superior to Stata as it allows for the creation of more types of plot, is more flexible in the formatting that can be achieved, and uses a consistent syntax that (once learned) makes creating plots of different types a breeze. See the accompanying R file for more information on ggplot2
(https://osf.io/d569x/).
In this section we examine the association between depression, BMI and sex and socio-economic position using individual cohort data. For BMI, we estimate OLS models, and for depression, we estimate logistic regression models. In both cases, we present results in absolute and relative effect forms as when comparing across cohorts, the direction of these effects are not necessarily the same (see main paper). To estimate relative effects for BMI, we need to standardize (mean = 0, SD = 1) our data in each cohort. We do this in the next section in which we also prepare our dataset so that it follows principles set out by Vable et al. (2021)1 - in particular, we create binary variables named such that the top category is clear from variable name (i.e. female
rather than sex
).
The code to prepare the data is below. We create three binary variables - female
, manual_sep
, and depressed
- that we will use in the regression models. To standardize BMI in each cohort, we use the by
prefix with the egen
command to create variables that contain the mean and SD of BMI in a given cohort.
gen female = sex - 1,
gen manual_sep = 2 - social_class,
gen depressed = depression - 1
by cohort, sort: egen mean_bmi = mean(bmi)
by cohort, sort: egen sd_bmi = sd(bmi)
gen bmi_std = (bmi - mean_bmi) / sd_bmi
label variable female "Female"
label variable manual_sep "Manual SEP"
label variable depressed "Depressed"
label variable bmi_std "Standardized BMI"
Next we run some unit tests to ensure we have prepared the variables correctly.
by cohort, sort: sum bmi bmi_std
---------------------------------------------------------------------------------------------------------------------------------- -> cohort = Boomers Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- bmi | 10,000 26.92524 3.176265 13.8 40.15921 bmi_std | 10,000 2.27e-07 1 -4.132287 4.16652 ---------------------------------------------------------------------------------------------------------------------------------- -> cohort = Gen X Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- bmi | 10,000 27.34874 3.355648 14.72619 39.79839 bmi_std | 10,000 -1.71e-07 1 -3.761585 3.71006 ---------------------------------------------------------------------------------------------------------------------------------- -> cohort = Millennials Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- bmi | 10,000 30.23923 4.324106 15.08092 45.53218 bmi_std | 10,000 -5.03e-08 1 -3.505536 3.536675 ---------------------------------------------------------------------------------------------------------------------------------- -> cohort = Gen Z Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- bmi | 10,000 33.09719 4.677421 17.53237 50.4 bmi_std | 10,000 -2.95e-07 1 -3.327651 3.69922
tab sex female
| Female Sex | 0 1 | Total -----------+----------------------+---------- Male | 20,086 0 | 20,086 Female | 0 19,914 | 19,914 -----------+----------------------+---------- Total | 20,086 19,914 | 40,000
tab social_class manual_sep
Father's | Occupation | Manual SEP al Class | 0 1 | Total -----------+----------------------+---------- Manual | 0 18,480 | 18,480 Non-Manual | 21,520 0 | 21,520 -----------+----------------------+---------- Total | 21,520 18,480 | 40,000
tab depression depressed
Depression | Depressed Diagnosis | 0 1 | Total --------------+----------------------+---------- No Depression | 34,995 0 | 34,995 Diagnosed | 0 5,005 | 5,005 --------------+----------------------+---------- Total | 34,995 5,005 | 40,000
All good!
In the main paper, we noted that attempts should be made to harmonise data across cohorts. One option discussed was creating ridit scores that place ordered categorical variables onto 0-1 scales. Below we create social class ridit scores using the user-written wridit
command, but do not use this data further in the analyses. (wridit
requires installation of the egenmore
package.)
// ssc install egenmore
// ssc install wridit
by cohort, sort: egen sep_ridit = ridit(social_class)
label variable sep_ridit "Social Class Ridit"
To run a linear regression model, we use the command regress
. Here we regress bmi
on female
and manual_sep
and their interaction using the data from Boomers (if cohort == 1
).2 c.female##c.manual_sep
concisely specifies that we want the interaction between female and manual SEP along with their main effects (c.female#c.manual_sep
would specify the interaction term only).3
Note, the c.
operator is used to specify that the variables are continuous. The i.
operator is used to specify categorical variables. Either would have worked in this situation as female
and manual_sep
are binary and on a 0/1 scale. We use the c.
operator here as it makes the results easier to read. (While in general, the c.
operator is not required to specify continuous variables as these are assumed by default, it is required when the #
interaction term operator is used because Stata assumes the use of categorical (i.
) variables in this situation.)
regress bmi c.female##c.manual_sep if cohort == 1
Source | SS df MS Number of obs = 10,000 -------------+---------------------------------- F(3, 9996) = 622.60 Model | 15881.6907 3 5293.89691 Prob > F = 0.0000 Residual | 84994.8243 9,996 8.50288359 R-squared = 0.1574 -------------+---------------------------------- Adj R-squared = 0.1572 Total | 100876.515 9,999 10.0886604 Root MSE = 2.916 --------------------------------------------------------------------------------------- bmi | Coef. Std. Err. t P>|t| [95% Conf. Interval] ----------------------+---------------------------------------------------------------- female | .8506442 .0914823 9.30 0.000 .6713205 1.029968 manual_sep | 2.685329 .0836673 32.10 0.000 2.521324 2.849333 | c.female#c.manual_sep | -.4036906 .1187431 -3.40 0.001 -.6364509 -.1709302 | _cons | 25.02798 .0646241 387.29 0.000 24.90131 25.15466 ---------------------------------------------------------------------------------------
We need to repeat this model for each cohort and for standardized BMI. We could write this command 7 more times to do this but to make the code more concise, we can instead use a loop. We will use the loop to iterate over each combination of cohort and BMI variable. The syntax to do this is below. We prepend the command with quietly
so that output is not printed. We save the individual regression results using estimates store
command so that they can be called upon again at a later point.
quietly forval cohort_num = 1/4{
foreach dep_var in bmi bmi_std{
regress `dep_var' c.female##c.manual_sep if cohort == `cohort_num'
local cohort_name: label cohort `cohort_num'
estimates store `dep_var'_`cohort_num', title(`cohort_name')
}
}
estimates dir
---------------------------------------------------------- name | command depvar npar title -------------+-------------------------------------------- bmi_1 | regress bmi 4 Boomers bmi_std_1 | regress bmi_std 4 Boomers bmi_2 | regress bmi 4 Gen X bmi_std_2 | regress bmi_std 4 Gen X bmi_3 | regress bmi 4 Millennials bmi_std_3 | regress bmi_std 4 Millennials bmi_4 | regress bmi 4 Gen Z bmi_std_4 | regress bmi_std 4 Gen Z ----------------------------------------------------------
The code above contains two loopss, one nested in the the other. Each loop iterates over a set of values, passing these values to a local macro
in turn and running the code within the curly brackets ({}
) at each iteration. We gave the name cohort_num
to the local macro
for the outer loop, and we gave the name dep_var
to the local macro
for the inner loop. cohort_num
takes the values 1, 2, 3, and 4, in turn, while dep_var
takes the values bmi
and bmi_std
.
We use these local macros
in the code that is run by the loops. We tell Stata that we want to use the macro using the back-tick/apostrophe (` '
) symbols. When Stata sees these symbols, it replaces the name of the local macro
with its contents and then it runs the command. A local macro
is just some text associated with a name. There are many ways of creating local macros
. local a: label b x
is an extended macro function that creates a local macro a
which contains the label attached to the number x
in value label b
. (We use this function to extract the cohort name.)
In the first iteration, cohort_num = 1
and dep_var = bmi
so:
regress dep_var i.female##i.manual_sep if cohort == `cohort_num'`
becomes regress bmi i.female##i.manual_sep if cohort == 1
local cohort_name: label cohort `cohort_num'
becomes local cohort_name: label cohort 1
(= "Boomers")estimates store `dep_var'_`cohort_num', title(`cohort_name')
becomes estimates stores bmi_1, title("Boomers")
.Including the title
option in estimates store
is not necessary, but is helpful for identifying particular estimates. We will use the titles when creating tables of the regression results.
Macros and loops are extremely useful components of the Stata programming language as they allow for the concise expression of repetitive code. They can save time, reduce the number of mistakes, and make spotting errors much easier. For more detailed guides on programming in Stata, see German Rodriguez's Stata Tutorial and Liam Wright's Advanced Stata workshop slides.
To print the individual results, we can use estimates replay
. Below we show the results of the standardized BMI (bmi_std
) regression for Gen X (cohort == 2
).
estimates replay bmi_std_2
---------------------------------------------------------------------------------------------------------------------------------- Model bmi_std_2 (Gen X) ---------------------------------------------------------------------------------------------------------------------------------- Source | SS df MS Number of obs = 10,000 -------------+---------------------------------- F(3, 9996) = 459.65 Model | 1212.14414 3 404.048047 Prob > F = 0.0000 Residual | 8786.85575 9,996 .87903719 R-squared = 0.1212 -------------+---------------------------------- Adj R-squared = 0.1210 Total | 9998.9999 9,999 .99999999 Root MSE = .93757 --------------------------------------------------------------------------------------- bmi_std | Coef. Std. Err. t P>|t| [95% Conf. Interval] ----------------------+---------------------------------------------------------------- female | .1354372 .0266072 5.09 0.000 .0832818 .1875926 manual_sep | .6886068 .0265366 25.95 0.000 .6365897 .7406238 | c.female#c.manual_sep | -.0079618 .0375044 -0.21 0.832 -.0814779 .0655543 | _cons | -.4123822 .0188841 -21.84 0.000 -.4493987 -.3753656 ---------------------------------------------------------------------------------------
We can produce regression tables using the estout
command. estout
allows for the display of multiple regressions models in a single table. By default, estout
will print the most recently produced results (bmi_std_4
). Passing the names of stored estimates will print the stored models instead.
estout
includes lots of options for formatting the table. Below we show some of these options using the results from the standardized BMI models. For more on estout
see the help files and the IDRE website.
estout bmi_std_1 bmi_std_2 bmi_std_3 bmi_std_4, ///
cells(b(star fmt(3) label(B)) ///
ci(par fmt(2) label(95% CI))) /// To display coefficients and CIs
label varlabels(_cons Constant) /// To label the columns and the constant
stats(r2 bic N, fmt(3 0 0) label(R-squared BIC N)) /// To add fit statistics
noomitted nobaselevels legend // To add a legend and drop reference categories
------------------------------------------------------------------------------------ Boomers Gen X Millennials Gen Z B/95% CI B/95% CI B/95% CI B/95% CI ------------------------------------------------------------------------------------ Female 0.268*** 0.135*** 0.253*** 0.279*** [0.21,0.32] [0.08,0.19] [0.21,0.29] [0.24,0.32] Manual SEP 0.845*** 0.689*** 1.088*** 1.209*** [0.79,0.90] [0.64,0.74] [1.04,1.13] [1.16,1.26] Female # Manual SEP -0.127*** -0.008 0.113*** -0.037 [-0.20,-0.05] [-0.08,0.07] [0.05,0.18] [-0.10,0.03] Constant -0.597*** -0.412*** -0.583*** -0.558*** [-0.64,-0.56] [-0.45,-0.38] [-0.61,-0.55] [-0.59,-0.53] ------------------------------------------------------------------------------------ R-squared 0.157 0.121 0.337 0.339 BIC 26702 27122 24309 24279 N 10000 10000 10000 10000 ------------------------------------------------------------------------------------ * p<0.05, ** p<0.01, *** p<0.001
estout
also contains functionality to export tables to external files in a variety of formats (e.g. tab separated, LaTeX, etc.). outreg2
and esttab
are similar packages for producing publication-ready tables and also allow export to different file types (such as Microsoft Word).
Several packages also exist in R for making publication-quality tables (such as the excellent gt
, gtsummary
and flextable
packages). The Stata package regsave
can be used to save regression results as .dta
files. These can then be imported into R (for instance, the haven
package). This process can also be used to plot results in R using ggplot2
.
Note, the estout
command works by using data stored in the ereturn list
. An unfortunately not-universally-known feature of Stata language is results of commands are stored behind-the-scenes in return lists. Data can be extracted from the return lists and used in following commands.
For instance, the summarize
command stores descriptive statistics in the return list
.
summarize bmi
Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- bmi | 40,000 29.4026 4.653675 13.8 50.4
return list
scalars: r(sum) = 1176103.920781312 r(max) = 50.39999999999999 r(min) = 13.8 r(sd) = 4.653674920684698 r(Var) = 21.65669026740973 r(mean) = 29.4025980195328 r(sum_w) = 40000 r(N) = 40000
display r(max) - r(min) // Results can be used in commands
36.6
Stata has three useful return lists:
return list
- for standard commands (e.g. describe
, summarise
)ereturn list
- for estimation results (e.g. regress
)creturn list
- for system information and other useful data (e.g. the time, pi, the alphabet)For more information, see German Rodriguez's Stata Tutorial and Liam Wright's Advanced Stata workshop slides.
To plot model results, we can use the Benn Jann's coefplot
command (ssc install coefplot
).
// ssc install coefplot
coefplot (bmi_1, label("Boomers")) ///
(bmi_2, label("Gen X")) ///
(bmi_3, label("Millennials")) ///
(bmi_4, label("Gen Z")), bylabel("Absolute") || ///
(bmi_std_1, label("Boomers")) ///
(bmi_std_2, label("Gen X")) ///
(bmi_std_3, label("Millennials")) ///
(bmi_std_4, label("Gen Z")), bylabel("Relative Effects") ||, ///
drop(_cons) xline(0) byopts(xrescale) /// Drop constant and add line through origin.
coeflabels(, interaction(" x ")) // To display x rather than # for interaction terms
Observe in the above plot that there are some differences in the comparative size of absolute and relative effects. For instance, the female
coefficient is smaller in absolute terms (left panel) in Boomers than Millennials but larger in relative terms (right panel) across the same two cohorts.
The coefplot
command includes several options to format the plot. See the help file (help coefplot
) and the extensive coefplot
website for more detail. While coefplot
is an excellent command, we still recommend ggplot2
for greater control of plot presentation.
Below we produce logistic regression models estimating the association between depression, sex and manual SEP. To run logistic regression models, we use the logit
command. Again we create a loop to do this, but as we only have a single outcome variable (depressed
), we do not need to use a nested loop.
quietly forval cohort_num = 1/4{
logit depressed c.female##c.manual_sep if cohort == `cohort_num'
local cohort_name: label cohort `cohort_num'
estimates store dep_`cohort_num', title(`cohort_name')
}
To print the results of a particular model, we can again use estimates replay
. By default logit
will print results on the log odds scales. Including the option or
prints odds ratios instead.
estimates replay dep_3, or
---------------------------------------------------------------------------------------------------------------------------------- Model dep_3 (Millennials) ---------------------------------------------------------------------------------------------------------------------------------- Logistic regression Number of obs = 10,000 LR chi2(3) = 96.04 Prob > chi2 = 0.0000 Log likelihood = -4109.056 Pseudo R2 = 0.0116 --------------------------------------------------------------------------------------- depressed | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] ----------------------+---------------------------------------------------------------- female | 1.697377 .1333788 6.73 0.000 1.455095 1.98 manual_sep | 1.54607 .1343602 5.01 0.000 1.303935 1.833168 | c.female#c.manual_sep | .83158 .0959467 -1.60 0.110 .6632746 1.042593 | _cons | .1115175 .0067321 -36.34 0.000 .0990735 .1255245 --------------------------------------------------------------------------------------- Note: _cons estimates baseline odds.
We can again use estout
to print the results in a single well-formatted table. The option eform
can be used to display exponentiated results.
estout dep_1 dep_2 dep_3 dep_4, ///
cells(b(star fmt(3) label(B)) ///
ci(par fmt(2) label(95% CI))) /// To display coefficients and CIs
label varlabels(_cons Constant) /// To label the columns and the constant
stats(r2 bic N, fmt(3 0 0) label(R-squared BIC N)) /// To add fit statistics
noomitted nobaselevels legend /// To add a legend and drop reference categories
eform
------------------------------------------------------------------------------------ Boomers Gen X Millennials Gen Z B/95% CI B/95% CI B/95% CI B/95% CI ------------------------------------------------------------------------------------ Depressed Female 2.654*** 1.826*** 1.697*** 1.532*** [1.90,3.70] [1.43,2.33] [1.46,1.98] [1.35,1.74] Manual SEP 2.585*** 2.035*** 1.546*** 1.514*** [1.88,3.55] [1.60,2.59] [1.30,1.83] [1.31,1.75] Female # Manual SEP 0.618* 0.761 0.832 0.855 [0.42,0.91] [0.56,1.04] [0.66,1.04] [0.70,1.04] Constant 0.025*** 0.044*** 0.112*** 0.189*** [0.02,0.03] [0.04,0.05] [0.10,0.13] [0.17,0.21] ------------------------------------------------------------------------------------ R-squared BIC 4696 5441 8255 10268 N 10000 10000 10000 10000 ------------------------------------------------------------------------------------ * p<0.05, ** p<0.01, *** p<0.001
As with estout
, we add the option eform
to display results on the odds-ratio scale.
coefplot (dep_1, label("Boomers")) ///
(dep_2, label("Gen X")) ///
(dep_3, label("Millennials")) ///
(dep_4, label("Gen Z")), ///
drop(_cons) xline(1) eform /// Drop constant and add line through origin
coeflabels(, interaction(" x ")) // To display x rather than # for interaction terms
The above results show effects on the relative scale. To extract marginal effects on the absolute scale (i.e. difference in the probability of depression), we can use the margins
command.
Below, we calculate marginal effects as probabilities for the variables female
and manual_sep
(dydx(female manual_sep)
) where the comparator is a male of high SEP (at(female = 0 manual_sep = 0)
). We post
these results to the ereturn list
so that we can use estimates store
to store the results for plotting with coefplot
.
quietly forval cohort_num = 1/4{
logit depressed c.female##c.manual_sep if cohort == `cohort_num'
margins, at(female = 0 manual_sep = 0) dydx(female manual_sep) post
estimates store margins_`cohort_num'
}
coefplot (margins_1, label("Boomers")) ///
(margins_2, label("Gen X")) ///
(margins_3, label("Millennials")) ///
(margins_4, label("Gen Z"))
Comparing this plot with the previous plot, we see some differences across the two scales. First, on the absolute scale, there is little difference by sex, while on the relative scale, there is a clear increase over time. Second, on the absolute scale, there is a clear increase in the differences in depression by SEP over time, but on the relative scale the opposite is true.
To pool results across cohorts, we can use meta-analysis. Below, we use the meta
command to pool estimates of the bivariate association between manual SEP and (standardized) BMI.
meta set
, the meta
command we will use, uses estimates and standard errors stored as variables (i.e. columns) in the dataset. Therefore to use this command, we first need to get effect estimate from each cohort (regress bmi_std manual_sep if cohort == `cohort_num'
) and store these in the dataset. We create three columns - cohort_lbl
, beta
and se
- to store the effect estimates (+ cohort name). Then we run a regression using a loop which harnesses the local macro c
as a counter to put estimates in each of the first four rows in turn (in
selects specific rows). _b[]
and _se[]
are inbuilt variables that allow users to extract regression estimates directly (beta coefficients and standard errors, respectively).
We wrap the code in quietly
so that the output is suppressed.
quietly{
generate cohort_lbl = ""
generate beta = .
generate se = .
forval cohort_num = 1/4{
regress bmi_std manual_sep if cohort == `cohort_num'
local cohort_name: label cohort `cohort_num'
replace cohort_lbl = "`cohort_name'" in `cohort_num'
replace beta = _b[manual_sep] in `cohort_num'
replace se = _se[manual_sep] in `cohort_num'
}
}
list cohort_lbl beta se in 1/4 // Check results are in correct rows and as expected.
+-----------------------------------+ | cohort_lbl beta se | |-----------------------------------| 1. | Boomers .7811322 .0188026 | 2. | Gen X .6838048 .018796 | 3. | Millennials 1.143039 .0169052 | 4. | Gen Z 1.187434 .0172727 | +-----------------------------------+
Next we perform a meta-analysis using the meta set
command. Random-effect meta-analysis (REML estimator) is performed by default.
meta set beta se, studylabel(cohort_lbl)
(39,996 missing values generated) Meta-analysis setting information Study information No. of studies: 4 Study label: cohort_lbl Study size: N/A Effect size Type: Generic Label: Effect Size Variable: beta Precision Std. Err.: se CI: [_meta_cil, _meta_ciu] CI level: 95% Model and method Model: Random-effects Method: REML
To display the results in a table, use meta summarize
meta summarize
Effect-size label: Effect Size Effect size: beta Std. Err.: se Study label: cohort_lbl Meta-analysis summary Number of studies = 4 Random-effects model Heterogeneity: Method: REML tau2 = 0.0640 I2 (%) = 99.50 H2 = 200.57 -------------------------------------------------------------------- Study | Effect Size [95% Conf. Interval] % Weight ------------------+------------------------------------------------- Boomers | 0.781 0.744 0.818 24.99 Gen X | 0.684 0.647 0.721 24.99 Millennials | 1.143 1.110 1.176 25.01 Gen Z | 1.187 1.154 1.221 25.01 ------------------+------------------------------------------------- theta | 0.949 0.700 1.198 -------------------------------------------------------------------- Test of theta = 0: z = 7.48 Prob > |z| = 0.0000 Test of homogeneity: Q = chi2(3) = 595.97 Prob > Q = 0.0000
A forest plot can also be produced using the meta forestplot
command.
meta forestplot
Effect-size label: Effect Size Effect size: beta Std. Err.: se Study label: cohort_lbl
To perform a fixed effects meta-analysis, we can add the fixed
option to meta set
.
quietly meta set beta se, studylabel(cohort_lbl) fixed
meta summarize
Effect-size label: Effect Size Effect size: beta Std. Err.: se Study label: cohort_lbl Meta-analysis summary Number of studies = 4 Fixed-effects model Heterogeneity: Method: Inverse-variance I2 (%) = 99.50 H2 = 198.66 -------------------------------------------------------------------- Study | Effect Size [95% Conf. Interval] % Weight ------------------+------------------------------------------------- Boomers | 0.781 0.744 0.818 22.61 Gen X | 0.684 0.647 0.721 22.63 Millennials | 1.143 1.110 1.176 27.97 Gen Z | 1.187 1.154 1.221 26.79 ------------------+------------------------------------------------- theta | 0.969 0.952 0.987 -------------------------------------------------------------------- Test of theta = 0: z = 108.40 Prob > |z| = 0.0000 Test of homogeneity: Q = chi2(3) = 595.97 Prob > Q = 0.0000
meta forestplot
Effect-size label: Effect Size Effect size: beta Std. Err.: se Study label: cohort_lbl
Comparing the random-effects and fixed-effects results, similar pooled effect estimates are obtained with each method (theta
= 0.95 and 0.97, respectively), but with considerably larger confidence intervals when using the random-effects estimator, as is to be expected.
drop cohort_lbl beta se
To carry out a pooled regression analysis that examines cross-cohort differences within a single model, we can add a further interaction term (i.cohort##...
) to the regression command. Here we look at cross-cohort differences in association with (absolute) BMI.
regress bmi i.cohort##c.female##c.manual_sep
Source | SS df MS Number of obs = 40,000 -------------+---------------------------------- F(15, 39984) = 2436.14 Model | 413642.291 15 27576.1527 Prob > F = 0.0000 Residual | 452603.663 39,984 11.3196194 R-squared = 0.4775 -------------+---------------------------------- Adj R-squared = 0.4773 Total | 866245.954 39,999 21.6566903 Root MSE = 3.3645 ---------------------------------------------------------------------------------------------- bmi | Coef. Std. Err. t P>|t| [95% Conf. Interval] -----------------------------+---------------------------------------------------------------- cohort | Gen X | .9369439 .1007565 9.30 0.000 .7394588 1.134429 Millennials | 2.692338 .0963498 27.94 0.000 2.50349 2.881186 Gen Z | 5.460237 .0953962 57.24 0.000 5.273259 5.647216 | female | .8506442 .1055528 8.06 0.000 .6437582 1.05753 | cohort#c.female | Gen X | -.3961648 .1423298 -2.78 0.005 -.6751345 -.1171951 Millennials | .2420068 .1367414 1.77 0.077 -.0260096 .5100232 Gen Z | .454465 .1345858 3.38 0.001 .1906738 .7182563 | manual_sep | 2.685329 .0965358 27.82 0.000 2.496116 2.874541 | cohort#c.manual_sep | Gen X | -.3746071 .1355995 -2.76 0.006 -.6403853 -.1088289 Millennials | 2.020185 .1363508 14.82 0.000 1.752934 2.287436 Gen Z | 2.971245 .1387889 21.41 0.000 2.699215 3.243274 | c.female#c.manual_sep | -.4036906 .1370065 -2.95 0.003 -.6722265 -.1351547 | cohort#c.female#c.manual_sep | Gen X | .3769735 .1920513 1.96 0.050 .0005486 .7533985 Millennials | .8908207 .1939876 4.59 0.000 .5106004 1.271041 Gen Z | .2296222 .196636 1.17 0.243 -.1557889 .6150334 | _cons | 25.02798 .0745636 335.66 0.000 24.88184 25.17413 ----------------------------------------------------------------------------------------------
Displaying these results in a presentation-ready table is simple enough:
estout, ///
cells(b(star fmt(3) label(B)) ///
ci(par fmt(2) label(95% CI))) /// To display coefficients and CIs
varlabels(_cons Constant) /// To label the constant
stats(r2 bic N, fmt(3 0 0) label(R-squared BIC N)) /// To add fit statistics
noomitted nobaselevels legend varwidth(30) // To add a legend and drop reference categories
---------------------------------------------- . B/95% CI ---------------------------------------------- 2.cohort 0.937*** [0.74,1.13] 3.cohort 2.692*** [2.50,2.88] 4.cohort 5.460*** [5.27,5.65] female 0.851*** [0.64,1.06] 2.cohort#c.female -0.396** [-0.68,-0.12] 3.cohort#c.female 0.242 [-0.03,0.51] 4.cohort#c.female 0.454*** [0.19,0.72] manual_sep 2.685*** [2.50,2.87] 2.cohort#c.manual_sep -0.375** [-0.64,-0.11] 3.cohort#c.manual_sep 2.020*** [1.75,2.29] 4.cohort#c.manual_sep 2.971*** [2.70,3.24] c.female#c.manual_sep -0.404** [-0.67,-0.14] 2.cohort#c.female#c.manual_sep 0.377* [0.00,0.75] 3.cohort#c.female#c.manual_sep 0.891*** [0.51,1.27] 4.cohort#c.female#c.manual_sep 0.230 [-0.16,0.62] Constant 25.028*** [24.88,25.17] ---------------------------------------------- R-squared 0.478 BIC 210730 N 40000 ---------------------------------------------- * p<0.05, ** p<0.01, *** p<0.001
Presenting the results with coefplot
requires the use of a few more options:
coefplot, ///
xline(0) drop(_cons) omitted baselevels ///
order(*cohort* . female sep) /// Order variables
headings(1.cohort = "{bf:Cohort}" /// Add headings to variable groups
1.cohort#c.female = "{bf:Cohort x Female}" ///
1.cohort#c.manual_sep = "{bf:Cohort x Manual SEP}" ///
1.cohort#c.female#c.manual_sep = "{bf:Cohort x Female x Manual SEP}" ///
female = "{bf:Main Effects}") ///
coeflabels(, interaction(" x "))
We can use the test
or testparm
commands to assess whether groups of the variables are jointly statistical significant. For instance, to test whether there are differences by cohort in the association between female
and bmi
, we use:
testparm i.cohort#c.female
( 1) 2.cohort#c.female = 0 ( 2) 3.cohort#c.female = 0 ( 3) 4.cohort#c.female = 0 F( 3, 39984) = 16.17 Prob > F = 0.0000
The null hypothesis is that all coefficients are equal to zero. We see that p-value for the test is very small (< 0.001), so we can reject the null of no difference by cohort.
In the simulated data, we included an indicator of missingness (observed
). The degree of missingness differs by sex and SEP to varying degrees across cohort. The observed data (observed == 1
) are therefore non-representative of the full population and the degree of this non-representativeness differs by cohort. Below, we show how not accounting for missingness can lead to biased, misleading results.
quietly{
regress bmi i.cohort
estimates store all
regress bmi i.cohort if observed == 1
estimates store obs
}
coefplot (all, label("Full Population")) ///
(obs, label("Observed Individuals")), ///
drop(_cons ) xline(0) baselevels /// Drop constant and add line through origin
coeflabels(, interaction(" x "))
In the full sample, there are clear increases in BMI across successive cohorts, but in the observed sample, the difference between Boomers and Gen X is close to zero and statistically insignificant. Not accounting for missing data (e.g. by not using non-response weights) could lead to incorrect inferences.
Several methods can be used to account for missing data. The imagined type of missing data shown here is unit non-response (i.e. missing participants, rather than missing items). This could be accounted for by deriving non-response weights. Note, to do this would require some information about those who are missing from the dataset - this could, for instance, be inferred by comparing the observed data with that expected from other contemporary representative datasets (e.g. the Census). Once derived, weights can be added to the regress
and logit
commands with the [pweight = weight_name]
argument.
To account for item missingness, imputation procedures such as multiple imputation by chained equations may be used. The main command for multiple imputation in Stata is mi impute
. For more information, see the reference manual and the IDRE website.
Another approach for accounting for missing data is estimation via Full Information Maximum Likelihood (FIML). FIML can be used with the sem
command. Describing this command is beyond the scope of this paper. For more information, see the reference manual.
Relatively more complex cross-cohort analyses may incorporate repeat measure, longitudinal data. There are several modelling options available where data contains multiple observations per individual, including mixed effects (alternatively known as multilevel modelling, growth curve modelling, and random effects models) and the fixed effects “within” panel estimator. The main command for estimating mixed effects models in Stata is mixed
. For fixed effects ("within") or random effects ("between") panel data models, the xtreg
command can alternatively be used. Mixed effects models can also be specified in a structural equation modelling framework with the sem
or gsem
commands.
To our knowledge, Stata contains no or limited options for estimating latent class-type growth models, such as growth mixture models or latent growth curve analysis models. For further information on these, including exemplar R code, see Herle et al. (2020).4
1 Vable, Anusha M, Scott F Diehl, and M Maria Glymour. 2021. “Code Review as a Simple Trick to Enhance Reproducibility, Accelerate Learning, and Improve the Quality of Your Team’s Research.” American Journal of Epidemiology, April. https://doi.org/10.1093/aje/kwab092.
2 If you have forgotten the number a particular cohort corresponds to, it is also possible to specify an if condition using the value label directly (e.g. 'if "Boomer":cohort'). Alternatively numbers can be prepended to value labels using the 'numlabel, add' command or looked up using 'label list'
3 Another alternative for including interaction terms is to explicitly create new variables which multiply variables together. However, this should generally be avoided because when using postestimation commands, such as those calculating marginal effects, the relevant function won’t “know” that the new variable is dependent on the other variables, whereas this is explicit when 'var1##var2' or 'var1#var2' are used
4 Herle, Moritz, Nadia Micali, Mohamed Abdulkadir, Ruth Loos, Rachel Bryant-Waugh, Christopher Hübel, Cynthia M. Bulik, and Bianca L. De Stavola. 2020. “Identifying Typical Trajectories in Longitudinal Data: Modelling Strategies and Interpretations.” European Journal of Epidemiology 35 (3): 205–22. https://doi.org/10.1007/s10654-020-00615-6.