Chapter 3 Data sets

3.1 General presentation of the data used in our examples

Four data sets have been simulated, each containing 7 variables:

  • 2 baseline confounders (denoted \(L(0)\) in the DAGs):
    • L0_male, a binary variable indicating the sex of the participant (1 for men, 0 for women);
    • L0_soc_env, a binary variable indicating if the participant has been exposed to a deprived social environment);
  • 1 exposure of interest (denoted \(A\) in the DAGs):
    • A0_PM2.5, a binary variable indicating if the participants had been exposed to a high level of \(\text{PM}_{2.5}\);
  • 1 confounder of the mediator-outcome relationship (denoted \(L(1)\) in the DAGs):
    • L1, a binary variable indicating if the participant is overweight (1 for being overweight, 0 for not being overweight);
  • 1 mediator of interest (denoted \(M\) in the DAGs):
    • M_diabetes, a binary variable indicating if the participant has type 2 diabetes;
  • 2 outcomes (denoted \(Y\) in the DAGs):
    • Y2_death, a binary variable indicating the occurrence of death before 60 years of age (1 if dead, 0 if alive);
    • Y2_qol, a quantitative variable corresponding to a quality of life measurement.

3.2 Data generating mechanisms

The 4 data generating mechanisms used to simulate the data sets are described in chapter 4 of the Expanse “Mediation analysis” report:

  • The first two data sets are simulated from a causal model where confounders of the mediator-outcome relationship (\(L(1)\)) are not affected by the exposure \(A\) (Figure 3.1),
    • The data set df1.csv is simulated from the statistical model \(\mathcal{M}_1\), which does not contain any \(A \ast M\) interaction effect on the outcome \(Y\).
    • The data set df1_int.csv is simulated from the statistical model \(\mathcal{M}_{1 \ast}\), which contains an \(A \ast M\) interaction effect on the outcome \(Y\).
Causal model 1

Figure 3.1: Causal model 1

  • The next two data sets are simulated from a causal model where confounders of the mediator-outcome relationship (\(L(1)\)) are affected by the exposure \(A\) (Figure 3.2),
    • The data set df2.csv is simulated from the statistical model \(\mathcal{M}_2\), which does not contain any \(A \ast M\) interaction effect on the outcome \(Y\).
    • The data set df2_int.csv is simulated from the statistical model \(\mathcal{M}_{2 \ast}\), which contains an \(A \ast M\) interaction effect on the outcome \(Y\).
Causal model 2

Figure 3.2: Causal model 2

The R functions used to simulate these 4 data sets are given in the Appendix A @ref(appendix_a).

The Appendix B @ref(appendix_b) describes how the true values for the estimands of the causal quantities of interest given in Table 2 of the Expanse “Mediation analysis” report were calculated. Those true values are the theoretical values expected under the causal and statistical models \(\mathcal{M}_1\), \(\mathcal{M}_{1 \ast}\), \(\mathcal{M}_2\) and \(\mathcal{M}_{2 \ast}\). Estimations that will be obtained from the data sets df1.csv, df1_int.csv, df2.csv, and df2_int.csv will be slightly different from the true values because of sample variability.