Chapter 3 Applying the CMAverse

If we use the CMAverse package with the df data set, we can estimate the Average Total Effect (\(ATE\)) and the Controlled direct effects (setting the mediator to 0) (\(CDE(M=0)\)).

\[\begin{align*} ATE &= \mathbb{E}(Y_{A=1}) - \mathbb{E}(Y_{A=0}) \\ CDE(M=0) &= \mathbb{E}(Y_{A=1,M=0}) - \mathbb{E}(Y_{A=0,M=0}) \end{align*}\]

In case of intermediate confounders affected by the exposure, the estimation based on regression coefficients cannot be used to estimate CDE and other direct and indirect effects. We need to use a g-computation, an IPTW estimator or a double-robust estimator.

3.1 Estimation by “parametric” g-computation

For example, we can estimate the two estimands ATE and CDE(M=0) on a risk difference scale as described below. In order to get estimates on the risk difference scale, the yreg argument needs to be set to "linear".

In order to estimate CDE(M=0) by parametric g-computation, we need:

a model of the outcome (note that the exposure*mediator interaction is correctly included)
a model of each of the intermediate confounder (2 models in our example).

Using those 3 models, we can simulated counterfactual values (under \(\{A=1,M=0\}\) and \(\{A=0,M=0\}\)) exactly as what we did in the introduction chapter to simulate the data set df.

In the results, the model of the mediator is not needed for the CDE, but it is needed to estimate the interventional (stochastic) direct and indirect effects.

Note that for the model of the intermediate confounder occupation, physical activity (phys) was not included as a predictor whereas it was present in the corresponding equation of the data-generating system: Can this “misspecification” result in some bias? (I don’t know the answer, it might be interesting to test on simulations).

library(CMAverse)
set.seed(1234)
gformula_death_RD <- cmest(data = df, 
                           model = "gformula", # for parametric g-computation
                           outcome = "death", # outcome variable
                           exposure = "edu", # exposure variable
                           mediator = "smoking", # mediator
                           basec = c("sex",     
                                     "low_par_edu"), # baseline confounders
                           postc = c("phys", 
                                     "occupation"), # intermediate confounders 
                           EMint = TRUE, # exposures*mediator interaction
                           mreg = list("logistic"), # g(M=1|L1,A,L0)
                           yreg = "linear",# Qbar.L2 = P(Y=1|M,L1,A,L0) 
                           postcreg = list("logistic", "logistic"), 
                           astar = 0,
                           a = 1,
                           mval = list(0), # do(M=0) to estimate CDE(M=0)
                           estimation = "imputation", # if model= gformula
                           inference = "bootstrap",
                           boot.ci.type = "per", # percentiles, other option: "bca"
                           nboot = 2) # use a large number of bootstrap samples

##   |                                                                              |                                                                      |   0%  |                                                                              |===================================                                   |  50%  |                                                                              |======================================================================| 100%

summary(gformula_death_RD)

## Causal Mediation Analysis
## 
## # Outcome regression:
## 
## Call:
## glm(formula = death ~ edu + smoking + edu * smoking + sex + low_par_edu + 
##     phys + occupation, family = gaussian(), data = getCall(x$reg.output$yreg)$data, 
##     weights = getCall(x$reg.output$yreg)$weights)
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.0006107  0.0128884  -0.047   0.9622    
## edu          0.0585572  0.0128876   4.544 5.59e-06 ***
## smoking      0.1122164  0.0130604   8.592  < 2e-16 ***
## sex          0.0633703  0.0084086   7.536 5.25e-14 ***
## low_par_edu  0.0652140  0.0094112   6.929 4.49e-12 ***
## phys        -0.0168476  0.0084636  -1.991   0.0466 *  
## occupation   0.0413486  0.0097641   4.235 2.31e-05 ***
## edu:smoking  0.1484815  0.0170875   8.689  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1671299)
## 
##     Null deviance: 1905.1  on 9999  degrees of freedom
## Residual deviance: 1670.0  on 9992  degrees of freedom
## AIC: 10499
## 
## Number of Fisher Scoring iterations: 2
## 
## 
## # Mediator regressions: 
## 
## Call:
## glm(formula = smoking ~ edu + sex + low_par_edu + phys + occupation, 
##     family = binomial(), data = getCall(x$reg.output$mreg[[1L]])$data, 
##     weights = getCall(x$reg.output$mreg[[1L]])$weights)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.85996    0.06143 -13.998   <2e-16 ***
## edu          0.70047    0.04402  15.911   <2e-16 ***
## sex          0.62499    0.04365  14.318   <2e-16 ***
## low_par_edu  0.41552    0.04772   8.707   <2e-16 ***
## phys        -0.38703    0.04418  -8.761   <2e-16 ***
## occupation   0.59325    0.04946  11.993   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 13578  on 9999  degrees of freedom
## Residual deviance: 12685  on 9994  degrees of freedom
## AIC: 12697
## 
## Number of Fisher Scoring iterations: 4
## 
## 
## # Regressions for mediator-outcome confounders affected by the exposure: 
## 
## Call:
## glm(formula = phys ~ edu + sex + low_par_edu, family = binomial(), 
##     data = getCall(x$reg.output$postcreg[[1L]])$data, weights = getCall(x$reg.output$postcreg[[1L]])$weights)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.38815    0.04765   8.145 3.78e-16 ***
## edu         -0.34545    0.04205  -8.215  < 2e-16 ***
## sex          0.43714    0.04124  10.599  < 2e-16 ***
## low_par_edu -0.19185    0.04689  -4.092 4.28e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 13729  on 9999  degrees of freedom
## Residual deviance: 13519  on 9996  degrees of freedom
## AIC: 13527
## 
## Number of Fisher Scoring iterations: 4
## 
## 
## 
## Call:
## glm(formula = occupation ~ edu + sex + low_par_edu, family = binomial(), 
##     data = getCall(x$reg.output$postcreg[[2L]])$data, weights = getCall(x$reg.output$postcreg[[2L]])$weights)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.37498    0.05083   7.378 1.61e-13 ***
## edu          0.85924    0.04734  18.150  < 2e-16 ***
## sex          0.36173    0.04783   7.562 3.97e-14 ***
## low_par_edu  0.09333    0.05217   1.789   0.0736 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 11377  on 9999  degrees of freedom
## Residual deviance: 10977  on 9996  degrees of freedom
## AIC: 10985
## 
## Number of Fisher Scoring iterations: 4
## 
## 
## # Effect decomposition on the mean difference scale via the g-formula approach
##  
## Direct counterfactual imputation estimation with 
##  bootstrap standard errors, percentile confidence intervals and p-values 
##  
##                Estimate Std.error   95% CIL 95% CIU  P.val    
## cde           0.0715898 0.0102149 0.0648080   0.079 <2e-16 ***
## rpnde         0.1415395 0.0124769 0.1284949   0.145 <2e-16 ***
## rtnde         0.1735670 0.0166478 0.1519051   0.174 <2e-16 ***
## rpnie         0.0242051 0.0002929 0.0222740   0.023 <2e-16 ***
## rtnie         0.0562325 0.0038779 0.0460777   0.051 <2e-16 ***
## te            0.1977720 0.0163549 0.1745726   0.197 <2e-16 ***
## rintref       0.0699497 0.0022620 0.0636869   0.067 <2e-16 ***
## rintmed       0.0320275 0.0041709 0.0234102   0.029 <2e-16 ***
## cde(prop)     0.3619816 0.0210889 0.3711416   0.399 <2e-16 ***
## rintref(prop) 0.3536883 0.0188551 0.3395706   0.365 <2e-16 ***
## rintmed(prop) 0.1619413 0.0100660 0.1340541   0.148 <2e-16 ***
## rpnie(prop)   0.1223887 0.0122998 0.1133772   0.130 <2e-16 ***
## rpm           0.2843301 0.0022338 0.2609549   0.264 <2e-16 ***
## rint          0.5156296 0.0087890 0.4871483   0.499 <2e-16 ***
## rpe           0.6380184 0.0210889 0.6005255   0.629 <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (cde: controlled direct effect; rpnde: randomized analogue of pure natural direct effect; rtnde: randomized analogue of total natural direct effect; rpnie: randomized analogue of pure natural indirect effect; rtnie: randomized analogue of total natural indirect effect; te: total effect; rintref: randomized analogue of reference interaction; rintmed: randomized analogue of mediated interaction; cde(prop): proportion cde; rintref(prop): proportion rintref; rintmed(prop): proportion rintmed; rpnie(prop): proportion rpnie; rpm: randomized analogue of overall proportion mediated; rint: randomized analogue of overall proportion attributable to interaction; rpe: randomized analogue of overall proportion eliminated)
## 
## Relevant variable values: 
## $a
## [1] 1
## 
## $astar
## [1] 0
## 
## $mval
## $mval[[1]]
## [1] 0

The ATE = 0.1978 and the CDE(M=0) = 0.0716.

3.2 Estimation by MSM estimated by IPTW

The CDE(M=0) can also by estimated using Marginal structural models (MSM) estimated by IPTW.

In order to get estimations by IPTW, the model argument needs to be set to msm.

The cmest function will estimate:

a MSM for the outcome (depending only on the exposure and the mediator). Confounding is handled by weighting. This MSM can be used to estimate the controlled direct effect.

\[\begin{align*} \mathbb{E}(Y_{A=a,M=m}) &= \beta_0 + \beta_{A_{educ}} \times a + \beta_{M_{smoking}} \times m + \beta_{A \star M} \times (a \times m) \\ CDE(M=m) &= \mathbb{E}(Y_{A=1,M=m}) - \mathbb{E}(Y_{A=0,M=m}) = \beta_{A_{educ}} + \beta_{A \star M} \times m \end{align*}\]

the weights \(sw = sw_A \times sw_M\) is needed to handle confounding in the MSM of the outcome, where \(sw_A\) is a weight balancing parents of the exposure \(A\) and \(sw_M\) is a weight balancing parents of the mediator. To calculate those weights, we need to specify the numerator and denominator for the mediator’s weight with the wmnomreg and wmdenomreg arguments. The denominator for the exposure’s weight is specified with the ereg argument.

\[\begin{align*} sw = sw_A \times sw_M = \frac{P(A_i)}{P(A_i \mid L(0))} \times \frac{P(M_i \mid A)}{P(M_i \mid L(0),A,L(1))} \end{align*}\]

An MSM of the mediator is also estimated (depending only on the exposure, confounding is handled by weighting). This MSM is not useful to estimate the CDE, but is needed to estimated the “interventional” or “stochastic” natural direct and indirect effects.

set.seed(1234)
iptw_death_RD <- cmest(data = df, 
                       model = "msm", # using MSM estimated by IPTW
                       outcome = "death", # outcome variable
                       exposure = "edu", # exposure variable
                       mediator = "smoking", # mediator
                       basec = c("sex",     
                                 "low_par_edu"), # baseline confounders
                       postc = c("phys", 
                                 "occupation"), # intermediate confounders 
                       EMint = TRUE, # exposures*mediator interaction
                       ereg = "logistic", # exposure regression model g(A=1|L(0))
                       mreg = list("logistic"), # g(M=1|L1,A,L0)
                       yreg = "linear",# Qbar.L2 = P(Y=1|M,L1,A,L0) 
                       postcreg = list("logistic", "logistic"), # Qbar.L1 = P(L1=1|A,L0)
                       wmnomreg = list("logistic"), #g(M=1|A) wgt nominator
                       wmdenomreg = list("logistic"), # g(M=1|L1,A,L(0)) wgt denominator
                       astar = 0,
                       a = 1,
                       mval = list(0), # do(M=0) to estimate CDE_m
                       estimation = "imputation", # if model= gformula
                       inference = "bootstrap",
                       boot.ci.type = "per", # for percentile, other option: "bca"
                       nboot = 2) # we should use a large number of bootstrap samples

##   |                                                                              |                                                                      |   0%

##   |                                                                              |===================================                                   |  50%

##   |                                                                              |======================================================================| 100%

summary(iptw_death_RD)

## Causal Mediation Analysis
## 
## # Outcome regression:
## 
## Call:
## glm(formula = death ~ edu + smoking + edu * smoking, family = gaussian(), 
##     data = getCall(x$reg.output$yreg)$data, weights = getCall(x$reg.output$yreg)$weights)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.082933   0.008865   9.355  < 2e-16 ***
## edu         0.070447   0.012775   5.514 3.59e-08 ***
## smoking     0.119740   0.012960   9.239  < 2e-16 ***
## edu:smoking 0.142064   0.017184   8.267  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1690611)
## 
##     Null deviance: 1880.9  on 9999  degrees of freedom
## Residual deviance: 1689.9  on 9996  degrees of freedom
## AIC: 10969
## 
## Number of Fisher Scoring iterations: 2
## 
## 
## # Mediator regressions: 
## 
## Call:
## glm(formula = smoking ~ edu, family = binomial(), data = getCall(x$reg.output$mreg[[1L]])$data, 
##     weights = getCall(x$reg.output$mreg[[1L]])$weights)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.11692    0.03149  -3.713 0.000205 ***
## edu          0.78869    0.04174  18.895  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 13576  on 9999  degrees of freedom
## Residual deviance: 13214  on 9998  degrees of freedom
## AIC: 13197
## 
## Number of Fisher Scoring iterations: 4
## 
## 
## # Mediator regressions for weighting (denominator): 
## 
## Call:
## glm(formula = smoking ~ edu + sex + low_par_edu + phys + occupation, 
##     family = binomial(), data = getCall(x$reg.output$wmdenomreg[[1L]])$data, 
##     weights = getCall(x$reg.output$wmdenomreg[[1L]])$weights)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.85996    0.06143 -13.998   <2e-16 ***
## edu          0.70047    0.04402  15.911   <2e-16 ***
## sex          0.62499    0.04365  14.318   <2e-16 ***
## low_par_edu  0.41552    0.04772   8.707   <2e-16 ***
## phys        -0.38703    0.04418  -8.761   <2e-16 ***
## occupation   0.59325    0.04946  11.993   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 13578  on 9999  degrees of freedom
## Residual deviance: 12685  on 9994  degrees of freedom
## AIC: 12697
## 
## Number of Fisher Scoring iterations: 4
## 
## 
## # Mediator regressions for weighting (nominator): 
## 
## Call:
## glm(formula = smoking ~ edu, family = binomial(), data = getCall(x$reg.output$wmnomreg[[1L]])$data, 
##     weights = getCall(x$reg.output$wmnomreg[[1L]])$weights)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.13313    0.03151  -4.225 2.39e-05 ***
## edu          0.81446    0.04178  19.493  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 13578  on 9999  degrees of freedom
## Residual deviance: 13192  on 9998  degrees of freedom
## AIC: 13196
## 
## Number of Fisher Scoring iterations: 4
## 
## 
## # Exposure regression for weighting: 
## 
## Call:
## glm(formula = edu ~ sex + low_par_edu, family = binomial(), data = getCall(x$reg.output$ereg)$data, 
##     weights = getCall(x$reg.output$ereg)$weights)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.02910    0.04186   0.695    0.487    
## sex         -0.23178    0.04154  -5.580 2.41e-08 ***
## low_par_edu  0.63793    0.04599  13.871  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 13497  on 9999  degrees of freedom
## Residual deviance: 13285  on 9997  degrees of freedom
## AIC: 13291
## 
## Number of Fisher Scoring iterations: 4
## 
## 
## # Effect decomposition on the mean difference scale via the marginal structural model
##  
## Direct counterfactual imputation estimation with 
##  bootstrap standard errors, percentile confidence intervals and p-values 
##  
##               Estimate Std.error  95% CIL 95% CIU  P.val    
## cde           0.070447  0.007889 0.067239   0.078 <2e-16 ***
## rpnde         0.137886  0.005815 0.137662   0.145 <2e-16 ***
## rtnde         0.164690  0.003025 0.168127   0.172 <2e-16 ***
## rpnie         0.022592  0.002277 0.020573   0.024 <2e-16 ***
## rtnie         0.049395  0.000512 0.050350   0.051 <2e-16 ***
## te            0.187281  0.005303 0.188700   0.196 <2e-16 ***
## rintref       0.067440  0.002075 0.067636   0.070 <2e-16 ***
## rintmed       0.026803  0.002789 0.026718   0.030 <2e-16 ***
## cde(prop)     0.376155  0.030641 0.356284   0.397 <2e-16 ***
## rintref(prop) 0.360098  0.020701 0.345418   0.373 <2e-16 ***
## rintmed(prop) 0.143118  0.018617 0.136462   0.161 <2e-16 ***
## rpnie(prop)   0.120629  0.008677 0.109013   0.121 <2e-16 ***
## rpm           0.263747  0.009939 0.257133   0.270 <2e-16 ***
## rint          0.503216  0.039318 0.481879   0.535 <2e-16 ***
## rpe           0.623845  0.030641 0.602550   0.644 <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (cde: controlled direct effect; rpnde: randomized analogue of pure natural direct effect; rtnde: randomized analogue of total natural direct effect; rpnie: randomized analogue of pure natural indirect effect; rtnie: randomized analogue of total natural indirect effect; te: total effect; rintref: randomized analogue of reference interaction; rintmed: randomized analogue of mediated interaction; cde(prop): proportion cde; rintref(prop): proportion rintref; rintmed(prop): proportion rintmed; rpnie(prop): proportion rpnie; rpm: randomized analogue of overall proportion mediated; rint: randomized analogue of overall proportion attributable to interaction; rpe: randomized analogue of overall proportion eliminated)
## 
## Relevant variable values: 
## $a
## [1] 1
## 
## $astar
## [1] 0
## 
## $mval
## $mval[[1]]
## [1] 0

Applying an IPTW estimator using the CMAverse, the ATE = 0.1873 and the CDE(M=0) = 0.0704.