Chapter 8 Marginal structural models

Marginal structural models (MSM) are parametric models that are used to summarize the relationship between the counterfactual outcome (\(Y_a\) or \(Y_{am}\) for example) and the exposure(s) \(A\) and mediators \(M\). It also possible to summarize the relationship according to a subset of the baseline confounders if it is relevant for the scientific question.

To illustrate the application of MSMs, we will first use the data simulated from the Causal model 1 (with an \(A \ast M\) interaction effect on the outcome) where the exposure \(A\) doesn’t affect the confounder \(L(1)\) between the mediator and the outcome.

8.1 MSM for the Average Total Effect (ATE)

8.1.1 Expressing the ATE using coefficients of an MSM

For a continuous or binary outcome, we can use the following MSM to summarize the relationship between the counterfactual outcome (\(Y_a\)) and the exposure(s) \(A\): \[\begin{equation} \mathbb{E}(Y_a) = \alpha_0 + \alpha_A a \tag{8.1} \end{equation}\]

The Average Total Effect \(\text{ATE} = \mathbb{E}(Y_{A=1}) - \mathbb{E}(Y_{A=0})\) can then be expressed using the coefficients of this MSM (8.1): \[\begin{equation*} \text{ATE} := \left(\alpha_0 + \alpha_A \times 1 \right) - \left(\alpha_0 + \alpha_A \times 0 \right) = \alpha_A \end{equation*}\]

In this example, the coefficient \(\alpha_A\) corresponds to the \(\text{ATE}\).

Such a model is not very useful for a binary exposure. It would be much more useful for higher-dimensional exposures, for example with a continuous exposure, where the relationship between all the possible continuous values of the exposure \(A=a\) and the corresponding outcomes \(Y_a\) is summarized (and arbitrarily simplified) by a single line and the slope coefficient \(\alpha_A\).

It is also possible to define MSMs adjusted for a subset \(V\) of the baseline confounders (\(V \subset L(0)\)). Such MSMs can be useful to estimate conditional effects. For example it is possible to define an average “conditional” total effect \(\text{ATE}|L(0)\) (instead of the marginal ATE defined above), projecting the counterfactual outcomes on a parametric model such as the following:

\[\begin{equation} \mathbb{E}(Y_a \mid L(0)) = \alpha_0 + \alpha_A a + \alpha_\text{male} L_\text{male}(0) + \alpha_\text{soc.env.} L_\text{soc.env.}(0) \tag{8.2} \end{equation}\]

so that \(\text{ATE} \mid L(0) = \mathbb{E}(Y_{A=1} \mid L(0)) - \mathbb{E}(Y_{A=0} \mid L(0)) = \alpha_A\) using the coefficient from the MSM (8.2).

MSMs are also very useful to study interactions, or effect modification of the exposure \(A\) by a baseline confounder. For example, in order to study the average total effect according to sex, we can use the following MSM:

\[\begin{equation} \mathbb{E}(Y_a \mid L_\text{male}(0)) = \alpha_0 + \alpha_A a + \alpha_\text{male} L_\text{male}(0) + \alpha_{A \ast L_\text{male}} \left(a \times L_\text{male}(0)\right) \tag{8.3} \end{equation}\]

and express the average total effect in each strata of sex using the coefficients of the MSM (8.3):

\[\begin{align*} \{\text{ATE} \mid L_\text{male}(0) = 0\} &:= \mathbb{E}(Y_1 \mid L_\text{male}(0) = 0) - \mathbb{E}(Y_0 \mid L_\text{male}(0) = 0) = \alpha_A \\ \{\text{ATE} \mid L_\text{male}(0) = 1\} &:= \mathbb{E}(Y_1 \mid L_\text{male}(0) = 1) - \mathbb{E}(Y_0 \mid L_\text{male}(0) = 1) = \alpha_A + \alpha_{A \ast L_\text{male}} \end{align*}\]

Because MSMs are models of unobserved counterfactual outcomes, estimators of the MSM coefficients are necessary. We will describe two possible approaches : estimation by IPTW or by G-computation.

In both approaches, 95% confidence intervals can be computed by bootstrap.

8.1.2 Estimation of the MSM coefficients by IPTW

MSM coefficients can be easily estimated using an Inverse Probability of Treatment (IPTW) approach based on weighted regressions.

For example, in order to fit the MSM (8.3) described above, we can use a linear regression of the (observed) outcome \(Y\) on the exposure and sex, weigthed by individual weights \(w_i\) or \(sw_i\): \[\begin{equation} \mathbb{E}\left[Y \mid L_\text{male}(0)\right] = \alpha_0 + \alpha_A a + \alpha_\text{male} L_\text{male}(0) + \alpha_{A \ast L_\text{male}} \left(a \times L_\text{male}(0)\right) \end{equation}\]

where \(w_i=\frac{1}{P(A=a_i \mid L(0)=l(0)_i)}\) or \(sw_i=\frac{P(A=a_i \mid L_\text{male}(0))}{P(A=a_i \mid L(0)=l(0)_i)}\).

As in chapter 7, the “no-unmeasured confounding” assumption is addressed by the application of weights \(w_i\) or \(sw_i\), which balance confounders \(L(0)\) relative to the exposure \(A\).

rm(list=ls())
df1_int <- read.csv(file = "data/df1_int.csv")

## 1. Denominator of the weight
# 1a. Estimate g(A=a_i|L(0)) (denominator of the weight)
g.A.L <- glm(A0_PM2.5 ~ L0_male + L0_soc_env,
           family = "binomial", data = df1_int)

# 1b. Predict each individual's probability of being exposed to her own exposure
# predict the probabilities P(A0_PM2.5 = 1) & P(A0_PM2.5 = 0)
pred.g1.L <- predict(g.A.L, type="response")
pred.g0.L <- 1 - pred.g1.L
# the predicted probability of the observed treatment P(A = a_i | L(0)) is :
gAi.L <- rep(NA, nrow(df1_int))
gAi.L[df1_int$A0_PM2.5==1] <- pred.g1.L[df1_int$A0_PM2.5==1]
gAi.L[df1_int$A0_PM2.5==0] <- pred.g0.L[df1_int$A0_PM2.5==0]

## 2. Numerator of the weight
# The numerator of the weight can be 1 for simple weights,
# or g(A=a_i|V) to obtain stabilized weights which put less weight to individuals
# with less observation. Stabilized weights enable a weaker positivity assumption.

# 2a. Estimate g(A=a_i | sex) (numerator of the stabilized weight)
g.A.sex <- glm(A0_PM2.5 ~ L0_male,
               family = "binomial", data = df1_int)

# 2b. Predict each individual's probability of being exposed to her own exposure
# predict the probabilities P(A0_PM2.5 = 1 | sex) & P(A0_PM2.5 = 0 | sex)
pred.g1.sex <- predict(g.A.sex, type="response")
pred.g0.sex <- 1 - pred.g1.sex
# the predicted probability of the observed treatment P(A = a_i | sex) is :
gAi.sex <- rep(NA, nrow(df1_int))
gAi.sex[df1_int$A0_PM2.5==1] <- pred.g1.sex[df1_int$A0_PM2.5==1]
gAi.sex[df1_int$A0_PM2.5==0] <- pred.g0.sex[df1_int$A0_PM2.5==0]

## 3. Define individual weights:
# We can use simple weights w = 1 / g(A=a_i | L(0))
w <- 1 / gAi.L
# Or alternatively, we can use stabilized weights : 
# sw = g(A=a_i | sex) / g(A=a_i | L(0))
sw <- gAi.sex / gAi.L

# we can see that stabilized weights have less extreme values
par(mfcol = c(1,2))
boxplot(w ~ df1_int$A0_PM2.5)
boxplot(sw ~ df1_int$A0_PM2.5)
par(mfcol = c(1,1))

## applying these weights creates a pseudo-population were the baseline
## confounders are balanced, relative to the exposure:
## before applying weights to the individuals:
table(df1_int$L0_male, df1_int$A0_PM2.5, deparse.level = 2)
#                df1_int$A0_PM2.5
# df1_int$L0_male    0    1
#               0 4527  463
#               1 4349  661
prop.table(table(df1_int$L0_male, df1_int$A0_PM2.5, deparse.level = 2),
           margin = 2)
#                          df1_int$A0_PM2.5
# df1_int$L0_male         0         1
#               0 0.5100270 0.4119217
#               1 0.4899730 0.5880783   # sex is unbalanced: 49% versus 59%

## after applying weights to the individuals:
library(questionr) # The questionr package enables to describe weighted populations
wtd.table(x = df1_int$L0_male, y = df1_int$A0_PM2.5,
          weights = w)
#          0        1
# 0 4989.425 4918.462
# 1 5010.862 5057.676
prop.table(wtd.table(x = df1_int$L0_male, y = df1_int$A0_PM2.5,
                     weights = w), margin = 2)
#           0         1
# 0 0.4989282 0.4930227
# 1 0.5010718 0.5069773 # sex is balanced in the weighted population

## 4. Estimate coefficients of the MSM using a weighted regression E(Y | A, sex)
# a GLM with gaussian family can be applied to estimate risk differences
# (for relative risk or rate ratios, we can apply a Poisson family; 
#  for OR, we can apply a binomial family)
msm1 <- glm(Y_death ~ A0_PM2.5 + L0_male + A0_PM2.5*L0_male,
           weights = w,
           family = "gaussian",
           data = df1_int)
coef(msm1)
# (Intercept)         A0_PM2.5          L0_male A0_PM2.5:L0_male
#  0.17573472       0.05883228       0.04598911       0.03077614

msm2 <- glm(Y_death ~ A0_PM2.5 + L0_male + A0_PM2.5*L0_male,
            weights = sw,
            family = "gaussian",
            data = df1_int)
coef(msm2)
# (Intercept)         A0_PM2.5          L0_male A0_PM2.5:L0_male
#  0.17573472       0.05883228       0.04598911       0.03077614

## 5. Estimate the ATE stratified by sex
# According to MSM1 (with simple weights)
ATE.msm1.male0 <- coef(msm1)["A0_PM2.5"]
# 0.05883228
ATE.msm1.male1 <- coef(msm1)["A0_PM2.5"] + coef(msm1)["A0_PM2.5:L0_male"]
# 0.08960842

# According to the MSM2 (with stabilized weights)
ATE.msm2.male0 <- coef(msm2)["A0_PM2.5"]
# 0.05883228
ATE.msm2.male1 <- coef(msm2)["A0_PM2.5"] + coef(msm2)["A0_PM2.5:L0_male"]
# 0.08960842
# The results are the same because there is no violation of the positivity assumption
# In case of positivity violation, stabilized weights would give more accurate estimates

The ATE estimates of death probability using an MSM estimated by IPTW are respectively +5.9% in women and +9.0% in men.

95% confidence intervals can be calculated by bootstrap.

Note: Using the true data generating model used to simulate the illustrative datasets, the “true” value of the ATE stratified by sex can be calculated:

  • the “true” \((ATE \mid L_\text{male}(0) = 0) = 0.0688\) in women,

  • the “true” \((ATE \mid L_\text{male}(0) = 1) = 0.0703\) in men.

8.1.3 Estimation of the MSM coefficients by G-computation (imputation)

We can also use a G-computation (sometimes described as an imputation) approach to estimate the coefficients of an MSM.

The following steps can be applied:

  1. Fit a (logistic or a linear) regression to estimate \(\overline{Q} = \mathbb{E}(Y \mid A, L(0))\). Don’t forget to add an \(A*L_{\text{male}}(0)\) interaction term, because we aim to estimate the effect of the exposure \(A\), modified by sex.

  2. Use this estimate to predict an outcome for each subject under the counterfactual scenarios \(\hat{\overline{Q}}(A=0)_i\) and \(\hat{\overline{Q}}(A=1)_i\), by evaluating the regression fit \(\overline{Q}\) at \(A=0\) and \(A=1\) respectively

  3. Duplicate the initial dataset in a single long dataset in which:

  • the first half of the long dataset corresponds to the first counterfactual scenario with \(A=0\) for all individuals and an additional column for the predicted counterfactual outcome \(\hat{\overline{Q}}(A=0)\) ;
  • the second half of the long dataset corresponds to the second counterfactual scenario with \(A=1\) for all individuals and \(\hat{\overline{Q}}(A=1)\) for the predicted counterfactual column.
  1. Fit the MSM \(\mathbb{E}\left[Y_a \mid L_\text{male}(0)\right]\) using the long dataset.
rm(list=ls())
df1_int <- read.csv(file = "data/df1_int.csv")

## 1. Estimate Qbar
Q.tot.death <- glm(Y_death ~ A0_PM2.5 + L0_male + L0_soc_env +
                     A0_PM2.5:L0_male, # don't forget the interaction term
                   family = "binomial", data = df1_int)
# The final result would be sligthly different if we apply a binomial family.

## 2. Predict an outcome for each subject, in 2 counterfactual scenarios
##    setting A=0 and A=1
# prepare data sets used to predict the outcome under the counterfactual
# scenarios setting A=0 and A=1
data.A1 <- data.A0 <- df1_int
data.A1$A0_PM2.5 <- 1
data.A0$A0_PM2.5 <- 0

# predict values under the same name in the corresponding counterfactual dataset
data.A1$Ya.death.pred <- predict(Q.tot.death, newdata = data.A1, type = "response")
data.A0$Ya.death.pred <- predict(Q.tot.death, newdata = data.A0, type = "response")

## 3. Append both counterfactual datasets in a single long dataset
# (the number of row is twice the initial number of row because there are 2
# counterfactual scenarios)
data.2scenarios <- rbind(data.A0, data.A1)

## 4. fit the MSM: E(Y_a|sex)
# a GLM with gaussian family can be applied to estimate risk differences
MSM.ATE.gcomp <- glm(Ya.death.pred ~ A0_PM2.5 + L0_male + A0_PM2.5:L0_male,
                     family = "gaussian",
                     data = data.2scenarios)
coef(MSM.ATE.gcomp)
# (Intercept)         A0_PM2.5          L0_male A0_PM2.5:L0_male
#  0.17571658       0.06108129       0.04602303       0.02516386

## 5. Estimate the ATE stratified by sex
# According to MSM.ATE.gcomp
ATE.MSM.gcomp.male0 <- coef(MSM.ATE.gcomp)["A0_PM2.5"]
# 0.06108129
ATE.MSM.gcomp.male1 <- (coef(MSM.ATE.gcomp)["A0_PM2.5"] +
                          coef(MSM.ATE.gcomp)["A0_PM2.5:L0_male"])
# 0.08624515

8.2 MSM for Controlled Direct Effects

8.2.1 Expressing the CDE using coefficients of an MSM

The controlled direct effect is defined as \(\text{CDE}_m= \mathbb{E}(Y_{am}) - \mathbb{E}(Y_{a^*m})\).

Using the following MSM \[\begin{equation} \mathbb{E}(Y_{am}) = \alpha_0 + \alpha_A a + \alpha_M m + \alpha_{A \ast M} a \times m \tag{8.4} \end{equation}\] the controlled direct effect (keeping the mediator constant to the value \(M=m\)) can be expressed using the coefficients of the MSM (8.4): \[\begin{align*} \text{CDE}_m &= (\alpha_0 + \alpha_A a + \alpha_M m + \alpha_{A \ast M} a \times m) - (\alpha_0 + \alpha_A a^* + \alpha_M m + \alpha_{A \ast M} a^* \times m) \\ \text{CDE}_m &= \alpha_A(a - a^* ) + \alpha_{A \ast M} \times (a - a^*) \times m \end{align*}\] For a binary exposure \(A\), we have \(\text{CDE}_m=\alpha_A + \alpha_{A \ast M} \times m\).

8.2.2 Estimation of the MSM coefficients by IPTW

MSM coefficients can be easily estimated using an Inverse Probability of Treatment (IPTW) approach based on weighted regressions.

In order to fit the MSM (8.4), we can use a linear regression of the (observed) outcome \(Y\) on the exposure and mediator, weighted by individual stabilized weights \(sw_i\) ((VanderWeele 2009)): \[\begin{equation} \mathbb{E}\left(Y \mid A,M\right) = \alpha_0 + \alpha_A a + \alpha_M m + \alpha_{A \ast M} a \times m \end{equation}\]

where \(sw_i\) is the product of two weights \(sw_i = sw_{A,i} \times sw_{M,i}\),

\(sw_{A,i}=\frac{P(A=a_i)}{P(A=a_i \mid L(0)=l(0)_i)}\) and \(sw_{M,i}=\frac{P(M=m_i \mid A=a_i)}{P(M = m_i \mid A=a_i,L(0)=l(0)_i), L(1)=l(1)_i}\).

The “no-unmeasured confounding” assumption is addressed by the application of weights \(sw_i\), which balances confounders \(L(0)\) relative to the exposure-outcome \(A-Y\) relationship, and balance the set of confounders \(\{L(0),A,L(1)\}\) relative to the mediator-outcome \(M-Y\) relationship.

8.2.2.1 Example 1, with no intermediate confounder affected by the exposure

### MSM of CDE, estimated by IPTW ----------------------------------------------
rm(list=ls())
df1_int <- read.csv(file = "data/df1_int.csv")

## 1. Stabilized weight for the exposure sw_{A,i}
# 1a. Estimate g(A=a_i|L(0)) (denominator of the weight)
g.A.L <- glm(A0_PM2.5 ~ L0_male + L0_soc_env,
             family = "binomial", data = df1_int)
# 1b. Predict each individual's probability of being exposed to her own exposure
# the predicted probability of the observed treatment g(A = a_i | L(0)) is :
gAi.L <- rep(NA, nrow(df1_int))
gAi.L[df1_int$A0_PM2.5==1] <- predict(g.A.L, type="response")[df1_int$A0_PM2.5==1]
gAi.L[df1_int$A0_PM2.5==0] <- (1 - predict(g.A.L, type="response"))[df1_int$A0_PM2.5==0]

# 1c. Estimate g(A=a_i) (numerator of the weight)
g.A <- glm(A0_PM2.5 ~ 1, family = "binomial", data = df1_int)
# 1d. Predict each individual's probability of being exposed to her own exposure
# the predicted probability of the observed treatment g(A = a_i) is :
gAi <- rep(NA, nrow(df1_int))
gAi[df1_int$A0_PM2.5==1] <- predict(g.A, type="response")[df1_int$A0_PM2.5==1]
gAi[df1_int$A0_PM2.5==0] <- (1 - predict(g.A, type="response"))[df1_int$A0_PM2.5==0]

# 1e. Calculate the weight for the exposure A: sw_{A,i}
sw_Ai <- gAi / gAi.L

## 2. Stabilized weight for the mediator sw_{M,i}
# 2a. Estimate g(M=m_i|L(0),A,L(1)) (denominator of the weight)
g.M.L <- glm(M_diabetes ~ L0_male + L0_soc_env + A0_PM2.5 + L1,
             family = "binomial", data = df1_int)
# 2b. Predict each individual's probability of being exposed to her own exposure
# the predicted probability of the observed treatment g(A = a_i | L(0)) is :
gMi.L <- rep(NA, nrow(df1_int))
gMi.L[df1_int$M_diabetes==1] <- predict(g.M.L, type="response")[df1_int$M_diabetes==1]
gMi.L[df1_int$M_diabetes==0] <- (1 - predict(g.M.L, type="response"))[df1_int$M_diabetes==0]

# 2c. Estimate g(M=m_i|A) (numerator of the weight)
g.M.A <- glm(M_diabetes ~ A0_PM2.5, family = "binomial", data = df1_int)
# 2d. Predict each individual's probability of being exposed to her own exposure
# the predicted probability of the observed treatment g(M = m_i|A) is :
gMi.A <- rep(NA, nrow(df1_int))
gMi.A[df1_int$M_diabetes==1] <- predict(g.M.A, type="response")[df1_int$M_diabetes==1]
gMi.A[df1_int$M_diabetes==0] <- (1 - predict(g.M.A, type="response"))[df1_int$M_diabetes==0]
# 2e. Calculate the weight for the mediator M: sw_{M,i}
sw_Mi <- gMi.A / gMi.L

## 3. Define the individual stabilized weight for the CDE_m
sw_cde <- sw_Ai * sw_Mi

## 4. Estimate coefficients of the MSM using a weighted regression E(Y | A, sex)
# a GLM with gaussian family can be applied to estimate risk differences
msm_cde <- glm(Y_death ~ A0_PM2.5 + M_diabetes + A0_PM2.5*M_diabetes,
               weights = sw_cde,
               family = "gaussian",
               data = df1_int)
coef(msm_cde)
# (Intercept)            A0_PM2.5          M_diabetes A0_PM2.5:M_diabetes 
#  0.17891689          0.06798282          0.06729724         -0.00495314

## 5. Estimate CDE for m=0 and for m=1 using the MSM's coefficients
CDE_mis0 <- coef(msm_cde)["A0_PM2.5"]
# 0.06798282
CDE_mis1 <- coef(msm_cde)["A0_PM2.5"] + coef(msm_cde)["A0_PM2.5:M_diabetes"]
# 0.06302968

In this example, our estimates of the controlled direct effects are \(CDE_{M=0} = 6.8\%\) and \(CDE_{M=1} = 6.3\%\). Confidence intervals can be calculated by bootstrap.

Importantly, this approach for the estimation of the controlled direct effect \(\text{CDE}_m\) by IPTW is also valid if the exposure \(A\) affects the intermediate confounder \(L(1)\) (as with the Causal model 2).

8.2.2.2 Example 2, with intermediate confounder affected by the exposure

Below, we estimate the CDE by hand, and using the ltmle package and the CMAverse package

### MSM of CDE, estimated by IPTW
rm(list=ls())
df2_int <- read.csv(file = "data/df2_int.csv")

## 1. Stabilized weight for the exposure sw_{A,i}
# 1a. Estimate g(A=a_i|L(0)) (denominator of the weight)
g.A.L <- glm(A0_PM2.5 ~ L0_male + L0_soc_env,
             family = "binomial", data = df2_int)
# 1b. Predict each individual's probability of being exposed to her own exposure
# the predicted probability of the observed treatment g(A = a_i | L(0)) is :
gAi.L <- rep(NA, nrow(df2_int))
gAi.L[df2_int$A0_PM2.5==1] <- predict(g.A.L, type="response")[df2_int$A0_PM2.5==1]
gAi.L[df2_int$A0_PM2.5==0] <- (1 - predict(g.A.L, type="response"))[df2_int$A0_PM2.5==0]

# 1c. Estimate g(A=a_i) (numerator of the weight)
g.A <- glm(A0_PM2.5 ~ 1, family = "binomial", data = df2_int)
# 1d. Predict each individual's probability of being exposed to her own exposure
# the predicted probability of the observed treatment g(A = a_i) is :
gAi <- rep(NA, nrow(df2_int))
gAi[df2_int$A0_PM2.5==1] <- predict(g.A, type="response")[df2_int$A0_PM2.5==1]
gAi[df2_int$A0_PM2.5==0] <- (1 - predict(g.A, type="response"))[df2_int$A0_PM2.5==0]

# 1e. Calculate the weight for the exposure A: sw_{A,i}
sw_Ai <- gAi / gAi.L

## 2. Stabilized weight for the mediator sw_{M,i}
# 2a. Estimate g(M=m_i|L(0),A,L(1)) (denominator of the weight)
g.M.L <- glm(M_diabetes ~ L0_male + L0_soc_env + A0_PM2.5 + L1,
             family = "binomial", data = df2_int)
# 2b. Predict each individual's probability of being exposed to her own exposure
# the predicted probability of the observed treatment g(A = a_i | L(0)) is :
gMi.L <- rep(NA, nrow(df2_int))
gMi.L[df2_int$M_diabetes==1] <- predict(g.M.L, type="response")[df2_int$M_diabetes==1]
gMi.L[df2_int$M_diabetes==0] <- (1 - predict(g.M.L, type="response"))[df2_int$M_diabetes==0]

# 2c. Estimate g(M=m_i|A) (numerator of the weight)
g.M.A <- glm(M_diabetes ~ A0_PM2.5, family = "binomial", data = df2_int)
# 2d. Predict each individual's probability of being exposed to her own exposure
# the predicted probability of the observed treatment g(M = m_i|A) is :
gMi.A <- rep(NA, nrow(df2_int))
gMi.A[df2_int$M_diabetes==1] <- predict(g.M.A, type="response")[df2_int$M_diabetes==1]
gMi.A[df2_int$M_diabetes==0] <- (1 - predict(g.M.A, type="response"))[df2_int$M_diabetes==0]
# 2e. Calculate the weight for the mediator M: sw_{M,i}
sw_Mi <- gMi.A / gMi.L

## 3. Define the individual stabilized weight for the CDE_m
sw_cde <- sw_Ai * sw_Mi

## 4. Estimate coefficients of the MSM using a weighted regression E(Y | A, sex)
# a GLM with gaussian family can be applied to estimate risk differences
msm_cde <- glm(Y_death ~ A0_PM2.5 + M_diabetes + A0_PM2.5*M_diabetes,
               weights = sw_cde,
               family = "gaussian",
               data = df2_int)
coef(msm_cde)
# (Intercept)            A0_PM2.5          M_diabetes A0_PM2.5:M_diabetes
#  0.17932146          0.06012920          0.07239661          0.03017266

## 5. Estimate CDE for m=0 and for m=1 using the MSM's coefficients
CDE_mis0 <- coef(msm_cde)["A0_PM2.5"]
# 0.0601292
CDE_mis1 <- coef(msm_cde)["A0_PM2.5"] + coef(msm_cde)["A0_PM2.5:M_diabetes"]
# 0.09030186


### Estimation by IPTW using ltmle package
library(ltmle)
Qform <- c(L1="Q.kplus1 ~ L0_male + L0_soc_env + A0_PM2.5",
           Y_death="Q.kplus1 ~ L0_male + L0_soc_env + L1 +
                    A0_PM2.5 * M_diabetes")
gform <- c("A0_PM2.5 ~ L0_male + L0_soc_env",
           "M_diabetes ~ L0_male + L0_soc_env + A0_PM2.5 + L1")
data_binary <- subset(df2_int, select = c(L0_male, L0_soc_env,
                                          A0_PM2.5, L1,
                                          M_diabetes, Y_death))
CDE_ltmle_M0_death <- ltmle(data = data_binary,
                            Anodes = c("A0_PM2.5", "M_diabetes"),
                            Lnodes = c("L1"), # intermediate confounders +/- baseline
                            Ynodes = c("Y_death"),
                            survivalOutcome = FALSE, # TRUE for time-to-event outcomes Y
                            Qform = Qform,
                            gform = gform,
                            abar = list(c(1,0), # counterfactual intervention do(A=1,M=0)
                                        c(0,0)), # counterfactual intervention do(A=0,M=0)
                            SL.library = NULL,
                            estimate.time = FALSE, # estimate computation time?
                            iptw.only = TRUE, # for only IPTW estimations (no TMLE)
                            gcomp = FALSE,
                            variance.method = "iptw") # IPTW influence curve
# CDE with M=0
summary(CDE_ltmle_M0_death, estimator = "iptw")$effect.measures$ATE
# $estimate
# [1] 0.0601292
# $CI
#            2.5%      97.5%
# [1,] 0.02424819 0.09601021

CDE_ltmle_M1_death <- ltmle(data = data_binary,
                            Anodes = c("A0_PM2.5", "M_diabetes"),
                            Lnodes = c("L1"), # intermediate confounders +/- baseline
                            Ynodes = c("Y_death"),
                            survivalOutcome = FALSE, # TRUE for time-to-event outcomes Y
                            Qform = Qform,
                            gform = gform,
                            abar = list(c(1,1), # counterfactual intervention do(A=1,M=0)
                                        c(0,1)), # counterfactual intervention do(A=0,M=0)
                            SL.library = NULL,
                            estimate.time = FALSE, # estimate computation time?
                            iptw.only = TRUE, # for only IPTW estimations (no TMLE)
                            gcomp = FALSE,
                            variance.method = "iptw") # IPTW influence curve
# CDE with M=1
summary(CDE_ltmle_M1_death, estimator = "iptw")$effect.measures$ATE
# $estimate
# [1] 0.09030186
# $CI
#            2.5%     97.5%
# [1,] 0.04084792 0.1397558

## we obtain the same results as the IPTW estimation by hand


### Estimation by IPTW using ltmle package
library(CMAverse)
rm(list=ls())
df2_int <- read.csv(file = "data/df2_int.csv")
cmdag(outcome = "Y_death", exposure = "A0_PM2.5", mediator = "M_diabetes",
      basec = c("L0_male", "L0_soc_env"), postc = "L1", node = TRUE, text_col = "white")
# In this setting, we can use the marginal structural model and the $g$-formula approach. The results are shown below.

## The Marginal Structural Model
res_msm_RD <- cmest(data = df2_int,
                    model = "msm",
                    outcome = "Y_death",
                    exposure = "A0_PM2.5",
                    mediator = "M_diabetes",
                    basec = c("L0_male", "L0_soc_env"),
                    postc = "L1",
                    EMint = TRUE, # E*M interaction
                    ereg = "logistic",
                    yreg = "linear", # MSM is a linear regression (to get RD)
                    mreg = list("logistic"),
                    wmnomreg = list("logistic"),
                    wmdenomreg = list("logistic"),
                    astar = 0, #E(Y_{A=0,M=1})
                    a = 1,  #E(Y_{A=1,M=1})
                    mval = list(1), # for the CDE, set mediator to M=1
                    estimation = "imputation",
                    inference = "bootstrap",
                    nboot = 2)

summary(res_msm_RD)
# Causal Mediation Analysis
# # Outcome regression:
# Call:
#   glm(formula = Y_death ~ A0_PM2.5 + M_diabetes + A0_PM2.5 * M_diabetes, 
#       family = gaussian(), data = getCall(x$reg.output$yreg)$data, 
#       weights = getCall(x$reg.output$yreg)$weights)
# Coefficients:
#                       Estimate Std. Error t value Pr(>|t|)      # MSM coef
#   (Intercept)         0.179321   0.005242  34.210  < 2e-16 ***  # are the
#   A0_PM2.5            0.060129   0.017253   3.485 0.000494 ***  # same as
#   M_diabetes          0.072397   0.009242   7.834 5.21e-15 ***  # calculation
#   A0_PM2.5:M_diabetes 0.030173   0.025895   1.165 0.243960      # by hand
#
# # Mediator regressions: 
# Call:
#   glm(formula = M_diabetes ~ A0_PM2.5, family = binomial(), data = getCall(x$reg.output$mreg[[1L]])$data, 
#       weights = getCall(x$reg.output$mreg[[1L]])$weights)
# Coefficients:
#               Estimate Std. Error z value Pr(>|z|)    
#   (Intercept) -0.73432    0.02268 -32.384  < 2e-16 ***
#   A0_PM2.5     0.51531    0.06422   8.024 1.02e-15 ***
# 
# # Mediator regressions for weighting (denominator): 
# Call:
#   glm(formula = M_diabetes ~ A0_PM2.5 + L0_male + L0_soc_env + 
#         L1, family = binomial(), data = getCall(x$reg.output$wmdenomreg[[1L]])$data, 
#       weights = getCall(x$reg.output$wmdenomreg[[1L]])$weights)
# Coefficients:
#               Estimate Std. Error z value Pr(>|z|)    
#   (Intercept) -1.36249    0.04783 -28.488  < 2e-16 ***
#   A0_PM2.5     0.30994    0.06668   4.648 3.35e-06 ***
#   L0_male      0.24661    0.04369   5.644 1.66e-08 ***
#   L0_soc_env   0.30628    0.04650   6.587 4.50e-11 ***
#   L1           0.86045    0.04493  19.152  < 2e-16 ***
# 
# # Mediator regressions for weighting (nominator): 
# Call:
#   glm(formula = M_diabetes ~ A0_PM2.5, family = binomial(), data = getCall(x$reg.output$wmnomreg[[1L]])$data, 
#       weights = getCall(x$reg.output$wmnomreg[[1L]])$weights)
# Coefficients:
#               Estimate Std. Error z value Pr(>|z|)    
#   (Intercept) -0.74205    0.02271 -32.680   <2e-16 ***
#   A0_PM2.5     0.55288    0.06408   8.628   <2e-16 ***
# 
# # Exposure regression for weighting: 
# Call:
#   glm(formula = A0_PM2.5 ~ L0_male + L0_soc_env, family = binomial(), 
#       data = getCall(x$reg.output$ereg)$data, weights = getCall(x$reg.output$ereg)$weights)
# Coefficients:
#               Estimate Std. Error z value Pr(>|z|)    
#   (Intercept) -2.73244    0.07425 -36.799  < 2e-16 ***
#   L0_male      0.40580    0.06447   6.294 3.09e-10 ***
#   L0_soc_env   0.64060    0.07350   8.716  < 2e-16 ***
# 
# # Effect decomposition on the mean difference scale via the marginal structural model
# Direct counterfactual imputation estimation with 
# bootstrap standard errors, percentile confidence intervals and p-values 
#                   Estimate  Std.error    95% CIL 95% CIU  P.val    
#   cde            9.030e-02  1.601e-02  6.094e-02   0.082 <2e-16 *** # same result
#   rpnde          7.003e-02  1.600e-02  5.838e-02   0.080 <2e-16 ***
#   rtnde          7.374e-02  1.597e-02  5.888e-02   0.080 <2e-16 ***
#   rpnie          8.903e-03  9.159e-04  8.284e-03   0.010 <2e-16 ***
#   rtnie          1.261e-02  9.523e-04  8.738e-03   0.010 <2e-16 ***
#   te             8.265e-02  1.505e-02  6.839e-02   0.089 <2e-16 ***
#   rintref       -2.027e-02  7.341e-06 -2.571e-03  -0.003 <2e-16 ***
#   rintmed        3.710e-03  3.634e-05  4.542e-04   0.001 <2e-16 ***
#   cde(prop)      1.093e+00  2.940e-02  8.907e-01   0.930 <2e-16 ***
#   rintref(prop) -2.452e-01  6.289e-03 -3.752e-02  -0.029 <2e-16 ***
#   rintmed(prop)  4.489e-02  1.662e-03  5.140e-03   0.007 <2e-16 ***
#   rpnie(prop)    1.077e-01  3.403e-02  9.376e-02   0.139 <2e-16 ***
#   rpm            1.526e-01  3.569e-02  9.890e-02   0.147 <2e-16 ***
#   rint          -2.004e-01  4.627e-03 -3.014e-02  -0.024 <2e-16 ***
#   rpe           -9.263e-02  2.940e-02  6.983e-02   0.109 <2e-16 ***
#   ---
#   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
#  (cde: controlled direct effect; 
#  rpnde: randomized analogue of pure natural direct effect; 
#  rtnde: randomized analogue of total natural direct effect; 
#  rpnie: randomized analogue of pure natural indirect effect; 
#  rtnie: randomized analogue of total natural indirect effect; 
#  te: total effect; rintref: randomized analogue of reference interaction; 
#  rintmed: randomized analogue of mediated interaction; 
#  cde(prop): proportion cde; 
#  rintref(prop): proportion rintref; 
#  rintmed(prop): proportion rintmed; 
#  rpnie(prop): proportion rpnie; 
#  rpm: randomized analogue of overall proportion mediated; 
#  rint: randomized analogue of overall proportion attributable to interaction; 
#  rpe: randomized analogue of overall proportion eliminated)

8.2.3 Estimation of the MSM coefficients by G-computation

As for the ATE, we can use G-computation to estimate the coefficients of the MSM to estimate Controlled Direct Effects.

Note that the algorithm described below is correct only if the exposure \(A\) doesn’t affect the intermediate confounders \(L(1)\) of the \(M \rightarrow Y\) relationship (such as the data simulated from the Causal model 1 in our examples). In that case, the following steps can be applied:

  1. Fit a (logistic or a linear) regression to estimate \(\overline{Q} = \mathbb{E}(Y \mid A,M, L(0),L(1))\)

  2. Use this estimate to predict an outcome for each subject under the counterfactual scenarios \(\hat{\overline{Q}}(A=0,M=m,L(0),L(1))_i\) and \(\hat{\overline{Q}}(A=1,M=m,L(0),L(1))_i\), by evaluating the regression fit \(\overline{Q}\) at \((A=0,M=m)\) and \((A=1,M=m)\) respectively. If we want to set the level of the mediator to \(M=0\) and \(M=1\), this would give 4 counterfactual scenarios \(\text{do}(A=0,M=0)\), \(\text{do}(A=1,M=0)\), \(\text{do}(A=0,M=1)\) and \(\text{do}(A=1,M=1)\).

  3. Duplicate the initial dataset for each scenario in a single long dataset in which:

  • the 1st part of the long dataset corresponds to the first counterfactual scenario with \((A=0,M=0)\) for all individuals and an additional column for the predicted counterfactual outcome \(\mathbb{E}(Y_{A=0,M=0}\mid L(0),L(1)) = \hat{\overline{Q}}(A=0,M=0,L(0),L(1))\) ;
  • the 2d part of the long dataset corresponds to the second counterfactual scenario with \((A=1,M=0)\) for all individuals and an additional column for the predicted counterfactual outcome \(\mathbb{E}(Y_{A=1,M=0}\mid L(0),L(1)) = \hat{\overline{Q}}(A=1,M=0,L(0),L(1))\) ;
  • the 3d part of the long dataset corresponds to the second counterfactual scenario with \((A=0,M=1)\) for all individuals and an additional column for the predicted counterfactual outcome \(\mathbb{E}(Y_{A=0,M=1}\mid L(0),L(1)) = \hat{\overline{Q}}(A=0,M=1,L(0),L(1))\) ;
  • the 4th part of the long dataset corresponds to the second counterfactual scenario with \((A=1,M=1)\) for all individuals and an additional column for the predicted counterfactual outcome \(\mathbb{E}(Y_{A=1,M=1}\mid L(0),L(1)) = \hat{\overline{Q}}(A=1,M=1,L(0),L(1))\) ;
  1. Fit the MSM \(\mathbb{E}(Y_{am}) = \alpha_0 + \alpha_A a + \alpha_M m + \alpha_{A \ast M} a \times m\) using the long dataset.
rm(list=ls())
df1_int <- read.csv(file = "data/df1_int.csv")

### MSM of CDE, estimated by G-computation -------------------------------------
## 1. Estimate Qbar(A,M,L0,L1)
Q.cde.death <- glm(Y_death ~ A0_PM2.5 + M_diabetes + A0_PM2.5:M_diabetes + 
                     L0_male + L0_soc_env + L1,
                   family = "gaussian", data = df1_int)
# The final result would be sligthly different if we applied a binomial family
# The Gaussian family corresponds to the true generating model in this example.

## 2. Predict an outcome for each subject, in each counterfactual scenario
# Prepare data sets that will be used to predict the outcome under the counterfactual
# 4 counterfactual scenarios setting (A=0,M=0), (A=1,M=0), (A=0,M=1) and (A=1,M=1)
data.A0M0 <- data.A1M0 <- data.A0M1 <- data.A1M1 <- df1_int
data.A0M0$A0_PM2.5 <- 0
data.A0M0$M_diabetes <- 0

data.A1M0$A0_PM2.5 <- 1
data.A1M0$M_diabetes <- 0

data.A0M1$A0_PM2.5 <- 0
data.A0M1$M_diabetes <- 1

data.A1M1$A0_PM2.5 <- 1
data.A1M1$M_diabetes <- 1

# predict values under the same name in the corresponding counterfactual dataset
data.A0M0$Yam.death.pred <- predict(Q.cde.death, newdata = data.A0M0, type = "response")
data.A1M0$Yam.death.pred <- predict(Q.cde.death, newdata = data.A1M0, type = "response")
data.A0M1$Yam.death.pred <- predict(Q.cde.death, newdata = data.A0M1, type = "response")
data.A1M1$Yam.death.pred <- predict(Q.cde.death, newdata = data.A1M1, type = "response")

## 3. Append the 4 counterfactual datasets in a single long dataset
# number of row is 4 times the initial value (we have 4 counterfactual scenarios)
data.4scenarios <- rbind(data.A0M0, data.A1M0,data.A0M1,data.A1M1)

## 4. fit the MSM: E(Y_am) = alpha_0 + alpha_A a + alpha_M m + alpha_AM a:m
MSM.CDE.gcomp <- glm(Yam.death.pred ~ A0_PM2.5 +  M_diabetes + A0_PM2.5:M_diabetes,
                     family = "gaussian", # gaussian family for risk differences
                     data = data.4scenarios)
coef(MSM.CDE.gcomp)
# (Intercept)            A0_PM2.5          M_diabetes A0_PM2.5:M_diabetes 
#  0.17968603          0.06000138          0.06757214          0.01918153

## 5. Estimate the CDE(M=m)
# CDE(M=0) = E(Y_{A=1,M=0}) - E(Y_{A=0,M=0})
CDE_mis0_gcomp <- coef(MSM.CDE.gcomp)["A0_PM2.5"]
# 0.06000138

# CDE(M=1) = E(Y_{A=1,M=1}) - E(Y_{A=0,M=1})
CDE_mis1_gcomp <- (coef(MSM.CDE.gcomp)["A0_PM2.5"] +
                     coef(MSM.CDE.gcomp)["A0_PM2.5:M_diabetes"])
# 0.07918291

# Note: Applying a binomial family for the first Qbar model would result in two
# sligthly different values of the CDE(M=m)
# => 0.05934409 in setting M=0
# => 0.07537874 in setting M=1

If the exposure \(A\) affects the intermediate confounder \(L(1)\), as in the Causal model 2, the steps 1) and 2) of the algorithm should follow the method described in paragraph 6.2.1 or 6.2.2.

Here is an example applying G-computation by iterative conditional expectation:

rm(list=ls())
df2_int <- read.csv(file = "data/df2_int.csv")

### MSM of CDE, estimated by G-computation (by ICE) ----------------------------
## 1a) Regress the outcome on L0, A, L1 and M (and the A*M interaction if appropriate)
Y.death.model <- glm(Y_death ~ L0_male + L0_soc_env + A0_PM2.5 + L1 +
                       M_diabetes + A0_PM2.5:M_diabetes,
                     family = "binomial", data = df2_int)

## 1b) Generate predicted values by evaluating the regression  
##     under the 4 counterfactual scenarios
data.A0M0 <- data.A1M0 <- data.A0M1 <- data.A1M1 <- df2_int
data.A0M0$A0_PM2.5 <- 0
data.A0M0$M_diabetes <- 0

data.A1M0$A0_PM2.5 <- 1
data.A1M0$M_diabetes <- 0

data.A0M1$A0_PM2.5 <- 0
data.A0M1$M_diabetes <- 1

data.A1M1$A0_PM2.5 <- 1
data.A1M1$M_diabetes <- 1

Q.Y.death.A0M0 <- predict(Y.death.model, newdata = data.A0M0, type = "response")
Q.Y.death.A1M0 <- predict(Y.death.model, newdata = data.A1M0, type = "response")
Q.Y.death.A0M1 <- predict(Y.death.model, newdata = data.A0M1, type = "response")
Q.Y.death.A1M1 <- predict(Y.death.model, newdata = data.A1M1, type = "response")

## 2a) Regress the predicted values conditional on the observed exposure A
##    and baseline confounders L(0)
L1.death.A0M0.model <- glm(Q.Y.death.A0M0 ~ L0_male + L0_soc_env + A0_PM2.5,
                           family = "quasibinomial", data = df2_int)
L1.death.A1M0.model <- glm(Q.Y.death.A1M0 ~ L0_male + L0_soc_env + A0_PM2.5,
                           family = "quasibinomial", data = df2_int)
L1.death.A0M1.model <- glm(Q.Y.death.A0M1 ~ L0_male + L0_soc_env + A0_PM2.5,
                           family = "quasibinomial", data = df2_int)
L1.death.A1M1.model <- glm(Q.Y.death.A1M1 ~ L0_male + L0_soc_env + A0_PM2.5,
                           family = "quasibinomial", data = df2_int)


## 2b) generate predicted values by evaluating the regression at exposure
##    of interest: {A=0,M=0}, {A=1,M=0}, {A=0,M=1}, {A=1,M=1}
data.A0M0$Yam.death.pred <- predict(L1.death.A0M0.model,
                                    newdata = data.A0M0, type = "response")
data.A1M0$Yam.death.pred <- predict(L1.death.A1M0.model,
                                    newdata = data.A1M0, type = "response")
data.A0M1$Yam.death.pred <- predict(L1.death.A0M1.model,
                                    newdata = data.A0M1, type = "response")
data.A1M1$Yam.death.pred <- predict(L1.death.A1M1.model,
                                    newdata = data.A1M1, type = "response")


## 3. Append the 4 counterfactual datasets in a single long dataset
# number of row is 4 times the initial value (we have 4 counterfactual scenarios)
data.4scenarios <- rbind(data.A0M0, data.A1M0,data.A0M1,data.A1M1)

## 4. fit the MSM: E(Y_am) = alpha_0 + alpha_A a + alpha_M m + alpha_AM a:m
MSM.CDE.gcomp <- glm(Yam.death.pred ~ A0_PM2.5 +  M_diabetes + A0_PM2.5:M_diabetes,
                     family = "gaussian", # gaussian family for risk differences
                     data = data.4scenarios)
coef(MSM.CDE.gcomp)
# (Intercept)            A0_PM2.5          M_diabetes A0_PM2.5:M_diabetes 
#  0.17974947          0.06342833          0.07366466          0.02469485

## 5. Estimate the CDE(M=m)
# CDE(M=0) = E(Y_{A=1,M=0}) - E(Y_{A=0,M=0})
CDE_mis0_gcomp_ice <- coef(MSM.CDE.gcomp)["A0_PM2.5"]
# 0.06342833

# CDE(M=1) = E(Y_{A=1,M=1}) - E(Y_{A=0,M=1})
CDE_mis1_gcomp_ice <- (coef(MSM.CDE.gcomp)["A0_PM2.5"] +
                         coef(MSM.CDE.gcomp)["A0_PM2.5:M_diabetes"])
# 0.08812318

The example above (MSM estimation using G-computation by ICE) corresponds to the algorithm applied by the ltmle package:

library(ltmle)
Qform <- c(L1="Q.kplus1 ~ L0_male + L0_soc_env + A0_PM2.5",
           Y_death="Q.kplus1 ~ L0_male + L0_soc_env + L1 +
                    A0_PM2.5 * M_diabetes")
gform <- c("A0_PM2.5 ~ L0_male + L0_soc_env",
           "M_diabetes ~ L0_male + L0_soc_env + A0_PM2.5 + L1")
# in this example of g-computation, the propensity scores 'gform' will not be used

data_binary <- subset(df2_int, select = c(L0_male, L0_soc_env,
                                          A0_PM2.5, L1,
                                          M_diabetes, Y_death))
CDE_ltmle_M0 <- ltmle(data = data_binary,
                      Anodes = c("A0_PM2.5", "M_diabetes"),
                      Lnodes = c("L1"), # intermediate confounders +/- baseline
                      Ynodes = c("Y_death"),
                      survivalOutcome = FALSE, # TRUE for time-to-event outcomes Y
                      Qform = Qform,
                      gform = gform,
                      abar = list(c(1,0), # counterfactual intervention do(A=1,M=0)
                                  c(0,0)), # counterfactual intervention do(A=0,M=0)
                      SL.library = NULL, # calls glm() instead of SuperLearner
                      estimate.time = FALSE, # estimate computation time?
                      gcomp = TRUE, # to apply g-computation
                      variance.method = "ic")
# CDE with M=0
summary(CDE_ltmle_M0)$effect.measures$ATE$estimate
#   Parameter Estimate:  0.06342833

CDE_ltmle_M1 <- ltmle(data = data_binary,
                      Anodes = c("A0_PM2.5", "M_diabetes"),
                      Lnodes = c("L1"), # intermediate confounders +/- baseline
                      Ynodes = c("Y_death"),
                      survivalOutcome = FALSE, # TRUE for time-to-event outcomes Y
                      Qform = Qform,
                      gform = gform,
                      abar = list(c(1,1), # counterfactual intervention do(A=1,M=0)
                                  c(0,1)), # counterfactual intervention do(A=0,M=0)
                      SL.library = NULL, # calls glm() instead of SuperLearner
                      estimate.time = FALSE, # estimate computation time?
                      gcomp = TRUE, # to apply g-computation
                      variance.method = "ic")
# CDE with M=1
summary(CDE_ltmle_M1)$effect.measures$ATE$estimate
#   Parameter Estimate:  0.08812318

8.3 MSM for Natural Direct and Indirect Effects

8.3.1 Expressing the NDE and NIE using coefficients of 2 MSMs

The (Pure) Natural Direct Effect is defined by \(\text{PNDE} = \mathbb{E}(Y_{a,M_{a^*}}) - \mathbb{E}(Y_{a^*,M_{a^*}})\) and the (Total) Natural Indirect Effect is defined by \(\text{TNIE} = \mathbb{E}(Y_{a,M_a}) - \mathbb{E}(Y_{a,M_{a^*}})\).

VanderWeele suggested using 2 MSMs conditional on baseline confounders \(L(0)\) in order to estimate natural direct and indirect effects (VanderWeele 2009):

  1. a model of the counterfactual values of the outcome \(\mathbb{E}(Y_{a,m} \mid l(0))=h^{-1}(a,m,l(0),l(1))\), where \(h\) is a link function. For example: \[\begin{equation} \mathbb{E}(Y_{a,m} \mid l(0),l(1)) = \alpha_0 + \alpha_A a + \alpha_M m + \alpha_{AM} a \times m + \alpha_{L(0)} l(0) + \alpha_{L(1)} l(1) \tag{8.5} \end{equation}\] (where \(h\) is the identity function, so that the model can be used to express risk differences)

  2. a model of the counterfactual values of the mediator \(\mathbb{E}(M_{a} \mid L(0))=g^{-1}(a,l(0),l(1))\), where \(g\) is a link function. For example with a binary mediator: \[\begin{equation} \mathbb{E}(M_a \mid l(0),l(1)) = g^{-1}\left[\beta_0 + \beta_A a + \beta_{L(0)} l(0) + \beta_{L(1)} l(1) \right] \tag{8.6} \end{equation}\] (where \(g\) is the logit function because the mediator is binary).

VanderWeele shows that if the function \(h\) is linear in \(m\) (no quadratic terms in \(m\), nor transformations such as \(\log(m)\) or \(\sqrt{m}\), etc) and the exposure \(A\) does not affect the intermediate confounder \(L(1)\), then \[\begin{equation*} \mathbb{E}(Y_{a,M_{a^*}}) = h^{-1}\left[a,g^{-1}\left(a^*,l(0),l(1)\right),l(0),l(1)\right] \end{equation*}\]

Using the 2 MSMs, we can express the Natural Direct and Indirect Effects conditional on baseline confounders \(L(0)\). In our example: \[\begin{align*} \text{PNDE} \mid L(0),L(1) &= \mathbb{E}(Y_{a,M_{a^*}} \mid L(0),L(1)) - \mathbb{E}(Y_{a^*,M_{a^*}} \mid L(0),L(1)) \\ &= \{\alpha_0 + \alpha_A a + [\alpha_M + \alpha_{AM}a] \times g^{-1}(a^*,l(0),l(1)) + \alpha_{L(0)} l(0) + \alpha_{L(1)} l(1) \} \\ & \quad \quad - \{ \alpha_0 + \alpha_A a^*+ [\alpha_M + \alpha_{AM}a^*] \times g^{-1}(a^*,l(0),l(1)) + \alpha_{L(0)} l(0) + \alpha_{L(1)} l(1)\} \\ &= (a - a^*) \times [\alpha_A + \alpha_{AM} \times g^{-1}(a^*,l(0),l(1))] \end{align*}\]

\[\begin{align*} \text{TNIE} \mid L(0),L(1) &= \mathbb{E}(Y_{a,M_{a}} \mid L(0),L(1)) - \mathbb{E}(Y_{a,M_{a^*}} \mid L(0),L(1)) \\ &= \{\alpha_0 + \alpha_A a + [\alpha_M + \alpha_{AM}a] \times g^{-1}(a,l(0),l(1)) + \alpha_{L(0)} l(0) + \alpha_{L(1)} l(1) \} \\ & \quad \quad - \{ \alpha_0 + \alpha_A a+ [\alpha_M + \alpha_{AM}a] \times g^{-1}(a^*,l(0),l(1)) + \alpha_{L(0)} l(0) + \alpha_{L(1)} l(1) \} \\ &= \left[ g^{-1}(a,l(0),l(1)) - g^{-1}(a^*,l(0),l(1)) \right](\alpha_M + \alpha_{AM} a) \end{align*}\]

Marginal Natural Direct and Indirect effect can then be obtained: \[\begin{align*} \text{PNDE} &= \sum_{l(0),l(1)} \left[ \text{PNDE} \mid L(0) = l(0), L(1)=l(1) \right] \times P(L(0)=l(0),L(1)=l(1)) \\ \quad \text{TNIE} &= \sum_{l(0),l(1)} \left[ \text{TNIE} \mid L(0) = l(0), L(1)=l(1) \right] \times P(L(0)=l(0),L(1)=l(1)) \end{align*}\]

8.3.2 Estimation of the 2 MSMs coefficients by IPTW for NDE and NIE

As previously, MSM coefficients can be estimated using an Inverse Probability of Treatment (IPTW) approach based on weighted regressions.

In order to fit the 1st MSM (8.5), we can use a linear regression of the (observed) outcome \(Y\) on the exposure and mediator, adjusted for \(L(0)\), weighted by individual stabilized weights \(sw_{msm1,i}\) (VanderWeele 2009): \[\begin{equation*} \mathbb{E}\left(Y \mid A,M,L(0)\right) = \alpha_0 + \alpha_A a + \alpha_M m + \alpha_{AM} a \times m + \alpha_{L(0)} L(0) \end{equation*}\]

where \(sw_{msm1,i}\) is the product of two weights \(sw_{msm1,i} = sw_{A,i} \times sw_{M,i}\), \[\begin{align*} sw_{A,i} =& \frac{P(A=a_i)}{P(A=a_i \mid L(0)=l(0)_i)} \quad \text{or} \quad sw_{A,i} = \frac{P(A=a_i \mid L(0)=l(0)_i)}{P(A=a_i \mid L(0)=l(0)_i)} = 1 \\ sw_{M,i} =& \frac{P(M=m_i \mid A=a_i)}{P(M = m_i \mid A=a_i,L(0)=l(0)_i), L(1)=l(1)_i} \\ & \quad \text{or} \quad sw_{M,i} = \frac{P(M=m_i \mid A=a_i,L(0)=l(0)_i)}{P(M = m_i \mid A=a_i,L(0)=l(0)_i), L(1)=l(1)_i} \\ \end{align*}\]

In order to fit the 2nd MSM (8.6), we can use a logistic regression of the (observed) mediator \(M\) on the exposure, adjusted for \(L(0)\), weighted by individual stabilized weights \(sw_{msm2,i}\): \[\begin{equation*} \text{logit} \mathbb{E}(M \mid a,l(0)) = \beta_0 + \beta_A a + \beta_{L(0)} l(0) \end{equation*}\]

\[\begin{equation*} \text{where} \quad sw_{msm2,i} = \frac{P(A=a_i)}{P(A=a_i \mid L(0)=l(0)_i)} \end{equation*}\]

### MSM of NDE & NIE, estimated by IPTW ----------------------------------------
## 1. Stabilized weight for the MSM1
# 1a. sw_Ai = g(A=a_i | L(0)) / g(A=a_i | L(0)) = 1
sw_Ai <- rep(1, nrow(df1_int))

# 1b. sw_Mi = g(M=m_i | A,L(0)) / g(M=m_i | A,L(0),L(1))
g.M.AL0 <- glm(M_diabetes ~ A0_PM2.5 + L0_male + L0_soc_env,
               family = "binomial", data = df1_int)
g.Mis1.AL0 <- predict(g.M.AL0, type = "response")
sw_M.num <- rep(NA, nrow(df1_int))
sw_M.num[df1_int$M_diabetes==1] <- g.Mis1.AL0[df1_int$M_diabetes==1]
sw_M.num[df1_int$M_diabetes==0] <- (1 - g.Mis1.AL0[df1_int$M_diabetes==0])

g.M.AL0L1 <- glm(M_diabetes ~ A0_PM2.5 + L0_male + L0_soc_env + L1,
                 family = "binomial", data = df1_int)
g.Mis1.AL0L1 <- predict(g.M.AL0L1, type = "response")
sw_M.denom <- rep(NA, nrow(df1_int))
sw_M.denom[df1_int$M_diabetes==1] <- g.Mis1.AL0L1[df1_int$M_diabetes==1]
sw_M.denom[df1_int$M_diabetes==0] <- (1 - g.Mis1.AL0L1[df1_int$M_diabetes==0])

sw_msm1 <- sw_Ai * sw_M.num / sw_M.denom

## 2. Estimate coefficients of the MSM1
MSM1 <- glm(Y_death ~ A0_PM2.5 + M_diabetes + A0_PM2.5:M_diabetes +
              L0_male + L0_soc_env,
            weights = sw_msm1,
            family = "gaussian",
            data = df1_int)
coef(MSM1)
# (Intercept)     A0_PM2.5  M_diabetes    L0_male L0_soc_env  A0_PM2.5:M_diabetes
#  0.12033221   0.06381257  0.06691712 0.04671886 0.05521263           0.01652446

## 3. Stabilized weight for the MSM2
# 3a. sw_A = g(A=a_i) / g(A=a_i | L(0))
# numerator
g.A <- glm(A0_PM2.5 ~ 1, family = "binomial", data = df1_int)
g.Ais1 <- predict(g.A, type = "response")
sw_msm2.num <- rep(NA, nrow(df1_int))
sw_msm2.num[df1_int$A0_PM2.5==1] <- g.Ais1[df1_int$A0_PM2.5==1]
sw_msm2.num[df1_int$A0_PM2.5==0] <- (1 - g.Ais1[df1_int$A0_PM2.5==0])

# denominator
g.A.L0 <- glm(A0_PM2.5 ~ L0_male + L0_soc_env,
              family = "binomial", data = df1_int)
g.Ais1.L0 <- predict(g.A.L0, type = "response")
sw_msm2.denom <- rep(NA, nrow(df1_int))
sw_msm2.denom[df1_int$A0_PM2.5==1] <- g.Ais1.L0[df1_int$A0_PM2.5==1]
sw_msm2.denom[df1_int$A0_PM2.5==0] <- (1 - g.Ais1.L0[df1_int$A0_PM2.5==0])

# stabilized weight
sw_msm2 <- sw_msm2.num / sw_msm2.denom

## 3. Estimate coefficients of the MSM2
MSM2 <- glm(M_diabetes ~ A0_PM2.5 + L0_male + L0_soc_env,
            weights = sw_msm2,
            family = "binomial",
            data = df1_int)
coef(MSM2)
# (Intercept)    A0_PM2.5     L0_male   L0_soc_env
#  -1.2723106   0.5883720   0.2566129    0.3270087

## 4. Estimate PNDE conditional on L(0), and the marginal value of PNDE
# a = 1 and a* = 0
# PNDE|L(0) = (a - a*)[alpha_A + alpha_AM.g^-1(a^*,l(0))]
g.minus1.A0 <- plogis(coef(MSM2)["(Intercept)"] + coef(MSM2)["A0_PM2.5"] * 0 +
                      coef(MSM2)["L0_male"] * df1_int$L0_male +
                      coef(MSM2)["L0_soc_env"] * df1_int$L0_soc_env)

# PNDE conditional on L(0)
PNDE_L0 <- (1 - 0) * (coef(MSM1)["A0_PM2.5"] +
                        coef(MSM1)["A0_PM2.5:M_diabetes"] * g.minus1.A0)
# marginal PNDE
PNDE <- mean(PNDE_L0)
# [1] 0.06850657

## 4. Estimate TNIE conditional on L(0), and the marginal value of TNIE
# TNIE|L(0) = [g^-1(a,l(0)) - g^-1(a^*,l(0))] * (alpha_M + alpha_AM * a)
g.minus1.A1 <- plogis(coef(MSM2)["(Intercept)"] + coef(MSM2)["A0_PM2.5"] * 1 +
                        coef(MSM2)["L0_male"] * df1_int$L0_male +
                        coef(MSM2)["L0_soc_env"] * df1_int$L0_soc_env)

# TNIE conditional on L(0)
TNIE_L0 <- (g.minus1.A1 - g.minus1.A0) * (coef(MSM1)["M_diabetes"] +
                                            coef(MSM1)["A0_PM2.5:M_diabetes"] * 1)
# marginal PNDE
TNIE <- mean(TNIE_L0)
# [1] 0.01096799

In this example, the estimation of the PNDE is 6.9% and the estimation of the TNIE is 1.1%. Confidence intervals can be calculated by bootstrap.

8.4 MSM estimated by the CMAverse package

8.4.1 Calculation by hand, using the “mediation formula”

The counterfactual quantity \(\mathbb{E}(Y_{a,M_{a^\prime}})\) used to defined the Natural Direct and Indirect effects can be expressed by the mediation formula (conditional on \(\{L(0),L(1)\}\), or the population average): \[\begin{align*} \mathbb{E}\left( Y_{a,M_{a^\prime}} \mid L(0),L(1) \right) &= \sum_m \mathbb{E}\left( Y_{a,m} \mid L(0),L(1) \right) \times \mathbb{P}\left(M_{a^\prime} = m \mid L(0),L(1)\right) \\ \text{ or } & \\ \mathbb{E}\left( Y_{a,M_{a^\prime}} \right) &= \sum_m \mathbb{E}\left( Y_{a,m} \right) \times \mathbb{P}(M_{a^\prime} = m) \\ \end{align*}\]

In order to estimate the \(\text{PNDE} = \mathbb{E}(Y_{1,M_0}) - \mathbb{E}(Y_{0,M_0})\) and the \(\text{TNIE} = \mathbb{E}(Y_{1,M_1}) - \mathbb{E}(Y_{1,M_0})\) we can estimate 2 Marginal Structural Models:

  • a model of the counterfactual mediator (where \(g\) is a link function), for example: \[\begin{equation*} g \left[ \mathbb{P}(M_{a^\prime} = 1)\right] = \beta_0 + \beta_A a^\prime \end{equation*}\]
  • a model of the counterfactual outcome (where \(h\) is a link function), for example: \[\begin{equation*} h \left[ \mathbb{E}(Y_{a,m})\right] = \alpha_0 + \alpha_A a + \alpha_M m + \alpha_{A \ast M} a \times m) \end{equation*}\]

Below, we will estimate the coefficients of the 2 MSMs by Inverse Probability of Treatment Weighting (IPTW).

rm(list=ls())
df1_int <- read.csv(file = "data/df1_int.csv")

## 1. Stabilized weight for the exposure sw_{A,i}
# 1a. Estimate g(A=a_i|L(0)) (denominator of the weight)
g.A.L <- glm(A0_PM2.5 ~ L0_male + L0_soc_env + L1,
             family = "binomial", data = df1_int)
summary(g.A.L)
# Coefficients:
#               Estimate Std. Error z value Pr(>|z|)
#   (Intercept) -2.74236    0.07729 -35.484  < 2e-16 ***
#   L0_male      0.40610    0.06448   6.298 3.01e-10 ***
#   L0_soc_env   0.64079    0.07350   8.718  < 2e-16 ***
#   L1           0.03257    0.06968   0.467     0.64

# 1b. Predict each individual's probability of being exposed to her own exposure
# the predicted probability of the observed treatment g(A = a_i | L(0)) is :
gAi.L <- rep(NA, nrow(df1_int))
gAi.L[df1_int$A0_PM2.5==1] <- predict(g.A.L, type="response")[df1_int$A0_PM2.5==1]
gAi.L[df1_int$A0_PM2.5==0] <- (1 - predict(g.A.L, type="response"))[df1_int$A0_PM2.5==0]

# 1c. Estimate g(A=a_i) (numerator of the weight)
g.A <- glm(A0_PM2.5 ~ 1, family = "binomial", data = df1_int)
# 1d. Predict each individual's probability of being exposed to her own exposure
# the predicted probability of the observed treatment g(A = a_i) is :
gAi <- rep(NA, nrow(df1_int))
gAi[df1_int$A0_PM2.5==1] <- predict(g.A, type="response")[df1_int$A0_PM2.5==1]
gAi[df1_int$A0_PM2.5==0] <- (1 - predict(g.A, type="response"))[df1_int$A0_PM2.5==0]

# 1e. Calculate sw_{A,i}
sw_Ai <- gAi / gAi.L

## 2. Stabilized weight for the mediator sw_{M,i}
# 2a. Estimate g(M=m_i|L(0),A,L(1)) (denominator of the weight)
g.M.L <- glm(M_diabetes ~ L0_male + L0_soc_env + A0_PM2.5 + L1,
             family = "binomial", data = df1_int)
summary(g.M.L)
#               Estimate Std. Error z value Pr(>|z|)
#   (Intercept) -1.37880    0.04872 -28.303  < 2e-16 ***
#   L0_male      0.25861    0.04437   5.829 5.57e-09 ***
#   L0_soc_env   0.33050    0.04744   6.967 3.23e-12 ***
#   A0_PM2.5     0.56260    0.06555   8.583  < 2e-16 ***
#   L1           0.33462    0.04744   7.054 1.74e-12 ***

# 2b. Predict each individual's probability of being exposed to her own exposure
# the predicted probability of the observed treatment g(A = a_i | L(0)) is :
gMi.L <- rep(NA, nrow(df1_int))
gMi.L[df1_int$M_diabetes==1] <- predict(g.M.L, type="response")[df1_int$M_diabetes==1]
gMi.L[df1_int$M_diabetes==0] <- (1 - predict(g.M.L, type="response"))[df1_int$M_diabetes==0]

# 2c. Estimate g(M=m_i|A) (numerator of the weight)
g.M.A <- glm(M_diabetes ~ A0_PM2.5, family = "binomial", data = df1_int)
summary(g.M.A)
#               Estimate Std. Error z value Pr(>|z|)
#   (Intercept) -0.93236    0.02358 -39.544   <2e-16 ***
#   A0_PM2.5     0.62388    0.06481   9.627   <2e-16 ***

# 2d. Predict each individual's probability of being exposed to her own exposure
# the predicted probability of the observed treatment g(M = m_i|A) is :
gMi.A <- rep(NA, nrow(df1_int))
gMi.A[df1_int$M_diabetes==1] <- predict(g.M.A, type="response")[df1_int$M_diabetes==1]
gMi.A[df1_int$M_diabetes==0] <- (1 - predict(g.M.A, type="response"))[df1_int$M_diabetes==0]
# 2e. Calculate sw_{M,i}
sw_Mi <- gMi.A / gMi.L

## 3. Estimate marginal model for the counterfactual outcome Y_{am}
model.Yam.death <- glm(Y_death ~ A0_PM2.5 * M_diabetes,
                       weights = sw_Ai * sw_Mi,
                       family = "gaussian",
                       data = df1_int)
summary(model.Yam.death)
# Coefficients:
#                        Estimate Std. Error t value Pr(>|t|)
#   (Intercept)          0.178967   0.005049  35.445  < 2e-16 ***
#   A0_PM2.5             0.067151   0.016679   4.026 5.72e-05 ***
#   M_diabetes           0.067321   0.009508   7.080 1.54e-12 ***
#   A0_PM2.5:M_diabetes -0.004491   0.026027  -0.173    0.863

## 4) Estimate marginal model for the counterfactual mediator P(M_a* = 1)
model.Ma <- glm(M_diabetes ~ A0_PM2.5,
                weights = sw_Ai,
                family = "binomial",
                data = df1_int)
summary(model.Ma)
# Coefficients:
#               Estimate Std. Error z value Pr(>|z|)
#   (Intercept) -0.92405    0.02353 -39.264   <2e-16 ***
#   A0_PM2.5     0.58413    0.06501   8.986   <2e-16 ***

## 5) Estimate population average counterfactuals using the "mediation formula"
##    E(Y_{a,M_a*}) = sum_m E(Y_{am}) * P(M_a^* = m)
E.Y0M0 <- ((coef(model.Yam.death)["(Intercept)"] + # sum_m E(Y_{a0}) * P(M_a^* = 0)
             coef(model.Yam.death)["A0_PM2.5"] * 0 +
             coef(model.Yam.death)["M_diabetes"] * 0 +
             coef(model.Yam.death)["A0_PM2.5:M_diabetes"] * 0 * 0) *
             (1 - plogis(coef(model.Ma)["(Intercept)"] +
                           coef(model.Ma)["A0_PM2.5"] * 0))) +
  ((coef(model.Yam.death)["(Intercept)"] +  # sum_m E(Y_{a1}) * P(M_a^* = 1)
      coef(model.Yam.death)["A0_PM2.5"] * 0 +
      coef(model.Yam.death)["M_diabetes"] * 1 +
      coef(model.Yam.death)["A0_PM2.5:M_diabetes"] * 0 * 1) *
     (plogis(coef(model.Ma)["(Intercept)"] +
               coef(model.Ma)["A0_PM2.5"] * 0)))

E.Y1M0 <- ((coef(model.Yam.death)["(Intercept)"] + # sum_m E(Y_{a,M=0}) * P(M_0 = 0)
              coef(model.Yam.death)["A0_PM2.5"] * 1 +
              coef(model.Yam.death)["M_diabetes"] * 0 +
              coef(model.Yam.death)["A0_PM2.5:M_diabetes"] * 1 * 0) *
             (1 - plogis(coef(model.Ma)["(Intercept)"] +
                           coef(model.Ma)["A0_PM2.5"] * 0))) +
  ((coef(model.Yam.death)["(Intercept)"] +  # sum_m E(Y_{a,M=1}) * P(M_0 = 1)
      coef(model.Yam.death)["A0_PM2.5"] * 1 +
      coef(model.Yam.death)["M_diabetes"] * 1 +
      coef(model.Yam.death)["A0_PM2.5:M_diabetes"] * 1 * 1) *
     (plogis(coef(model.Ma)["(Intercept)"] +
               coef(model.Ma)["A0_PM2.5"] * 0)))

E.Y1M1 <- ((coef(model.Yam.death)["(Intercept)"] + #  E(Y_{a,M=0}) * P(M_1 = 0)
              coef(model.Yam.death)["A0_PM2.5"] * 1 +
              coef(model.Yam.death)["M_diabetes"] * 0 +
              coef(model.Yam.death)["A0_PM2.5:M_diabetes"] * 1 * 0) *
             (1 - plogis(coef(model.Ma)["(Intercept)"] +
                           coef(model.Ma)["A0_PM2.5"] * 1))) +
  ((coef(model.Yam.death)["(Intercept)"] +  #  E(Y_{a,M=1}) * P(M_1 = 1)
      coef(model.Yam.death)["A0_PM2.5"] * 1 +
      coef(model.Yam.death)["M_diabetes"] * 1 +
      coef(model.Yam.death)["A0_PM2.5:M_diabetes"] * 1 * 1) *
     (plogis(coef(model.Ma)["(Intercept)"] +
               coef(model.Ma)["A0_PM2.5"] * 1)))

PNDE.death <- E.Y1M0 - E.Y0M0
# 0.06587481
TNIE.death <- E.Y1M1 - E.Y1M0
# 0.008274351

In the example above, we directly used the estimated coefficients of the MSMs to calculate the quantities \(\mathbb{E}(Y_{a,M_{a^\prime}})\) for the PNDE and the TNIE. When using the CMAverse, individual values for the counterfactual mediator are simulated as shown below. This strategy can be useful if the mediator is continuous of high-dimensional.

## 6) Estimate population average counterfactuals using simulated
##    values of the mediator
PNDE.death.sim <- rep(NA,5)
TNIE.death.sim <- rep(NA,5)

set.seed(1234)
for(k in 1:15){
  M0 <- rbinom(n = nrow(df1_int), size = 1,
               prob = plogis(coef(model.Ma)["(Intercept)"] +
                               coef(model.Ma)["A0_PM2.5"] * 0))
  M1 <- rbinom(n = nrow(df1_int), size = 1,
               prob = plogis(coef(model.Ma)["(Intercept)"] +
                               coef(model.Ma)["A0_PM2.5"] * 1))
  E.Y0M0 <- mean((coef(model.Yam.death)["(Intercept)"] +
                    coef(model.Yam.death)["A0_PM2.5"] * 0 +
                    coef(model.Yam.death)["M_diabetes"] * M0 +
                    coef(model.Yam.death)["A0_PM2.5:M_diabetes"] * 0 * M0))
  E.Y1M0 <- mean((coef(model.Yam.death)["(Intercept)"] +
                    coef(model.Yam.death)["A0_PM2.5"] * 1 +
                    coef(model.Yam.death)["M_diabetes"] * M0 +
                    coef(model.Yam.death)["A0_PM2.5:M_diabetes"] * 1 * M0))
  E.Y1M1 <- mean((coef(model.Yam.death)["(Intercept)"] +
                coef(model.Yam.death)["A0_PM2.5"] * 1 +
                coef(model.Yam.death)["M_diabetes"] * M1 +
                coef(model.Yam.death)["A0_PM2.5:M_diabetes"] * 1 * M1))
  PNDE.death.sim[k] <- E.Y1M0  - E.Y0M0
  TNIE.death.sim[k] <- E.Y1M1 - E.Y1M0
}
mean(PNDE.death.sim)
# [1] 0.06587304
mean(TNIE.death.sim)
# [1] 0.00824998

8.4.2 Estimation of NDE and NIE using CMAverse (by IPTW or gcomputation)

We can use the CMAverse package to estimate the coefficients of the 2 MSMs required to estimate Natural Direct and Indirect Effects.

For estimation by IPTW, we have to specify the following arguments model = msm (to estimate the MSM of the outcome and the MSM of the mediator), and that the counterfactual mediator values are simulated (estimation = "imputation"). Because the coefficients of the MSMs are estimated by IPTW, we have to specify

  • the link functions for the MSM of the outcome (yreg) and the MSM of the mediator (mreg)
  • and the link function to calculate the weights ereg for \(\mathbb{P}(A=1|L(0))\), wmnomreg for the numerator of the mediator’s weight and wmdenomreg for the denominator of the mediator’s weight.
## Using the CMAverse to estimate MSM estimated by IPTW
set.seed(1234) # note: there is randomness, even with estimation by IPTW
               #       => counterfactuals are imputed (estimation = "imputation")
res_msm_df1 <- cmest(data = df1_int,
                     model = "msm",
                     outcome = "Y_death",
                     exposure = "A0_PM2.5",
                     mediator = "M_diabetes",
                     basec = c("L0_male", "L0_soc_env","L1"),
                     postc = NULL,
                     EMint = TRUE, # E*M interaction
                     ereg = "logistic", # exposure regression model g(A=1|L(0))
                     yreg = "linear", # to get risk difference
                     mreg = list("logistic"), # mediation model g(M=1|L1,A,L0)
                     wmnomreg = list("logistic"), #g(M=1|A) wgt nominator
                     wmdenomreg = list("logistic"), # g(M=1|L1,A,L(0)) wgt denom
                     astar = 0, #E(Y_{A=0,M=1})
                     a = 1,  #E(Y_{A=1,M=1})
                     mval = list(0), # mediator value at which the variable is controlled
                     estimation = "imputation",
                     inference = "bootstrap",
                     nboot = 2)
summary(res_msm_df1)
# Causal Mediation Analysis
#
# # Outcome regression:
# Call:
#   glm(formula = Y_death ~ A0_PM2.5 + M_diabetes + A0_PM2.5 * M_diabetes,
#       family = gaussian(), data = getCall(x$reg.output$yreg)$data,
#       weights = getCall(x$reg.output$yreg)$weights)
# Coefficients:
#                        Estimate Std. Error t value Pr(>|t|)
#   (Intercept)          0.178967   0.005049  35.445  < 2e-16 ***
#   A0_PM2.5             0.067151   0.016679   4.026 5.72e-05 ***
#   M_diabetes           0.067321   0.009508   7.080 1.54e-12 ***
#   A0_PM2.5:M_diabetes -0.004491   0.026027  -0.173    0.863
#
# # Mediator regressions:
# Call:
#   glm(formula = M_diabetes ~ A0_PM2.5, family = binomial(), 
#       data = getCall(x$reg.output$mreg[[1L]])$data,
#       weights = getCall(x$reg.output$mreg[[1L]])$weights)
# Coefficients:
#               Estimate Std. Error z value Pr(>|z|)
#   (Intercept) -0.92405    0.02353 -39.264   <2e-16 ***
#   A0_PM2.5     0.58413    0.06501   8.986   <2e-16 ***
#
# # Mediator regressions for weighting (denominator):
# Call:
#   glm(formula = M_diabetes ~ A0_PM2.5 + L0_male + L0_soc_env +
#         L1, family = binomial(), data = getCall(x$reg.output$wmdenomreg[[1L]])$data,
#       weights = getCall(x$reg.output$wmdenomreg[[1L]])$weights)
# Coefficients:
#               Estimate Std. Error z value Pr(>|z|)
#   (Intercept) -1.37880    0.04872 -28.303  < 2e-16 ***
#   A0_PM2.5     0.56260    0.06555   8.583  < 2e-16 ***
#   L0_male      0.25861    0.04437   5.829 5.57e-09 ***
#   L0_soc_env   0.33050    0.04744   6.967 3.23e-12 ***
#   L1           0.33462    0.04744   7.054 1.74e-12 ***
#
# # Mediator regressions for weighting (nominator):
# Call:
#   glm(formula = M_diabetes ~ A0_PM2.5, family = binomial(), 
#       data = getCall(x$reg.output$wmnomreg[[1L]])$data,
#       weights = getCall(x$reg.output$wmnomreg[[1L]])$weights)
# Coefficients:
#               Estimate Std. Error z value Pr(>|z|)
#   (Intercept) -0.93236    0.02358 -39.544   <2e-16 ***
#   A0_PM2.5     0.62388    0.06481   9.627   <2e-16 ***
#
# # Exposure regression for weighting:
# Call:
#   glm(formula = A0_PM2.5 ~ L0_male + L0_soc_env + L1, family = binomial(),
#       data = getCall(x$reg.output$ereg)$data, 
#       weights = getCall(x$reg.output$ereg)$weights)
# Coefficients:
#               Estimate Std. Error z value Pr(>|z|)
#   (Intercept) -2.74236    0.07729 -35.484  < 2e-16 ***
#   L0_male      0.40610    0.06448   6.298 3.01e-10 ***
#   L0_soc_env   0.64079    0.07350   8.718  < 2e-16 ***
#   L1           0.03257    0.06968   0.467     0.64
#
# # Effect decomposition on the mean difference scale via the marginal structural model
# Direct counterfactual imputation estimation with
# bootstrap standard errors, percentile confidence intervals and p-values
#
#                  Estimate  Std.error    95% CIL 95% CIU  P.val
#   cde           0.0671509  0.0157808  0.0761499   0.097 <2e-16 ***
#   pnde          0.0658727  0.0169713  0.0643983   0.087 <2e-16 *** <- PNDE
#   tnde          0.0652889  0.0172698  0.0594343   0.083 <2e-16 ***
#   pnie          0.0087514  0.0013389  0.0070189   0.009 <2e-16 ***
#   tnie          0.0081675  0.0010405  0.0024559   0.004 <2e-16 *** <- TNDE
#   te            0.0740402  0.0159309  0.0682520   0.090 <2e-16 ***
#   intref       -0.0012782  0.0011906 -0.0117517  -0.010 <2e-16 ***
#   intmed       -0.0005839  0.0002984 -0.0049640  -0.005 <2e-16 ***
#   cde(prop)     0.9069522  0.0222813  1.0860362   1.116 <2e-16 ***
#   intref(prop) -0.0172642  0.0439631 -0.1726807  -0.114 <2e-16 ***
#   intmed(prop) -0.0078856  0.0162851 -0.0729153  -0.051 <2e-16 ***
#   pnie(prop)    0.1181976  0.0379668  0.0786163   0.130 <2e-16 ***
#   pm            0.1103120  0.0216818  0.0275800   0.057 <2e-16 ***
#   int          -0.0251498  0.0602482 -0.2455960  -0.165 <2e-16 ***
#   pe            0.0930478  0.0222813 -0.1159712  -0.086 <2e-16 ***
#   ---
#   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# (cde: controlled direct effect;
#  pnde: pure natural direct effect;
#  tnde: total natural direct effect;
#  pnie: pure natural indirect effect;
#  tnie: total natural indirect effect;
#  te: total effect;
#  intref: reference interaction;
#  intmed: mediated interaction;
#  cde(prop): proportion cde;
#  intref(prop): proportion intref;
#  intmed(prop): proportion intmed;
#  pnie(prop): proportion pnie;
#  pm: overall proportion mediated;
#  int: overall proportion attributable to interaction;
#  pe: overall proportion eliminated)

The CMAverse can also estimate the coefficients of the 2 MSM by (parametric) g-computation. For this estimation, we have to specify the following arguments model = gformula, and that the counterfactual outcome and mediator values are simulated (estimation = "imputation"). We don’t need to specify the ereg, wmnomreg and wmdenomreg arguments which were used to calculate weights in the previous approach.

Note that the estimated models of the outcome and the mediator are conditional on baseline confounders \(L(0)\) and \(L(1)\).

## Using the CMAverse to estimate MSM estimated by g-comp
set.seed(1234) # note: there is randomness, even with estimation by IPTW
#       => counterfactuals are imputed (estimation = "imputation")
res_msm_df1.qol <- cmest(data = df1_int,
                     model = "gformula",
                     outcome = "Y_qol",
                     exposure = "A0_PM2.5",
                     mediator = "M_diabetes",
                     basec = c("L0_male", "L0_soc_env","L1"),
                     postc = NULL,
                     EMint = TRUE, # E*M interaction
                     # ereg = "logistic", # # not needed with gcomp
                     yreg = "linear", # to get risk difference
                     mreg = list("logistic"), # mediation model g(M=1|L1,A,L0)
                     # wmnomreg = list("logistic"), # not needed with gcomp
                     # wmdenomreg = list("logistic"), ## not needed with gcomp
                     astar = 0, #E(Y_{A=0,M=1})
                     a = 1,  #E(Y_{A=1,M=1})
                     mval = list(0), # mediator value at which the variable is controlled
                     estimation = "imputation",
                     inference = "bootstrap",
                     nboot = 2)
summary(res_msm_df1.qol)
# Causal Mediation Analysis
#
# # Outcome regression:
# Call:
#   glm(formula = Y_qol ~ A0_PM2.5 + M_diabetes + A0_PM2.5 * M_diabetes +
#         L0_male + L0_soc_env + L1, family = gaussian(), data = getCall(x$reg.output$yreg)$data,
#       weights = getCall(x$reg.output$yreg)$weights)
# Coefficients:
#                       Estimate Std. Error t value Pr(>|t|)
#   (Intercept)          74.7669     0.2139 349.557  < 2e-16 ***
#   A0_PM2.5             -3.7153     0.4160  -8.931  < 2e-16 ***
#   M_diabetes           -8.6317     0.2385 -36.197  < 2e-16 ***
#   L0_male              -0.7235     0.2018  -3.586 0.000337 ***
#   L0_soc_env           -2.8899     0.2112 -13.684  < 2e-16 ***
#   L1                   -3.4280     0.2212 -15.494  < 2e-16 ***
#   A0_PM2.5:M_diabetes  -5.6154     0.6514  -8.621  < 2e-16 ***
#
# # Mediator regressions:
# Call:
#   glm(formula = M_diabetes ~ A0_PM2.5 + L0_male + L0_soc_env +
#         L1, family = binomial(), data = getCall(x$reg.output$mreg[[1L]])$data,
#       weights = getCall(x$reg.output$mreg[[1L]])$weights)
# Coefficients:
#               Estimate Std. Error z value Pr(>|z|)
#   (Intercept) -1.37880    0.04872 -28.303  < 2e-16 ***
#   A0_PM2.5     0.56260    0.06555   8.583  < 2e-16 ***
#   L0_male      0.25861    0.04437   5.829 5.57e-09 ***
#   L0_soc_env   0.33050    0.04744   6.967 3.23e-12 ***
#   L1           0.33462    0.04744   7.054 1.74e-12 ***
#
# # Effect decomposition on the mean difference scale via the g-formula approach
#
# Direct counterfactual imputation estimation with
# bootstrap standard errors, percentile confidence intervals and p-values
#
#                 Estimate Std.error   95% CIL 95% CIU  P.val
#   cde          -3.715265  0.114707 -3.084527  -2.930 <2e-16 ***
#   pnde         -5.103390  0.008919 -4.820137  -4.808 <2e-16 ***
#   tnde         -5.659875  0.006095 -5.579563  -5.571 <2e-16 ***
#   pnie         -0.855400  0.040149 -1.006846  -0.953 <2e-16 ***
#   tnie         -1.411886  0.025135 -1.758083  -1.724 <2e-16 ***
#   te           -6.515276  0.034054 -6.578220  -6.532 <2e-16 ***
#   intref       -1.388125  0.105788 -1.877736  -1.736 <2e-16 ***
#   intmed       -0.556485  0.015014 -0.771408  -0.751 <2e-16 ***
#   cde(prop)     0.570239  0.015115  0.448589   0.469 <2e-16 ***
#   intref(prop)  0.213057  0.017570  0.263846   0.287 <2e-16 ***
#   intmed(prop)  0.085412  0.002894  0.114201   0.118 <2e-16 ***
#   pnie(prop)    0.131292  0.005348  0.145871   0.153 <2e-16 ***
#   pm            0.216704  0.002454  0.263960   0.267 <2e-16 ***
#   int           0.298469  0.020463  0.378048   0.406 <2e-16 ***
#   pe            0.429761  0.015115  0.531104   0.551 <2e-16 ***
#   ---
#   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# (cde: controlled direct effect;
#  pnde: pure natural direct effect;
#  tnde: total natural direct effect;
#  pnie: pure natural indirect effect;
#  tnie: total natural indirect effect;
#  te: total effect; intref:
#  reference interaction;
#  intmed: mediated interaction;
#  cde(prop): proportion cde;
#  intref(prop): proportion intref;
#  intmed(prop): proportion intmed;
#  pnie(prop): proportion pnie;
#  pm: overall proportion mediated;
#  int: overall proportion attributable to interaction;
#  pe: overall proportion eliminated)

8.5 “Unified” approach for estimating Natural Direct and Indirect effects

(Lange, Vansteelandt, and Bekaert 2012) suggested to use a “unified” approach for estimating Natural Direct and Indirect effects (when they are identifiable), based on the following Marginal Structural Model: \[\begin{equation*} h\left[\mathbb{E}(Y_{a,M_{a^*}})\right] = \gamma_0 + \gamma_1 a + \gamma_2 a^* + \gamma_3 (a \times a^*) \text{ where } h \text{ is a link function} \end{equation*}\]

So that we can express the PNDE as: \[\begin{align*} \text{PNDE} &= \mathbb{E}(Y_{1,M_{0}}) - \mathbb{E}(Y_{0,M_{0}}) \\ \text{PNDE} &= h^{-1}\left(\gamma_0 + \gamma_1 \right) - h^{-1}\left(\gamma_0 \right) \\ \text{PNDE} &= \gamma_1 \text{ if the link function is the identity function} \end{align*}\]

and we can express the TNIE as: \[\begin{align*} \text{TNIE} &= \mathbb{E}(Y_{1,M_{1}}) - \mathbb{E}(Y_{1,M_{0}}) \\ \text{TNIE} &= h^{-1}\left(\gamma_0 + \gamma_1 + \gamma_2 + \gamma_3 \right) - h^{-1}\left(\gamma_0 + \gamma_1 \right) \\ \text{TNIE} &= \gamma_2 + \gamma_3 \text{ if the link function is the identity function} \end{align*}\]

Note that this model can also be extended to study effect modifications by baseline confounders.

8.5.1 Estimation of the MSM by IPTW

8.5.1.1 Manual calculation

The estimation of the “unified” MSM for Natural Direct and Indirect effects relies on the following steps:

  1. Estimate suitable models for the exposure conditional on baseline confounders (\(L(0)\) and \(L(1)\)), using the original data set, in order to compute weights for the exposure \(A\)
  2. Estimate a suitable model for the mediator conditional on baseline confounders (\(L(0)\) and \(L(1)\)), using the original data set, in order to compute weights for the mediator \(M\)
  3. Construct a new data set by repeating the original data set twice, and including an additional variable \(A^*\), equal to the original exposure in the 1st data set, and equal to \(1-A\) in the 2nd data set (for multicategorical exposures, see supplementary material of (Lange, Vansteelandt, and Bekaert 2012)).
  4. Compute weights by applying the fitted models from steps 1 and 2 \[\begin{align*} w_i &= \frac{1}{P(A=A_i\mid L(0)=L(0)_i,L(1)=L(1)_i)} \times \frac{P(M=M_i \mid A = A^*_i,L(0)=L(0)_i,L(1)=L(1)_i)}{P(M=M_i \mid A = A_i,L(0)=L(0)_i,L(1)=L(1)_i)} \\ \text{ or } sw_i &= \frac{P(A=A_i)}{P(A=A_i\mid L(0)=L(0)_i,L(1)=L(1)_i)} \times \frac{P(M=M_i \mid A = A^*_i,L(0)=L(0)_i,L(1)=L(1)_i)}{P(M=M_i \mid A = A_i,L(0)=L(0)_i,L(1)=L(1)_i)} \\ \end{align*}\]
  5. Estimate the unified MSM as a weighted model of the outcome including only \(A\) and \(A^*\) (and their interaction), weighted by simple weights or stabilized weights. Use the estimated coefficients of the MSM to calculate the PNDE and TNIE
rm(list=ls())
df1_int <- read.csv(file = "data/df1_int.csv")

## Step 1. Estimate a suitable model for the exposure conditional on confounders,
##         using the original data set
# for A weight numerator
g.A <- glm(A0_PM2.5 ~ 1,
           family = "binomial", data = df1_int)

# for A weight denominator
g.A.L0 <- glm(A0_PM2.5 ~ L0_male + L0_soc_env + L1, # L1 not necessary in our example
              family = "binomial", data = df1_int)

## Step 2. Estimate a suitable model for the mediator conditional on confounders,
##         using the original data set
df1_int$A0_PM2.5_temp <- df1_int$A0_PM2.5
g.M.AL0 <- glm(M_diabetes ~ L0_male + L0_soc_env + A0_PM2.5_temp + L1,
               family = "binomial", data = df1_int)

## Step 3. Construct a new data set by repeating the original data set twice,
##         and including an additional variable A*, equal to the original exposure
##         in the 1st data set, and equal to 1-A in the 2nd data set
##         (note for multicategorical exposure, see supplementary material of Lange et al)

# create identifier
df1_int$id <- 1:nrow(df1_int)

# dupplicate the original data set
df1_int.1 <- df1_int.2 <- df1_int

# Add an A* variable
df1_int.1$A0_PM2.5_star <- df1_int$A0_PM2.5 # A* = A in the 1st data set
df1_int.2$A0_PM2.5_star <- 1 - df1_int$A0_PM2.5 # A* = 1 - A in the 2d data set

# stack the 2 new datasets
df1_int.new <- rbind(df1_int.1,df1_int.2)
# sort the data set by the id-variable to use geeglm
# (necessary to use geepack and the geeglm() later)
df1_int.new <- df1_int.new[order(df1_int.new$id), ]

head(df1_int.new, 12)
#       L0_male L0_soc_env A0_PM2.5 L1 M_diabetes Y_death Y_qol A0_PM2.5_temp id A0_PM2.5_star
# 1           0          1        0  1          0       0  93.4             0  1             0
# 10001       0          1        0  1          0       0  93.4             0  1             1
# 2           1          1        1  1          0       1  64.0             1  2             1
# 10002       1          1        1  1          0       1  64.0             1  2             0
# 3           1          1        0  0          0       0  75.6             0  3             0
# 10003       1          1        0  0          0       0  75.6             0  3             1
# 4           1          0        0  0          0       0  89.8             0  4             0
# 10004       1          0        0  0          0       0  89.8             0  4             1
# 5           1          1        0  0          0       0  77.2             0  5             0
# 10005       1          1        0  0          0       0  77.2             0  5             1
# 6           1          1        0  0          1       0  73.9             0  6             0
# 10006       1          1        0  0          1       0  73.9             0  6             1


## Step 4. Compute weights by applying the fitted models from steps 1 and 2
# weights for A
g.Ais1 <- predict(g.A, newdata = df1_int.new, type = "response")
sw_A.num <- rep(NA, nrow(df1_int.new))
sw_A.num[df1_int.new$A0_PM2.5==1] <- g.Ais1[df1_int.new$A0_PM2.5==1]
sw_A.num[df1_int.new$A0_PM2.5==0] <- (1 - g.Ais1[df1_int.new$A0_PM2.5==0])

g.Ais1.L0 <- predict(g.A.L0, newdata = df1_int.new, type = "response")
sw_A.denom <- rep(NA, nrow(df1_int.new))
sw_A.denom[df1_int.new$A0_PM2.5==1] <- g.Ais1.L0[df1_int.new$A0_PM2.5==1]
sw_A.denom[df1_int.new$A0_PM2.5==0] <- (1 - g.Ais1.L0[df1_int.new$A0_PM2.5==0])

w_A <- 1 / sw_A.denom
sw_A <- sw_A.num / sw_A.denom

# weights for M|A*
df1_int.new$A0_PM2.5_temp <- df1_int.new$A0_PM2.5_star
g.Mis1.Astar <- predict(g.M.AL0, newdata = df1_int.new, type = "response")
sw_M.Astar <- rep(NA, nrow(df1_int.new))
sw_M.Astar[df1_int.new$M_diabetes==1] <- g.Mis1.Astar[df1_int.new$M_diabetes==1]
sw_M.Astar[df1_int.new$M_diabetes==0] <- 1 - g.Mis1.Astar[df1_int.new$M_diabetes==0]

# weight for M|A
df1_int.new$A0_PM2.5_temp <- df1_int.new$A0_PM2.5
g.Mis1.A <- predict(g.M.AL0, newdata = df1_int.new, type = "response")
sw_M.A <- rep(NA, nrow(df1_int.new))
sw_M.A[df1_int.new$M_diabetes==1] <- g.Mis1.A[df1_int.new$M_diabetes==1]
sw_M.A[df1_int.new$M_diabetes==0] <- 1 - g.Mis1.A[df1_int.new$M_diabetes==0]

# non-stabilized weight
w <- w_A * (sw_M.Astar / sw_M.A)
# stabilized weight
sw <- sw_A * (sw_M.Astar / sw_M.A)

boxplot(data.frame(w, sw))

## Step 5. Fit a suitable model to the outcome including only A and A* (and their
##         interaction) using a weighted regression

# we will use a GEE model in order to get "robust" standard errors
library(geepack)
# note: be careful that individuals in the "df1_int.new" data set
#       and in the "w" vector have the same order

## for a risk difference in death
MSM.model.death <- geeglm(Y_death ~ A0_PM2.5 * A0_PM2.5_star,
                          weights = sw,
                          family = "gaussian",
                          id = df1_int.new$id,
                          data = df1_int.new,
                          scale.fix = TRUE)
summary(MSM.model.death)
# Coefficients:
#                           Estimate   Std.err    Wald Pr(>|W|)
#   (Intercept)             0.198840  0.004251 2187.61  < 2e-16 ***
#   A0_PM2.5                0.065950  0.014539   20.58  5.7e-06 ***
#   A0_PM2.5_star           0.008534  0.001242   47.25  6.2e-12 ***
#   A0_PM2.5:A0_PM2.5_star -0.000406  0.003762    0.01     0.91

## Calculate PNDE and PNIE using the MSM's coefficients
unif.PNDE.death <- coef(MSM.model.death)["A0_PM2.5"]
# 0.066
unif.TNIE.death <- (coef(MSM.model.death)["A0_PM2.5_star"] +
                      coef(MSM.model.death)["A0_PM2.5:A0_PM2.5_star"])
# 0.00813

## For Quality of life
MSM.model.qol <- geeglm(Y_qol ~ A0_PM2.5 * A0_PM2.5_star,
                        weights = sw,
                        family = "gaussian",
                        id = df1_int.new$id,
                        data = df1_int.new,
                        scale.fix = TRUE)
summary(MSM.model.qol)
# Coefficients:
#                          Estimate Std.err     Wald Pr(>|W|)
#   (Intercept)             69.0853  0.1174 346240.4  < 2e-16 ***
#   A0_PM2.5                -5.3771  0.4010    179.8  < 2e-16 ***
#   A0_PM2.5_star           -1.0769  0.0311   1197.2  < 2e-16 ***
#   A0_PM2.5:A0_PM2.5_star  -0.6947  0.0909     58.4  2.2e-14 ***

## Calculate PNDE and PNIE using the MSM's coefficients
unif.PNDE.qol <- coef(MSM.model.qol)["A0_PM2.5"]
# -5.38
unif.TNIE.qol <- (coef(MSM.model.qol)["A0_PM2.5_star"] +
                    coef(MSM.model.qol)["A0_PM2.5:A0_PM2.5_star"])
# -1.77

# 95% confidence intervals can also be computed by bootstrap

8.5.1.2 Using the medflex package

We can used the medflex package to apply this approach, as described below.

The “unified” MSM can be defined as a model of the expected counterfactual outcome \(\mathbb{E}(Y_{a,M_{a^*}} \mid L(0),L(1))\) conditional on the baseline confounders \(L(0)\) and \(L(1)\), or as a model the marginal expected counterfactual outcome \(\mathbb{E}(Y_{a,M_{a^*}}\).

For estimation by IPTW, the data is expanded using the neWeight function, and the MSM is estimated using the neModel function.Population average estimations are obtained using a weighted regression, where the weights are calculated using a model of the exposure, stated at the xFit argument inside the neModel function.

Standard errors can be computed by bootstrap or as robust “Sandwich” standard errors.

library(medflex)
# this package can used to estimate Natural direct and indirect effects when they are
# identifiable

### estimate PNDE and TNDE, conditional on L0 and L1 
# ---------------------------------------------------------------------------- #
# get back to the original data set
df1_int <- subset(df1_int, select = -c(A0_PM2.5_temp, id))

## 1) Expand the dataset and calculate weights using the neWeight() function
g.M <- glm(M_diabetes ~ factor(A0_PM2.5) + # the 1st variable should be the exposure
             L0_male + L0_soc_env + L1, # then add baseline confounders
           family = "binomial", data = df1_int)

# expend the data set and add a second column A* = 1 - A
# and calculate the weights = sw_M.Astar / sw_M.A
exp.Data <- neWeight(g.M)

# it is also possible to use the neWeight function directly:
# exp.Data <- neWeight(M_diabetes ~ factor(A0_PM2.5) + # the 1st variable should be the exposure
#                        L0_male + L0_soc_env + L1, # then add baseline confounders
#                      family = "binomial", data = df1_int)

class(exp.Data)
# [1] "data.frame" "expData"    "weightData"
head(exp.Data)
#   id A0_PM2.50 A0_PM2.51 L0_male L0_soc_env L1 M_diabetes Y_death Y_qol
# 1  1         0         0       0          1  1          0       0  93.4
# 2  1         0         1       0          1  1          0       0  93.4
# 3  2         1         1       1          1  1          0       1  64.0
# 4  2         1         0       1          1  1          0       1  64.0
# 5  3         0         0       1          1  0          0       0  75.6
# 6  3         0         1       1          1  0          0       0  75.6

# the new variables A0_PM2.50 and A0_PM2.51 correspond to A and A*

# check the weights:
w <- weights(exp.Data)
boxplot(w)

# Note that we estimated the same weights by hand previously:
head(data.frame(w, sw_M.Astar / sw_M.A), 6)
#       w sw_M.Astar.sw_M.A
# 1 1.000             1.000
# 2 0.801             0.801
# 3 1.000             1.000
# 4 1.293             1.293
# 5 1.000             1.000
# 6 0.809             0.809
plot(w, sw_M.Astar / sw_M.A)
abline(0,1)

## 2) Fit the natural effect model using the neModel() function
##    (adjusted for baseline confounders)

## for death
set.seed(1234)
neMod.death <- neModel(Y_death ~ A0_PM2.50 * A0_PM2.51 + L0_male + L0_soc_env + L1,
                       family = "gaussian", # to estimate risk difference
                       expData = exp.Data,
                       se = c("bootstrap"), # or "robust"
                       nBoot = 10, # use >= 1000 samples, if se = bootstrap
                       )
summary(neMod.death)
#                         Estimate Std. Error z value Pr(>|z|)
#   (Intercept)            0.10982    0.00807   13.61  < 2e-16 ***
#   A0_PM2.501             0.06576    0.01642    4.01  6.2e-05 ***
#   A0_PM2.511             0.00836    0.00214    3.91  9.1e-05 ***
#   L0_male                0.05041    0.00741    6.80  1.0e-11 ***
#   L0_soc_env             0.06113    0.00609   10.03  < 2e-16 ***
#   L1                     0.08341    0.01281    6.51  7.5e-11 ***
#   A0_PM2.501:A0_PM2.511  0.00238    0.00446    0.53     0.59

## estimate the conditional PNDE and TNDE, given L(0)=0 and L(1)=0
medflex.PNDE.death.L0 <- coef(neMod.death)["A0_PM2.501"]
# 0.0658
medflex.TNIE.death.L0 <- (coef(neMod.death)["A0_PM2.511"] +
                            coef(neMod.death)["A0_PM2.501:A0_PM2.511"])
# 0.0107

# 95% CI of the TNIE (which combines 2 coefficients) ?
# cov(b1,b2) = var(b1) + var(b2) + 2 * cov(b1,b2)
c(medflex.TNIE.death.L0 -
    qnorm(0.975) * sqrt(var(neMod.death$bootRes$t[,3]) + # column of A0_PM2.511
                          var(neMod.death$bootRes$t[,7]) + # column of A0_PM2.501:A0_PM2.511
                          2 * var(neMod.death$bootRes$t[,c(3,7)])[1,2]),
  medflex.TNIE.death.L0 +
    qnorm(0.975) * sqrt(var(neMod.death$bootRes$t[,3]) +
                          var(neMod.death$bootRes$t[,7]) +
                          2 * var(neMod.death$bootRes$t[,c(3,7)])[1,2]))
# A0_PM2.511 A0_PM2.511
#     0.0027     0.0188

## for quality of life
neMod.qol <- neModel(Y_qol ~ A0_PM2.50 * A0_PM2.51 + L0_male + L0_soc_env + L1,
                     family = "gaussian", # to estimate risk difference
                     expData = exp.Data,
                     se = c("robust") # or "bootstrap"
                     # nBoot = 1000, # if se = bootstrap
                     )
summary(neMod.qol)
#                         Estimate Std. Error z value Pr(>|z|)
#   (Intercept)             73.187      0.226  324.09  < 2e-16 ***
#   A0_PM2.501              -5.336      0.338  -15.77  < 2e-16 ***
#   A0_PM2.511              -1.069      0.136   -7.87  3.4e-15 ***
#   L0_male                 -1.231      0.223   -5.53  3.3e-08 ***
#   L0_soc_env              -3.546      0.230  -15.41  < 2e-16 ***
#   L1                      -4.101      0.243  -16.87  < 2e-16 ***
#   A0_PM2.501:A0_PM2.511   -0.765      0.126   -6.09  1.1e-09 ***

## estimate the conditional PNDE and TNDE, given L(0)=0 and L(1)=0
medflex.PNDE.qol.L0 <- coef(neMod.qol)["A0_PM2.501"]
# -5.34
medflex.TNIE.qol.L0 <- (coef(neMod.qol)["A0_PM2.511"] +
                            coef(neMod.qol)["A0_PM2.501:A0_PM2.511"])
# -1.83

# 95% CI of the TNIE (which combines 2 coefficients) ?
# cov(b1,b2) = var(b1) + var(b2) + 2cov(b1,b2)
c(medflex.TNIE.qol.L0 -
    qnorm(0.975) * sqrt(neMod.qol$vcov["A0_PM2.511","A0_PM2.511"] +
                          neMod.qol$vcov["A0_PM2.501:A0_PM2.511","A0_PM2.501:A0_PM2.511"] +
                          2 * neMod.qol$vcov["A0_PM2.511","A0_PM2.501:A0_PM2.511"]),
  medflex.TNIE.qol.L0 +
    qnorm(0.975) * sqrt(neMod.qol$vcov["A0_PM2.511","A0_PM2.511"] +
                          neMod.qol$vcov["A0_PM2.501:A0_PM2.511","A0_PM2.501:A0_PM2.511"] +
                          2 * neMod.qol$vcov["A0_PM2.511","A0_PM2.501:A0_PM2.511"]))
# A0_PM2.511 A0_PM2.511
#     -2.30      -1.37

# plot results
plot(neMod.qol)

### estimate marginal PNDE and TNDE (population average) 
# ---------------------------------------------------------------------------- #

## 1) Expand the dataset and calculate weights using the neWeight() function

## Estimate a model of the exposure
g.A.L0 <- glm(A0_PM2.5 ~ L0_male + L0_soc_env + L1, # will no work without L(1) !
              family = "binomial", data = df1_int)

## Estimate a model of the mediator
g.M <- glm(M_diabetes ~ factor(A0_PM2.5) + # the 1st variable should be the exposure
             L0_male + L0_soc_env + L1, # then add baseline confounders
           family = "binomial", data = df1_int)

# expend the data set and add a second column A* = 1 - A
# and calculate the weights = sw_M.Astar / sw_M.A
exp.Data <- neWeight(g.M)

## 2) Fit the natural effect model using the neModel() function
## for death
neMod.death.pop <- neModel(Y_death ~ A0_PM2.50 * A0_PM2.51, # no need for L(0),L(1)
                       family = "gaussian", # to estimate risk difference
                       expData = exp.Data,
                       xFit = g.A.L0, # model of the exposure
                       se = c("robust")) # or "bootstrap"
summary(neMod.death.pop)
# Parameter estimates:
#   Estimate Std. Error z value Pr(>|z|)
#   (Intercept)            0.198840   0.004247   46.82  < 2e-16 ***
#   A0_PM2.501             0.065950   0.014346    4.60  4.3e-06 ***
#   A0_PM2.511             0.008534   0.001619    5.27  1.4e-07 ***
#   A0_PM2.501:A0_PM2.511 -0.000406   0.003744   -0.11     0.91

## estimate the population average PNDE and TNDE,
medflex.PNDE.death <- coef(neMod.death.pop)["A0_PM2.501"]
# 0.066
medflex.TNIE.death <- (coef(neMod.death.pop)["A0_PM2.511"] +
                          coef(neMod.death.pop)["A0_PM2.501:A0_PM2.511"])
# 0.00813
# the results are the same than results calculated by hand

## for quality of life
neMod.qol.pop <- neModel(Y_qol ~ A0_PM2.50 * A0_PM2.51, # no need for L(0),L(1)
                         family = "gaussian", # to estimate risk difference
                         expData = exp.Data,
                         xFit = g.A.L0, # model of the exposure
                         se = c("robust")) # or "bootstrap"
summary(neMod.qol.pop)
# Parameter estimates:
#                         Estimate Std. Error z value Pr(>|z|)
#   (Intercept)             69.085      0.117  590.36  < 2e-16 ***
#   A0_PM2.501              -5.377      0.350  -15.38  < 2e-16 ***
#   A0_PM2.511              -1.077      0.137   -7.88  3.3e-15 ***
#   A0_PM2.501:A0_PM2.511   -0.695      0.120   -5.80  6.7e-09 ***

## estimate the population average PNDE and TNDE,
medflex.PNDE.qol <- coef(neMod.qol.pop)["A0_PM2.501"]
# -5.38
medflex.TNIE.qol <- (coef(neMod.qol.pop)["A0_PM2.511"] +
                       coef(neMod.qol.pop)["A0_PM2.501:A0_PM2.511"])
# -1.77
# the results are the same than results calculated by hand

8.5.2 Estimation of the MSM by g-computation

8.5.2.1 Manual calculation

The “unified” MSM can also be estimated by g-computation. Note that for the estimation of the population average effects, calculation relies on a weighted regression, as with the IPTW approach.

We can apply the following steps:

  1. Estimate a suitable model for the outcome conditional on the exposure, mediator and confounders, using the original data set
  2. Construct a new data set by repeating the original data set twice, and including an additional variable \(A^*\), equal to the original exposure in both data sets, and change the value of \(A\) from \(A\) to \(1-A\) in the 2nd data set. Then impute counterfactuals \(\mathbb{E}(Y_{a,M_a*}|L(0),L(1))\)
  3. Estimate the unified MSM as a model of the outcome including \(A\) and \(A^*\) (and their interaction), adjusted for baseline confounders \(L(0)\) and \(L(1)\). In order to obtain population average model (unconditional on baseline confounders), coefficients can be estimated using a weighted regression (with simple weights or stabilized weights of the exposure). Use the estimated coefficients of the MSM to calculate the PNDE and TNIE.
rm(list=ls())
df1_int <- read.csv(file = "data/df1_int.csv")

## Step 1. Estimate a suitable model for the outcome conditional on the exposure,
##         mediator and confounders, using the original data set
Q.Y.death <- glm(Y_death ~ L0_male + L0_soc_env + L1  + A0_PM2.5 * M_diabetes,
                 family = "binomial",
                 data = df1_int)

Q.Y.qol <- glm(Y_qol ~ L0_male + L0_soc_env + L1 + A0_PM2.5 * M_diabetes,
               family = "gaussian",
               data = df1_int)

## Step 2. Expand data and impute counterfactuals E(Y_{a,M_a*}|L(0),L(1))
## Expand data
# create identifier
df1_int$id <- 1:nrow(df1_int)

# dupplicate the original data set
df1_int.1 <- df1_int.2 <- df1_int

# Add an A* variable where A* = A
df1_int.1$A0_PM2.5_star <- df1_int$A0_PM2.5
df1_int.2$A0_PM2.5_star <- df1_int$A0_PM2.5

# for the 2nd data set, change the exposure A = 1 - A
df1_int.2$A0_PM2.5 <- 1 - df1_int$A0_PM2.5

# stack the 2 new datasets
df1_int.new <- rbind(df1_int.1,df1_int.2)
# sort the data set by the id-variable to use geeglm
# (necessary to use geepack and the geeglm() later)
df1_int.new <- df1_int.new[order(df1_int.new$id), ]

head(df1_int.new, 12)
#       L0_male L0_soc_env A0_PM2.5 L1 M_diabetes Y_death Y_qol A0_PM2.5_temp id A0_PM2.5_star
# 1           0          1        0  1          0       0  93.4             0  1             0
# 10001       0          1        1  1          0       0  93.4             0  1             0
# 2           1          1        1  1          0       1  64.0             1  2             1
# 10002       1          1        0  1          0       1  64.0             1  2             1
# 3           1          1        0  0          0       0  75.6             0  3             0
# 10003       1          1        1  0          0       0  75.6             0  3             0
# 4           1          0        0  0          0       0  89.8             0  4             0
# 10004       1          0        1  0          0       0  89.8             0  4             0
# 5           1          1        0  0          0       0  77.2             0  5             0
# 10005       1          1        1  0          0       0  77.2             0  5             0
# 6           1          1        0  0          1       0  73.9             0  6             0
# 10006       1          1        1  0          1       0  73.9             0  6             0

## Impute counterfactuals
# death - we predict the probability P(Y=1|A,M,L(0),L(1))
Y.pred.death <- predict(Q.Y.death, newdata = df1_int.new, type = "response")

# qol - we can predict the mean E(Y|A,M,L(0),L(1))
Y.pred.qol <- predict(Q.Y.qol, newdata = df1_int.new, type = "response")

## Step 3. Fit a suitable model to the outcome incluing only A and A* (and their
##         interaction)
library(geepack)
# for death (conditional)
MSM.model.death.1 <- geeglm(Y.pred.death ~ A0_PM2.5 * A0_PM2.5_star +
                              L0_male + L0_soc_env + L1,
                            family = "gaussian",
                            id = df1_int.new$id,
                            data = df1_int.new,
                            scale.fix = TRUE)
summary(MSM.model.death.1)
#                          Estimate  Std.err   Wald Pr(>|W|)
#   (Intercept)            0.104153 0.000611  29104  < 2e-16 ***
#   A0_PM2.5               0.063543 0.000138 210638  < 2e-16 ***
#   A0_PM2.5_star          0.007032 0.001112     40  2.5e-10 ***
#   L0_male                0.054291 0.000706   5922  < 2e-16 ***
#   L0_soc_env             0.065692 0.000692   9004  < 2e-16 ***
#   L1                     0.086446 0.000860  10100  < 2e-16 ***
#   A0_PM2.5:A0_PM2.5_star 0.004918 0.000397    154  < 2e-16 ***

## estimate the conditional PNDE and TNDE, given L(0)=0 and L(1)=0
PNDE.death.L0 <- coef(MSM.model.death.1)["A0_PM2.5"]
# 0.0635
TNIE.death.L0 <- (coef(MSM.model.death.1)["A0_PM2.5_star"] +
                            coef(MSM.model.death.1)["A0_PM2.5:A0_PM2.5_star"])
# 0.012

# For quality of life (conditional)
MSM.model.qol.1 <- geeglm(Y.pred.qol ~ A0_PM2.5 * A0_PM2.5_star +
                              L0_male + L0_soc_env + L1,
                            family = "gaussian",
                            id = df1_int.new$id,
                            data = df1_int.new,
                            scale.fix = TRUE)
summary(MSM.model.qol.1)
#                          Estimate Std.err     Wald Pr(>|W|)
#   (Intercept)             73.3371  0.0949 597466.3  < 2e-16 ***
#   A0_PM2.5                -5.3013  0.0268  39033.2  < 2e-16 ***
#   A0_PM2.5_star           -1.0535  0.1353     60.6  6.9e-15 ***
#   L0_male                 -1.3289  0.1038    164.0  < 2e-16 ***
#   L0_soc_env              -3.6467  0.1061   1180.6  < 2e-16 ***
#   L1                      -4.2325  0.1168   1312.8  < 2e-16 ***
#   A0_PM2.5:A0_PM2.5_star  -0.7920  0.0870     82.9  < 2e-16 ***

## estimate the conditional PNDE and TNDE, given L(0)=0 and L(1)=0
PNDE.qol.L0 <- coef(MSM.model.qol.1)["A0_PM2.5"]
# -5.3
TNIE.qol.L0 <- (coef(MSM.model.qol.1)["A0_PM2.5_star"] +
                  coef(MSM.model.qol.1)["A0_PM2.5:A0_PM2.5_star"])
# -1.85

## For population average estimations
## Estimate a model of the exposure
g.A.L0 <- glm(A0_PM2.5 ~ L0_male + L0_soc_env + L1, # will no work without L(1) !
              family = "binomial", data = df1_int)
g.Ais1.L0 <- predict(g.A.L0, newdata = df1_int.new, type = "response")

w <- rep(NA, nrow(df1_int.new))
w[df1_int.new$A0_PM2.5_star == 1] <- 1 / g.Ais1.L0[df1_int.new$A0_PM2.5_star == 1]
w[df1_int.new$A0_PM2.5_star == 0] <- 1 / (1 - g.Ais1.L0[df1_int.new$A0_PM2.5_star == 0])

# for death (population average), we use the weighted regression
MSM.model.death.2 <- geeglm(Y.pred.death ~ A0_PM2.5 * A0_PM2.5_star,
                            family = "gaussian",
                            weights = w,
                            id = df1_int.new$id,
                            data = df1_int.new,
                            scale.fix = TRUE)
summary(MSM.model.death.2)
#                          Estimate  Std.err    Wald Pr(>|W|)
#   (Intercept)            0.198882 0.000644  95250.3  < 2e-16 ***
#   A0_PM2.5               0.063867 0.000138 213552.3  < 2e-16 ***
#   A0_PM2.5_star          0.009213 0.002009     21.0  4.5e-06 ***
#   A0_PM2.5:A0_PM2.5_star 0.002330 0.000445     27.4  1.7e-07 ***

## estimate the conditional PNDE and TNDE, given L(0)=0 and L(1)=0
PNDE.death <- coef(MSM.model.death.2)["A0_PM2.5"]
# 0.0639
TNIE.death <- (coef(MSM.model.death.2)["A0_PM2.5_star"] +
                 coef(MSM.model.death.2)["A0_PM2.5:A0_PM2.5_star"])
# 0.0115

# For quality of life (population average), we use the weighted regression
MSM.model.qol.2 <- geeglm(Y.pred.qol ~ A0_PM2.5 * A0_PM2.5_star,
                          family = "gaussian",
                          weights = w,
                          id = df1_int.new$id,
                          data = df1_int.new,
                          scale.fix = TRUE)
summary(MSM.model.qol.2)
#                          Estimate Std.err     Wald Pr(>|W|)
#   (Intercept)             69.0849  0.0492 1.97e+06  < 2e-16 ***
#   A0_PM2.5                -5.3108  0.0269 3.88e+04  < 2e-16 ***
#   A0_PM2.5_star           -1.1690  0.1624 5.18e+01  6.1e-13 ***
#   A0_PM2.5:A0_PM2.5_star  -0.7395  0.0906 6.66e+01  3.3e-16 ***

## estimate the conditional PNDE and TNDE, given L(0)=0 and L(1)=0
PNDE.qol.L0 <- coef(MSM.model.qol.2)["A0_PM2.5"]
# -5.31
TNIE.qol.L0 <- (coef(MSM.model.qol.2)["A0_PM2.5_star"] +
                  coef(MSM.model.qol.2)["A0_PM2.5:A0_PM2.5_star"])
# -1.91

8.5.2.2 Using the medflex package

We can used the medflex package to apply this approach, as described below.

The “unified” MSM can be defined as a model of the expected counterfactual outcome \(\mathbb{E}(Y_{a,M_{a^*}} \mid L(0),L(1))\) conditional on the baseline confounders \(L(0)\) and \(L(1)\), or as a model the marginal expected counterfactual outcome \(\mathbb{E}(Y_{a,M_{a^*}}\).

For estimation by g-computation, the data is expanded using the neImpute function, and the MSM is estimated using the neModel function. Population average estimations are obtained using a weighted regression, where the weights are calcuated using a model of the exposure, stated at the xFit argument inside the neModel function.

Standard errors can be computed by bootstrap or as robust “Sandwich” standard errors.

library(medflex)

### estimate PNDE and TNDE, conditional on L0 and L1
# ---------------------------------------------------------------------------- #
# get back to the original data set
df1_int <- subset(df1_int, select = -c(id))
df1_int$A0_PM2.5 <- factor(df1_int$A0_PM2.5)

## 1) Fit a model for the outcome
Q.Y.death <- glm(Y_death ~ A0_PM2.5 + # start with the exposure
                   M_diabetes + A0_PM2.5:M_diabetes + # then the mediator
                   L0_male + L0_soc_env + L1, # then baseline confounders
                 family = "binomial",
                 data = df1_int)

Q.Y.qol <- glm(Y_qol ~ A0_PM2.5 + # start with the exposure
                 M_diabetes + A0_PM2.5:M_diabetes + # then the mediator
                 L0_male + L0_soc_env + L1, # then baseline confounders
               family = "gaussian",
               data = df1_int)


## 2) expend the data set and add a second column A* = 1 - A
exp.Data.death <- neImpute(Q.Y.death)
head(exp.Data.death)
#   id A0_PM2.50 A0_PM2.51 L0_male L0_soc_env L1 M_diabetes Y_death Y_qol
# 1  1         0         0       0          1  1          0   0.222  93.4
# 2  1         1         0       0          1  1          0   0.292  93.4
# 3  2         1         1       1          1  1          0   0.356  64.0
# 4  2         0         1       1          1  1          0   0.277  64.0
# 5  3         0         0       1          1  0          0   0.197  75.6
# 6  3         1         0       1          1  0          0   0.261  75.6

exp.Data.qol <- neImpute(Q.Y.qol)
head(exp.Data.qol)
#   id A0_PM2.50 A0_PM2.51 L0_male L0_soc_env L1 M_diabetes Y_death Y_qol
# 1  1         0         0       0          1  1          0       0  68.4
# 2  1         1         0       0          1  1          0       0  64.7
# 3  2         1         1       1          1  1          0       1  64.0
# 4  2         0         1       1          1  1          0       1  67.7
# 5  3         0         0       1          1  0          0       0  71.2
# 6  3         1         0       1          1  0          0       0  67.4

# the predicted outcomes by medflex are the same than our previous predictions
plot(exp.Data.death$Y_death, Y.pred.death)
abline(0,1)

plot(exp.Data.qol$Y_qol, Y.pred.qol)
abline(0,1)

## 3) Fit the natural effect model using the neModel() function
##    (adjusted for baseline confounders)

## for death
neMod.death <- neModel(Y_death ~ A0_PM2.50 * A0_PM2.51 + L0_male + L0_soc_env + L1,
                       family = "gaussian", # to estimate risk difference
                       expData = exp.Data.death,
                       se = c("robust") # or "bootstrap"
                       )
summary(neMod.death)
#                         Estimate Std. Error z value Pr(>|z|)
#   (Intercept)            0.10415    0.00829   12.56  < 2e-16 ***
#   A0_PM2.501             0.06354    0.01373    4.63  3.7e-06 ***
#   A0_PM2.511             0.00703    0.00175    4.01  6.1e-05 ***
#   L0_male                0.05429    0.00867    6.26  3.8e-10 ***
#   L0_soc_env             0.06569    0.00882    7.45  9.3e-14 ***
#   L1                     0.08645    0.00999    8.66  < 2e-16 ***
#   A0_PM2.501:A0_PM2.511  0.00492    0.00391    1.26     0.21

## estimate the conditional PNDE and TNDE, given L(0)=0 and L(1)=0
medflex.PNDE.death.L0 <- coef(neMod.death)["A0_PM2.501"]
# 0.0635
medflex.TNIE.death.L0 <- (coef(neMod.death)["A0_PM2.511"] +
                            coef(neMod.death)["A0_PM2.501:A0_PM2.511"])
# 0.012

## for quality of life
neMod.qol <- neModel(Y_qol ~ A0_PM2.50 * A0_PM2.51 + L0_male + L0_soc_env + L1,
                     family = "gaussian", # to estimate risk difference
                     expData = exp.Data.qol,
                     se = c("robust")) # or "bootstrap"
summary(neMod.qol)
#                         Estimate Std. Error z value Pr(>|z|)
#   (Intercept)             73.337      0.230  319.17  < 2e-16 ***
#   A0_PM2.501              -5.301      0.340  -15.61  < 2e-16 ***
#   A0_PM2.511              -1.054      0.139   -7.58  3.6e-14 ***
#   L0_male                 -1.329      0.227   -5.85  5.0e-09 ***
#   L0_soc_env              -3.647      0.235  -15.54  < 2e-16 ***
#   L1                      -4.233      0.249  -17.01  < 2e-16 ***
#   A0_PM2.501:A0_PM2.511   -0.792      0.127   -6.21  5.1e-10 ***

## estimate the conditional PNDE and TNDE, given L(0)=0 and L(1)=0
medflex.PNDE.qol.L0 <- coef(neMod.qol)["A0_PM2.501"]
# -5.3
medflex.TNIE.qol.L0 <- (coef(neMod.qol)["A0_PM2.511"] +
                          coef(neMod.qol)["A0_PM2.501:A0_PM2.511"])
# -1.85


### estimate marginal PNDE and TNDE (population average) 
# ---------------------------------------------------------------------------- #
## Estimate a model of the exposure
g.A.L0 <- glm(A0_PM2.5 ~ L0_male + L0_soc_env + L1, # will no work without L(1) !
              family = "binomial", data = df1_int)

## 3) Fit the natural effect model using the neModel() function
## for death
neMod.death.pop <- neModel(Y_death ~ A0_PM2.50 * A0_PM2.51, # no need for L(0),L(1)
                           family = "gaussian", # to estimate risk difference
                           expData = exp.Data.death,
                           xFit = g.A.L0, # model of the exposure
                           se = c("robust")) # or "bootstrap"
summary(neMod.death.pop)
# Parameter estimates:
#                         Estimate Std. Error z value Pr(>|z|)
#   (Intercept)            0.19888    0.00425   46.83  < 2e-16 ***
#   A0_PM2.501             0.06387    0.01378    4.64  3.6e-06 ***
#   A0_PM2.511             0.00921    0.00169    5.45  5.1e-08 ***
#   A0_PM2.501:A0_PM2.511  0.00233    0.00367    0.64     0.53

## estimate the population average PNDE and TNDE,
medflex.PNDE.death <- coef(neMod.death.pop)["A0_PM2.501"]
# 0.0639
medflex.TNIE.death <- (coef(neMod.death.pop)["A0_PM2.511"] +
                         coef(neMod.death.pop)["A0_PM2.501:A0_PM2.511"])
# 0.0115
# the results are the same than results calculated by hand

## for quality of life
neMod.qol.pop <- neModel(Y_qol ~ A0_PM2.50 * A0_PM2.51, # no need for L(0),L(1)
                         family = "gaussian", # to estimate risk difference
                         expData = exp.Data.qol,
                         xFit = g.A.L0, # model of the exposure
                         se = c("robust")) # or "bootstrap"
summary(neMod.qol.pop)
# Parameter estimates:
#                         Estimate Std. Error z value Pr(>|z|)
#   (Intercept)             69.085      0.117  590.50  < 2e-16 ***
#   A0_PM2.501              -5.311      0.339  -15.65  < 2e-16 ***
#   A0_PM2.511              -1.169      0.145   -8.09  6.1e-16 ***
#   A0_PM2.501:A0_PM2.511   -0.740      0.125   -5.92  3.2e-09 ***

## estimate the population average PNDE and TNDE,
medflex.PNDE.qol <- coef(neMod.qol.pop)["A0_PM2.501"]
# -5.31
medflex.TNIE.qol <- (coef(neMod.qol.pop)["A0_PM2.511"] +
                       coef(neMod.qol.pop)["A0_PM2.501:A0_PM2.511"])
# -1.91
# the results are the same than results calculated by hand

References

Lange, Theis, Stijn Vansteelandt, and Maarten Bekaert. 2012. “A Simple Unified Approach for Estimating Natural Direct and Indirect Effects.” American Journal of Epidemiology 176 (3): 190–95.
VanderWeele, Tyler J. 2009. “Marginal Structural Models for the Estimation of Direct and Indirect Effects.” Epidemiology 20: 18–26.