Chapter 8 Marginal structural models
Marginal structural models (MSM) are parametric models that are used to summarize the relationship between the counterfactual outcome (\(Y_a\) or \(Y_{am}\) for example) and the exposure(s) \(A\) and mediators \(M\). It also possible to summarize the relationship according to a subset of the baseline confounders if it is relevant for the scientific question.
To illustrate the application of MSMs, we will first use the data simulated from the Causal model 1 (with an \(A \ast M\) interaction effect on the outcome) where the exposure \(A\) doesn’t affect the confounder \(L(1)\) between the mediator and the outcome.
8.1 MSM for the Average Total Effect (ATE)
8.1.1 Expressing the ATE using coefficients of an MSM
For a continuous or binary outcome, we can use the following MSM to summarize the relationship between the counterfactual outcome (\(Y_a\)) and the exposure(s) \(A\): \[\begin{equation} \mathbb{E}(Y_a) = \alpha_0 + \alpha_A a \tag{8.1} \end{equation}\]
The Average Total Effect \(\text{ATE} = \mathbb{E}(Y_{A=1}) - \mathbb{E}(Y_{A=0})\) can then be expressed using the coefficients of this MSM (8.1): \[\begin{equation*} \text{ATE} := \left(\alpha_0 + \alpha_A \times 1 \right) - \left(\alpha_0 + \alpha_A \times 0 \right) = \alpha_A \end{equation*}\]
In this example, the coefficient \(\alpha_A\) corresponds to the \(\text{ATE}\).
Such a model is not very useful for a binary exposure. It would be much more useful for higher-dimensional exposures, for example with a continuous exposure, where the relationship between all the possible continuous values of the exposure \(A=a\) and the corresponding outcomes \(Y_a\) is summarized (and arbitrarily simplified) by a single line and the slope coefficient \(\alpha_A\).
It is also possible to define MSMs adjusted for a subset \(V\) of the baseline confounders (\(V \subset L(0)\)). Such MSMs can be useful to estimate conditional effects. For example it is possible to define an average “conditional” total effect \(\text{ATE}|L(0)\) (instead of the marginal ATE defined above), projecting the counterfactual outcomes on a parametric model such as the following:
\[\begin{equation} \mathbb{E}(Y_a \mid L(0)) = \alpha_0 + \alpha_A a + \alpha_\text{male} L_\text{male}(0) + \alpha_\text{soc.env.} L_\text{soc.env.}(0) \tag{8.2} \end{equation}\]
so that \(\text{ATE} \mid L(0) = \mathbb{E}(Y_{A=1} \mid L(0)) - \mathbb{E}(Y_{A=0} \mid L(0)) = \alpha_A\) using the coefficient from the MSM (8.2).
MSMs are also very useful to study interactions, or effect modification of the exposure \(A\) by a baseline confounder. For example, in order to study the average total effect according to sex, we can use the following MSM:
\[\begin{equation} \mathbb{E}(Y_a \mid L_\text{male}(0)) = \alpha_0 + \alpha_A a + \alpha_\text{male} L_\text{male}(0) + \alpha_{A \ast L_\text{male}} \left(a \times L_\text{male}(0)\right) \tag{8.3} \end{equation}\]
and express the average total effect in each strata of sex using the coefficients of the MSM (8.3):
\[\begin{align*} \{\text{ATE} \mid L_\text{male}(0) = 0\} &:= \mathbb{E}(Y_1 \mid L_\text{male}(0) = 0) - \mathbb{E}(Y_0 \mid L_\text{male}(0) = 0) = \alpha_A \\ \{\text{ATE} \mid L_\text{male}(0) = 1\} &:= \mathbb{E}(Y_1 \mid L_\text{male}(0) = 1) - \mathbb{E}(Y_0 \mid L_\text{male}(0) = 1) = \alpha_A + \alpha_{A \ast L_\text{male}} \end{align*}\]
Because MSMs are models of unobserved counterfactual outcomes, estimators of the MSM coefficients are necessary. We will describe two possible approaches : estimation by IPTW or by G-computation.
In both approaches, 95% confidence intervals can be computed by bootstrap.
8.1.2 Estimation of the MSM coefficients by IPTW
MSM coefficients can be easily estimated using an Inverse Probability of Treatment (IPTW) approach based on weighted regressions.
For example, in order to fit the MSM (8.3) described above, we can use a linear regression of the (observed) outcome \(Y\) on the exposure and sex, weigthed by individual weights \(w_i\) or \(sw_i\): \[\begin{equation} \mathbb{E}\left[Y \mid L_\text{male}(0)\right] = \alpha_0 + \alpha_A a + \alpha_\text{male} L_\text{male}(0) + \alpha_{A \ast L_\text{male}} \left(a \times L_\text{male}(0)\right) \end{equation}\]
where \(w_i=\frac{1}{P(A=a_i \mid L(0)=l(0)_i)}\) or \(sw_i=\frac{P(A=a_i \mid L_\text{male}(0))}{P(A=a_i \mid L(0)=l(0)_i)}\).
As in chapter 7, the “no-unmeasured confounding” assumption is addressed by the application of weights \(w_i\) or \(sw_i\), which balance confounders \(L(0)\) relative to the exposure \(A\).
rm(list=ls())
df1_int <- read.csv(file = "data/df1_int.csv")
## 1. Denominator of the weight
# 1a. Estimate g(A=a_i|L(0)) (denominator of the weight)
g.A.L <- glm(A0_PM2.5 ~ L0_male + L0_soc_env,
family = "binomial", data = df1_int)
# 1b. Predict each individual's probability of being exposed to her own exposure
# predict the probabilities P(A0_PM2.5 = 1) & P(A0_PM2.5 = 0)
pred.g1.L <- predict(g.A.L, type="response")
pred.g0.L <- 1 - pred.g1.L
# the predicted probability of the observed treatment P(A = a_i | L(0)) is :
gAi.L <- rep(NA, nrow(df1_int))
gAi.L[df1_int$A0_PM2.5==1] <- pred.g1.L[df1_int$A0_PM2.5==1]
gAi.L[df1_int$A0_PM2.5==0] <- pred.g0.L[df1_int$A0_PM2.5==0]
## 2. Numerator of the weight
# The numerator of the weight can be 1 for simple weights,
# or g(A=a_i|V) to obtain stabilized weights which put less weight to individuals
# with less observation. Stabilized weights enable a weaker positivity assumption.
# 2a. Estimate g(A=a_i | sex) (numerator of the stabilized weight)
g.A.sex <- glm(A0_PM2.5 ~ L0_male,
family = "binomial", data = df1_int)
# 2b. Predict each individual's probability of being exposed to her own exposure
# predict the probabilities P(A0_PM2.5 = 1 | sex) & P(A0_PM2.5 = 0 | sex)
pred.g1.sex <- predict(g.A.sex, type="response")
pred.g0.sex <- 1 - pred.g1.sex
# the predicted probability of the observed treatment P(A = a_i | sex) is :
gAi.sex <- rep(NA, nrow(df1_int))
gAi.sex[df1_int$A0_PM2.5==1] <- pred.g1.sex[df1_int$A0_PM2.5==1]
gAi.sex[df1_int$A0_PM2.5==0] <- pred.g0.sex[df1_int$A0_PM2.5==0]
## 3. Define individual weights:
# We can use simple weights w = 1 / g(A=a_i | L(0))
w <- 1 / gAi.L
# Or alternatively, we can use stabilized weights :
# sw = g(A=a_i | sex) / g(A=a_i | L(0))
sw <- gAi.sex / gAi.L
# we can see that stabilized weights have less extreme values
par(mfcol = c(1,2))
boxplot(w ~ df1_int$A0_PM2.5)
boxplot(sw ~ df1_int$A0_PM2.5)
par(mfcol = c(1,1))
## applying these weights creates a pseudo-population were the baseline
## confounders are balanced, relative to the exposure:
## before applying weights to the individuals:
table(df1_int$L0_male, df1_int$A0_PM2.5, deparse.level = 2)
# df1_int$A0_PM2.5
# df1_int$L0_male 0 1
# 0 4527 463
# 1 4349 661
prop.table(table(df1_int$L0_male, df1_int$A0_PM2.5, deparse.level = 2),
margin = 2)
# df1_int$A0_PM2.5
# df1_int$L0_male 0 1
# 0 0.5100270 0.4119217
# 1 0.4899730 0.5880783 # sex is unbalanced: 49% versus 59%
## after applying weights to the individuals:
library(questionr) # The questionr package enables to describe weighted populations
wtd.table(x = df1_int$L0_male, y = df1_int$A0_PM2.5,
weights = w)
# 0 1
# 0 4989.425 4918.462
# 1 5010.862 5057.676
prop.table(wtd.table(x = df1_int$L0_male, y = df1_int$A0_PM2.5,
weights = w), margin = 2)
# 0 1
# 0 0.4989282 0.4930227
# 1 0.5010718 0.5069773 # sex is balanced in the weighted population
## 4. Estimate coefficients of the MSM using a weighted regression E(Y | A, sex)
# a GLM with gaussian family can be applied to estimate risk differences
# (for relative risk or rate ratios, we can apply a Poisson family;
# for OR, we can apply a binomial family)
msm1 <- glm(Y_death ~ A0_PM2.5 + L0_male + A0_PM2.5*L0_male,
weights = w,
family = "gaussian",
data = df1_int)
coef(msm1)
# (Intercept) A0_PM2.5 L0_male A0_PM2.5:L0_male
# 0.17573472 0.05883228 0.04598911 0.03077614
msm2 <- glm(Y_death ~ A0_PM2.5 + L0_male + A0_PM2.5*L0_male,
weights = sw,
family = "gaussian",
data = df1_int)
coef(msm2)
# (Intercept) A0_PM2.5 L0_male A0_PM2.5:L0_male
# 0.17573472 0.05883228 0.04598911 0.03077614
## 5. Estimate the ATE stratified by sex
# According to MSM1 (with simple weights)
ATE.msm1.male0 <- coef(msm1)["A0_PM2.5"]
# 0.05883228
ATE.msm1.male1 <- coef(msm1)["A0_PM2.5"] + coef(msm1)["A0_PM2.5:L0_male"]
# 0.08960842
# According to the MSM2 (with stabilized weights)
ATE.msm2.male0 <- coef(msm2)["A0_PM2.5"]
# 0.05883228
ATE.msm2.male1 <- coef(msm2)["A0_PM2.5"] + coef(msm2)["A0_PM2.5:L0_male"]
# 0.08960842
# The results are the same because there is no violation of the positivity assumption
# In case of positivity violation, stabilized weights would give more accurate estimates
The ATE estimates of death probability using an MSM estimated by IPTW are respectively +5.9% in women and +9.0% in men.
95% confidence intervals can be calculated by bootstrap.
Note: Using the true data generating model used to simulate the illustrative datasets, the “true” value of the ATE stratified by sex can be calculated:
the “true” \((ATE \mid L_\text{male}(0) = 0) = 0.0688\) in women,
the “true” \((ATE \mid L_\text{male}(0) = 1) = 0.0703\) in men.
8.1.3 Estimation of the MSM coefficients by G-computation (imputation)
We can also use a G-computation (sometimes described as an imputation) approach to estimate the coefficients of an MSM.
The following steps can be applied:
Fit a (logistic or a linear) regression to estimate \(\overline{Q} = \mathbb{E}(Y \mid A, L(0))\). Don’t forget to add an \(A*L_{\text{male}}(0)\) interaction term, because we aim to estimate the effect of the exposure \(A\), modified by sex.
Use this estimate to predict an outcome for each subject under the counterfactual scenarios \(\hat{\overline{Q}}(A=0)_i\) and \(\hat{\overline{Q}}(A=1)_i\), by evaluating the regression fit \(\overline{Q}\) at \(A=0\) and \(A=1\) respectively
Duplicate the initial dataset in a single long dataset in which:
- the first half of the long dataset corresponds to the first counterfactual scenario with \(A=0\) for all individuals and an additional column for the predicted counterfactual outcome \(\hat{\overline{Q}}(A=0)\) ;
- the second half of the long dataset corresponds to the second counterfactual scenario with \(A=1\) for all individuals and \(\hat{\overline{Q}}(A=1)\) for the predicted counterfactual column.
- Fit the MSM \(\mathbb{E}\left[Y_a \mid L_\text{male}(0)\right]\) using the long dataset.
rm(list=ls())
df1_int <- read.csv(file = "data/df1_int.csv")
## 1. Estimate Qbar
Q.tot.death <- glm(Y_death ~ A0_PM2.5 + L0_male + L0_soc_env +
A0_PM2.5:L0_male, # don't forget the interaction term
family = "binomial", data = df1_int)
# The final result would be sligthly different if we apply a binomial family.
## 2. Predict an outcome for each subject, in 2 counterfactual scenarios
## setting A=0 and A=1
# prepare data sets used to predict the outcome under the counterfactual
# scenarios setting A=0 and A=1
data.A1 <- data.A0 <- df1_int
data.A1$A0_PM2.5 <- 1
data.A0$A0_PM2.5 <- 0
# predict values under the same name in the corresponding counterfactual dataset
data.A1$Ya.death.pred <- predict(Q.tot.death, newdata = data.A1, type = "response")
data.A0$Ya.death.pred <- predict(Q.tot.death, newdata = data.A0, type = "response")
## 3. Append both counterfactual datasets in a single long dataset
# (the number of row is twice the initial number of row because there are 2
# counterfactual scenarios)
data.2scenarios <- rbind(data.A0, data.A1)
## 4. fit the MSM: E(Y_a|sex)
# a GLM with gaussian family can be applied to estimate risk differences
MSM.ATE.gcomp <- glm(Ya.death.pred ~ A0_PM2.5 + L0_male + A0_PM2.5:L0_male,
family = "gaussian",
data = data.2scenarios)
coef(MSM.ATE.gcomp)
# (Intercept) A0_PM2.5 L0_male A0_PM2.5:L0_male
# 0.17571658 0.06108129 0.04602303 0.02516386
## 5. Estimate the ATE stratified by sex
# According to MSM.ATE.gcomp
ATE.MSM.gcomp.male0 <- coef(MSM.ATE.gcomp)["A0_PM2.5"]
# 0.06108129
ATE.MSM.gcomp.male1 <- (coef(MSM.ATE.gcomp)["A0_PM2.5"] +
coef(MSM.ATE.gcomp)["A0_PM2.5:L0_male"])
# 0.08624515
8.2 MSM for Controlled Direct Effects
8.2.1 Expressing the CDE using coefficients of an MSM
The controlled direct effect is defined as \(\text{CDE}_m= \mathbb{E}(Y_{am}) - \mathbb{E}(Y_{a^*m})\).
Using the following MSM \[\begin{equation} \mathbb{E}(Y_{am}) = \alpha_0 + \alpha_A a + \alpha_M m + \alpha_{A \ast M} a \times m \tag{8.4} \end{equation}\] the controlled direct effect (keeping the mediator constant to the value \(M=m\)) can be expressed using the coefficients of the MSM (8.4): \[\begin{align*} \text{CDE}_m &= (\alpha_0 + \alpha_A a + \alpha_M m + \alpha_{A \ast M} a \times m) - (\alpha_0 + \alpha_A a^* + \alpha_M m + \alpha_{A \ast M} a^* \times m) \\ \text{CDE}_m &= \alpha_A(a - a^* ) + \alpha_{A \ast M} \times (a - a^*) \times m \end{align*}\] For a binary exposure \(A\), we have \(\text{CDE}_m=\alpha_A + \alpha_{A \ast M} \times m\).
8.2.2 Estimation of the MSM coefficients by IPTW
MSM coefficients can be easily estimated using an Inverse Probability of Treatment (IPTW) approach based on weighted regressions.
In order to fit the MSM (8.4), we can use a linear regression of the (observed) outcome \(Y\) on the exposure and mediator, weighted by individual stabilized weights \(sw_i\) ((VanderWeele 2009)): \[\begin{equation} \mathbb{E}\left(Y \mid A,M\right) = \alpha_0 + \alpha_A a + \alpha_M m + \alpha_{A \ast M} a \times m \end{equation}\]
where \(sw_i\) is the product of two weights \(sw_i = sw_{A,i} \times sw_{M,i}\),
\(sw_{A,i}=\frac{P(A=a_i)}{P(A=a_i \mid L(0)=l(0)_i)}\) and \(sw_{M,i}=\frac{P(M=m_i \mid A=a_i)}{P(M = m_i \mid A=a_i,L(0)=l(0)_i), L(1)=l(1)_i}\).
The “no-unmeasured confounding” assumption is addressed by the application of weights \(sw_i\), which balances confounders \(L(0)\) relative to the exposure-outcome \(A-Y\) relationship, and balance the set of confounders \(\{L(0),A,L(1)\}\) relative to the mediator-outcome \(M-Y\) relationship.
8.2.2.1 Example 1, with no intermediate confounder affected by the exposure
### MSM of CDE, estimated by IPTW ----------------------------------------------
rm(list=ls())
df1_int <- read.csv(file = "data/df1_int.csv")
## 1. Stabilized weight for the exposure sw_{A,i}
# 1a. Estimate g(A=a_i|L(0)) (denominator of the weight)
g.A.L <- glm(A0_PM2.5 ~ L0_male + L0_soc_env,
family = "binomial", data = df1_int)
# 1b. Predict each individual's probability of being exposed to her own exposure
# the predicted probability of the observed treatment g(A = a_i | L(0)) is :
gAi.L <- rep(NA, nrow(df1_int))
gAi.L[df1_int$A0_PM2.5==1] <- predict(g.A.L, type="response")[df1_int$A0_PM2.5==1]
gAi.L[df1_int$A0_PM2.5==0] <- (1 - predict(g.A.L, type="response"))[df1_int$A0_PM2.5==0]
# 1c. Estimate g(A=a_i) (numerator of the weight)
g.A <- glm(A0_PM2.5 ~ 1, family = "binomial", data = df1_int)
# 1d. Predict each individual's probability of being exposed to her own exposure
# the predicted probability of the observed treatment g(A = a_i) is :
gAi <- rep(NA, nrow(df1_int))
gAi[df1_int$A0_PM2.5==1] <- predict(g.A, type="response")[df1_int$A0_PM2.5==1]
gAi[df1_int$A0_PM2.5==0] <- (1 - predict(g.A, type="response"))[df1_int$A0_PM2.5==0]
# 1e. Calculate the weight for the exposure A: sw_{A,i}
sw_Ai <- gAi / gAi.L
## 2. Stabilized weight for the mediator sw_{M,i}
# 2a. Estimate g(M=m_i|L(0),A,L(1)) (denominator of the weight)
g.M.L <- glm(M_diabetes ~ L0_male + L0_soc_env + A0_PM2.5 + L1,
family = "binomial", data = df1_int)
# 2b. Predict each individual's probability of being exposed to her own exposure
# the predicted probability of the observed treatment g(A = a_i | L(0)) is :
gMi.L <- rep(NA, nrow(df1_int))
gMi.L[df1_int$M_diabetes==1] <- predict(g.M.L, type="response")[df1_int$M_diabetes==1]
gMi.L[df1_int$M_diabetes==0] <- (1 - predict(g.M.L, type="response"))[df1_int$M_diabetes==0]
# 2c. Estimate g(M=m_i|A) (numerator of the weight)
g.M.A <- glm(M_diabetes ~ A0_PM2.5, family = "binomial", data = df1_int)
# 2d. Predict each individual's probability of being exposed to her own exposure
# the predicted probability of the observed treatment g(M = m_i|A) is :
gMi.A <- rep(NA, nrow(df1_int))
gMi.A[df1_int$M_diabetes==1] <- predict(g.M.A, type="response")[df1_int$M_diabetes==1]
gMi.A[df1_int$M_diabetes==0] <- (1 - predict(g.M.A, type="response"))[df1_int$M_diabetes==0]
# 2e. Calculate the weight for the mediator M: sw_{M,i}
sw_Mi <- gMi.A / gMi.L
## 3. Define the individual stabilized weight for the CDE_m
sw_cde <- sw_Ai * sw_Mi
## 4. Estimate coefficients of the MSM using a weighted regression E(Y | A, sex)
# a GLM with gaussian family can be applied to estimate risk differences
msm_cde <- glm(Y_death ~ A0_PM2.5 + M_diabetes + A0_PM2.5*M_diabetes,
weights = sw_cde,
family = "gaussian",
data = df1_int)
coef(msm_cde)
# (Intercept) A0_PM2.5 M_diabetes A0_PM2.5:M_diabetes
# 0.17891689 0.06798282 0.06729724 -0.00495314
## 5. Estimate CDE for m=0 and for m=1 using the MSM's coefficients
CDE_mis0 <- coef(msm_cde)["A0_PM2.5"]
# 0.06798282
CDE_mis1 <- coef(msm_cde)["A0_PM2.5"] + coef(msm_cde)["A0_PM2.5:M_diabetes"]
# 0.06302968
In this example, our estimates of the controlled direct effects are \(CDE_{M=0} = 6.8\%\) and \(CDE_{M=1} = 6.3\%\). Confidence intervals can be calculated by bootstrap.
Importantly, this approach for the estimation of the controlled direct effect \(\text{CDE}_m\) by IPTW is also valid if the exposure \(A\) affects the intermediate confounder \(L(1)\) (as with the Causal model 2).
8.2.2.2 Example 2, with intermediate confounder affected by the exposure
Below, we estimate the CDE by hand, and using the ltmle
package and the CMAverse
package
### MSM of CDE, estimated by IPTW
rm(list=ls())
df2_int <- read.csv(file = "data/df2_int.csv")
## 1. Stabilized weight for the exposure sw_{A,i}
# 1a. Estimate g(A=a_i|L(0)) (denominator of the weight)
g.A.L <- glm(A0_PM2.5 ~ L0_male + L0_soc_env,
family = "binomial", data = df2_int)
# 1b. Predict each individual's probability of being exposed to her own exposure
# the predicted probability of the observed treatment g(A = a_i | L(0)) is :
gAi.L <- rep(NA, nrow(df2_int))
gAi.L[df2_int$A0_PM2.5==1] <- predict(g.A.L, type="response")[df2_int$A0_PM2.5==1]
gAi.L[df2_int$A0_PM2.5==0] <- (1 - predict(g.A.L, type="response"))[df2_int$A0_PM2.5==0]
# 1c. Estimate g(A=a_i) (numerator of the weight)
g.A <- glm(A0_PM2.5 ~ 1, family = "binomial", data = df2_int)
# 1d. Predict each individual's probability of being exposed to her own exposure
# the predicted probability of the observed treatment g(A = a_i) is :
gAi <- rep(NA, nrow(df2_int))
gAi[df2_int$A0_PM2.5==1] <- predict(g.A, type="response")[df2_int$A0_PM2.5==1]
gAi[df2_int$A0_PM2.5==0] <- (1 - predict(g.A, type="response"))[df2_int$A0_PM2.5==0]
# 1e. Calculate the weight for the exposure A: sw_{A,i}
sw_Ai <- gAi / gAi.L
## 2. Stabilized weight for the mediator sw_{M,i}
# 2a. Estimate g(M=m_i|L(0),A,L(1)) (denominator of the weight)
g.M.L <- glm(M_diabetes ~ L0_male + L0_soc_env + A0_PM2.5 + L1,
family = "binomial", data = df2_int)
# 2b. Predict each individual's probability of being exposed to her own exposure
# the predicted probability of the observed treatment g(A = a_i | L(0)) is :
gMi.L <- rep(NA, nrow(df2_int))
gMi.L[df2_int$M_diabetes==1] <- predict(g.M.L, type="response")[df2_int$M_diabetes==1]
gMi.L[df2_int$M_diabetes==0] <- (1 - predict(g.M.L, type="response"))[df2_int$M_diabetes==0]
# 2c. Estimate g(M=m_i|A) (numerator of the weight)
g.M.A <- glm(M_diabetes ~ A0_PM2.5, family = "binomial", data = df2_int)
# 2d. Predict each individual's probability of being exposed to her own exposure
# the predicted probability of the observed treatment g(M = m_i|A) is :
gMi.A <- rep(NA, nrow(df2_int))
gMi.A[df2_int$M_diabetes==1] <- predict(g.M.A, type="response")[df2_int$M_diabetes==1]
gMi.A[df2_int$M_diabetes==0] <- (1 - predict(g.M.A, type="response"))[df2_int$M_diabetes==0]
# 2e. Calculate the weight for the mediator M: sw_{M,i}
sw_Mi <- gMi.A / gMi.L
## 3. Define the individual stabilized weight for the CDE_m
sw_cde <- sw_Ai * sw_Mi
## 4. Estimate coefficients of the MSM using a weighted regression E(Y | A, sex)
# a GLM with gaussian family can be applied to estimate risk differences
msm_cde <- glm(Y_death ~ A0_PM2.5 + M_diabetes + A0_PM2.5*M_diabetes,
weights = sw_cde,
family = "gaussian",
data = df2_int)
coef(msm_cde)
# (Intercept) A0_PM2.5 M_diabetes A0_PM2.5:M_diabetes
# 0.17932146 0.06012920 0.07239661 0.03017266
## 5. Estimate CDE for m=0 and for m=1 using the MSM's coefficients
CDE_mis0 <- coef(msm_cde)["A0_PM2.5"]
# 0.0601292
CDE_mis1 <- coef(msm_cde)["A0_PM2.5"] + coef(msm_cde)["A0_PM2.5:M_diabetes"]
# 0.09030186
### Estimation by IPTW using ltmle package
library(ltmle)
Qform <- c(L1="Q.kplus1 ~ L0_male + L0_soc_env + A0_PM2.5",
Y_death="Q.kplus1 ~ L0_male + L0_soc_env + L1 +
A0_PM2.5 * M_diabetes")
gform <- c("A0_PM2.5 ~ L0_male + L0_soc_env",
"M_diabetes ~ L0_male + L0_soc_env + A0_PM2.5 + L1")
data_binary <- subset(df2_int, select = c(L0_male, L0_soc_env,
A0_PM2.5, L1,
M_diabetes, Y_death))
CDE_ltmle_M0_death <- ltmle(data = data_binary,
Anodes = c("A0_PM2.5", "M_diabetes"),
Lnodes = c("L1"), # intermediate confounders +/- baseline
Ynodes = c("Y_death"),
survivalOutcome = FALSE, # TRUE for time-to-event outcomes Y
Qform = Qform,
gform = gform,
abar = list(c(1,0), # counterfactual intervention do(A=1,M=0)
c(0,0)), # counterfactual intervention do(A=0,M=0)
SL.library = NULL,
estimate.time = FALSE, # estimate computation time?
iptw.only = TRUE, # for only IPTW estimations (no TMLE)
gcomp = FALSE,
variance.method = "iptw") # IPTW influence curve
# CDE with M=0
summary(CDE_ltmle_M0_death, estimator = "iptw")$effect.measures$ATE
# $estimate
# [1] 0.0601292
# $CI
# 2.5% 97.5%
# [1,] 0.02424819 0.09601021
CDE_ltmle_M1_death <- ltmle(data = data_binary,
Anodes = c("A0_PM2.5", "M_diabetes"),
Lnodes = c("L1"), # intermediate confounders +/- baseline
Ynodes = c("Y_death"),
survivalOutcome = FALSE, # TRUE for time-to-event outcomes Y
Qform = Qform,
gform = gform,
abar = list(c(1,1), # counterfactual intervention do(A=1,M=0)
c(0,1)), # counterfactual intervention do(A=0,M=0)
SL.library = NULL,
estimate.time = FALSE, # estimate computation time?
iptw.only = TRUE, # for only IPTW estimations (no TMLE)
gcomp = FALSE,
variance.method = "iptw") # IPTW influence curve
# CDE with M=1
summary(CDE_ltmle_M1_death, estimator = "iptw")$effect.measures$ATE
# $estimate
# [1] 0.09030186
# $CI
# 2.5% 97.5%
# [1,] 0.04084792 0.1397558
## we obtain the same results as the IPTW estimation by hand
### Estimation by IPTW using ltmle package
library(CMAverse)
rm(list=ls())
df2_int <- read.csv(file = "data/df2_int.csv")
cmdag(outcome = "Y_death", exposure = "A0_PM2.5", mediator = "M_diabetes",
basec = c("L0_male", "L0_soc_env"), postc = "L1", node = TRUE, text_col = "white")
# In this setting, we can use the marginal structural model and the $g$-formula approach. The results are shown below.
## The Marginal Structural Model
res_msm_RD <- cmest(data = df2_int,
model = "msm",
outcome = "Y_death",
exposure = "A0_PM2.5",
mediator = "M_diabetes",
basec = c("L0_male", "L0_soc_env"),
postc = "L1",
EMint = TRUE, # E*M interaction
ereg = "logistic",
yreg = "linear", # MSM is a linear regression (to get RD)
mreg = list("logistic"),
wmnomreg = list("logistic"),
wmdenomreg = list("logistic"),
astar = 0, #E(Y_{A=0,M=1})
a = 1, #E(Y_{A=1,M=1})
mval = list(1), # for the CDE, set mediator to M=1
estimation = "imputation",
inference = "bootstrap",
nboot = 2)
summary(res_msm_RD)
# Causal Mediation Analysis
# # Outcome regression:
# Call:
# glm(formula = Y_death ~ A0_PM2.5 + M_diabetes + A0_PM2.5 * M_diabetes,
# family = gaussian(), data = getCall(x$reg.output$yreg)$data,
# weights = getCall(x$reg.output$yreg)$weights)
# Coefficients:
# Estimate Std. Error t value Pr(>|t|) # MSM coef
# (Intercept) 0.179321 0.005242 34.210 < 2e-16 *** # are the
# A0_PM2.5 0.060129 0.017253 3.485 0.000494 *** # same as
# M_diabetes 0.072397 0.009242 7.834 5.21e-15 *** # calculation
# A0_PM2.5:M_diabetes 0.030173 0.025895 1.165 0.243960 # by hand
#
# # Mediator regressions:
# Call:
# glm(formula = M_diabetes ~ A0_PM2.5, family = binomial(), data = getCall(x$reg.output$mreg[[1L]])$data,
# weights = getCall(x$reg.output$mreg[[1L]])$weights)
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -0.73432 0.02268 -32.384 < 2e-16 ***
# A0_PM2.5 0.51531 0.06422 8.024 1.02e-15 ***
#
# # Mediator regressions for weighting (denominator):
# Call:
# glm(formula = M_diabetes ~ A0_PM2.5 + L0_male + L0_soc_env +
# L1, family = binomial(), data = getCall(x$reg.output$wmdenomreg[[1L]])$data,
# weights = getCall(x$reg.output$wmdenomreg[[1L]])$weights)
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -1.36249 0.04783 -28.488 < 2e-16 ***
# A0_PM2.5 0.30994 0.06668 4.648 3.35e-06 ***
# L0_male 0.24661 0.04369 5.644 1.66e-08 ***
# L0_soc_env 0.30628 0.04650 6.587 4.50e-11 ***
# L1 0.86045 0.04493 19.152 < 2e-16 ***
#
# # Mediator regressions for weighting (nominator):
# Call:
# glm(formula = M_diabetes ~ A0_PM2.5, family = binomial(), data = getCall(x$reg.output$wmnomreg[[1L]])$data,
# weights = getCall(x$reg.output$wmnomreg[[1L]])$weights)
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -0.74205 0.02271 -32.680 <2e-16 ***
# A0_PM2.5 0.55288 0.06408 8.628 <2e-16 ***
#
# # Exposure regression for weighting:
# Call:
# glm(formula = A0_PM2.5 ~ L0_male + L0_soc_env, family = binomial(),
# data = getCall(x$reg.output$ereg)$data, weights = getCall(x$reg.output$ereg)$weights)
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -2.73244 0.07425 -36.799 < 2e-16 ***
# L0_male 0.40580 0.06447 6.294 3.09e-10 ***
# L0_soc_env 0.64060 0.07350 8.716 < 2e-16 ***
#
# # Effect decomposition on the mean difference scale via the marginal structural model
# Direct counterfactual imputation estimation with
# bootstrap standard errors, percentile confidence intervals and p-values
# Estimate Std.error 95% CIL 95% CIU P.val
# cde 9.030e-02 1.601e-02 6.094e-02 0.082 <2e-16 *** # same result
# rpnde 7.003e-02 1.600e-02 5.838e-02 0.080 <2e-16 ***
# rtnde 7.374e-02 1.597e-02 5.888e-02 0.080 <2e-16 ***
# rpnie 8.903e-03 9.159e-04 8.284e-03 0.010 <2e-16 ***
# rtnie 1.261e-02 9.523e-04 8.738e-03 0.010 <2e-16 ***
# te 8.265e-02 1.505e-02 6.839e-02 0.089 <2e-16 ***
# rintref -2.027e-02 7.341e-06 -2.571e-03 -0.003 <2e-16 ***
# rintmed 3.710e-03 3.634e-05 4.542e-04 0.001 <2e-16 ***
# cde(prop) 1.093e+00 2.940e-02 8.907e-01 0.930 <2e-16 ***
# rintref(prop) -2.452e-01 6.289e-03 -3.752e-02 -0.029 <2e-16 ***
# rintmed(prop) 4.489e-02 1.662e-03 5.140e-03 0.007 <2e-16 ***
# rpnie(prop) 1.077e-01 3.403e-02 9.376e-02 0.139 <2e-16 ***
# rpm 1.526e-01 3.569e-02 9.890e-02 0.147 <2e-16 ***
# rint -2.004e-01 4.627e-03 -3.014e-02 -0.024 <2e-16 ***
# rpe -9.263e-02 2.940e-02 6.983e-02 0.109 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# (cde: controlled direct effect;
# rpnde: randomized analogue of pure natural direct effect;
# rtnde: randomized analogue of total natural direct effect;
# rpnie: randomized analogue of pure natural indirect effect;
# rtnie: randomized analogue of total natural indirect effect;
# te: total effect; rintref: randomized analogue of reference interaction;
# rintmed: randomized analogue of mediated interaction;
# cde(prop): proportion cde;
# rintref(prop): proportion rintref;
# rintmed(prop): proportion rintmed;
# rpnie(prop): proportion rpnie;
# rpm: randomized analogue of overall proportion mediated;
# rint: randomized analogue of overall proportion attributable to interaction;
# rpe: randomized analogue of overall proportion eliminated)
8.2.3 Estimation of the MSM coefficients by G-computation
As for the ATE, we can use G-computation to estimate the coefficients of the MSM to estimate Controlled Direct Effects.
Note that the algorithm described below is correct only if the exposure \(A\) doesn’t affect the intermediate confounders \(L(1)\) of the \(M \rightarrow Y\) relationship (such as the data simulated from the Causal model 1 in our examples). In that case, the following steps can be applied:
Fit a (logistic or a linear) regression to estimate \(\overline{Q} = \mathbb{E}(Y \mid A,M, L(0),L(1))\)
Use this estimate to predict an outcome for each subject under the counterfactual scenarios \(\hat{\overline{Q}}(A=0,M=m,L(0),L(1))_i\) and \(\hat{\overline{Q}}(A=1,M=m,L(0),L(1))_i\), by evaluating the regression fit \(\overline{Q}\) at \((A=0,M=m)\) and \((A=1,M=m)\) respectively. If we want to set the level of the mediator to \(M=0\) and \(M=1\), this would give 4 counterfactual scenarios \(\text{do}(A=0,M=0)\), \(\text{do}(A=1,M=0)\), \(\text{do}(A=0,M=1)\) and \(\text{do}(A=1,M=1)\).
Duplicate the initial dataset for each scenario in a single long dataset in which:
- the 1st part of the long dataset corresponds to the first counterfactual scenario with \((A=0,M=0)\) for all individuals and an additional column for the predicted counterfactual outcome \(\mathbb{E}(Y_{A=0,M=0}\mid L(0),L(1)) = \hat{\overline{Q}}(A=0,M=0,L(0),L(1))\) ;
- the 2d part of the long dataset corresponds to the second counterfactual scenario with \((A=1,M=0)\) for all individuals and an additional column for the predicted counterfactual outcome \(\mathbb{E}(Y_{A=1,M=0}\mid L(0),L(1)) = \hat{\overline{Q}}(A=1,M=0,L(0),L(1))\) ;
- the 3d part of the long dataset corresponds to the second counterfactual scenario with \((A=0,M=1)\) for all individuals and an additional column for the predicted counterfactual outcome \(\mathbb{E}(Y_{A=0,M=1}\mid L(0),L(1)) = \hat{\overline{Q}}(A=0,M=1,L(0),L(1))\) ;
- the 4th part of the long dataset corresponds to the second counterfactual scenario with \((A=1,M=1)\) for all individuals and an additional column for the predicted counterfactual outcome \(\mathbb{E}(Y_{A=1,M=1}\mid L(0),L(1)) = \hat{\overline{Q}}(A=1,M=1,L(0),L(1))\) ;
- Fit the MSM \(\mathbb{E}(Y_{am}) = \alpha_0 + \alpha_A a + \alpha_M m + \alpha_{A \ast M} a \times m\) using the long dataset.
rm(list=ls())
df1_int <- read.csv(file = "data/df1_int.csv")
### MSM of CDE, estimated by G-computation -------------------------------------
## 1. Estimate Qbar(A,M,L0,L1)
Q.cde.death <- glm(Y_death ~ A0_PM2.5 + M_diabetes + A0_PM2.5:M_diabetes +
L0_male + L0_soc_env + L1,
family = "gaussian", data = df1_int)
# The final result would be sligthly different if we applied a binomial family
# The Gaussian family corresponds to the true generating model in this example.
## 2. Predict an outcome for each subject, in each counterfactual scenario
# Prepare data sets that will be used to predict the outcome under the counterfactual
# 4 counterfactual scenarios setting (A=0,M=0), (A=1,M=0), (A=0,M=1) and (A=1,M=1)
data.A0M0 <- data.A1M0 <- data.A0M1 <- data.A1M1 <- df1_int
data.A0M0$A0_PM2.5 <- 0
data.A0M0$M_diabetes <- 0
data.A1M0$A0_PM2.5 <- 1
data.A1M0$M_diabetes <- 0
data.A0M1$A0_PM2.5 <- 0
data.A0M1$M_diabetes <- 1
data.A1M1$A0_PM2.5 <- 1
data.A1M1$M_diabetes <- 1
# predict values under the same name in the corresponding counterfactual dataset
data.A0M0$Yam.death.pred <- predict(Q.cde.death, newdata = data.A0M0, type = "response")
data.A1M0$Yam.death.pred <- predict(Q.cde.death, newdata = data.A1M0, type = "response")
data.A0M1$Yam.death.pred <- predict(Q.cde.death, newdata = data.A0M1, type = "response")
data.A1M1$Yam.death.pred <- predict(Q.cde.death, newdata = data.A1M1, type = "response")
## 3. Append the 4 counterfactual datasets in a single long dataset
# number of row is 4 times the initial value (we have 4 counterfactual scenarios)
data.4scenarios <- rbind(data.A0M0, data.A1M0,data.A0M1,data.A1M1)
## 4. fit the MSM: E(Y_am) = alpha_0 + alpha_A a + alpha_M m + alpha_AM a:m
MSM.CDE.gcomp <- glm(Yam.death.pred ~ A0_PM2.5 + M_diabetes + A0_PM2.5:M_diabetes,
family = "gaussian", # gaussian family for risk differences
data = data.4scenarios)
coef(MSM.CDE.gcomp)
# (Intercept) A0_PM2.5 M_diabetes A0_PM2.5:M_diabetes
# 0.17968603 0.06000138 0.06757214 0.01918153
## 5. Estimate the CDE(M=m)
# CDE(M=0) = E(Y_{A=1,M=0}) - E(Y_{A=0,M=0})
CDE_mis0_gcomp <- coef(MSM.CDE.gcomp)["A0_PM2.5"]
# 0.06000138
# CDE(M=1) = E(Y_{A=1,M=1}) - E(Y_{A=0,M=1})
CDE_mis1_gcomp <- (coef(MSM.CDE.gcomp)["A0_PM2.5"] +
coef(MSM.CDE.gcomp)["A0_PM2.5:M_diabetes"])
# 0.07918291
# Note: Applying a binomial family for the first Qbar model would result in two
# sligthly different values of the CDE(M=m)
# => 0.05934409 in setting M=0
# => 0.07537874 in setting M=1
If the exposure \(A\) affects the intermediate confounder \(L(1)\), as in the Causal model 2, the steps 1) and 2) of the algorithm should follow the method described in paragraph 6.2.1 or 6.2.2.
Here is an example applying G-computation by iterative conditional expectation:
rm(list=ls())
df2_int <- read.csv(file = "data/df2_int.csv")
### MSM of CDE, estimated by G-computation (by ICE) ----------------------------
## 1a) Regress the outcome on L0, A, L1 and M (and the A*M interaction if appropriate)
Y.death.model <- glm(Y_death ~ L0_male + L0_soc_env + A0_PM2.5 + L1 +
M_diabetes + A0_PM2.5:M_diabetes,
family = "binomial", data = df2_int)
## 1b) Generate predicted values by evaluating the regression
## under the 4 counterfactual scenarios
data.A0M0 <- data.A1M0 <- data.A0M1 <- data.A1M1 <- df2_int
data.A0M0$A0_PM2.5 <- 0
data.A0M0$M_diabetes <- 0
data.A1M0$A0_PM2.5 <- 1
data.A1M0$M_diabetes <- 0
data.A0M1$A0_PM2.5 <- 0
data.A0M1$M_diabetes <- 1
data.A1M1$A0_PM2.5 <- 1
data.A1M1$M_diabetes <- 1
Q.Y.death.A0M0 <- predict(Y.death.model, newdata = data.A0M0, type = "response")
Q.Y.death.A1M0 <- predict(Y.death.model, newdata = data.A1M0, type = "response")
Q.Y.death.A0M1 <- predict(Y.death.model, newdata = data.A0M1, type = "response")
Q.Y.death.A1M1 <- predict(Y.death.model, newdata = data.A1M1, type = "response")
## 2a) Regress the predicted values conditional on the observed exposure A
## and baseline confounders L(0)
L1.death.A0M0.model <- glm(Q.Y.death.A0M0 ~ L0_male + L0_soc_env + A0_PM2.5,
family = "quasibinomial", data = df2_int)
L1.death.A1M0.model <- glm(Q.Y.death.A1M0 ~ L0_male + L0_soc_env + A0_PM2.5,
family = "quasibinomial", data = df2_int)
L1.death.A0M1.model <- glm(Q.Y.death.A0M1 ~ L0_male + L0_soc_env + A0_PM2.5,
family = "quasibinomial", data = df2_int)
L1.death.A1M1.model <- glm(Q.Y.death.A1M1 ~ L0_male + L0_soc_env + A0_PM2.5,
family = "quasibinomial", data = df2_int)
## 2b) generate predicted values by evaluating the regression at exposure
## of interest: {A=0,M=0}, {A=1,M=0}, {A=0,M=1}, {A=1,M=1}
data.A0M0$Yam.death.pred <- predict(L1.death.A0M0.model,
newdata = data.A0M0, type = "response")
data.A1M0$Yam.death.pred <- predict(L1.death.A1M0.model,
newdata = data.A1M0, type = "response")
data.A0M1$Yam.death.pred <- predict(L1.death.A0M1.model,
newdata = data.A0M1, type = "response")
data.A1M1$Yam.death.pred <- predict(L1.death.A1M1.model,
newdata = data.A1M1, type = "response")
## 3. Append the 4 counterfactual datasets in a single long dataset
# number of row is 4 times the initial value (we have 4 counterfactual scenarios)
data.4scenarios <- rbind(data.A0M0, data.A1M0,data.A0M1,data.A1M1)
## 4. fit the MSM: E(Y_am) = alpha_0 + alpha_A a + alpha_M m + alpha_AM a:m
MSM.CDE.gcomp <- glm(Yam.death.pred ~ A0_PM2.5 + M_diabetes + A0_PM2.5:M_diabetes,
family = "gaussian", # gaussian family for risk differences
data = data.4scenarios)
coef(MSM.CDE.gcomp)
# (Intercept) A0_PM2.5 M_diabetes A0_PM2.5:M_diabetes
# 0.17974947 0.06342833 0.07366466 0.02469485
## 5. Estimate the CDE(M=m)
# CDE(M=0) = E(Y_{A=1,M=0}) - E(Y_{A=0,M=0})
CDE_mis0_gcomp_ice <- coef(MSM.CDE.gcomp)["A0_PM2.5"]
# 0.06342833
# CDE(M=1) = E(Y_{A=1,M=1}) - E(Y_{A=0,M=1})
CDE_mis1_gcomp_ice <- (coef(MSM.CDE.gcomp)["A0_PM2.5"] +
coef(MSM.CDE.gcomp)["A0_PM2.5:M_diabetes"])
# 0.08812318
The example above (MSM estimation using G-computation by ICE) corresponds to the algorithm applied by the ltmle
package:
library(ltmle)
Qform <- c(L1="Q.kplus1 ~ L0_male + L0_soc_env + A0_PM2.5",
Y_death="Q.kplus1 ~ L0_male + L0_soc_env + L1 +
A0_PM2.5 * M_diabetes")
gform <- c("A0_PM2.5 ~ L0_male + L0_soc_env",
"M_diabetes ~ L0_male + L0_soc_env + A0_PM2.5 + L1")
# in this example of g-computation, the propensity scores 'gform' will not be used
data_binary <- subset(df2_int, select = c(L0_male, L0_soc_env,
A0_PM2.5, L1,
M_diabetes, Y_death))
CDE_ltmle_M0 <- ltmle(data = data_binary,
Anodes = c("A0_PM2.5", "M_diabetes"),
Lnodes = c("L1"), # intermediate confounders +/- baseline
Ynodes = c("Y_death"),
survivalOutcome = FALSE, # TRUE for time-to-event outcomes Y
Qform = Qform,
gform = gform,
abar = list(c(1,0), # counterfactual intervention do(A=1,M=0)
c(0,0)), # counterfactual intervention do(A=0,M=0)
SL.library = NULL, # calls glm() instead of SuperLearner
estimate.time = FALSE, # estimate computation time?
gcomp = TRUE, # to apply g-computation
variance.method = "ic")
# CDE with M=0
summary(CDE_ltmle_M0)$effect.measures$ATE$estimate
# Parameter Estimate: 0.06342833
CDE_ltmle_M1 <- ltmle(data = data_binary,
Anodes = c("A0_PM2.5", "M_diabetes"),
Lnodes = c("L1"), # intermediate confounders +/- baseline
Ynodes = c("Y_death"),
survivalOutcome = FALSE, # TRUE for time-to-event outcomes Y
Qform = Qform,
gform = gform,
abar = list(c(1,1), # counterfactual intervention do(A=1,M=0)
c(0,1)), # counterfactual intervention do(A=0,M=0)
SL.library = NULL, # calls glm() instead of SuperLearner
estimate.time = FALSE, # estimate computation time?
gcomp = TRUE, # to apply g-computation
variance.method = "ic")
# CDE with M=1
summary(CDE_ltmle_M1)$effect.measures$ATE$estimate
# Parameter Estimate: 0.08812318
8.3 MSM for Natural Direct and Indirect Effects
8.3.1 Expressing the NDE and NIE using coefficients of 2 MSMs
The (Pure) Natural Direct Effect is defined by \(\text{PNDE} = \mathbb{E}(Y_{a,M_{a^*}}) - \mathbb{E}(Y_{a^*,M_{a^*}})\) and the (Total) Natural Indirect Effect is defined by \(\text{TNIE} = \mathbb{E}(Y_{a,M_a}) - \mathbb{E}(Y_{a,M_{a^*}})\).
VanderWeele suggested using 2 MSMs conditional on baseline confounders \(L(0)\) in order to estimate natural direct and indirect effects (VanderWeele 2009):
a model of the counterfactual values of the outcome \(\mathbb{E}(Y_{a,m} \mid l(0))=h^{-1}(a,m,l(0),l(1))\), where \(h\) is a link function. For example: \[\begin{equation} \mathbb{E}(Y_{a,m} \mid l(0),l(1)) = \alpha_0 + \alpha_A a + \alpha_M m + \alpha_{AM} a \times m + \alpha_{L(0)} l(0) + \alpha_{L(1)} l(1) \tag{8.5} \end{equation}\] (where \(h\) is the identity function, so that the model can be used to express risk differences)
a model of the counterfactual values of the mediator \(\mathbb{E}(M_{a} \mid L(0))=g^{-1}(a,l(0),l(1))\), where \(g\) is a link function. For example with a binary mediator: \[\begin{equation} \mathbb{E}(M_a \mid l(0),l(1)) = g^{-1}\left[\beta_0 + \beta_A a + \beta_{L(0)} l(0) + \beta_{L(1)} l(1) \right] \tag{8.6} \end{equation}\] (where \(g\) is the logit function because the mediator is binary).
VanderWeele shows that if the function \(h\) is linear in \(m\) (no quadratic terms in \(m\), nor transformations such as \(\log(m)\) or \(\sqrt{m}\), etc) and the exposure \(A\) does not affect the intermediate confounder \(L(1)\), then \[\begin{equation*} \mathbb{E}(Y_{a,M_{a^*}}) = h^{-1}\left[a,g^{-1}\left(a^*,l(0),l(1)\right),l(0),l(1)\right] \end{equation*}\]
Using the 2 MSMs, we can express the Natural Direct and Indirect Effects conditional on baseline confounders \(L(0)\). In our example: \[\begin{align*} \text{PNDE} \mid L(0),L(1) &= \mathbb{E}(Y_{a,M_{a^*}} \mid L(0),L(1)) - \mathbb{E}(Y_{a^*,M_{a^*}} \mid L(0),L(1)) \\ &= \{\alpha_0 + \alpha_A a + [\alpha_M + \alpha_{AM}a] \times g^{-1}(a^*,l(0),l(1)) + \alpha_{L(0)} l(0) + \alpha_{L(1)} l(1) \} \\ & \quad \quad - \{ \alpha_0 + \alpha_A a^*+ [\alpha_M + \alpha_{AM}a^*] \times g^{-1}(a^*,l(0),l(1)) + \alpha_{L(0)} l(0) + \alpha_{L(1)} l(1)\} \\ &= (a - a^*) \times [\alpha_A + \alpha_{AM} \times g^{-1}(a^*,l(0),l(1))] \end{align*}\]
\[\begin{align*} \text{TNIE} \mid L(0),L(1) &= \mathbb{E}(Y_{a,M_{a}} \mid L(0),L(1)) - \mathbb{E}(Y_{a,M_{a^*}} \mid L(0),L(1)) \\ &= \{\alpha_0 + \alpha_A a + [\alpha_M + \alpha_{AM}a] \times g^{-1}(a,l(0),l(1)) + \alpha_{L(0)} l(0) + \alpha_{L(1)} l(1) \} \\ & \quad \quad - \{ \alpha_0 + \alpha_A a+ [\alpha_M + \alpha_{AM}a] \times g^{-1}(a^*,l(0),l(1)) + \alpha_{L(0)} l(0) + \alpha_{L(1)} l(1) \} \\ &= \left[ g^{-1}(a,l(0),l(1)) - g^{-1}(a^*,l(0),l(1)) \right](\alpha_M + \alpha_{AM} a) \end{align*}\]
Marginal Natural Direct and Indirect effect can then be obtained: \[\begin{align*} \text{PNDE} &= \sum_{l(0),l(1)} \left[ \text{PNDE} \mid L(0) = l(0), L(1)=l(1) \right] \times P(L(0)=l(0),L(1)=l(1)) \\ \quad \text{TNIE} &= \sum_{l(0),l(1)} \left[ \text{TNIE} \mid L(0) = l(0), L(1)=l(1) \right] \times P(L(0)=l(0),L(1)=l(1)) \end{align*}\]
8.3.2 Estimation of the 2 MSMs coefficients by IPTW for NDE and NIE
As previously, MSM coefficients can be estimated using an Inverse Probability of Treatment (IPTW) approach based on weighted regressions.
In order to fit the 1st MSM (8.5), we can use a linear regression of the (observed) outcome \(Y\) on the exposure and mediator, adjusted for \(L(0)\), weighted by individual stabilized weights \(sw_{msm1,i}\) (VanderWeele 2009): \[\begin{equation*} \mathbb{E}\left(Y \mid A,M,L(0)\right) = \alpha_0 + \alpha_A a + \alpha_M m + \alpha_{AM} a \times m + \alpha_{L(0)} L(0) \end{equation*}\]
where \(sw_{msm1,i}\) is the product of two weights \(sw_{msm1,i} = sw_{A,i} \times sw_{M,i}\), \[\begin{align*} sw_{A,i} =& \frac{P(A=a_i)}{P(A=a_i \mid L(0)=l(0)_i)} \quad \text{or} \quad sw_{A,i} = \frac{P(A=a_i \mid L(0)=l(0)_i)}{P(A=a_i \mid L(0)=l(0)_i)} = 1 \\ sw_{M,i} =& \frac{P(M=m_i \mid A=a_i)}{P(M = m_i \mid A=a_i,L(0)=l(0)_i), L(1)=l(1)_i} \\ & \quad \text{or} \quad sw_{M,i} = \frac{P(M=m_i \mid A=a_i,L(0)=l(0)_i)}{P(M = m_i \mid A=a_i,L(0)=l(0)_i), L(1)=l(1)_i} \\ \end{align*}\]
In order to fit the 2nd MSM (8.6), we can use a logistic regression of the (observed) mediator \(M\) on the exposure, adjusted for \(L(0)\), weighted by individual stabilized weights \(sw_{msm2,i}\): \[\begin{equation*} \text{logit} \mathbb{E}(M \mid a,l(0)) = \beta_0 + \beta_A a + \beta_{L(0)} l(0) \end{equation*}\]
\[\begin{equation*} \text{where} \quad sw_{msm2,i} = \frac{P(A=a_i)}{P(A=a_i \mid L(0)=l(0)_i)} \end{equation*}\]
### MSM of NDE & NIE, estimated by IPTW ----------------------------------------
## 1. Stabilized weight for the MSM1
# 1a. sw_Ai = g(A=a_i | L(0)) / g(A=a_i | L(0)) = 1
sw_Ai <- rep(1, nrow(df1_int))
# 1b. sw_Mi = g(M=m_i | A,L(0)) / g(M=m_i | A,L(0),L(1))
g.M.AL0 <- glm(M_diabetes ~ A0_PM2.5 + L0_male + L0_soc_env,
family = "binomial", data = df1_int)
g.Mis1.AL0 <- predict(g.M.AL0, type = "response")
sw_M.num <- rep(NA, nrow(df1_int))
sw_M.num[df1_int$M_diabetes==1] <- g.Mis1.AL0[df1_int$M_diabetes==1]
sw_M.num[df1_int$M_diabetes==0] <- (1 - g.Mis1.AL0[df1_int$M_diabetes==0])
g.M.AL0L1 <- glm(M_diabetes ~ A0_PM2.5 + L0_male + L0_soc_env + L1,
family = "binomial", data = df1_int)
g.Mis1.AL0L1 <- predict(g.M.AL0L1, type = "response")
sw_M.denom <- rep(NA, nrow(df1_int))
sw_M.denom[df1_int$M_diabetes==1] <- g.Mis1.AL0L1[df1_int$M_diabetes==1]
sw_M.denom[df1_int$M_diabetes==0] <- (1 - g.Mis1.AL0L1[df1_int$M_diabetes==0])
sw_msm1 <- sw_Ai * sw_M.num / sw_M.denom
## 2. Estimate coefficients of the MSM1
MSM1 <- glm(Y_death ~ A0_PM2.5 + M_diabetes + A0_PM2.5:M_diabetes +
L0_male + L0_soc_env,
weights = sw_msm1,
family = "gaussian",
data = df1_int)
coef(MSM1)
# (Intercept) A0_PM2.5 M_diabetes L0_male L0_soc_env A0_PM2.5:M_diabetes
# 0.12033221 0.06381257 0.06691712 0.04671886 0.05521263 0.01652446
## 3. Stabilized weight for the MSM2
# 3a. sw_A = g(A=a_i) / g(A=a_i | L(0))
# numerator
g.A <- glm(A0_PM2.5 ~ 1, family = "binomial", data = df1_int)
g.Ais1 <- predict(g.A, type = "response")
sw_msm2.num <- rep(NA, nrow(df1_int))
sw_msm2.num[df1_int$A0_PM2.5==1] <- g.Ais1[df1_int$A0_PM2.5==1]
sw_msm2.num[df1_int$A0_PM2.5==0] <- (1 - g.Ais1[df1_int$A0_PM2.5==0])
# denominator
g.A.L0 <- glm(A0_PM2.5 ~ L0_male + L0_soc_env,
family = "binomial", data = df1_int)
g.Ais1.L0 <- predict(g.A.L0, type = "response")
sw_msm2.denom <- rep(NA, nrow(df1_int))
sw_msm2.denom[df1_int$A0_PM2.5==1] <- g.Ais1.L0[df1_int$A0_PM2.5==1]
sw_msm2.denom[df1_int$A0_PM2.5==0] <- (1 - g.Ais1.L0[df1_int$A0_PM2.5==0])
# stabilized weight
sw_msm2 <- sw_msm2.num / sw_msm2.denom
## 3. Estimate coefficients of the MSM2
MSM2 <- glm(M_diabetes ~ A0_PM2.5 + L0_male + L0_soc_env,
weights = sw_msm2,
family = "binomial",
data = df1_int)
coef(MSM2)
# (Intercept) A0_PM2.5 L0_male L0_soc_env
# -1.2723106 0.5883720 0.2566129 0.3270087
## 4. Estimate PNDE conditional on L(0), and the marginal value of PNDE
# a = 1 and a* = 0
# PNDE|L(0) = (a - a*)[alpha_A + alpha_AM.g^-1(a^*,l(0))]
g.minus1.A0 <- plogis(coef(MSM2)["(Intercept)"] + coef(MSM2)["A0_PM2.5"] * 0 +
coef(MSM2)["L0_male"] * df1_int$L0_male +
coef(MSM2)["L0_soc_env"] * df1_int$L0_soc_env)
# PNDE conditional on L(0)
PNDE_L0 <- (1 - 0) * (coef(MSM1)["A0_PM2.5"] +
coef(MSM1)["A0_PM2.5:M_diabetes"] * g.minus1.A0)
# marginal PNDE
PNDE <- mean(PNDE_L0)
# [1] 0.06850657
## 4. Estimate TNIE conditional on L(0), and the marginal value of TNIE
# TNIE|L(0) = [g^-1(a,l(0)) - g^-1(a^*,l(0))] * (alpha_M + alpha_AM * a)
g.minus1.A1 <- plogis(coef(MSM2)["(Intercept)"] + coef(MSM2)["A0_PM2.5"] * 1 +
coef(MSM2)["L0_male"] * df1_int$L0_male +
coef(MSM2)["L0_soc_env"] * df1_int$L0_soc_env)
# TNIE conditional on L(0)
TNIE_L0 <- (g.minus1.A1 - g.minus1.A0) * (coef(MSM1)["M_diabetes"] +
coef(MSM1)["A0_PM2.5:M_diabetes"] * 1)
# marginal PNDE
TNIE <- mean(TNIE_L0)
# [1] 0.01096799
In this example, the estimation of the PNDE is 6.9% and the estimation of the TNIE is 1.1%. Confidence intervals can be calculated by bootstrap.
8.4 MSM estimated by the CMAverse
package
8.4.1 Calculation by hand, using the “mediation formula”
The counterfactual quantity \(\mathbb{E}(Y_{a,M_{a^\prime}})\) used to defined the Natural Direct and Indirect effects can be expressed by the mediation formula (conditional on \(\{L(0),L(1)\}\), or the population average): \[\begin{align*} \mathbb{E}\left( Y_{a,M_{a^\prime}} \mid L(0),L(1) \right) &= \sum_m \mathbb{E}\left( Y_{a,m} \mid L(0),L(1) \right) \times \mathbb{P}\left(M_{a^\prime} = m \mid L(0),L(1)\right) \\ \text{ or } & \\ \mathbb{E}\left( Y_{a,M_{a^\prime}} \right) &= \sum_m \mathbb{E}\left( Y_{a,m} \right) \times \mathbb{P}(M_{a^\prime} = m) \\ \end{align*}\]
In order to estimate the \(\text{PNDE} = \mathbb{E}(Y_{1,M_0}) - \mathbb{E}(Y_{0,M_0})\) and the \(\text{TNIE} = \mathbb{E}(Y_{1,M_1}) - \mathbb{E}(Y_{1,M_0})\) we can estimate 2 Marginal Structural Models:
- a model of the counterfactual mediator (where \(g\) is a link function), for example: \[\begin{equation*} g \left[ \mathbb{P}(M_{a^\prime} = 1)\right] = \beta_0 + \beta_A a^\prime \end{equation*}\]
- a model of the counterfactual outcome (where \(h\) is a link function), for example: \[\begin{equation*} h \left[ \mathbb{E}(Y_{a,m})\right] = \alpha_0 + \alpha_A a + \alpha_M m + \alpha_{A \ast M} a \times m) \end{equation*}\]
Below, we will estimate the coefficients of the 2 MSMs by Inverse Probability of Treatment Weighting (IPTW).
rm(list=ls())
df1_int <- read.csv(file = "data/df1_int.csv")
## 1. Stabilized weight for the exposure sw_{A,i}
# 1a. Estimate g(A=a_i|L(0)) (denominator of the weight)
g.A.L <- glm(A0_PM2.5 ~ L0_male + L0_soc_env + L1,
family = "binomial", data = df1_int)
summary(g.A.L)
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -2.74236 0.07729 -35.484 < 2e-16 ***
# L0_male 0.40610 0.06448 6.298 3.01e-10 ***
# L0_soc_env 0.64079 0.07350 8.718 < 2e-16 ***
# L1 0.03257 0.06968 0.467 0.64
# 1b. Predict each individual's probability of being exposed to her own exposure
# the predicted probability of the observed treatment g(A = a_i | L(0)) is :
gAi.L <- rep(NA, nrow(df1_int))
gAi.L[df1_int$A0_PM2.5==1] <- predict(g.A.L, type="response")[df1_int$A0_PM2.5==1]
gAi.L[df1_int$A0_PM2.5==0] <- (1 - predict(g.A.L, type="response"))[df1_int$A0_PM2.5==0]
# 1c. Estimate g(A=a_i) (numerator of the weight)
g.A <- glm(A0_PM2.5 ~ 1, family = "binomial", data = df1_int)
# 1d. Predict each individual's probability of being exposed to her own exposure
# the predicted probability of the observed treatment g(A = a_i) is :
gAi <- rep(NA, nrow(df1_int))
gAi[df1_int$A0_PM2.5==1] <- predict(g.A, type="response")[df1_int$A0_PM2.5==1]
gAi[df1_int$A0_PM2.5==0] <- (1 - predict(g.A, type="response"))[df1_int$A0_PM2.5==0]
# 1e. Calculate sw_{A,i}
sw_Ai <- gAi / gAi.L
## 2. Stabilized weight for the mediator sw_{M,i}
# 2a. Estimate g(M=m_i|L(0),A,L(1)) (denominator of the weight)
g.M.L <- glm(M_diabetes ~ L0_male + L0_soc_env + A0_PM2.5 + L1,
family = "binomial", data = df1_int)
summary(g.M.L)
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -1.37880 0.04872 -28.303 < 2e-16 ***
# L0_male 0.25861 0.04437 5.829 5.57e-09 ***
# L0_soc_env 0.33050 0.04744 6.967 3.23e-12 ***
# A0_PM2.5 0.56260 0.06555 8.583 < 2e-16 ***
# L1 0.33462 0.04744 7.054 1.74e-12 ***
# 2b. Predict each individual's probability of being exposed to her own exposure
# the predicted probability of the observed treatment g(A = a_i | L(0)) is :
gMi.L <- rep(NA, nrow(df1_int))
gMi.L[df1_int$M_diabetes==1] <- predict(g.M.L, type="response")[df1_int$M_diabetes==1]
gMi.L[df1_int$M_diabetes==0] <- (1 - predict(g.M.L, type="response"))[df1_int$M_diabetes==0]
# 2c. Estimate g(M=m_i|A) (numerator of the weight)
g.M.A <- glm(M_diabetes ~ A0_PM2.5, family = "binomial", data = df1_int)
summary(g.M.A)
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -0.93236 0.02358 -39.544 <2e-16 ***
# A0_PM2.5 0.62388 0.06481 9.627 <2e-16 ***
# 2d. Predict each individual's probability of being exposed to her own exposure
# the predicted probability of the observed treatment g(M = m_i|A) is :
gMi.A <- rep(NA, nrow(df1_int))
gMi.A[df1_int$M_diabetes==1] <- predict(g.M.A, type="response")[df1_int$M_diabetes==1]
gMi.A[df1_int$M_diabetes==0] <- (1 - predict(g.M.A, type="response"))[df1_int$M_diabetes==0]
# 2e. Calculate sw_{M,i}
sw_Mi <- gMi.A / gMi.L
## 3. Estimate marginal model for the counterfactual outcome Y_{am}
model.Yam.death <- glm(Y_death ~ A0_PM2.5 * M_diabetes,
weights = sw_Ai * sw_Mi,
family = "gaussian",
data = df1_int)
summary(model.Yam.death)
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.178967 0.005049 35.445 < 2e-16 ***
# A0_PM2.5 0.067151 0.016679 4.026 5.72e-05 ***
# M_diabetes 0.067321 0.009508 7.080 1.54e-12 ***
# A0_PM2.5:M_diabetes -0.004491 0.026027 -0.173 0.863
## 4) Estimate marginal model for the counterfactual mediator P(M_a* = 1)
model.Ma <- glm(M_diabetes ~ A0_PM2.5,
weights = sw_Ai,
family = "binomial",
data = df1_int)
summary(model.Ma)
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -0.92405 0.02353 -39.264 <2e-16 ***
# A0_PM2.5 0.58413 0.06501 8.986 <2e-16 ***
## 5) Estimate population average counterfactuals using the "mediation formula"
## E(Y_{a,M_a*}) = sum_m E(Y_{am}) * P(M_a^* = m)
E.Y0M0 <- ((coef(model.Yam.death)["(Intercept)"] + # sum_m E(Y_{a0}) * P(M_a^* = 0)
coef(model.Yam.death)["A0_PM2.5"] * 0 +
coef(model.Yam.death)["M_diabetes"] * 0 +
coef(model.Yam.death)["A0_PM2.5:M_diabetes"] * 0 * 0) *
(1 - plogis(coef(model.Ma)["(Intercept)"] +
coef(model.Ma)["A0_PM2.5"] * 0))) +
((coef(model.Yam.death)["(Intercept)"] + # sum_m E(Y_{a1}) * P(M_a^* = 1)
coef(model.Yam.death)["A0_PM2.5"] * 0 +
coef(model.Yam.death)["M_diabetes"] * 1 +
coef(model.Yam.death)["A0_PM2.5:M_diabetes"] * 0 * 1) *
(plogis(coef(model.Ma)["(Intercept)"] +
coef(model.Ma)["A0_PM2.5"] * 0)))
E.Y1M0 <- ((coef(model.Yam.death)["(Intercept)"] + # sum_m E(Y_{a,M=0}) * P(M_0 = 0)
coef(model.Yam.death)["A0_PM2.5"] * 1 +
coef(model.Yam.death)["M_diabetes"] * 0 +
coef(model.Yam.death)["A0_PM2.5:M_diabetes"] * 1 * 0) *
(1 - plogis(coef(model.Ma)["(Intercept)"] +
coef(model.Ma)["A0_PM2.5"] * 0))) +
((coef(model.Yam.death)["(Intercept)"] + # sum_m E(Y_{a,M=1}) * P(M_0 = 1)
coef(model.Yam.death)["A0_PM2.5"] * 1 +
coef(model.Yam.death)["M_diabetes"] * 1 +
coef(model.Yam.death)["A0_PM2.5:M_diabetes"] * 1 * 1) *
(plogis(coef(model.Ma)["(Intercept)"] +
coef(model.Ma)["A0_PM2.5"] * 0)))
E.Y1M1 <- ((coef(model.Yam.death)["(Intercept)"] + # E(Y_{a,M=0}) * P(M_1 = 0)
coef(model.Yam.death)["A0_PM2.5"] * 1 +
coef(model.Yam.death)["M_diabetes"] * 0 +
coef(model.Yam.death)["A0_PM2.5:M_diabetes"] * 1 * 0) *
(1 - plogis(coef(model.Ma)["(Intercept)"] +
coef(model.Ma)["A0_PM2.5"] * 1))) +
((coef(model.Yam.death)["(Intercept)"] + # E(Y_{a,M=1}) * P(M_1 = 1)
coef(model.Yam.death)["A0_PM2.5"] * 1 +
coef(model.Yam.death)["M_diabetes"] * 1 +
coef(model.Yam.death)["A0_PM2.5:M_diabetes"] * 1 * 1) *
(plogis(coef(model.Ma)["(Intercept)"] +
coef(model.Ma)["A0_PM2.5"] * 1)))
PNDE.death <- E.Y1M0 - E.Y0M0
# 0.06587481
TNIE.death <- E.Y1M1 - E.Y1M0
# 0.008274351
In the example above, we directly used the estimated coefficients of the MSMs to calculate the quantities \(\mathbb{E}(Y_{a,M_{a^\prime}})\) for the PNDE and the TNIE. When using the CMAverse, individual values for the counterfactual mediator are simulated as shown below. This strategy can be useful if the mediator is continuous of high-dimensional.
## 6) Estimate population average counterfactuals using simulated
## values of the mediator
PNDE.death.sim <- rep(NA,5)
TNIE.death.sim <- rep(NA,5)
set.seed(1234)
for(k in 1:15){
M0 <- rbinom(n = nrow(df1_int), size = 1,
prob = plogis(coef(model.Ma)["(Intercept)"] +
coef(model.Ma)["A0_PM2.5"] * 0))
M1 <- rbinom(n = nrow(df1_int), size = 1,
prob = plogis(coef(model.Ma)["(Intercept)"] +
coef(model.Ma)["A0_PM2.5"] * 1))
E.Y0M0 <- mean((coef(model.Yam.death)["(Intercept)"] +
coef(model.Yam.death)["A0_PM2.5"] * 0 +
coef(model.Yam.death)["M_diabetes"] * M0 +
coef(model.Yam.death)["A0_PM2.5:M_diabetes"] * 0 * M0))
E.Y1M0 <- mean((coef(model.Yam.death)["(Intercept)"] +
coef(model.Yam.death)["A0_PM2.5"] * 1 +
coef(model.Yam.death)["M_diabetes"] * M0 +
coef(model.Yam.death)["A0_PM2.5:M_diabetes"] * 1 * M0))
E.Y1M1 <- mean((coef(model.Yam.death)["(Intercept)"] +
coef(model.Yam.death)["A0_PM2.5"] * 1 +
coef(model.Yam.death)["M_diabetes"] * M1 +
coef(model.Yam.death)["A0_PM2.5:M_diabetes"] * 1 * M1))
PNDE.death.sim[k] <- E.Y1M0 - E.Y0M0
TNIE.death.sim[k] <- E.Y1M1 - E.Y1M0
}
mean(PNDE.death.sim)
# [1] 0.06587304
mean(TNIE.death.sim)
# [1] 0.00824998
8.4.2 Estimation of NDE and NIE using CMAverse (by IPTW or gcomputation)
We can use the CMAverse
package to estimate the coefficients of the 2 MSMs required to estimate Natural Direct and Indirect Effects.
For estimation by IPTW, we have to specify the following arguments model = msm
(to estimate the MSM of the outcome and the MSM of the mediator), and that the counterfactual mediator values are simulated (estimation = "imputation"
). Because the coefficients of the MSMs are estimated by IPTW, we have to specify
- the link functions for the MSM of the outcome (
yreg
) and the MSM of the mediator (mreg
) - and the link function to calculate the weights
ereg
for \(\mathbb{P}(A=1|L(0))\),wmnomreg
for the numerator of the mediator’s weight andwmdenomreg
for the denominator of the mediator’s weight.
## Using the CMAverse to estimate MSM estimated by IPTW
set.seed(1234) # note: there is randomness, even with estimation by IPTW
# => counterfactuals are imputed (estimation = "imputation")
res_msm_df1 <- cmest(data = df1_int,
model = "msm",
outcome = "Y_death",
exposure = "A0_PM2.5",
mediator = "M_diabetes",
basec = c("L0_male", "L0_soc_env","L1"),
postc = NULL,
EMint = TRUE, # E*M interaction
ereg = "logistic", # exposure regression model g(A=1|L(0))
yreg = "linear", # to get risk difference
mreg = list("logistic"), # mediation model g(M=1|L1,A,L0)
wmnomreg = list("logistic"), #g(M=1|A) wgt nominator
wmdenomreg = list("logistic"), # g(M=1|L1,A,L(0)) wgt denom
astar = 0, #E(Y_{A=0,M=1})
a = 1, #E(Y_{A=1,M=1})
mval = list(0), # mediator value at which the variable is controlled
estimation = "imputation",
inference = "bootstrap",
nboot = 2)
summary(res_msm_df1)
# Causal Mediation Analysis
#
# # Outcome regression:
# Call:
# glm(formula = Y_death ~ A0_PM2.5 + M_diabetes + A0_PM2.5 * M_diabetes,
# family = gaussian(), data = getCall(x$reg.output$yreg)$data,
# weights = getCall(x$reg.output$yreg)$weights)
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.178967 0.005049 35.445 < 2e-16 ***
# A0_PM2.5 0.067151 0.016679 4.026 5.72e-05 ***
# M_diabetes 0.067321 0.009508 7.080 1.54e-12 ***
# A0_PM2.5:M_diabetes -0.004491 0.026027 -0.173 0.863
#
# # Mediator regressions:
# Call:
# glm(formula = M_diabetes ~ A0_PM2.5, family = binomial(),
# data = getCall(x$reg.output$mreg[[1L]])$data,
# weights = getCall(x$reg.output$mreg[[1L]])$weights)
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -0.92405 0.02353 -39.264 <2e-16 ***
# A0_PM2.5 0.58413 0.06501 8.986 <2e-16 ***
#
# # Mediator regressions for weighting (denominator):
# Call:
# glm(formula = M_diabetes ~ A0_PM2.5 + L0_male + L0_soc_env +
# L1, family = binomial(), data = getCall(x$reg.output$wmdenomreg[[1L]])$data,
# weights = getCall(x$reg.output$wmdenomreg[[1L]])$weights)
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -1.37880 0.04872 -28.303 < 2e-16 ***
# A0_PM2.5 0.56260 0.06555 8.583 < 2e-16 ***
# L0_male 0.25861 0.04437 5.829 5.57e-09 ***
# L0_soc_env 0.33050 0.04744 6.967 3.23e-12 ***
# L1 0.33462 0.04744 7.054 1.74e-12 ***
#
# # Mediator regressions for weighting (nominator):
# Call:
# glm(formula = M_diabetes ~ A0_PM2.5, family = binomial(),
# data = getCall(x$reg.output$wmnomreg[[1L]])$data,
# weights = getCall(x$reg.output$wmnomreg[[1L]])$weights)
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -0.93236 0.02358 -39.544 <2e-16 ***
# A0_PM2.5 0.62388 0.06481 9.627 <2e-16 ***
#
# # Exposure regression for weighting:
# Call:
# glm(formula = A0_PM2.5 ~ L0_male + L0_soc_env + L1, family = binomial(),
# data = getCall(x$reg.output$ereg)$data,
# weights = getCall(x$reg.output$ereg)$weights)
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -2.74236 0.07729 -35.484 < 2e-16 ***
# L0_male 0.40610 0.06448 6.298 3.01e-10 ***
# L0_soc_env 0.64079 0.07350 8.718 < 2e-16 ***
# L1 0.03257 0.06968 0.467 0.64
#
# # Effect decomposition on the mean difference scale via the marginal structural model
# Direct counterfactual imputation estimation with
# bootstrap standard errors, percentile confidence intervals and p-values
#
# Estimate Std.error 95% CIL 95% CIU P.val
# cde 0.0671509 0.0157808 0.0761499 0.097 <2e-16 ***
# pnde 0.0658727 0.0169713 0.0643983 0.087 <2e-16 *** <- PNDE
# tnde 0.0652889 0.0172698 0.0594343 0.083 <2e-16 ***
# pnie 0.0087514 0.0013389 0.0070189 0.009 <2e-16 ***
# tnie 0.0081675 0.0010405 0.0024559 0.004 <2e-16 *** <- TNDE
# te 0.0740402 0.0159309 0.0682520 0.090 <2e-16 ***
# intref -0.0012782 0.0011906 -0.0117517 -0.010 <2e-16 ***
# intmed -0.0005839 0.0002984 -0.0049640 -0.005 <2e-16 ***
# cde(prop) 0.9069522 0.0222813 1.0860362 1.116 <2e-16 ***
# intref(prop) -0.0172642 0.0439631 -0.1726807 -0.114 <2e-16 ***
# intmed(prop) -0.0078856 0.0162851 -0.0729153 -0.051 <2e-16 ***
# pnie(prop) 0.1181976 0.0379668 0.0786163 0.130 <2e-16 ***
# pm 0.1103120 0.0216818 0.0275800 0.057 <2e-16 ***
# int -0.0251498 0.0602482 -0.2455960 -0.165 <2e-16 ***
# pe 0.0930478 0.0222813 -0.1159712 -0.086 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# (cde: controlled direct effect;
# pnde: pure natural direct effect;
# tnde: total natural direct effect;
# pnie: pure natural indirect effect;
# tnie: total natural indirect effect;
# te: total effect;
# intref: reference interaction;
# intmed: mediated interaction;
# cde(prop): proportion cde;
# intref(prop): proportion intref;
# intmed(prop): proportion intmed;
# pnie(prop): proportion pnie;
# pm: overall proportion mediated;
# int: overall proportion attributable to interaction;
# pe: overall proportion eliminated)
The CMAverse can also estimate the coefficients of the 2 MSM by (parametric) g-computation. For this estimation, we have to specify the following arguments model = gformula
, and that the counterfactual outcome and mediator values are simulated (estimation = "imputation"
). We don’t need to specify the ereg
, wmnomreg
and wmdenomreg
arguments which were used to calculate weights in the previous approach.
Note that the estimated models of the outcome and the mediator are conditional on baseline confounders \(L(0)\) and \(L(1)\).
## Using the CMAverse to estimate MSM estimated by g-comp
set.seed(1234) # note: there is randomness, even with estimation by IPTW
# => counterfactuals are imputed (estimation = "imputation")
res_msm_df1.qol <- cmest(data = df1_int,
model = "gformula",
outcome = "Y_qol",
exposure = "A0_PM2.5",
mediator = "M_diabetes",
basec = c("L0_male", "L0_soc_env","L1"),
postc = NULL,
EMint = TRUE, # E*M interaction
# ereg = "logistic", # # not needed with gcomp
yreg = "linear", # to get risk difference
mreg = list("logistic"), # mediation model g(M=1|L1,A,L0)
# wmnomreg = list("logistic"), # not needed with gcomp
# wmdenomreg = list("logistic"), ## not needed with gcomp
astar = 0, #E(Y_{A=0,M=1})
a = 1, #E(Y_{A=1,M=1})
mval = list(0), # mediator value at which the variable is controlled
estimation = "imputation",
inference = "bootstrap",
nboot = 2)
summary(res_msm_df1.qol)
# Causal Mediation Analysis
#
# # Outcome regression:
# Call:
# glm(formula = Y_qol ~ A0_PM2.5 + M_diabetes + A0_PM2.5 * M_diabetes +
# L0_male + L0_soc_env + L1, family = gaussian(), data = getCall(x$reg.output$yreg)$data,
# weights = getCall(x$reg.output$yreg)$weights)
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 74.7669 0.2139 349.557 < 2e-16 ***
# A0_PM2.5 -3.7153 0.4160 -8.931 < 2e-16 ***
# M_diabetes -8.6317 0.2385 -36.197 < 2e-16 ***
# L0_male -0.7235 0.2018 -3.586 0.000337 ***
# L0_soc_env -2.8899 0.2112 -13.684 < 2e-16 ***
# L1 -3.4280 0.2212 -15.494 < 2e-16 ***
# A0_PM2.5:M_diabetes -5.6154 0.6514 -8.621 < 2e-16 ***
#
# # Mediator regressions:
# Call:
# glm(formula = M_diabetes ~ A0_PM2.5 + L0_male + L0_soc_env +
# L1, family = binomial(), data = getCall(x$reg.output$mreg[[1L]])$data,
# weights = getCall(x$reg.output$mreg[[1L]])$weights)
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -1.37880 0.04872 -28.303 < 2e-16 ***
# A0_PM2.5 0.56260 0.06555 8.583 < 2e-16 ***
# L0_male 0.25861 0.04437 5.829 5.57e-09 ***
# L0_soc_env 0.33050 0.04744 6.967 3.23e-12 ***
# L1 0.33462 0.04744 7.054 1.74e-12 ***
#
# # Effect decomposition on the mean difference scale via the g-formula approach
#
# Direct counterfactual imputation estimation with
# bootstrap standard errors, percentile confidence intervals and p-values
#
# Estimate Std.error 95% CIL 95% CIU P.val
# cde -3.715265 0.114707 -3.084527 -2.930 <2e-16 ***
# pnde -5.103390 0.008919 -4.820137 -4.808 <2e-16 ***
# tnde -5.659875 0.006095 -5.579563 -5.571 <2e-16 ***
# pnie -0.855400 0.040149 -1.006846 -0.953 <2e-16 ***
# tnie -1.411886 0.025135 -1.758083 -1.724 <2e-16 ***
# te -6.515276 0.034054 -6.578220 -6.532 <2e-16 ***
# intref -1.388125 0.105788 -1.877736 -1.736 <2e-16 ***
# intmed -0.556485 0.015014 -0.771408 -0.751 <2e-16 ***
# cde(prop) 0.570239 0.015115 0.448589 0.469 <2e-16 ***
# intref(prop) 0.213057 0.017570 0.263846 0.287 <2e-16 ***
# intmed(prop) 0.085412 0.002894 0.114201 0.118 <2e-16 ***
# pnie(prop) 0.131292 0.005348 0.145871 0.153 <2e-16 ***
# pm 0.216704 0.002454 0.263960 0.267 <2e-16 ***
# int 0.298469 0.020463 0.378048 0.406 <2e-16 ***
# pe 0.429761 0.015115 0.531104 0.551 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# (cde: controlled direct effect;
# pnde: pure natural direct effect;
# tnde: total natural direct effect;
# pnie: pure natural indirect effect;
# tnie: total natural indirect effect;
# te: total effect; intref:
# reference interaction;
# intmed: mediated interaction;
# cde(prop): proportion cde;
# intref(prop): proportion intref;
# intmed(prop): proportion intmed;
# pnie(prop): proportion pnie;
# pm: overall proportion mediated;
# int: overall proportion attributable to interaction;
# pe: overall proportion eliminated)
8.5 “Unified” approach for estimating Natural Direct and Indirect effects
(Lange, Vansteelandt, and Bekaert 2012) suggested to use a “unified” approach for estimating Natural Direct and Indirect effects (when they are identifiable), based on the following Marginal Structural Model: \[\begin{equation*} h\left[\mathbb{E}(Y_{a,M_{a^*}})\right] = \gamma_0 + \gamma_1 a + \gamma_2 a^* + \gamma_3 (a \times a^*) \text{ where } h \text{ is a link function} \end{equation*}\]
So that we can express the PNDE as: \[\begin{align*} \text{PNDE} &= \mathbb{E}(Y_{1,M_{0}}) - \mathbb{E}(Y_{0,M_{0}}) \\ \text{PNDE} &= h^{-1}\left(\gamma_0 + \gamma_1 \right) - h^{-1}\left(\gamma_0 \right) \\ \text{PNDE} &= \gamma_1 \text{ if the link function is the identity function} \end{align*}\]
and we can express the TNIE as: \[\begin{align*} \text{TNIE} &= \mathbb{E}(Y_{1,M_{1}}) - \mathbb{E}(Y_{1,M_{0}}) \\ \text{TNIE} &= h^{-1}\left(\gamma_0 + \gamma_1 + \gamma_2 + \gamma_3 \right) - h^{-1}\left(\gamma_0 + \gamma_1 \right) \\ \text{TNIE} &= \gamma_2 + \gamma_3 \text{ if the link function is the identity function} \end{align*}\]
Note that this model can also be extended to study effect modifications by baseline confounders.
8.5.1 Estimation of the MSM by IPTW
8.5.1.1 Manual calculation
The estimation of the “unified” MSM for Natural Direct and Indirect effects relies on the following steps:
- Estimate suitable models for the exposure conditional on baseline confounders (\(L(0)\) and \(L(1)\)), using the original data set, in order to compute weights for the exposure \(A\)
- Estimate a suitable model for the mediator conditional on baseline confounders (\(L(0)\) and \(L(1)\)), using the original data set, in order to compute weights for the mediator \(M\)
- Construct a new data set by repeating the original data set twice, and including an additional variable \(A^*\), equal to the original exposure in the 1st data set, and equal to \(1-A\) in the 2nd data set (for multicategorical exposures, see supplementary material of (Lange, Vansteelandt, and Bekaert 2012)).
- Compute weights by applying the fitted models from steps 1 and 2 \[\begin{align*} w_i &= \frac{1}{P(A=A_i\mid L(0)=L(0)_i,L(1)=L(1)_i)} \times \frac{P(M=M_i \mid A = A^*_i,L(0)=L(0)_i,L(1)=L(1)_i)}{P(M=M_i \mid A = A_i,L(0)=L(0)_i,L(1)=L(1)_i)} \\ \text{ or } sw_i &= \frac{P(A=A_i)}{P(A=A_i\mid L(0)=L(0)_i,L(1)=L(1)_i)} \times \frac{P(M=M_i \mid A = A^*_i,L(0)=L(0)_i,L(1)=L(1)_i)}{P(M=M_i \mid A = A_i,L(0)=L(0)_i,L(1)=L(1)_i)} \\ \end{align*}\]
- Estimate the unified MSM as a weighted model of the outcome including only \(A\) and \(A^*\) (and their interaction), weighted by simple weights or stabilized weights. Use the estimated coefficients of the MSM to calculate the PNDE and TNIE
rm(list=ls())
df1_int <- read.csv(file = "data/df1_int.csv")
## Step 1. Estimate a suitable model for the exposure conditional on confounders,
## using the original data set
# for A weight numerator
g.A <- glm(A0_PM2.5 ~ 1,
family = "binomial", data = df1_int)
# for A weight denominator
g.A.L0 <- glm(A0_PM2.5 ~ L0_male + L0_soc_env + L1, # L1 not necessary in our example
family = "binomial", data = df1_int)
## Step 2. Estimate a suitable model for the mediator conditional on confounders,
## using the original data set
df1_int$A0_PM2.5_temp <- df1_int$A0_PM2.5
g.M.AL0 <- glm(M_diabetes ~ L0_male + L0_soc_env + A0_PM2.5_temp + L1,
family = "binomial", data = df1_int)
## Step 3. Construct a new data set by repeating the original data set twice,
## and including an additional variable A*, equal to the original exposure
## in the 1st data set, and equal to 1-A in the 2nd data set
## (note for multicategorical exposure, see supplementary material of Lange et al)
# create identifier
df1_int$id <- 1:nrow(df1_int)
# dupplicate the original data set
df1_int.1 <- df1_int.2 <- df1_int
# Add an A* variable
df1_int.1$A0_PM2.5_star <- df1_int$A0_PM2.5 # A* = A in the 1st data set
df1_int.2$A0_PM2.5_star <- 1 - df1_int$A0_PM2.5 # A* = 1 - A in the 2d data set
# stack the 2 new datasets
df1_int.new <- rbind(df1_int.1,df1_int.2)
# sort the data set by the id-variable to use geeglm
# (necessary to use geepack and the geeglm() later)
df1_int.new <- df1_int.new[order(df1_int.new$id), ]
head(df1_int.new, 12)
# L0_male L0_soc_env A0_PM2.5 L1 M_diabetes Y_death Y_qol A0_PM2.5_temp id A0_PM2.5_star
# 1 0 1 0 1 0 0 93.4 0 1 0
# 10001 0 1 0 1 0 0 93.4 0 1 1
# 2 1 1 1 1 0 1 64.0 1 2 1
# 10002 1 1 1 1 0 1 64.0 1 2 0
# 3 1 1 0 0 0 0 75.6 0 3 0
# 10003 1 1 0 0 0 0 75.6 0 3 1
# 4 1 0 0 0 0 0 89.8 0 4 0
# 10004 1 0 0 0 0 0 89.8 0 4 1
# 5 1 1 0 0 0 0 77.2 0 5 0
# 10005 1 1 0 0 0 0 77.2 0 5 1
# 6 1 1 0 0 1 0 73.9 0 6 0
# 10006 1 1 0 0 1 0 73.9 0 6 1
## Step 4. Compute weights by applying the fitted models from steps 1 and 2
# weights for A
g.Ais1 <- predict(g.A, newdata = df1_int.new, type = "response")
sw_A.num <- rep(NA, nrow(df1_int.new))
sw_A.num[df1_int.new$A0_PM2.5==1] <- g.Ais1[df1_int.new$A0_PM2.5==1]
sw_A.num[df1_int.new$A0_PM2.5==0] <- (1 - g.Ais1[df1_int.new$A0_PM2.5==0])
g.Ais1.L0 <- predict(g.A.L0, newdata = df1_int.new, type = "response")
sw_A.denom <- rep(NA, nrow(df1_int.new))
sw_A.denom[df1_int.new$A0_PM2.5==1] <- g.Ais1.L0[df1_int.new$A0_PM2.5==1]
sw_A.denom[df1_int.new$A0_PM2.5==0] <- (1 - g.Ais1.L0[df1_int.new$A0_PM2.5==0])
w_A <- 1 / sw_A.denom
sw_A <- sw_A.num / sw_A.denom
# weights for M|A*
df1_int.new$A0_PM2.5_temp <- df1_int.new$A0_PM2.5_star
g.Mis1.Astar <- predict(g.M.AL0, newdata = df1_int.new, type = "response")
sw_M.Astar <- rep(NA, nrow(df1_int.new))
sw_M.Astar[df1_int.new$M_diabetes==1] <- g.Mis1.Astar[df1_int.new$M_diabetes==1]
sw_M.Astar[df1_int.new$M_diabetes==0] <- 1 - g.Mis1.Astar[df1_int.new$M_diabetes==0]
# weight for M|A
df1_int.new$A0_PM2.5_temp <- df1_int.new$A0_PM2.5
g.Mis1.A <- predict(g.M.AL0, newdata = df1_int.new, type = "response")
sw_M.A <- rep(NA, nrow(df1_int.new))
sw_M.A[df1_int.new$M_diabetes==1] <- g.Mis1.A[df1_int.new$M_diabetes==1]
sw_M.A[df1_int.new$M_diabetes==0] <- 1 - g.Mis1.A[df1_int.new$M_diabetes==0]
# non-stabilized weight
w <- w_A * (sw_M.Astar / sw_M.A)
# stabilized weight
sw <- sw_A * (sw_M.Astar / sw_M.A)
boxplot(data.frame(w, sw))
## Step 5. Fit a suitable model to the outcome including only A and A* (and their
## interaction) using a weighted regression
# we will use a GEE model in order to get "robust" standard errors
library(geepack)
# note: be careful that individuals in the "df1_int.new" data set
# and in the "w" vector have the same order
## for a risk difference in death
MSM.model.death <- geeglm(Y_death ~ A0_PM2.5 * A0_PM2.5_star,
weights = sw,
family = "gaussian",
id = df1_int.new$id,
data = df1_int.new,
scale.fix = TRUE)
summary(MSM.model.death)
# Coefficients:
# Estimate Std.err Wald Pr(>|W|)
# (Intercept) 0.198840 0.004251 2187.61 < 2e-16 ***
# A0_PM2.5 0.065950 0.014539 20.58 5.7e-06 ***
# A0_PM2.5_star 0.008534 0.001242 47.25 6.2e-12 ***
# A0_PM2.5:A0_PM2.5_star -0.000406 0.003762 0.01 0.91
## Calculate PNDE and PNIE using the MSM's coefficients
unif.PNDE.death <- coef(MSM.model.death)["A0_PM2.5"]
# 0.066
unif.TNIE.death <- (coef(MSM.model.death)["A0_PM2.5_star"] +
coef(MSM.model.death)["A0_PM2.5:A0_PM2.5_star"])
# 0.00813
## For Quality of life
MSM.model.qol <- geeglm(Y_qol ~ A0_PM2.5 * A0_PM2.5_star,
weights = sw,
family = "gaussian",
id = df1_int.new$id,
data = df1_int.new,
scale.fix = TRUE)
summary(MSM.model.qol)
# Coefficients:
# Estimate Std.err Wald Pr(>|W|)
# (Intercept) 69.0853 0.1174 346240.4 < 2e-16 ***
# A0_PM2.5 -5.3771 0.4010 179.8 < 2e-16 ***
# A0_PM2.5_star -1.0769 0.0311 1197.2 < 2e-16 ***
# A0_PM2.5:A0_PM2.5_star -0.6947 0.0909 58.4 2.2e-14 ***
## Calculate PNDE and PNIE using the MSM's coefficients
unif.PNDE.qol <- coef(MSM.model.qol)["A0_PM2.5"]
# -5.38
unif.TNIE.qol <- (coef(MSM.model.qol)["A0_PM2.5_star"] +
coef(MSM.model.qol)["A0_PM2.5:A0_PM2.5_star"])
# -1.77
# 95% confidence intervals can also be computed by bootstrap
8.5.1.2 Using the medflex
package
We can used the medflex
package to apply this approach, as described below.
The “unified” MSM can be defined as a model of the expected counterfactual outcome \(\mathbb{E}(Y_{a,M_{a^*}} \mid L(0),L(1))\) conditional on the baseline confounders \(L(0)\) and \(L(1)\), or as a model the marginal expected counterfactual outcome \(\mathbb{E}(Y_{a,M_{a^*}}\).
For estimation by IPTW, the data is expanded using the neWeight
function, and the MSM is estimated using the neModel
function.Population average estimations are obtained using a weighted regression, where the weights are calculated using a model of the exposure, stated at the xFit
argument inside the neModel
function.
Standard errors can be computed by bootstrap or as robust “Sandwich” standard errors.
library(medflex)
# this package can used to estimate Natural direct and indirect effects when they are
# identifiable
### estimate PNDE and TNDE, conditional on L0 and L1
# ---------------------------------------------------------------------------- #
# get back to the original data set
df1_int <- subset(df1_int, select = -c(A0_PM2.5_temp, id))
## 1) Expand the dataset and calculate weights using the neWeight() function
g.M <- glm(M_diabetes ~ factor(A0_PM2.5) + # the 1st variable should be the exposure
L0_male + L0_soc_env + L1, # then add baseline confounders
family = "binomial", data = df1_int)
# expend the data set and add a second column A* = 1 - A
# and calculate the weights = sw_M.Astar / sw_M.A
exp.Data <- neWeight(g.M)
# it is also possible to use the neWeight function directly:
# exp.Data <- neWeight(M_diabetes ~ factor(A0_PM2.5) + # the 1st variable should be the exposure
# L0_male + L0_soc_env + L1, # then add baseline confounders
# family = "binomial", data = df1_int)
class(exp.Data)
# [1] "data.frame" "expData" "weightData"
head(exp.Data)
# id A0_PM2.50 A0_PM2.51 L0_male L0_soc_env L1 M_diabetes Y_death Y_qol
# 1 1 0 0 0 1 1 0 0 93.4
# 2 1 0 1 0 1 1 0 0 93.4
# 3 2 1 1 1 1 1 0 1 64.0
# 4 2 1 0 1 1 1 0 1 64.0
# 5 3 0 0 1 1 0 0 0 75.6
# 6 3 0 1 1 1 0 0 0 75.6
# the new variables A0_PM2.50 and A0_PM2.51 correspond to A and A*
# check the weights:
w <- weights(exp.Data)
boxplot(w)
# Note that we estimated the same weights by hand previously:
head(data.frame(w, sw_M.Astar / sw_M.A), 6)
# w sw_M.Astar.sw_M.A
# 1 1.000 1.000
# 2 0.801 0.801
# 3 1.000 1.000
# 4 1.293 1.293
# 5 1.000 1.000
# 6 0.809 0.809
plot(w, sw_M.Astar / sw_M.A)
abline(0,1)
## 2) Fit the natural effect model using the neModel() function
## (adjusted for baseline confounders)
## for death
set.seed(1234)
neMod.death <- neModel(Y_death ~ A0_PM2.50 * A0_PM2.51 + L0_male + L0_soc_env + L1,
family = "gaussian", # to estimate risk difference
expData = exp.Data,
se = c("bootstrap"), # or "robust"
nBoot = 10, # use >= 1000 samples, if se = bootstrap
)
summary(neMod.death)
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 0.10982 0.00807 13.61 < 2e-16 ***
# A0_PM2.501 0.06576 0.01642 4.01 6.2e-05 ***
# A0_PM2.511 0.00836 0.00214 3.91 9.1e-05 ***
# L0_male 0.05041 0.00741 6.80 1.0e-11 ***
# L0_soc_env 0.06113 0.00609 10.03 < 2e-16 ***
# L1 0.08341 0.01281 6.51 7.5e-11 ***
# A0_PM2.501:A0_PM2.511 0.00238 0.00446 0.53 0.59
## estimate the conditional PNDE and TNDE, given L(0)=0 and L(1)=0
medflex.PNDE.death.L0 <- coef(neMod.death)["A0_PM2.501"]
# 0.0658
medflex.TNIE.death.L0 <- (coef(neMod.death)["A0_PM2.511"] +
coef(neMod.death)["A0_PM2.501:A0_PM2.511"])
# 0.0107
# 95% CI of the TNIE (which combines 2 coefficients) ?
# cov(b1,b2) = var(b1) + var(b2) + 2 * cov(b1,b2)
c(medflex.TNIE.death.L0 -
qnorm(0.975) * sqrt(var(neMod.death$bootRes$t[,3]) + # column of A0_PM2.511
var(neMod.death$bootRes$t[,7]) + # column of A0_PM2.501:A0_PM2.511
2 * var(neMod.death$bootRes$t[,c(3,7)])[1,2]),
medflex.TNIE.death.L0 +
qnorm(0.975) * sqrt(var(neMod.death$bootRes$t[,3]) +
var(neMod.death$bootRes$t[,7]) +
2 * var(neMod.death$bootRes$t[,c(3,7)])[1,2]))
# A0_PM2.511 A0_PM2.511
# 0.0027 0.0188
## for quality of life
neMod.qol <- neModel(Y_qol ~ A0_PM2.50 * A0_PM2.51 + L0_male + L0_soc_env + L1,
family = "gaussian", # to estimate risk difference
expData = exp.Data,
se = c("robust") # or "bootstrap"
# nBoot = 1000, # if se = bootstrap
)
summary(neMod.qol)
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 73.187 0.226 324.09 < 2e-16 ***
# A0_PM2.501 -5.336 0.338 -15.77 < 2e-16 ***
# A0_PM2.511 -1.069 0.136 -7.87 3.4e-15 ***
# L0_male -1.231 0.223 -5.53 3.3e-08 ***
# L0_soc_env -3.546 0.230 -15.41 < 2e-16 ***
# L1 -4.101 0.243 -16.87 < 2e-16 ***
# A0_PM2.501:A0_PM2.511 -0.765 0.126 -6.09 1.1e-09 ***
## estimate the conditional PNDE and TNDE, given L(0)=0 and L(1)=0
medflex.PNDE.qol.L0 <- coef(neMod.qol)["A0_PM2.501"]
# -5.34
medflex.TNIE.qol.L0 <- (coef(neMod.qol)["A0_PM2.511"] +
coef(neMod.qol)["A0_PM2.501:A0_PM2.511"])
# -1.83
# 95% CI of the TNIE (which combines 2 coefficients) ?
# cov(b1,b2) = var(b1) + var(b2) + 2cov(b1,b2)
c(medflex.TNIE.qol.L0 -
qnorm(0.975) * sqrt(neMod.qol$vcov["A0_PM2.511","A0_PM2.511"] +
neMod.qol$vcov["A0_PM2.501:A0_PM2.511","A0_PM2.501:A0_PM2.511"] +
2 * neMod.qol$vcov["A0_PM2.511","A0_PM2.501:A0_PM2.511"]),
medflex.TNIE.qol.L0 +
qnorm(0.975) * sqrt(neMod.qol$vcov["A0_PM2.511","A0_PM2.511"] +
neMod.qol$vcov["A0_PM2.501:A0_PM2.511","A0_PM2.501:A0_PM2.511"] +
2 * neMod.qol$vcov["A0_PM2.511","A0_PM2.501:A0_PM2.511"]))
# A0_PM2.511 A0_PM2.511
# -2.30 -1.37
# plot results
plot(neMod.qol)
### estimate marginal PNDE and TNDE (population average)
# ---------------------------------------------------------------------------- #
## 1) Expand the dataset and calculate weights using the neWeight() function
## Estimate a model of the exposure
g.A.L0 <- glm(A0_PM2.5 ~ L0_male + L0_soc_env + L1, # will no work without L(1) !
family = "binomial", data = df1_int)
## Estimate a model of the mediator
g.M <- glm(M_diabetes ~ factor(A0_PM2.5) + # the 1st variable should be the exposure
L0_male + L0_soc_env + L1, # then add baseline confounders
family = "binomial", data = df1_int)
# expend the data set and add a second column A* = 1 - A
# and calculate the weights = sw_M.Astar / sw_M.A
exp.Data <- neWeight(g.M)
## 2) Fit the natural effect model using the neModel() function
## for death
neMod.death.pop <- neModel(Y_death ~ A0_PM2.50 * A0_PM2.51, # no need for L(0),L(1)
family = "gaussian", # to estimate risk difference
expData = exp.Data,
xFit = g.A.L0, # model of the exposure
se = c("robust")) # or "bootstrap"
summary(neMod.death.pop)
# Parameter estimates:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 0.198840 0.004247 46.82 < 2e-16 ***
# A0_PM2.501 0.065950 0.014346 4.60 4.3e-06 ***
# A0_PM2.511 0.008534 0.001619 5.27 1.4e-07 ***
# A0_PM2.501:A0_PM2.511 -0.000406 0.003744 -0.11 0.91
## estimate the population average PNDE and TNDE,
medflex.PNDE.death <- coef(neMod.death.pop)["A0_PM2.501"]
# 0.066
medflex.TNIE.death <- (coef(neMod.death.pop)["A0_PM2.511"] +
coef(neMod.death.pop)["A0_PM2.501:A0_PM2.511"])
# 0.00813
# the results are the same than results calculated by hand
## for quality of life
neMod.qol.pop <- neModel(Y_qol ~ A0_PM2.50 * A0_PM2.51, # no need for L(0),L(1)
family = "gaussian", # to estimate risk difference
expData = exp.Data,
xFit = g.A.L0, # model of the exposure
se = c("robust")) # or "bootstrap"
summary(neMod.qol.pop)
# Parameter estimates:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 69.085 0.117 590.36 < 2e-16 ***
# A0_PM2.501 -5.377 0.350 -15.38 < 2e-16 ***
# A0_PM2.511 -1.077 0.137 -7.88 3.3e-15 ***
# A0_PM2.501:A0_PM2.511 -0.695 0.120 -5.80 6.7e-09 ***
## estimate the population average PNDE and TNDE,
medflex.PNDE.qol <- coef(neMod.qol.pop)["A0_PM2.501"]
# -5.38
medflex.TNIE.qol <- (coef(neMod.qol.pop)["A0_PM2.511"] +
coef(neMod.qol.pop)["A0_PM2.501:A0_PM2.511"])
# -1.77
# the results are the same than results calculated by hand
8.5.2 Estimation of the MSM by g-computation
8.5.2.1 Manual calculation
The “unified” MSM can also be estimated by g-computation. Note that for the estimation of the population average effects, calculation relies on a weighted regression, as with the IPTW approach.
We can apply the following steps:
- Estimate a suitable model for the outcome conditional on the exposure, mediator and confounders, using the original data set
- Construct a new data set by repeating the original data set twice, and including an additional variable \(A^*\), equal to the original exposure in both data sets, and change the value of \(A\) from \(A\) to \(1-A\) in the 2nd data set. Then impute counterfactuals \(\mathbb{E}(Y_{a,M_a*}|L(0),L(1))\)
- Estimate the unified MSM as a model of the outcome including \(A\) and \(A^*\) (and their interaction), adjusted for baseline confounders \(L(0)\) and \(L(1)\). In order to obtain population average model (unconditional on baseline confounders), coefficients can be estimated using a weighted regression (with simple weights or stabilized weights of the exposure). Use the estimated coefficients of the MSM to calculate the PNDE and TNIE.
rm(list=ls())
df1_int <- read.csv(file = "data/df1_int.csv")
## Step 1. Estimate a suitable model for the outcome conditional on the exposure,
## mediator and confounders, using the original data set
Q.Y.death <- glm(Y_death ~ L0_male + L0_soc_env + L1 + A0_PM2.5 * M_diabetes,
family = "binomial",
data = df1_int)
Q.Y.qol <- glm(Y_qol ~ L0_male + L0_soc_env + L1 + A0_PM2.5 * M_diabetes,
family = "gaussian",
data = df1_int)
## Step 2. Expand data and impute counterfactuals E(Y_{a,M_a*}|L(0),L(1))
## Expand data
# create identifier
df1_int$id <- 1:nrow(df1_int)
# dupplicate the original data set
df1_int.1 <- df1_int.2 <- df1_int
# Add an A* variable where A* = A
df1_int.1$A0_PM2.5_star <- df1_int$A0_PM2.5
df1_int.2$A0_PM2.5_star <- df1_int$A0_PM2.5
# for the 2nd data set, change the exposure A = 1 - A
df1_int.2$A0_PM2.5 <- 1 - df1_int$A0_PM2.5
# stack the 2 new datasets
df1_int.new <- rbind(df1_int.1,df1_int.2)
# sort the data set by the id-variable to use geeglm
# (necessary to use geepack and the geeglm() later)
df1_int.new <- df1_int.new[order(df1_int.new$id), ]
head(df1_int.new, 12)
# L0_male L0_soc_env A0_PM2.5 L1 M_diabetes Y_death Y_qol A0_PM2.5_temp id A0_PM2.5_star
# 1 0 1 0 1 0 0 93.4 0 1 0
# 10001 0 1 1 1 0 0 93.4 0 1 0
# 2 1 1 1 1 0 1 64.0 1 2 1
# 10002 1 1 0 1 0 1 64.0 1 2 1
# 3 1 1 0 0 0 0 75.6 0 3 0
# 10003 1 1 1 0 0 0 75.6 0 3 0
# 4 1 0 0 0 0 0 89.8 0 4 0
# 10004 1 0 1 0 0 0 89.8 0 4 0
# 5 1 1 0 0 0 0 77.2 0 5 0
# 10005 1 1 1 0 0 0 77.2 0 5 0
# 6 1 1 0 0 1 0 73.9 0 6 0
# 10006 1 1 1 0 1 0 73.9 0 6 0
## Impute counterfactuals
# death - we predict the probability P(Y=1|A,M,L(0),L(1))
Y.pred.death <- predict(Q.Y.death, newdata = df1_int.new, type = "response")
# qol - we can predict the mean E(Y|A,M,L(0),L(1))
Y.pred.qol <- predict(Q.Y.qol, newdata = df1_int.new, type = "response")
## Step 3. Fit a suitable model to the outcome incluing only A and A* (and their
## interaction)
library(geepack)
# for death (conditional)
MSM.model.death.1 <- geeglm(Y.pred.death ~ A0_PM2.5 * A0_PM2.5_star +
L0_male + L0_soc_env + L1,
family = "gaussian",
id = df1_int.new$id,
data = df1_int.new,
scale.fix = TRUE)
summary(MSM.model.death.1)
# Estimate Std.err Wald Pr(>|W|)
# (Intercept) 0.104153 0.000611 29104 < 2e-16 ***
# A0_PM2.5 0.063543 0.000138 210638 < 2e-16 ***
# A0_PM2.5_star 0.007032 0.001112 40 2.5e-10 ***
# L0_male 0.054291 0.000706 5922 < 2e-16 ***
# L0_soc_env 0.065692 0.000692 9004 < 2e-16 ***
# L1 0.086446 0.000860 10100 < 2e-16 ***
# A0_PM2.5:A0_PM2.5_star 0.004918 0.000397 154 < 2e-16 ***
## estimate the conditional PNDE and TNDE, given L(0)=0 and L(1)=0
PNDE.death.L0 <- coef(MSM.model.death.1)["A0_PM2.5"]
# 0.0635
TNIE.death.L0 <- (coef(MSM.model.death.1)["A0_PM2.5_star"] +
coef(MSM.model.death.1)["A0_PM2.5:A0_PM2.5_star"])
# 0.012
# For quality of life (conditional)
MSM.model.qol.1 <- geeglm(Y.pred.qol ~ A0_PM2.5 * A0_PM2.5_star +
L0_male + L0_soc_env + L1,
family = "gaussian",
id = df1_int.new$id,
data = df1_int.new,
scale.fix = TRUE)
summary(MSM.model.qol.1)
# Estimate Std.err Wald Pr(>|W|)
# (Intercept) 73.3371 0.0949 597466.3 < 2e-16 ***
# A0_PM2.5 -5.3013 0.0268 39033.2 < 2e-16 ***
# A0_PM2.5_star -1.0535 0.1353 60.6 6.9e-15 ***
# L0_male -1.3289 0.1038 164.0 < 2e-16 ***
# L0_soc_env -3.6467 0.1061 1180.6 < 2e-16 ***
# L1 -4.2325 0.1168 1312.8 < 2e-16 ***
# A0_PM2.5:A0_PM2.5_star -0.7920 0.0870 82.9 < 2e-16 ***
## estimate the conditional PNDE and TNDE, given L(0)=0 and L(1)=0
PNDE.qol.L0 <- coef(MSM.model.qol.1)["A0_PM2.5"]
# -5.3
TNIE.qol.L0 <- (coef(MSM.model.qol.1)["A0_PM2.5_star"] +
coef(MSM.model.qol.1)["A0_PM2.5:A0_PM2.5_star"])
# -1.85
## For population average estimations
## Estimate a model of the exposure
g.A.L0 <- glm(A0_PM2.5 ~ L0_male + L0_soc_env + L1, # will no work without L(1) !
family = "binomial", data = df1_int)
g.Ais1.L0 <- predict(g.A.L0, newdata = df1_int.new, type = "response")
w <- rep(NA, nrow(df1_int.new))
w[df1_int.new$A0_PM2.5_star == 1] <- 1 / g.Ais1.L0[df1_int.new$A0_PM2.5_star == 1]
w[df1_int.new$A0_PM2.5_star == 0] <- 1 / (1 - g.Ais1.L0[df1_int.new$A0_PM2.5_star == 0])
# for death (population average), we use the weighted regression
MSM.model.death.2 <- geeglm(Y.pred.death ~ A0_PM2.5 * A0_PM2.5_star,
family = "gaussian",
weights = w,
id = df1_int.new$id,
data = df1_int.new,
scale.fix = TRUE)
summary(MSM.model.death.2)
# Estimate Std.err Wald Pr(>|W|)
# (Intercept) 0.198882 0.000644 95250.3 < 2e-16 ***
# A0_PM2.5 0.063867 0.000138 213552.3 < 2e-16 ***
# A0_PM2.5_star 0.009213 0.002009 21.0 4.5e-06 ***
# A0_PM2.5:A0_PM2.5_star 0.002330 0.000445 27.4 1.7e-07 ***
## estimate the conditional PNDE and TNDE, given L(0)=0 and L(1)=0
PNDE.death <- coef(MSM.model.death.2)["A0_PM2.5"]
# 0.0639
TNIE.death <- (coef(MSM.model.death.2)["A0_PM2.5_star"] +
coef(MSM.model.death.2)["A0_PM2.5:A0_PM2.5_star"])
# 0.0115
# For quality of life (population average), we use the weighted regression
MSM.model.qol.2 <- geeglm(Y.pred.qol ~ A0_PM2.5 * A0_PM2.5_star,
family = "gaussian",
weights = w,
id = df1_int.new$id,
data = df1_int.new,
scale.fix = TRUE)
summary(MSM.model.qol.2)
# Estimate Std.err Wald Pr(>|W|)
# (Intercept) 69.0849 0.0492 1.97e+06 < 2e-16 ***
# A0_PM2.5 -5.3108 0.0269 3.88e+04 < 2e-16 ***
# A0_PM2.5_star -1.1690 0.1624 5.18e+01 6.1e-13 ***
# A0_PM2.5:A0_PM2.5_star -0.7395 0.0906 6.66e+01 3.3e-16 ***
## estimate the conditional PNDE and TNDE, given L(0)=0 and L(1)=0
PNDE.qol.L0 <- coef(MSM.model.qol.2)["A0_PM2.5"]
# -5.31
TNIE.qol.L0 <- (coef(MSM.model.qol.2)["A0_PM2.5_star"] +
coef(MSM.model.qol.2)["A0_PM2.5:A0_PM2.5_star"])
# -1.91
8.5.2.2 Using the medflex
package
We can used the medflex
package to apply this approach, as described below.
The “unified” MSM can be defined as a model of the expected counterfactual outcome \(\mathbb{E}(Y_{a,M_{a^*}} \mid L(0),L(1))\) conditional on the baseline confounders \(L(0)\) and \(L(1)\), or as a model the marginal expected counterfactual outcome \(\mathbb{E}(Y_{a,M_{a^*}}\).
For estimation by g-computation, the data is expanded using the neImpute
function, and the MSM is estimated using the neModel
function. Population average estimations are obtained using a weighted regression, where the weights are calcuated using a model of the exposure, stated at the xFit
argument inside the neModel
function.
Standard errors can be computed by bootstrap or as robust “Sandwich” standard errors.
library(medflex)
### estimate PNDE and TNDE, conditional on L0 and L1
# ---------------------------------------------------------------------------- #
# get back to the original data set
df1_int <- subset(df1_int, select = -c(id))
df1_int$A0_PM2.5 <- factor(df1_int$A0_PM2.5)
## 1) Fit a model for the outcome
Q.Y.death <- glm(Y_death ~ A0_PM2.5 + # start with the exposure
M_diabetes + A0_PM2.5:M_diabetes + # then the mediator
L0_male + L0_soc_env + L1, # then baseline confounders
family = "binomial",
data = df1_int)
Q.Y.qol <- glm(Y_qol ~ A0_PM2.5 + # start with the exposure
M_diabetes + A0_PM2.5:M_diabetes + # then the mediator
L0_male + L0_soc_env + L1, # then baseline confounders
family = "gaussian",
data = df1_int)
## 2) expend the data set and add a second column A* = 1 - A
exp.Data.death <- neImpute(Q.Y.death)
head(exp.Data.death)
# id A0_PM2.50 A0_PM2.51 L0_male L0_soc_env L1 M_diabetes Y_death Y_qol
# 1 1 0 0 0 1 1 0 0.222 93.4
# 2 1 1 0 0 1 1 0 0.292 93.4
# 3 2 1 1 1 1 1 0 0.356 64.0
# 4 2 0 1 1 1 1 0 0.277 64.0
# 5 3 0 0 1 1 0 0 0.197 75.6
# 6 3 1 0 1 1 0 0 0.261 75.6
exp.Data.qol <- neImpute(Q.Y.qol)
head(exp.Data.qol)
# id A0_PM2.50 A0_PM2.51 L0_male L0_soc_env L1 M_diabetes Y_death Y_qol
# 1 1 0 0 0 1 1 0 0 68.4
# 2 1 1 0 0 1 1 0 0 64.7
# 3 2 1 1 1 1 1 0 1 64.0
# 4 2 0 1 1 1 1 0 1 67.7
# 5 3 0 0 1 1 0 0 0 71.2
# 6 3 1 0 1 1 0 0 0 67.4
# the predicted outcomes by medflex are the same than our previous predictions
plot(exp.Data.death$Y_death, Y.pred.death)
abline(0,1)
plot(exp.Data.qol$Y_qol, Y.pred.qol)
abline(0,1)
## 3) Fit the natural effect model using the neModel() function
## (adjusted for baseline confounders)
## for death
neMod.death <- neModel(Y_death ~ A0_PM2.50 * A0_PM2.51 + L0_male + L0_soc_env + L1,
family = "gaussian", # to estimate risk difference
expData = exp.Data.death,
se = c("robust") # or "bootstrap"
)
summary(neMod.death)
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 0.10415 0.00829 12.56 < 2e-16 ***
# A0_PM2.501 0.06354 0.01373 4.63 3.7e-06 ***
# A0_PM2.511 0.00703 0.00175 4.01 6.1e-05 ***
# L0_male 0.05429 0.00867 6.26 3.8e-10 ***
# L0_soc_env 0.06569 0.00882 7.45 9.3e-14 ***
# L1 0.08645 0.00999 8.66 < 2e-16 ***
# A0_PM2.501:A0_PM2.511 0.00492 0.00391 1.26 0.21
## estimate the conditional PNDE and TNDE, given L(0)=0 and L(1)=0
medflex.PNDE.death.L0 <- coef(neMod.death)["A0_PM2.501"]
# 0.0635
medflex.TNIE.death.L0 <- (coef(neMod.death)["A0_PM2.511"] +
coef(neMod.death)["A0_PM2.501:A0_PM2.511"])
# 0.012
## for quality of life
neMod.qol <- neModel(Y_qol ~ A0_PM2.50 * A0_PM2.51 + L0_male + L0_soc_env + L1,
family = "gaussian", # to estimate risk difference
expData = exp.Data.qol,
se = c("robust")) # or "bootstrap"
summary(neMod.qol)
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 73.337 0.230 319.17 < 2e-16 ***
# A0_PM2.501 -5.301 0.340 -15.61 < 2e-16 ***
# A0_PM2.511 -1.054 0.139 -7.58 3.6e-14 ***
# L0_male -1.329 0.227 -5.85 5.0e-09 ***
# L0_soc_env -3.647 0.235 -15.54 < 2e-16 ***
# L1 -4.233 0.249 -17.01 < 2e-16 ***
# A0_PM2.501:A0_PM2.511 -0.792 0.127 -6.21 5.1e-10 ***
## estimate the conditional PNDE and TNDE, given L(0)=0 and L(1)=0
medflex.PNDE.qol.L0 <- coef(neMod.qol)["A0_PM2.501"]
# -5.3
medflex.TNIE.qol.L0 <- (coef(neMod.qol)["A0_PM2.511"] +
coef(neMod.qol)["A0_PM2.501:A0_PM2.511"])
# -1.85
### estimate marginal PNDE and TNDE (population average)
# ---------------------------------------------------------------------------- #
## Estimate a model of the exposure
g.A.L0 <- glm(A0_PM2.5 ~ L0_male + L0_soc_env + L1, # will no work without L(1) !
family = "binomial", data = df1_int)
## 3) Fit the natural effect model using the neModel() function
## for death
neMod.death.pop <- neModel(Y_death ~ A0_PM2.50 * A0_PM2.51, # no need for L(0),L(1)
family = "gaussian", # to estimate risk difference
expData = exp.Data.death,
xFit = g.A.L0, # model of the exposure
se = c("robust")) # or "bootstrap"
summary(neMod.death.pop)
# Parameter estimates:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 0.19888 0.00425 46.83 < 2e-16 ***
# A0_PM2.501 0.06387 0.01378 4.64 3.6e-06 ***
# A0_PM2.511 0.00921 0.00169 5.45 5.1e-08 ***
# A0_PM2.501:A0_PM2.511 0.00233 0.00367 0.64 0.53
## estimate the population average PNDE and TNDE,
medflex.PNDE.death <- coef(neMod.death.pop)["A0_PM2.501"]
# 0.0639
medflex.TNIE.death <- (coef(neMod.death.pop)["A0_PM2.511"] +
coef(neMod.death.pop)["A0_PM2.501:A0_PM2.511"])
# 0.0115
# the results are the same than results calculated by hand
## for quality of life
neMod.qol.pop <- neModel(Y_qol ~ A0_PM2.50 * A0_PM2.51, # no need for L(0),L(1)
family = "gaussian", # to estimate risk difference
expData = exp.Data.qol,
xFit = g.A.L0, # model of the exposure
se = c("robust")) # or "bootstrap"
summary(neMod.qol.pop)
# Parameter estimates:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 69.085 0.117 590.50 < 2e-16 ***
# A0_PM2.501 -5.311 0.339 -15.65 < 2e-16 ***
# A0_PM2.511 -1.169 0.145 -8.09 6.1e-16 ***
# A0_PM2.501:A0_PM2.511 -0.740 0.125 -5.92 3.2e-09 ***
## estimate the population average PNDE and TNDE,
medflex.PNDE.qol <- coef(neMod.qol.pop)["A0_PM2.501"]
# -5.31
medflex.TNIE.qol <- (coef(neMod.qol.pop)["A0_PM2.511"] +
coef(neMod.qol.pop)["A0_PM2.501:A0_PM2.511"])
# -1.91
# the results are the same than results calculated by hand