Chapter 9 Targeted Maximum Likelihood Estimation (TMLE)

When estimating a mean counterfactual outcome using g-computation methods, we have to estimate some \(\bar{Q}\) functions (functions of the outcome conditional on the exposures and confounders, \(\bar{Q}=\mathbb{E}\left(Y\mid A,L(0)\right)\)). For example, the Average Total Effect (ATE) is defined as a marginal effect, estimated using the empirical mean of such \(\bar{Q}\) functions: \[\begin{equation*} \hat{\Psi}^{\text{ATE}}_{\text{gcomp}} = \frac{1}{n} \sum_{i=1}^n \left[ \hat{\overline{Q}}(A=1)_i - \hat{\overline{Q}}(A=0)_i \right] \end{equation*}\]

Unless the \(\bar{Q}\) functions are not misspecified, its estimate is expected to be biased (and \(\bar{Q}\) are expected to be misspecified, especially if the set of baseline confounders \(L(0)\) is high dimensional, for example if it includes is a large number of variables or continuous variables). In order to improve the estimation of \(\bar{Q}(A,L)\), it is possible to use data-adaptive methods (machine learning algorithms) in order to optimize the bias-variance trade-off. However, this bias-variance trade-off would be optimized for the \(\bar{Q}\) functions, not for the ATE estimate \(\hat{\Psi}^\text{ATE}_\text{gcomp}\). If the \(\bar{Q}\) function is unknown and has to be estimated (preferably by data-adaptive methods), it can be shown that the g-computation estimate of \(\Psi^\text{ATE}\) is asymptotically biased.

The Targeted Maximum Likelihood Estimation (TMLE) method has been developed as an asymptotically linear estimator, so that the estimation of any target parameter in any semiparametric statistical model is unbiased and efficient. In order to estimate a parameter \(\Psi(P_0)\), where \(P_0\) is an unknown probability distribution among a set \(\mathcal{M}\) of possible statistical models, the TMLE is described as a two-step procedure (Laan and Rose 2011):

The first step is to obtain an initial estimate of the relevant part (\(\bar{Q}_0\) in our applications) of the probability distribution \(P_0\). Data adaptive methods (machine learning algorithms) can be used to optimize this first step.
The second step is to update the initial fit in order to “target toward making an optimal bias-variance tradeoff for the parameter of interest” \(\Psi(\bar{Q})\).

Several R packages have been developed in order to carry out TMLE estimation of causal effects. We will begin using the ltmle package, as it can be used to estimate ATE or CDE. More generally, this package can be used to estimate the counterfactual effects of repeated exposure in time-to-event settings. In the setting of mediation analysis, a controlled direct effect (CDE) corresponds to a sequence of counterfactual interventions on 2 “exposure variables”: the initial exposure \(A\) and the mediator of interest \(M\). The package can also be used in simpler settings with only one binary or continuous outcome, measure only once at the end a the study.

9.1 TMLE for the ATE

In order to illustrate the TMLE procedure, the estimation of a mean counterfactual outcome, denoted \(\Psi(A=1) = \mathbb{E} \left[\bar{Q}(A=1,L(0))\right]\), will be described in detail, following the algorithm implemented in the ltmle package.

The basic steps of the procedure are the following (Laan and Rose 2011):

Estimate \(\bar{Q}_0\). Data-adaptive methods can be used here, the ltmle package relies on the SuperLearner package to fit and predict \(\hat{\bar{Q}}(A=1)\).
Estimate the treatment mechanism \(g(A=1 \mid L(0))\). Once again, data-adaptive methods can be used to improve the estimation.
The initial estimator of \(\bar{Q}_0(A=1)\) will be slightly modified using a parametric fluctuation model, in order to reduce the bias when estimating the ATE. For example, the following parametric model of \(\bar{Q}_0(A=1)\) and a “clever covariate” \(H = \frac{I(A=1)}{\hat{g}(A=1 \mid L(0))}\) can be applied: \[\begin{equation*} \text{logit} P(Y \mid \hat{\bar{Q}}, H) = \hat{\text{logit} \bar{Q}} + \varepsilon H \end{equation*}\] The parametric fluctuation model is chosen so that the derivative of its log-likelihood loss function is equal to the appropriate component of the efficient influence curve of the target parameter \(\Psi(A=1)\).
Modify the initial estimator of \(\bar{Q}_0(A=1)\) with the parametric fluctuation model (using the estimation \(\hat{\varepsilon}\) from the previous step). We denote \(\hat{\bar{Q}}^*(A=1)\) the updated value of \(\hat{\bar{Q}}(A=1)\)
Use the updated values \(\hat{\bar{Q}}^*(A=1)\) in the substitution estimator to estimate the target parameter \(\Psi(A=1)\) : \[\begin{equation*} \hat{\Psi}(A=1)_\text{TMLE} = \frac{1}{n} \sum_{i=1}^n \hat{\bar{Q}}^* (A=1,L(0)) \end{equation*}\]
Estimate the efficient influence curve \(D^*(Q_0,g_0)\) : \[\begin{equation*} D^*(Q_0,g_0) = \frac{I(A=1)}{g_0(A=1 \mid L(0))}(Y - \bar{Q}_0(A,L(O))) + \bar{Q}_0(A=1,L(0)) - \Psi(A=1) \end{equation*}\]

The variance of the target parameter can then be calculated using the variance of the efficient influence curve: \[\begin{equation*} \text{var} \hat{\Psi}(A=1)_\text{TMLE} = \frac{\text{var} \hat{D}^*}{n} \end{equation*}\]

rm(list=ls())
df2_int <- read.csv(file = "./data/df2_int.csv")

## 1) Estimate Qbar and predict Qbar when A0_PM2.5 is set to 1
Q.fit <- glm(Y_death ~ A0_PM2.5 + L0_male + L0_soc_env,
             family = "binomial", data = df2_int)
data.A1 <- df2_int
data.A1$A0_PM2.5 <- 1

# predict the Qvar function when setting the exposure to A=1, on the logit scale
logitQ <- predict(Q.fit, newdata = data.A1, type = "link")

## 2) Estimate the treatment mechanism
g.L <- glm(A0_PM2.5 ~ L0_male + L0_soc_env,
           family = "binomial", data = df2_int)
# predict the probabilities g(A=1 | L(0)) = P(A0_PM2.5=1|L(0))
g1.L <- predict(g.L, type="response")

head(g1.L)
#          1          2          3          4          5          6
# 0.10989220 0.15629749 0.15629749 0.08894074 0.15629749 0.15629749

# It is useful to check the distribution of gA.L, as values close to 0 or 1 are 
# indicators of near positivity violation and can result in large variance for the 
# estimation. 
# In case of near positivity violation, gA.L values can be truncated to decrease
# the variance (at the cost a increased bias).
summary(g1.L)
#     Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
# 0.06109 0.08894 0.10989 0.11240 0.15630 0.15630
# there is no positivity issues in this example.

## 3) Determine a parametric family of fluctuations of Qbar. 
# The fluctuation model is a model of logitQbar and g(A=1|L(0)) 

# The clever covariate H(A,L(0)) depends on g(A=1|L(0)):
H <- (df2_int$A0_PM2.5 == 1) / g1.L

# Update the initial fit Qbar from step 1.
# This is achieved by holding Qbar fixed (as intercept) while estimating the
# coefficient epsilon for H

# for example we could use the following fluctuation model (from the "Targeted 
# Learning" book)
update.fit <- glm(df2_int$Y_death ~ -1 + offset(logitQ) + H,
                  family = "quasibinomial")
# Coefficients:
#   H
# -0.0001756
Qstar <- predict(update.fit, data = data.frame(logitQ, H), type = "response")

# In the ltmle package, the fluctuation parametric model is slightly different
# (but with the same purpose). The "clever covariate" H is scaled and used as a 
# weight in the parametric quasi-logistic regression
S1 <- rep(1, nrow(df2_int))
update.fit.ltmle <- glm(df2_int$Y_death ~ -1 + S1 + offset(logitQ),
                        family = "quasibinomial",
                        weights = scale(H, center = FALSE))
# Coefficients:
#   S1
# -0.001667

## 4) Update the initial estimate of Qbar using the fluctuation parametric model
Qstar.tmle <- predict(update.fit.ltmle, 
                      data = data.frame(logitQ, H), 
                      type = "response")


head(Qstar.tmle)
#         1         2         3         4         5         6 
# 0.2872412 0.3441344 0.3441344 0.2591356 0.3441344 0.3441344

## 5) Obtain the substition estimator of Psi_Ais1
Psi_Ais1 <- mean(Qstar.tmle)
# [1] 0.2871408

## 5) Calculate standard errors based on the influence curve of the TMLE
IC <- H * (df2_int$Y_death - Qstar.tmle) + Qstar.tmle - Psi_Ais1
head(IC)
# 0.0001003559  4.2532581791  0.0569935644 -0.0280052148  0.0569935644  0.056993564

# The standard error of the target parameter Psi(A=1) can be estimated by :
sqrt(var(IC)/nrow(df2_int))
# [1] 0.01383821

We can see that we can get the same output using the ltmle package:

rm(list=ls())
df2_int <- read.csv(file = "./data/df2_int.csv")

library(ltmle)
?ltmle

# The Qform and gform arguments are defined from the DAG
Qform <- c(Y_death="Q.kplus1 ~ L0_male + L0_soc_env + A0_PM2.5")
gform <- c("A0_PM2.5 ~ L0_male + L0_soc_env")

# in the ltmle package, the data set should be formated so that the order of the 
# columns corresponds to the time-ordering of the model
data_ltmle <- subset(df2_int, 
                     select = c(L0_male, L0_soc_env, A0_PM2.5, Y_death))

# the counterfactual intervention is defined in the abar argument
abar <- 1

Psi_Ais1 <- ltmle(data_ltmle,
                  Anodes = "A0_PM2.5",
                  Ynodes = "Y_death",
                  Qform = Qform,
                  gform = gform,
                  gbounds = c(0.01, 1), # by default, g function are truncated at 0.01
                  abar = abar,
                  SL.library = "glm",
                  variance.method = "ic")

# from the ltmle() function, we can get the point estimate, its standard error, 
# 95% confidence interval and the p-value for the null hypothesis.
summary(Psi_Ais1, "tmle")
# Parameter Estimate:  0.28714
#  Estimated Std Err:  0.013838
#            p-value:  <2e-16
#  95% Conf Interval: (0.26002, 0.31426)

# The ltmle() function returns an object with several outputs.
# We can see that g functions are the same as in the previous manual calculation
head(Psi_Ais1$cum.g)
#            [,1]
# [1,] 0.10989220
# [2,] 0.15629749
# [3,] 0.15629749
# [4,] 0.08894074
# [5,] 0.15629749
# [6,] 0.15629749

# we can get the estimation of the epsilon parameter from the fluctuation model
Psi_Ais1$fit$Qstar
# Coefficients:
#   S1
# -0.001667
# Degrees of Freedom: 1124 Total (i.e. Null);  1123 Residual

# we can get the updated Qbar functions:
head(Psi_Ais1$Qstar)
# [1] 0.2872412 0.3441344 0.3441344 0.2591356 0.3441344 0.3441344

# we can get the influence curve 
head(Psi_Ais1$IC$tmle)
# [1]  0.0001003559  4.2532581791  0.0569935644 -0.0280052148  0.0569935644  0.0569935644

In practice, it is recommended to apply data-adaptive algorithms to estimate \(\bar{Q}\) and \(g\) functions: the ltmle package relies on the SuperLearner package. As indicated in the Guide to SuperLearner, The SuperLearner is “an algorithm that uses cross-validation to estimate the performance of multiple machine learning models, or the same model with different settings. It then creates an optimal weighted average of those models (ensemble learning) using the test data performance.”

Here is an example for our estimation of the Average Total Effect (ATE).

The SuperLearner package includes a set of algorithms with default parameters (showed by listWrappers()). Because the simulated data set only have 2 binary baseline variables, the set \(\mathcal{M}\) of possible statistical models is limited. In order to estimate the ATE, we will include a library with:

SL.mean, the null-model which only predict the marginal mean (it can be used as a reference for a bad model);
SL.glm, a glm using the main terms from the Qform and gform argument;
SL.interaction.back, a step-by-step backward GLM prodecure (based on the AIC), starting with all \(2 \times 2\) interactions between main terms. Interaction terms might be useful to estimate the \(\bar{Q}(A,L(0))\) function because the dataset was generated from and additive model, where as the function is estimated below using a logistic (multiplicative) model.
SL.xgboost.custom a customized xgboost algorithm from the initial SL.xgboost algorithm, showing how we can modify some default arguments.

library(SuperLearner)
library(xgboost)
# Below, we use the same ltmle() function than previously, 
# and specify a family of algorithms to be used with the SuperLearner

## we can change the default argument of the SL.xgboost algorithm and the 
## SL.step.interaction algorithm

# We can check how arguments are used in the pre-specified algorithms
SL.step.interaction
# function (Y, X, newX, family, direction = "both", trace = 0, 
#     k = 2, ...) 
# {
#     fit.glm <- glm(Y ~ ., data = X, family = family)
#     fit.step <- step(fit.glm, scope = Y ~ .^2, direction = direction, 
#         trace = trace, k = k)
#     pred <- predict(fit.step, newdata = newX, type = "response")
#     fit <- list(object = fit.step)
#     out <- list(pred = pred, fit = fit)
#     class(out$fit) <- c("SL.step")
#     return(out)
# }
# <bytecode: 0x000001b965ed0dc0>
# <environment: namespace:SuperLearner>

# The pre-specified algorithm can be easily modified to obtain a step-by-step backward 
# selection.
SL.interaction.back = function(...) {
  SL.step.interaction(..., direction = "backward")
}

# The same principle can be applied to modify the SL.xgboost default algorithm
SL.xgboost
SL.xgboost.custom = function(...) {
  SL.xgboost(..., ntrees = 50)
}


## the algorithms we would like to use can be specified separately for the Q and 
# g functions
SL.library <- list(Q=c("SL.mean","SL.glm","SL.interaction.back", "SL.xgboost.custom"),
                   g=c("SL.mean","SL.glm","SL.interaction.back", "SL.xgboost.custom"))

set.seed(42)
Psi_ATE_tmle <- ltmle(data = data_ltmle,
                      Anodes = "A0_PM2.5",
                      Ynodes = "Y_death",
                      Qform = Qform,
                      gform = gform,
                      gbounds = c(0.01, 1),
                      abar = list(1,0), # vector of the counterfactual treatment
                      SL.library = SL.library,
                      variance.method = "ic")
summary(Psi_ATE_tmle, estimator = "tmle")
# The function give the ATE on the difference scale (as well, as RR and OR)
# Additive Treatment Effect:
   # Parameter Estimate:  0.081832 
   #  Estimated Std Err:  0.014291 
   #            p-value:  1.0275e-08 
   #  95% Conf Interval: (0.053822, 0.10984) 

## We can see how the SuperLearner used the algorithms for the g function
Psi_ATE_tmle$fit$g
# [[1]]$A0_PM2.5
#                               Risk        Coef
# SL.mean_All             0.09976892 0.003545569 # risk is higher for the bad model
# SL.glm_All              0.09865424 0.416238369
# SL.interaction.back_All 0.09865424 0.000000000
# SL.xgboost.custom_All   0.09865550 0.580216062

# for the g function, the SuperLearner predicts the treatment mechanism
# base on a mix between the glm and the customized xgboost algorithm.

## We can see how the SuperLearner used the algorithms for the g function
Psi_ATE_tmle$fit$Q
#                              Risk       Coef
# SL.mean_All             0.1684737 0.02003166 # risk is higher for the bad model
# SL.glm_All              0.1662241 0.00000000
# SL.interaction.back_All 0.1662241 0.55956284
# SL.xgboost.custom_All   0.1662422 0.42040550

# The SuperLearner predicts both the treatment mechanism g and the Q function
# from a mix between the backward interaction glm (or the main term glm) and the 
# customized xgboost algorithm. 
# However, the choice between the SL.glm and the SL.interaction.back 
# procedure was arbitrary: as we can see the Risk is exactly the same for both 
# algorithms. The final model from the step-by-step procedure was much probably 
# a main term glm.


## The `ltmle` package can also be used to estimate the effect of binary exposures
## on continous outcomes
Qform <- c(Y_qol="Q.kplus1 ~ L0_male + L0_soc_env + A0_PM2.5")
gform <- c("A0_PM2.5 ~ L0_male + L0_soc_env")

set.seed(42)
Psi_ATE_tmle_qol <- ltmle(data = subset(df2_int, 
                                        select = c(L0_male,L0_soc_env,
                                         A0_PM2.5,
                                         Y_qol)),
                      Anodes = "A0_PM2.5",
                      Ynodes = "Y_qol",
                      Qform = Qform,
                      gform = gform,
                      gbounds = c(0.01, 1),
                      abar = list(1,0), # vector of the counterfactual treatment 
                      SL.library = SL.library,
                      variance.method = "ic")
summary(Psi_ATE_tmle_qol, estimator = "tmle")
# Additive Treatment Effect:
#    Parameter Estimate:  -8.265 
#     Estimated Std Err:  0.41008 
#               p-value:  <2e-16 
#     95% Conf Interval: (-9.0687, -7.4612)

The TMLE estimation of the ATE from the ltmle package for death probability and mean quality of life is +8.18% (95% CI=[+5.38%, +10.98%]) and -8.27 [-9.07, -7.46].

Note that the ltmle package can also be used to calculate the IPTW estimation of the ATE and the CDE.

# using the output from the previous ltmle() procedure
summary(Psi_ATE_tmle, estimator = "iptw")
# Additive Treatment Effect:
#    Parameter Estimate:  0.082578 
#     Estimated Std Err:  0.014415 
#               p-value:  1.0135e-08 
#     95% Conf Interval: (0.054325, 0.11083) 

summary(Psi_ATE_tmle_qol, estimator = "iptw")
# Additive Treatment Effect:
#    Parameter Estimate:  -8.2887 
#     Estimated Std Err:  0.41799 
#               p-value:  <2e-16 
#     95% Conf Interval: (-9.108, -7.4695)

The IPTW estimation of the ATE from the ltmle package for death probability and mean quality of life is +8.26% (95% CI=[+5.43%, +11.08%]) and -8.29 [-9.11, -7.47].

9.2 TMLE of the Controlled direct effect (CDE)

If the controlled direct effect (CDE) is identifiable, the ltmle package can be used to calculate a TMLE estimation of the CDE \(\Psi^{\text{CDE}_m} = \mathbb{E}(Y_{A=1,M=m}) - \mathbb{E}(Y_{A=0,M=m})\).

Below, we show how to use the ltmle() function to estimate CDE by TMLE, with data generated from the causal model with the presence of confounders of the mediator-outcome relationship (\(L(1)\)) affected by the exposure \(A\) (Figure 3.2), and an \(A \ast M\) interaction effect on the outcome.

As with the G-computation method by iterative conditional expectation, the TMLE procedure relies on the estimation of 2 \(\bar{Q}\) functions:

\(\bar{Q}_Y = \mathbb{E}(Y \mid L(0),A,L(1),M)\)
and \(\bar{Q}_{L(1)} = \mathbb{E}(\hat{\bar{Q}}_Y(M=m) \mid L(0),A)\);

And as with the IPTW method, the TMLE procedure relies also on the estimation of the 2 treatment mechanisms \(g\):

\(g_A(L(0)) = P(A=1 \mid L(0))\)
and \(g_M(L(0),A,L(1)) = P(M=1 \mid L(0),A,L(1))\).

Note: for continuous outcomes, the ltmle package transforms the outcome on a 0 to 1 continuous scale, \(Y_\text{transformed} = \frac{Y - \min(Y)}{\max(Y) - \min(Y)}\), so that quasi-binomial parametric models can be used in the computation procedure. Mean predictions are then back-transformed on the original scale.

9.2.1 For binary outcomes

rm(list=ls())
df2_int <- read.csv(file = "./data/df2_int.csv")

library(ltmle)
# Define the formulas for the estimation of the 2 barQ functions
# Note that it is possible to specify the A*M interaction, if we really want to
# take it into account.
# Another option is to indicate prediction algorithms well adapted to the estimation
# of interaction phenomena into the SuperLearner arguments.
Qform <- c(L1="Q.kplus1 ~ L0_male + L0_soc_env + A0_PM2.5",
           Y_death="Q.kplus1 ~ L0_male + L0_soc_env + L1 +
                    A0_PM2.5 * M_diabetes")

# Define the formulas for the estimation of the 2 g function
gform <- c("A0_PM2.5 ~ L0_male + L0_soc_env",
           "M_diabetes ~ L0_male + L0_soc_env + A0_PM2.5 + L1")

# The data frame should follow the time-ordering of the nodes
data_binary <- subset(df2_int, select = c(L0_male, L0_soc_env,
                                          A0_PM2.5, L1,
                                          M_diabetes, Y_death))



# Choose a family of data-adaptive algorithms from the SuperLearner package
SL.library <- list(Q=c("SL.mean","SL.glm","SL.step.interaction","SL.xgboost"),
                   g=c("SL.mean","SL.glm","SL.step.interaction","SL.xgboost"))


## CDE, setting M=0
set.seed(42) # for reproducibility (xgboost algorithm relies on random procedures)
CDE_ltmle_M0_death <- ltmle(data = data_binary,
                            Anodes = c("A0_PM2.5", "M_diabetes"),
                            Lnodes = c("L1"), # intermediate confounders +/- baseline
                            Ynodes = c("Y_death"),
                            survivalOutcome = FALSE, # TRUE for time-to-event outcomes Y
                            Qform = Qform,
                            gform = gform,
                            abar = list(c(1,0), # counterfactual intervention do(A=1,M=0)
                                        c(0,0)), # counterfactual intervention do(A=0,M=0)
                            SL.library = SL.library,
                            estimate.time = FALSE, # estimate computation time?
                            gcomp = FALSE,
                            variance.method = "ic") # a more robust variance can
                                                    # be estimated with
                                                    # variance.method = "tmle"
summary(CDE_ltmle_M0_death)
# Additive Treatment Effect:
# Parameter Estimate:  0.056766
#  Estimated Std Err:  0.018037
#            p-value:  0.0016488
#  95% Conf Interval: (0.021413, 0.092118)

## CDE, setting M=1
set.seed(42) # for reproducibility
CDE_ltmle_M1_death <- ltmle(data = data_binary,
                            Anodes = c("A0_PM2.5", "M_diabetes"),
                            Lnodes = c("L1"), # intermediate confounders +/- baseline
                            Ynodes = c("Y_death"),
                            survivalOutcome = FALSE, # TRUE for time-to-event outcomes Y
                            Qform = Qform,
                            gform = gform,
                            abar = list(c(1,1), # counterfactual intervention do(A=1,M=1)
                                        c(0,1)), # counterfactual intervention do(A=0,M=1)
                            SL.library = SL.library,
                            estimate.time = FALSE, # estimate computation time?
                            gcomp = FALSE,
                            variance.method = "ic")
summary(CDE_ltmle_M1_death)
# Additive Treatment Effect:
# Parameter Estimate:  0.094776
#  Estimated Std Err:  0.024
#            p-value:  7.8496e-05
#  95% Conf Interval: (0.047736, 0.14182)

The controlled direct effect of ACE on the probability of death, had the mediator been set to 0 for every participant is 5.68%, 95%CI=[2.14%, 9.21%].

The controlled direct effect of ACE on the probability of death, had the mediator been set to 1 for every participant is 9.48%, 95%CI=[4.77%, 14.18%].

9.2.2 For continuous outcomes

# Define the data set with the continuous outcome Y_qol
data_continuous <- subset(df2_int, select = c(L0_male, L0_soc_env,
                                              A0_PM2.5, L1,
                                              M_diabetes, Y_qol))

# Replace the Qbar function (the 2d formula should be named Y_qol instead of Y_death)
Qform <- c(L1="Q.kplus1 ~ L0_male + L0_soc_env + A0_PM2.5",
           Y_qol="Q.kplus1 ~ L0_male + L0_soc_env + L1 +
                    A0_PM2.5 * M_diabetes")

set.seed(42)
## CDE, setting M=0
CDE_ltmle_M0_qol <- ltmle(data = data_continuous,
                      Anodes = c("A0_PM2.5", "M_diabetes"),
                      Lnodes = c("L1"), # intermediate confounders +/- baseline confounders
                      Ynodes = c("Y_qol"),
                      survivalOutcome = FALSE, # TRUE for time-to-event outcomes Y
                      Qform = Qform,
                      gform = gform,
                      abar = list(c(1,0), # counterfactual intervention do(A=1,M=0)
                                  c(0,0)), # counterfactual intervention do(A=0,M=0)
                      SL.library = SL.library,
                      estimate.time = FALSE, # estimate computation time?
                      gcomp = FALSE,
                      variance.method = "ic")
summary(CDE_ltmle_M0_qol)
# Additive Treatment Effect:
#   Parameter Estimate:  -4.8023
#    Estimated Std Err:  0.43135
#              p-value:  <2e-16
#    95% Conf Interval: (-5.6477, -3.9569)

## CDE, setting M=1
set.seed(42)
CDE_ltmle_M1_qol <- ltmle(data = data_continuous,
                          Anodes = c("A0_PM2.5", "M_diabetes"),
                          Lnodes = c("L1"), # intermediate confounders +/- baseline
                          Ynodes = c("Y_qol"),
                          survivalOutcome = FALSE, # TRUE for time-to-event outcomes Y
                          Qform = Qform,
                          gform = gform,
                          abar = list(c(1,1), # counterfactual intervention do(A=1,M=1)
                                      c(0,1)), # counterfactual intervention do(A=0,M=1)
                          SL.library = SL.library,
                          estimate.time = FALSE, # estimate computation time?
                          gcomp = FALSE,
                          variance.method = "ic")
summary(CDE_ltmle_M1_qol)
# Additive Treatment Effect:
#   Parameter Estimate:  -10.219
#    Estimated Std Err:  0.544
#              p-value:  <2e-16
#    95% Conf Interval: (-11.285, -9.1523)

The controlled direct effect of ACE on the quality of life score, had the mediator been set to 0 for every participant is -4.8, 95%CI=[-5.6, -4.0].

The controlled direct effect of ACE on the quality of life score, had the mediator been set to 1 for every participant is -10.2, 95%CI=[-11.3, -9.2].

9.3 One-step and TMLE estimator of the Marginal Randomized/Interventional Direct and Indirect Effects

The medoutcon package (Hejazi, Rudolph, and Díaz 2022),(Díaz et al. 2020) enables the estimation of Marginal Randomized/Interventional Direct and Indirect Effects (analogues of the Natural direct and indirect effects) by one-step estimation or TMLE. The one-step estimator relies on “cross-fitting” and the TMLE relies on cross-validation.

More practical details can be found in the Materials for the workshop “Modern Causal Mediation Analysis” at the 2024 Society for Epidemiologic Research (SER) annual meeting in Austin, TX.

This package is associated with the tlverse ecosystem which has been developed for applying Targeted Learning methodology in practice. To use the medoutcon package, we will need:

the sl3 package (to implement SuperLearning)
and the Highly-adaptive lasso hal9001 package to estimate some functions of interest.

Note that the medoutcon package can deal with multiple mediators, but only a single binary intermediate confounder \(L(1)\). For causal structures with multiple mediators and multiple intermediate confounders, the HDmediation package could be used instead.

Below, we apply the one-step estimator to the data set with an intermediate confounder \(L(1)\) affected by the exposure, and the presence of an \(A \ast M\) interaction effect on the outcome.

# remotes::install_github("nhejazi/medoutcon")
# remotes::install_github("nhejazi/medoutcon", INSTALL_opts=c("--no-multiarch"))
# https://arxiv.org/pdf/1912.09936
rm(list=ls())
df2_int <- read.csv(file = "data/df2_int.csv")

library(tidyverse)
library(sl3)
library(medoutcon)
library(hal9001)

?medoutcon::medoutcon

## 1) binary outcome 
# ---------------------------------------------------------------------------- #
### compute one-step estimate of the interventional direct effect
set.seed(1234)
os_de <- medoutcon(W = df2_int[,c("L0_male","L0_soc_env")], #matrix of baseline L(0)
                   A = df2_int$A0_PM2.5, # numeric vector of the exposure
                   Z = df2_int$L1, # numeric vector L(1) (only 1 variable)
                   M = df2_int$M_diabetes, # numeric vector or matrix
                   Y = df2_int$Y_death, # numeric vector
                   effect = "direct",
                   b_learners = sl3::Lrnr_hal9001$new(), # outcome regression
                   estimator = "onestep")
os_de
# Interventional Direct Effect
# Estimator: onestep
# Estimate: 0.068
# Std. Error: 0.015
# 95% CI: [0.039, 0.096]

os_de$theta
# [1] 0.06763031

sqrt(os_de$var)
# [1] 0.01455125

set.seed(1234)
os_ie <- medoutcon(W = df2_int[,c("L0_male","L0_soc_env")], #matrix of baseline L(0)
                   A = df2_int$A0_PM2.5, # numeric vector of the exposure
                   Z = df2_int$L1, # numeric vector L(1) (only 1 variable)
                   M = df2_int$M_diabetes, # numeric vector or matrix
                   Y = df2_int$Y_death, # numeric vector
                   effect = "indirect",
                   b_learners = sl3::Lrnr_hal9001$new(), # outcome regression
                   estimator = "onestep")
os_ie
# Interventional Indirect Effect
# Estimator: onestep
# Estimate: 0.015
# Std. Error: 0.004
# 95% CI: [0.008, 0.022]

os_ie$theta
# [1] 0.01502601

### Estimate counterfactuals E(Y_{a,G_{a*}})
set.seed(1234)
EY_1M0 <- medoutcon(W = df2_int[,c("L0_male","L0_soc_env")],
                    A = df2_int$A0_PM2.5,
                    Z = df2_int$L1,
                    M = df2_int$M_diabetes,
                    Y = df2_int$Y_death,
                    contrast = c(1,0),
                    b_learners = sl3::Lrnr_hal9001$new(), # outcome regression
                    estimator = "onestep")
EY_1M0
# Counterfactual TSM
# Contrast: A = 1, M(A = 0)
# Estimator: onestep
# Estimate: 0.273
# Std. Error: 0.014
# 95% CI: [0.246, 0.301]

EY_0M0 <- medoutcon(W = df2_int[,c("L0_male","L0_soc_env")],
                    A = df2_int$A0_PM2.5,
                    Z = df2_int$L1,
                    M = df2_int$M_diabetes,
                    Y = df2_int$Y_death,
                    contrast = c(0,0),
                    b_learners = sl3::Lrnr_hal9001$new(), # outcome regression
                    estimator = "onestep")
EY_0M0
# Counterfactual TSM
# Contrast: A = 0, M(A = 0)
# Estimator: onestep
# Estimate: 0.203
# Std. Error: 0.004
# 95% CI: [0.195, 0.212]

## randomized Natural Direct Effect =
rNDE <- EY_1M0$theta - EY_0M0$theta
# [1] 0.06964436

## se
se_rNDE <- sqrt(var(EY_1M0$eif - EY_0M0$eif) / nrow(df2_int))
# [1] 0.01459493

## 95%CI
c(rNDE - qnorm(0.975) * se_rNDE,
  rNDE + qnorm(0.975) * se_rNDE)
# [1] 0.04103882 0.09824989

## 2) continuous outcome 
# ---------------------------------------------------------------------------- #
### compute one-step estimate of the interventional direct effect
set.seed(1234)
os_de <- medoutcon(W = df2_int[,c("L0_male","L0_soc_env")], #matrix of baseline L(0)
                   A = df2_int$A0_PM2.5, # numeric vector of the exposure
                   Z = df2_int$L1, # numeric vector L(1) (only 1 variable)
                   M = df2_int$M_diabetes, # numeric vector or matrix
                   Y = df2_int$Y_qol, # numeric vector
                   effect = "direct",
                   b_learners = sl3::Lrnr_hal9001$new(), # outcome regression
                   estimator = "onestep")
os_de
# Interventional Direct Effect
# Estimator: onestep
# Estimate: -6.597
# Std. Error: 0.346
# 95% CI: [-7.276, -5.918]

set.seed(1234)
os_ie <- medoutcon(W = df2_int[,c("L0_male","L0_soc_env")], #matrix of baseline L(0)
                   A = df2_int$A0_PM2.5, # numeric vector of the exposure
                   Z = df2_int$L1, # numeric vector L(1) (only 1 variable)
                   M = df2_int$M_diabetes, # numeric vector or matrix
                   Y = df2_int$Y_qol, # numeric vector
                   effect = "indirect",
                   b_learners = sl3::Lrnr_hal9001$new(), # outcome regression
                   estimator = "onestep")
os_ie
# Interventional Indirect Effect
# Estimator: onestep
# Estimate: -1.703
# Std. Error: 0.239
# 95% CI: [-2.172, -1.234]

Below, we apply the TMLE estimator to the same data set. Note that the estimation of the direct effect seems to be biased (confirmed on simulations): maybe the arguments given in the medoutcon function are not correct?

## 1) binary outcome 
# ---------------------------------------------------------------------------- #
### compute tmle estimate of the interventional direct effect & indirect effects
set.seed(1234)
tmle_de <- medoutcon(W = df2_int[,c("L0_male","L0_soc_env")], #matrix of baseline L(0)
                     A = df2_int$A0_PM2.5, # numeric vector of the exposure
                     Z = df2_int$L1, # numeric vector L(1) (only 1 variable)
                     M = df2_int$M_diabetes, # numeric vector or matrix
                     Y = df2_int$Y_death, # numeric vector
                     effect = "direct",
                     estimator = "tmle",
                     b_learners = sl3::Lrnr_hal9001$new(), # outcome regression
                     # estimator_args = list(cv_folds = 10L, # instead of 5L
                     #                       max_iter = 1L, # instead of 5L
                     #                       tiltmod_tol = 1) # instead of 5
                    )
tmle_de
# Interventional Direct Effect
# Estimator: tmle
# Estimate: 0.042             <- ?! biased (confirmed on simulations)
# Std. Error: 0.015
# 95% CI: [0.014, 0.071]

tmle_de$theta
# [1] 0.04222315

sqrt(tmle_de$var)
# [1] 0.01455125

set.seed(1234)
tmle_ie <- medoutcon(W = df2_int[,c("L0_male","L0_soc_env")], #matrix of baseline L(0)
                   A = df2_int$A0_PM2.5, # numeric vector of the exposure
                   Z = df2_int$L1, # numeric vector L(1) (only 1 variable)
                   M = df2_int$M_diabetes, # numeric vector or matrix
                   Y = df2_int$Y_death, # numeric vector
                   effect = "indirect",
                   b_learners = sl3::Lrnr_hal9001$new(), # outcome regression
                   estimator = "tmle")
tmle_ie
# Interventional Indirect Effect
# Estimator: tmle
# Estimate: 0.013
# Std. Error: 0.004
# 95% CI: [0.006, 0.021]

## 2) continuous outcome
# ---------------------------------------------------------------------------- #
### compute tmle estimate of the interventional direct effect & indirect effects
set.seed(1234)
tmle_de <- medoutcon(W = df2_int[,c("L0_male","L0_soc_env")], #matrix of baseline L(0)
                     A = df2_int$A0_PM2.5, # numeric vector of the exposure
                     Z = df2_int$L1, # numeric vector L(1) (only 1 variable)
                     M = df2_int$M_diabetes, # numeric vector or matrix
                     Y = df2_int$Y_qol, # numeric vector
                     effect = "direct",
                     b_learners = sl3::Lrnr_hal9001$new(), # outcome regression
                     estimator = "tmle")
tmle_de
# Interventional Direct Effect
# Estimator: tmle
# Estimate: -4.808          <- also biased ?
# Std. Error: 0.346
# 95% CI: [-5.487, -4.129]

set.seed(1234)
tmle_ie <- medoutcon(W = df2_int[,c("L0_male","L0_soc_env")], #matrix of baseline L(0)
                     A = df2_int$A0_PM2.5, # numeric vector of the exposure
                     Z = df2_int$L1, # numeric vector L(1) (only 1 variable)
                     M = df2_int$M_diabetes, # numeric vector or matrix
                     Y = df2_int$Y_qol, # numeric vector
                     effect = "indirect",
                     b_learners = sl3::Lrnr_hal9001$new(), # outcome regression
                     estimator = "tmle")
tmle_ie
# Interventional Indirect Effect
# Estimator: tmle
# Estimate: -1.738
# Std. Error: 0.239
# 95% CI: [-2.207, -1.269]

References

Díaz, I, N S Hejazi, K E Rudolph, and M J van Der Laan. 2020. “Nonparametric efficient causal mediation with intermediate confounders.” Biometrika 108 (3): 627–41. https://doi.org/10.1093/biomet/asaa085.

Hejazi, Nima S., Kara E. Rudolph, and Iván Díaz. 2022. “‘Medoutcon‘: Nonparametric Efficient Causal Mediation Analysis with Machine Learning in ‘r‘.” Journal of Open Source Software 7 (69): 3979. https://doi.org/10.21105/joss.03979.

Laan, Mark J. van der, and Sherri Rose. 2011. Targeted Learning: Causal Inference for Observational and Experimental Data. 1st ed. Springer Series in Statistics. New York, NY: Springer.