Chapter 7 ltmle package with interval censored survival data

7.1 Setting

The ltmle package can be used to estimate a cumulative risk, with interval censored survival data.

Here we simulate a simple example, based on 2 intervals between 3 visits.

  • baseline confounders \(L(0)\) and the exposure of interest \(A\) is measured at visit 1.
  • intermediate confounders \(L(1)\) and the mediator \(M\) are measured at visit 2. Death and censoring can occur before the mediator:
  • \(Y(1)\) and \(Y(2)\) are occurences of death between visits 1 and 2, and visits 2 and 3.
  • \(C(1)\) and \(C(2)\) are censoring occuring before visit 2 and before visite 3, respectively. They will be considered as exposure variables (treatment mechanisms) and counterfactual scenarios will be emulated under “no censoring”. Censoring can be informative (influenced by previous variables in the system).
DAG with interval censored survival data

Figure 7.1: DAG with interval censored survival data

In this example, we simulate the exposure and the mediator as 3-level categorical variables. In order to implement the ltmle function, each one needs to be recoded by 2 dummy variables. For example:

  • The exposure \(A(1)= \{0,1,2\}\) can be recoded by 2 dummy variables (\(A(1)_1\) and \(A(1)_2\)): \[\begin{align*} A(1)_1 &= \mathbb{I}(A(1) = 1) \\ A(1)_2 &= \mathbb{I}(A(1) = 2) \\ \end{align*}\] and the distribution \(P(a(1)) = P(a(1)_1, a(1)_2) = P(a(1)_1) \times (P(a(1)_2 \mid a(1)_1)\). We also know that if the first dummy variable \(A(1)_1 = 1\), then \(A(1)_2 = 0\) deterministically. However, if \(A(1)_1 = 0\), then we can apply the conditional probability \(P(A(1)_2 = 1 \mid L(0))\).

7.2 Data generating function

Here is a example of a function simulating data corresponding to this DAG.

Note, in this example, we did not add “Exposures \(\times\) mediator” interaction terms in the 2 outcome models, so that the controlled direct effects is expected to be independent of the value \(m\) chosen, when setting \(M=m\) for Controlled Direct Effects.

rm(list = ls())

## Data generating function ----
GenerateData.CDE <- function(N) { 
  # rexpit function
  rexpit <- function (x) rbinom(length(x), 1, plogis(x))
  
  # baseline confounders L0
  L0_1 <- rbinom(N, size = 1, prob = 0.45) 
  L0_2 <- rexpit(qlogis(0.6) + log(1.5) * L0_1) 
  
  # exposure A: treatment
  A1_1 <- rexpit(qlogis(0.2) + log(0.9) * L0_1  + log(1.5) * L0_2)
  A1_2 <- ifelse(A1_1 == 1,
                 0,
                 rexpit(qlogis(0.5) + log(0.8) * L0_1  + log(2) * L0_2))
  
  # censoring C1
  C1 <- rexpit(qlogis(0.03) + log(1.4) * L0_1  + log(1.4) * L0_2 + 
                 log(1.1) * A1_1 + log(1.4) * A1_2)
  
  # death Y1
  Y1 <- rexpit(qlogis(0.05) + log(1.2) * L0_1  + log(1.5) * L0_2 + 
                 log(2) * A1_1 + log(3) * A1_2)  
  
  ### intermediate counfounders L1
  L1 <- ifelse(Y1 == 1,
               NA,
               rnorm(N, mean = 50 + (5 * L0_1) + 
                       (-3 * L0_2) + (4 * A1_1) + (10 * A1_2),
                     sd = 15))

  # mediator M2: continue treatment
  M2_1 <- ifelse(Y1 == 1, 
                 NA, 
                 rexpit(qlogis(0.4) + log(0.9) * L0_1[Y1 == 0]  + 
                          log(1.5) * L0_2[Y1 == 0] + 
                          log(1.3) * A1_1[Y1 == 0] + log(1.6) * A1_2[Y1 == 0] + 
                          log(1.02) * L1[Y1 == 0]))
  M2_2 <- rep(NA, N)
  M2_2[Y1 == 1] <- NA
  M2_2[Y1 == 0 & M2_1 == 1] <- 0
  M2_2[Y1 == 0 & M2_1 == 0] <- rexpit(qlogis(0.4) + 
                                        log(0.9) * L0_1[Y1 == 0 & M2_1 == 0]  + 
                                        log(1.5) * L0_2[Y1 == 0 & M2_1 == 0] + 
                                        log(1.5) * A1_1[Y1 == 0 & M2_1 == 0] + 
                                        log(2) * A1_2[Y1 == 0 & M2_1 == 0] + 
                                        log(1.03) * L1[Y1 == 0 & M2_1 == 0])
  
  # censoring C2
  C2 <- rep(NA, N)
  C2[Y1 == 1] <- NA
  C2[Y1 == 0 & C1 == 1] <- 1
  C2[Y1 == 0 & C1 == 0] <- rexpit(qlogis(0.03) + 
                                    log(1.4) * L0_1[Y1 == 0 & C1 == 0] + 
                                    log(1.4) * L0_2[Y1 == 0 & C1 == 0] + 
                                    log(1.05) * A1_1[Y1 == 0 & C1 == 0] + 
                                    log(1.2) * A1_2[Y1 == 0 & C1 == 0] + 
                                    log(1.01) * L1[Y1 == 0 & C1 == 0] + 
                                    log(1.1) * M2_1[Y1 == 0 & C1 == 0] + 
                                    log(1.4) * M2_2[Y1 == 0 & C1 == 0])

  # death Y2
  Y2 <- rep(NA, N)
  Y2[Y1 == 1] <- 1
  Y2[Y1 == 0] <- rexpit(qlogis(0.05) + log(1.2) * L0_1[Y1 == 0] + 
                          log(1.5) * L0_2[Y1 == 0] + 
                          log(1.5) * A1_1[Y1 == 0] + log(2) * A1_2[Y1 == 0] + 
                          log(1.01) * L1[Y1 == 0] + 
                          log(1.5) * M2_1[Y1 == 0] + log(2) * M2_2[Y1 == 0])  
                               
  df <- data.frame(L0_1 = L0_1, L0_2 = L0_2,
                   A1_1 = A1_1, A1_2 = A1_2, C1,
                   Y1 = Y1, L1 = L1,
                   M2_1 = M2_1, M2_2 = M2_2, C2,
                   Y2 = Y2)
  
  df$Y1 <- ifelse(df$C1 == 1, NA, df$Y1)
  df$L1 <- ifelse(df$C1 == 1, NA, df$L1)
  df$M2_1 <- ifelse(df$C1 == 1, NA, df$M2_1)
  df$M2_2 <- ifelse(df$C1 == 1, NA, df$M2_2)
  df$C2 <- ifelse(df$C1 == 1, NA, df$C2)
  df$Y2 <- ifelse(df$C2 == 1, NA, df$Y2)

  return(df)
}

## Generate 1 data frame, named df2 ----
set.seed(1234)
df2 <- GenerateData.CDE(N = 10000)
write.csv2(df2, "data/df2.csv")

## Import data df2
df2 <- read.csv2("data/df2.csv")

7.3 Inspect the data generated

Let’s inspect the data.

View(df2)
df2 |> 
  with(table(A1_1, A1_2)) # distribution of the exposure

df2 |> 
  subset(subset = (C1 == 1)) |>
  summary() # 519 censored after C1

df2 |> 
  subset(subset = (C1 == 0)) |>
  summary() # distribution of variables uncensored after C1

df2 |> 
  subset(subset = (C1 == 0 & Y1 == 1)) |>
  summary() # among uncensored participants at C1, 1319 death at Y1

df2 |> 
  subset(subset = (C1 == 0 & Y1 == 0)) |>
  summary() # distribution of variables uncensored at C1 and alive at Y1
            

df2 |> 
  subset(subset = (C1 == 0 & Y1 == 0 & C2 == 1)) |>
  summary() # 754 censored at C2

df2 |> 
  subset(subset = (C1 == 0 & Y1 == 0 & C2 == 0)) |>
  summary() # distribution of variable alive at Y1 and uncensored at C2

df2 |> 
  subset(subset = (C1 == 0 & Y1 == 0)) |>
  with(table(M2_1, M2_2)) # distribution of the mediator

7.4 Estimate the controlled direct effect

We can use the ltmle function to estimate a Controlled Direct Effect \[\begin{equation*} \small CDE(M=0) = \mathbb{E}\left(Y_{A=2, M=0} \right) - \mathbb{E}\left(Y_{A=0, M=0} \right) = \mathbb{E}\left(Y_{A(1)_1=0, A(1)_2 = 1, M_1 = 0, M_2 = 0} \right) - \mathbb{E}\left(Y_{A(1)_1=0, A(1)_2 = 0, M_1 = 0, M_2 = 0} \right) \end{equation*}\] We will also estimate the Average Total Effect: \[\begin{equation*} ATE_{A=2 vs A=0} = \mathbb{E}\left(Y_{A=2} \right) - \mathbb{E}\left(Y_{A=0} \right) = \mathbb{E}\left(Y_{A(1)_1=0, A(1)_2 = 1} \right) - \mathbb{E}\left(Y_{A(1)_1=0, A(1)_2 = 0} \right) \end{equation*}\]

Before :

  • We need to format the censoring variables, using the BinoryToCensoring function which converts binary censoring variables to character vectors of censored/uncensored values.
  • We can define a determinist g-function to use with the 3-level exposure and mediator.

Note 1: in this example, we let the ltmle function define automatically the variables to include in the Q-formulas and g-formulas.

Note 2: g-formulas should include censoring mechanisms, in addition to the exposure and mediator.

library(ltmle)
df_ltmle <- df2

## format censoring variables:
df_ltmle$C1 <- BinaryToCensoring(is.censored = df_ltmle$C1)
df_ltmle$C2 <- BinaryToCensoring(is.censored = df_ltmle$C2)

## format the outcome (see the ltmle documation ?ltmle):
# once a Ynode jumps to 1 (e.g. death), all subsequent Ynode values should be 1
df_ltmle$Y2 <- ifelse(df_ltmle$Y1 == 1, 1, df_ltmle$Y2) 

head(df_ltmle, 13)
##     X L0_1 L0_2 A1_1 A1_2         C1 Y1       L1 M2_1 M2_2         C2 Y2
## 1   1    0    1    0    0 uncensored  0 84.37728    1    0 uncensored  0
## 2   2    1    1    1    0 uncensored  0 56.79832    0    1 uncensored  1
## 3   3    1    1    0    1 uncensored  0 68.84374    1    0   censored NA
## 4   4    1    0    0    0 uncensored  0 78.65583    1    0 uncensored  0
## 5   5    1    1    0    0 uncensored  0 61.33529    1    0 uncensored  0
## 6   6    1    1    0    0 uncensored  0 69.81963    0    1 uncensored  0
## 7   7    0    1    0    0 uncensored  0 42.79730    1    0 uncensored  0
## 8   8    0    1    0    0 uncensored  0 26.72749    1    0 uncensored  0
## 9   9    1    1    1    0 uncensored  0 51.65862    1    0 uncensored  0
## 10 10    0    1    0    0 uncensored  0 29.31751    1    0 uncensored  0
## 11 11    1    0    0    0 uncensored  1       NA   NA   NA       <NA>  1
## 12 12    0    0    0    0 uncensored  0 54.69811    0    0 uncensored  0
## 13 13    0    0    0    1 uncensored  0 36.30073    1    0 uncensored  0
## Define an SuperLearner library (choose a various set of learner in practice)
SL.library <- c("SL.glm", "SL.mean")

## Dealing with multicategorical exposures and mediators
## we can incorporate deterministic knowledge that can be used with the 
## "deterministic.g.function" argument of the ltmle function:
det.g <- function(data, current.node, nodes) {
  if (names(data)[current.node] != "A1_2" &  names(data)[current.node] != "M2_2") {
    return(NULL) # for other variables
  } else if (names(data)[current.node] == "A1_2") {
    is.deterministic <- data$A1_1 == 1 # if we're regressing A1_2, 
                                       # then: if A1_1=1 then P(A1_2 = 1) = 0
    return(list(is.deterministic = is.deterministic, prob1 = 0))
  } else if (names(data)[current.node] == "M2_2") {
    is.deterministic <- data$M2_1 == 1 # if we're regressing M2_2, 
                                       # then: if M2_1=1 then P(M2_2 = 1) = 0
    return(list(is.deterministic = is.deterministic, prob1 = 0))
  }
    else {
    stop("something went wrong!") # defensive programming
  }
}

## run the ltmle function
CDE_A2vA0_M0 <- ltmle(data = df_ltmle,
                Anodes = c("A1_1", "A1_2", "M2_1", "M2_2"), # exposure and mediator
                Cnodes = c("C1", "C2"),
                Lnodes = c("L1"),
                Ynodes = c("Y1", "Y2"),
                survivalOutcome = TRUE,
                abar = list(c(0,1,0,0), # A1 = 2 # EY(A=2,M=0)
                            c(0,0,0,0)), # M2 = 0 # EY(A=0,M=0)
                deterministic.g.function = det.g,
                SL.library = SL.library,
                gcomp = FALSE,
                variance.method = "ic")
## Qform not specified, using defaults:
## formula for Y1:
## Q.kplus1 ~ X + L0_1 + L0_2 + A1_1 + A1_2
## formula for Y2:
## Q.kplus1 ~ X + L0_1 + L0_2 + A1_1 + A1_2 + L1 + M2_1 + M2_2
## 
## gform not specified, using defaults:
## formula for A1_1:
## A1_1 ~ X + L0_1 + L0_2
## formula for A1_2:
## A1_2 ~ X + L0_1 + L0_2 + A1_1
## formula for C1:
## C1 ~ X + L0_1 + L0_2 + A1_1 + A1_2
## formula for M2_1:
## M2_1 ~ X + L0_1 + L0_2 + A1_1 + A1_2 + L1
## formula for M2_2:
## M2_2 ~ X + L0_1 + L0_2 + A1_1 + A1_2 + L1 + M2_1
## formula for C2:
## C2 ~ X + L0_1 + L0_2 + A1_1 + A1_2 + L1 + M2_1 + M2_2
## 
## Estimate of time to completion: 2 to 2 minutes
summary(CDE_A2vA0_M0)
## Estimator:  tmle 
## Call:
## ltmle(data = df_ltmle, Anodes = c("A1_1", "A1_2", "M2_1", "M2_2"), 
##     Cnodes = c("C1", "C2"), Lnodes = c("L1"), Ynodes = c("Y1", 
##         "Y2"), survivalOutcome = TRUE, abar = list(c(0, 1, 0, 
##         0), c(0, 0, 0, 0)), deterministic.g.function = det.g, 
##     SL.library = SL.library, gcomp = FALSE, variance.method = "ic")
## 
## Treatment Estimate:
##    Parameter Estimate:  0.4033 
##     Estimated Std Err:  0.033428 
##               p-value:  <2e-16 
##     95% Conf Interval: (0.33778, 0.46882) 
## 
## Control Estimate:
##    Parameter Estimate:  0.15786 
##     Estimated Std Err:  0.02203 
##               p-value:  7.7231e-13 
##     95% Conf Interval: (0.11469, 0.20104) 
## 
## Additive Treatment Effect:
##    Parameter Estimate:  0.24544 
##     Estimated Std Err:  0.040026 
##               p-value:  8.6843e-10 
##     95% Conf Interval: (0.16699, 0.32389) 
## 
## Relative Risk:
##    Parameter Estimate:  2.5547 
##   Est Std Err log(RR):  0.16228 
##               p-value:  7.4749e-09 
##     95% Conf Interval: (1.8587, 3.5114) 
## 
## Odds Ratio:
##    Parameter Estimate:  3.6056 
##   Est Std Err log(OR):  0.21618 
##               p-value:  2.9857e-09 
##     95% Conf Interval: (2.3602, 5.5079)
CDE_A2vA0_M0$fit$g
## [[1]]
## [[1]]$A1_1
##                  Risk      Coef
## SL.glm_All  0.1828096 0.9600873
## SL.mean_All 0.1839670 0.0399127
## 
## [[1]]$A1_2
##                  Risk       Coef
## SL.glm_All  0.2358587 0.98738181
## SL.mean_All 0.2438133 0.01261819
## 
## [[1]]$C1
##                   Risk      Coef
## SL.glm_All  0.04905206 0.9073443
## SL.mean_All 0.04922034 0.0926557
## 
## [[1]]$M2_1
##                  Risk      Coef
## SL.glm_All  0.1839150 0.2953515
## SL.mean_All 0.1838302 0.7046485
## 
## [[1]]$M2_2
##                  Risk       Coef
## SL.glm_All  0.1302708 0.92939499
## SL.mean_All 0.1376532 0.07060501
## 
## [[1]]$C2
##                   Risk       Coef
## SL.glm_All  0.08307675 0.91616088
## SL.mean_All 0.08385440 0.08383912
## 
## 
## [[2]]
## [[2]]$A1_1
##                  Risk      Coef
## SL.glm_All  0.1828096 0.9600873
## SL.mean_All 0.1839670 0.0399127
## 
## [[2]]$A1_2
##                  Risk       Coef
## SL.glm_All  0.2358587 0.98738181
## SL.mean_All 0.2438133 0.01261819
## 
## [[2]]$C1
##                   Risk      Coef
## SL.glm_All  0.04905206 0.9073443
## SL.mean_All 0.04922034 0.0926557
## 
## [[2]]$M2_1
##                  Risk      Coef
## SL.glm_All  0.1839150 0.2953515
## SL.mean_All 0.1838302 0.7046485
## 
## [[2]]$M2_2
##                  Risk       Coef
## SL.glm_All  0.1302708 0.92939499
## SL.mean_All 0.1376532 0.07060501
## 
## [[2]]$C2
##                   Risk       Coef
## SL.glm_All  0.08307675 0.91616088
## SL.mean_All 0.08385440 0.08383912
CDE_A2vA0_M0$fit$Q
## [[1]]
## [[1]]$Y1
##                   Risk        Coef
## SL.glm_All  0.06307006 0.994996871
## SL.mean_All 0.06833395 0.005003129
## 
## [[1]]$Y2
##                  Risk       Coef
## SL.glm_All  0.1708663 0.98570313
## SL.mean_All 0.1770992 0.01429687
## 
## 
## [[2]]
## [[2]]$Y1
##                   Risk       Coef
## SL.glm_All  0.09451827 0.98460715
## SL.mean_All 0.09865130 0.01539285
## 
## [[2]]$Y2
##                  Risk       Coef
## SL.glm_All  0.1708663 0.98570313
## SL.mean_All 0.1770992 0.01429687
## Average total effect
## we do not need the mediators, but we need the censoring variables
## (which are considered as exposures)
ATE_A2vA0 <- ltmle(data = subset(df_ltmle, select = -c(M2_1, M2_2)),
                      Anodes = c("A1_1", "A1_2"), # exposure and mediator
                      Cnodes = c("C1", "C2"),
                      Lnodes = c("L1"),
                      Ynodes = c("Y1", "Y2"),
                      survivalOutcome = TRUE,
                      abar = list(c(0,1), # A1 = 2 # EY(A=2)
                                  c(0,0)), # M2 = 0 # EY(A=0)
                      deterministic.g.function = det.g,
                      SL.library = SL.library,
                      gcomp = FALSE,
                      variance.method = "ic")
## Qform not specified, using defaults:
## formula for Y1:
## Q.kplus1 ~ X + L0_1 + L0_2 + A1_1 + A1_2
## formula for Y2:
## Q.kplus1 ~ X + L0_1 + L0_2 + A1_1 + A1_2 + L1
## 
## gform not specified, using defaults:
## formula for A1_1:
## A1_1 ~ X + L0_1 + L0_2
## formula for A1_2:
## A1_2 ~ X + L0_1 + L0_2 + A1_1
## formula for C1:
## C1 ~ X + L0_1 + L0_2 + A1_1 + A1_2
## formula for C2:
## C2 ~ X + L0_1 + L0_2 + A1_1 + A1_2 + L1
## 
## Estimate of time to completion: 1 to 2 minutes
summary(ATE_A2vA0)
## Estimator:  tmle 
## Call:
## ltmle(data = subset(df_ltmle, select = -c(M2_1, M2_2)), Anodes = c("A1_1", 
##     "A1_2"), Cnodes = c("C1", "C2"), Lnodes = c("L1"), Ynodes = c("Y1", 
##     "Y2"), survivalOutcome = TRUE, abar = list(c(0, 1), c(0, 
##     0)), deterministic.g.function = det.g, SL.library = SL.library, 
##     gcomp = FALSE, variance.method = "ic")
## 
## Treatment Estimate:
##    Parameter Estimate:  0.42858 
##     Estimated Std Err:  0.0080164 
##               p-value:  <2e-16 
##     95% Conf Interval: (0.41287, 0.4443) 
## 
## Control Estimate:
##    Parameter Estimate:  0.22137 
##     Estimated Std Err:  0.0081064 
##               p-value:  <2e-16 
##     95% Conf Interval: (0.20548, 0.23726) 
## 
## Additive Treatment Effect:
##    Parameter Estimate:  0.20721 
##     Estimated Std Err:  0.01138 
##               p-value:  <2e-16 
##     95% Conf Interval: (0.18491, 0.22952) 
## 
## Relative Risk:
##    Parameter Estimate:  1.936 
##   Est Std Err log(RR):  0.041059 
##               p-value:  <2e-16 
##     95% Conf Interval: (1.7863, 2.0983) 
## 
## Odds Ratio:
##    Parameter Estimate:  2.6381 
##   Est Std Err log(OR):  0.057203 
##               p-value:  <2e-16 
##     95% Conf Interval: (2.3583, 2.9511)

In this example,

  • the g-computation algorithm estimates the \(\bar{Q}\) function among the uncensored participants (and the censoring mechanism is not taken into account by g-computation). The estimation can be biased because of attrition;
  • however, the censoring mechanism is still taken into account with the IPTW estimator (similarly to an estimation by Inverse Probability of Censoring Weighting, IPCW).