1 Simulation

Simulations are an important part of the pre-data step in statistical analyses. The basic steps are:

Create simulation modules
Test simulation modules
Run simulation studies
Document simulation modules and simulation studies in quarto project report

Simulation modules are recipes for creating data that is analysed in simulation studies. They therefore represent the data-generating process and contain important assumptions about the causal model. Specifying detailed simulations can therefore also help to think about and communicate your causal model.

Simulation studies help in validating the following domains (Betancourt 2020):

Domain Expertise Consistency: Is our model consistent with our domain expertise?
Computational Faithfulness: Will our computational tools be sufficient to accurately fit our model?
Inferential Adequacy: Will our inference provide enough information to answer our questions?
Model Adequacy: Is our model rich enough to capture the relevant structure of the true data generating process?

The following packages are used:

Creating simulation modules: simcausal (Sofrygin, Laan, and Neugebauer 2017)
Running simulation studies: simChef (Tang and Duncan 2025)

1.1 Creating simulation modules with `simcausal`

Simulation modules are created with the following basic code.

sim_mod <- DAG.empty() %>% 
  add.nodes(
    node("name",
      t = 0,
      distr = "",
      ...
    )
  ) %>% 
  add.nodes(
    node("name",
      t = 1:2,
      distr = "",
      ...
    )
  )

DAG.empty() creates an empty DAG
Pipe into add.nodes() to add a single variable with node()
Inside node(), specify a name, a distribution distr, and the parameters of this distribution
Specify timepoint t, if you simulate longitudinal data. For background variables, use t = 0. Once, you used t, all subsequent variables also need it. For cross-sectional data, t can be omitted.
Repeat piping into add.nodes() to add a single variable with node()

Here are some distributions and there parameters:

rconst: const
rbern: prob
rnorm: mean, sd
truncnorm::rtruncnorm: a, b, mean, sd
supfuns::rdiscunif: min, max
rpois: lambda

If the value of a parameter depends on the value of another variable, make sure to add the depending variable after the variable it depends on. Include the variable by its name and add the time in square brackets. Use one or more ifelse() to code the dependencies.

sim_mod <- DAG.empty() %>% 
  add.nodes(
    node("asthma",
      t = 1,
      distr = "rbern",
      prob = p_asthma_1
    )
  ) %>% 
  add.nodes(
    node("asthma", 
      t = 2:5,
      distr = "rbern",
      prob = ifelse(asthma[t - 1] == 1,
        1 - p_remission,
        p_relapse
      )
    )
  )

In the code above, note how you can re-use the name ("asthma") if t differs. asthma[1] needs to be defined first, because its prob parameter is different from the t = 2:5 as it does not depend on a previous asthma status. Note also that t can be used when addressing a simulated variable. prob of asthma[2] depends on asthma[1], prob of asthma[3] depends on asthma[2], etc. Also note that in addition to variable asthma, there are parameters p_asthma_1, p_remission, p_relapse. Parameters have a single value that is the same for all simulated observations. However, since we want to change parameter values in simulation studies, we use a variable, i.e., placeholder. In addition to the code, the R script of a simulation module should include a list and description of variables and parameters.

# Title -------------------------------------------------------------------
# Project Title: ...
# Script Title: ...

# Script Description: Creating simulation module "..."


## Variables:
# asthma[1:5] (binary asthma indicator)

## Parameters:
# p_asthma_1 (probability of asthma at time 1)
# p_remission (probability of asthma remission)
# p_relapse (probability of asthma relapse)


# Module ------------------------------------------------------------------

sim_mod_asthma <- DAG.empty() %>% 
  add.nodes(
    node("asthma",
      t = 1,
      distr = "rbern",
      prob = p_asthma_1
    )
  ) %>% 
  add.nodes(
    node("asthma", 
      t = 2:5,
      distr = "rbern",
      prob = ifelse(asthma[t - 1] == 1,
        1 - p_remission,
        p_relapse
      )
    )
  )

Multiple simulation modules can be combined with each other and variables from earlier modules can be used in later modules. The complete DAG, i.e., after combining simulation modules, cannot include the same variables multiple times, and therefore, when you create modules that are always combined, don’t include variables from earlier modules in later modules. Also, when using supfuns::run_sim(), make sure that the modules are listed in the correct order in the list passed to argument dags.

1.2 Testing simulation modules

Before conducting simulation studies, test if the module has been implemented correctly. For this, simulate some data based on pre-specified parameter values and evaluate if they can be recovered. The workflow for running simulation studies is also used for testing simulation modules. See an example below:

Example of testing a simulation module

# Title -------------------------------------------------------------------
# Project Title: greenstress
# Script Title: pre_2_2_test_outc_miscl_2

# Script Description: Testing simulation module "Outcome misclassification
# with predictor"


# Prepare experiment ------------------------------------------------------

## Create functions ----

test_2_dgp <- function(n, pX, pAlpha, ORx, se, sp) {
  dags <- list(sim_mod_2)
  args <- list(
    pX = pX,
    pAlpha = pAlpha,
    ORx = ORx,
    se = se,
    sp = sp
  )

  df <- run_sim(
    dags = dags,
    args = args,
    n = n
  )

  out <- list(
    df = df,
    se = se,
    sp = sp
  )

  return(out)
}

test_2_method <- function(df, se, sp) {
  out <- list(
    p_x = mean(df$x),
    p_yTrue_x0 = mean(df$yTrue[df$x == 0]),
    or_true = exp(qlogis(mean(df$yTrue[df$x == 1])) - qlogis(mean(df$yTrue[df$x == 0]))),
    se_true = mean(df$yObs[df$yTrue == 1] == 1),
    sp_true = mean(df$yObs[df$yTrue == 0] == 0)
  )

  return(out)
}

## Create experiment ----

dgp2 <- create_dgp(
  .dgp_fun = test_2_dgp,
  .name = "Test Module 2: DGP",
  n = 1000
)

meth2 <- create_method(
  .method_fun = test_2_method,
  .name = "Test Module 2: Method"
)

test_2_exp <- create_experiment(
  name = "Test Module 2"
) %>%
  add_dgp(dgp2) %>%
  add_method(meth2) %>%
  add_vary_across(
    .dgp = dgp2,
    pX = list(0.5),
    pAlpha = list(0.005, 0.05, 0.5),
    ORx = list(1, 1.2, 2),
    se = list(0.6),
    sp = list(0.8)
  )

## Clean
rm(test_2_dgp, test_2_method, dgp2, meth2)


# Run experiment ----------------------------------------------------------

if (cache$exists("test_2_result")) {
  test_2_result <- cache$get("test_2_result")
} else {
  test_2_result <- test_2_exp %>%
    run_experiment(n_reps = 100, save = FALSE)
  cache$set("test_2_result", test_2_result)
}
rm(test_2_exp)


# Visualise ---------------------------------------------------------------

df_s_no <- test_2_result$fit_results %>%
  filter(pAlpha == 0.005, ORx == 1)

df_s_low <- test_2_result$fit_results %>%
  filter(pAlpha == 0.005, ORx == 1.2)

df_s_high <- test_2_result$fit_results %>%
  filter(pAlpha == 0.005, ORx == 2)

df_m_no <- test_2_result$fit_results %>%
  filter(pAlpha == 0.05, ORx == 1)

df_m_low <- test_2_result$fit_results %>%
  filter(pAlpha == 0.05, ORx == 1.2)

df_m_high <- test_2_result$fit_results %>%
  filter(pAlpha == 0.05, ORx == 2)

df_l_no <- test_2_result$fit_results %>%
  filter(pAlpha == 0.5, ORx == 1)

df_l_low <- test_2_result$fit_results %>%
  filter(pAlpha == 0.5, ORx == 1.2)

df_l_high <- test_2_result$fit_results %>%
  filter(pAlpha == 0.5, ORx == 2)

## 0.005, 1 ----
# 0.005, 1, pX
p_pX_s_no <- df_s_no %>%
  ggplot(aes(x = p_x)) +
  geom_density(fill = "lightblue") +
  geom_vline(aes(xintercept = 0.5)) +
  theme_bw() +
  labs(x = "p(X)", y = "Density", title = "OR = 1") +
  xlim(0, 1)

y_max <- ggplot2::layer_scales(p_pX_s_no)$y$range$range[2]

p_pX_s_no <- p_pX_s_no +
  annotate("label",
    x = 0.5,
    y = y_max / 2,
    label = "pX = 0.5"
  )

# 0.005, 1, pAlpha
p_pAlpha_s_no <- df_s_no %>%
  ggplot(aes(x = p_yTrue_x0)) +
  geom_density(fill = "lightblue") +
  geom_vline(aes(xintercept = 0.005)) +
  theme_bw() +
  labs(x = "p(yTrue, X=0)", y = "Density") +
  xlim(0, 1)

y_max <- ggplot2::layer_scales(p_pAlpha_s_no)$y$range$range[2]

p_pAlpha_s_no <- p_pAlpha_s_no +
  annotate("label",
    x = 0.005,
    y = y_max / 2,
    label = "pAlpha = 0.005",
    hjust = "left"
  )

# 0.005, 1, ORx
p_ORx_s_no <- df_s_no %>%
  ggplot(aes(x = or_true)) +
  geom_density(fill = "lightblue") +
  geom_vline(aes(xintercept = 1)) +
  theme_bw() +
  labs(x = "OR", y = "Density") +
  xlim(0, 4)

y_max <- ggplot2::layer_scales(p_ORx_s_no)$y$range$range[2]

p_ORx_s_no <- p_ORx_s_no +
  annotate("label",
    x = 1,
    y = y_max / 2,
    label = "ORx = 1"
  )

# 0.005, 1, se
p_se_s_no <- df_s_no %>%
  ggplot(aes(x = se_true)) +
  geom_density(fill = "lightblue") +
  geom_vline(aes(xintercept = 0.6)) +
  theme_bw() +
  labs(x = "Sensitivity", y = "Density") +
  xlim(0, 1)

y_max <- ggplot2::layer_scales(p_se_s_no)$y$range$range[2]

p_se_s_no <- p_se_s_no +
  annotate("label",
    x = 0.6,
    y = y_max / 2,
    label = "se = 0.6"
  )

# 0.005, 1, sp
p_sp_s_no <- df_s_no %>%
  ggplot(aes(x = sp_true)) +
  geom_density(fill = "lightblue") +
  geom_vline(aes(xintercept = 0.8)) +
  theme_bw() +
  labs(x = "Specificity", y = "Density") +
  xlim(0, 1)

y_max <- ggplot2::layer_scales(p_sp_s_no)$y$range$range[2]

p_sp_s_no <- p_sp_s_no +
  annotate("label",
    x = 0.8,
    y = y_max / 2,
    label = "sp = 0.8",
    hjust = "right"
  )

## 0.005, 1.2 ----
# 0.005, 1.2, pX
p_pX_s_low <- df_s_low %>%
  ggplot(aes(x = p_x)) +
  geom_density(fill = "lightblue") +
  geom_vline(aes(xintercept = 0.5)) +
  theme_bw() +
  labs(x = "p(X)", y = "Density", title = "OR = 1.2") +
  xlim(0, 1)

y_max <- ggplot2::layer_scales(p_pX_s_low)$y$range$range[2]

p_pX_s_low <- p_pX_s_low +
  annotate("label",
    x = 0.5,
    y = y_max / 2,
    label = "pX = 0.5"
  )

# 0.005, 1.2, pAlpha
p_pAlpha_s_low <- df_s_low %>%
  ggplot(aes(x = p_yTrue_x0)) +
  geom_density(fill = "lightblue") +
  geom_vline(aes(xintercept = 0.005)) +
  theme_bw() +
  labs(x = "p(yTrue, X=0)", y = "Density") +
  xlim(0, 1)

y_max <- ggplot2::layer_scales(p_pAlpha_s_low)$y$range$range[2]

p_pAlpha_s_low <- p_pAlpha_s_low +
  annotate("label",
    x = 0.005,
    y = y_max / 2,
    label = "pAlpha = 0.005",
    hjust = "left"
  )

# 0.005, 1.2, ORx
p_ORx_s_low <- df_s_low %>%
  ggplot(aes(x = or_true)) +
  geom_density(fill = "lightblue") +
  geom_vline(aes(xintercept = 1.2)) +
  theme_bw() +
  labs(x = "OR", y = "Density") +
  xlim(0, 4)

y_max <- ggplot2::layer_scales(p_ORx_s_low)$y$range$range[2]

p_ORx_s_low <- p_ORx_s_low +
  annotate("label",
    x = 1.2,
    y = y_max / 2,
    label = "ORx = 1.2"
  )

# 0.005, 1.2, se
p_se_s_low <- df_s_low %>%
  ggplot(aes(x = se_true)) +
  geom_density(fill = "lightblue") +
  geom_vline(aes(xintercept = 0.6)) +
  theme_bw() +
  labs(x = "Sensitivity", y = "Density") +
  xlim(0, 1)

y_max <- ggplot2::layer_scales(p_se_s_low)$y$range$range[2]

p_se_s_low <- p_se_s_low +
  annotate("label",
    x = 0.6,
    y = y_max / 2,
    label = "se = 0.6"
  )

# 0.005, 1.2, sp
p_sp_s_low <- df_s_low %>%
  ggplot(aes(x = sp_true)) +
  geom_density(fill = "lightblue") +
  geom_vline(aes(xintercept = 0.8)) +
  theme_bw() +
  labs(x = "Specificity", y = "Density") +
  xlim(0, 1)

y_max <- ggplot2::layer_scales(p_sp_s_low)$y$range$range[2]

p_sp_s_low <- p_sp_s_low +
  annotate("label",
    x = 0.8,
    y = y_max / 2,
    label = "sp = 0.8",
    hjust = "right"
  )

## 0.005, 2 ----
# 0.005, 2, pX
p_pX_s_high <- df_s_high %>%
  ggplot(aes(x = p_x)) +
  geom_density(fill = "lightblue") +
  geom_vline(aes(xintercept = 0.5)) +
  theme_bw() +
  labs(x = "p(X)", y = "Density", title = "OR = 2") +
  xlim(0, 1)

y_max <- ggplot2::layer_scales(p_pX_s_high)$y$range$range[2]

p_pX_s_high <- p_pX_s_high +
  annotate("label",
    x = 0.5,
    y = y_max / 2,
    label = "pX = 0.5"
  )

# 0.005, 2, pAlpha
p_pAlpha_s_high <- df_s_high %>%
  ggplot(aes(x = p_yTrue_x0)) +
  geom_density(fill = "lightblue") +
  geom_vline(aes(xintercept = 0.005)) +
  theme_bw() +
  labs(x = "p(yTrue, X=0)", y = "Density") +
  xlim(0, 1)

y_max <- ggplot2::layer_scales(p_pAlpha_s_high)$y$range$range[2]

p_pAlpha_s_high <- p_pAlpha_s_high +
  annotate("label",
    x = 0.005,
    y = y_max / 2,
    label = "pAlpha = 0.005",
    hjust = "left"
  )

# 0.005, 2, ORx
p_ORx_s_high <- df_s_high %>%
  ggplot(aes(x = or_true)) +
  geom_density(fill = "lightblue") +
  geom_vline(aes(xintercept = 2)) +
  theme_bw() +
  labs(x = "OR", y = "Density") +
  xlim(0, 4)

y_max <- ggplot2::layer_scales(p_ORx_s_high)$y$range$range[2]

p_ORx_s_high <- p_ORx_s_high +
  annotate("label",
    x = 2,
    y = y_max / 2,
    label = "ORx = 2"
  )

# 0.005, 2, se
p_se_s_high <- df_s_high %>%
  ggplot(aes(x = se_true)) +
  geom_density(fill = "lightblue") +
  geom_vline(aes(xintercept = 0.6)) +
  theme_bw() +
  labs(x = "Sensitivity", y = "Density") +
  xlim(0, 1)

y_max <- ggplot2::layer_scales(p_se_s_high)$y$range$range[2]

p_se_s_high <- p_se_s_high +
  annotate("label",
    x = 0.6,
    y = y_max / 2,
    label = "se = 0.6"
  )

# 0.005, 2, sp
p_sp_s_high <- df_s_high %>%
  ggplot(aes(x = sp_true)) +
  geom_density(fill = "lightblue") +
  geom_vline(aes(xintercept = 0.8)) +
  theme_bw() +
  labs(x = "Specificity", y = "Density") +
  xlim(0, 1)

y_max <- ggplot2::layer_scales(p_sp_s_high)$y$range$range[2]

p_sp_s_high <- p_sp_s_high +
  annotate("label",
    x = 0.8,
    y = y_max / 2,
    label = "sp = 0.8",
    hjust = "right"
  )

p_s <- wrap_plots(
  p_pX_s_no, p_pX_s_low, p_pX_s_high,
  p_pAlpha_s_no, p_pAlpha_s_low, p_pAlpha_s_high,
  p_ORx_s_no, p_ORx_s_low, p_ORx_s_high,
  p_se_s_no, p_se_s_low, p_se_s_high,
  p_sp_s_no, p_sp_s_low, p_sp_s_high,
  nrow = 5, ncol = 3,
  axes = "collect"
) +
  plot_annotation(tag_levels = "A")

rm(
  p_pX_s_no, p_pX_s_low, p_pX_s_high,
  p_pAlpha_s_no, p_pAlpha_s_low, p_pAlpha_s_high,
  p_ORx_s_no, p_ORx_s_low, p_ORx_s_high,
  p_se_s_no, p_se_s_low, p_se_s_high,
  p_sp_s_no, p_sp_s_low, p_sp_s_high
)


## 0.05, 1 ----
# 0.05, 1, pX
p_pX_m_no <- df_m_no %>%
  ggplot(aes(x = p_x)) +
  geom_density(fill = "lightblue") +
  geom_vline(aes(xintercept = 0.5)) +
  theme_bw() +
  labs(x = "p(X)", y = "Density", title = "OR = 1") +
  xlim(0, 1)

y_max <- ggplot2::layer_scales(p_pX_m_no)$y$range$range[2]

p_pX_m_no <- p_pX_m_no +
  annotate("label",
    x = 0.5,
    y = y_max / 2,
    label = "pX = 0.5"
  )

# 0.05, 1, pAlpha
p_pAlpha_m_no <- df_m_no %>%
  ggplot(aes(x = p_yTrue_x0)) +
  geom_density(fill = "lightblue") +
  geom_vline(aes(xintercept = 0.05)) +
  theme_bw() +
  labs(x = "p(yTrue, X=0)", y = "Density") +
  xlim(0, 1)

y_max <- ggplot2::layer_scales(p_pAlpha_m_no)$y$range$range[2]

p_pAlpha_m_no <- p_pAlpha_m_no +
  annotate("label",
    x = 0.05,
    y = y_max / 2,
    label = "pAlpha = 0.05",
    hjust = "left"
  )

# 0.05, 1, ORx
p_ORx_m_no <- df_m_no %>%
  ggplot(aes(x = or_true)) +
  geom_density(fill = "lightblue") +
  geom_vline(aes(xintercept = 1)) +
  theme_bw() +
  labs(x = "OR", y = "Density") +
  xlim(0, 4)

y_max <- ggplot2::layer_scales(p_ORx_m_no)$y$range$range[2]

p_ORx_m_no <- p_ORx_m_no +
  annotate("label",
    x = 1,
    y = y_max / 2,
    label = "ORx = 1"
  )

# 0.05, 1, se
p_se_m_no <- df_m_no %>%
  ggplot(aes(x = se_true)) +
  geom_density(fill = "lightblue") +
  geom_vline(aes(xintercept = 0.6)) +
  theme_bw() +
  labs(x = "Sensitivity", y = "Density") +
  xlim(0, 1)

y_max <- ggplot2::layer_scales(p_se_m_no)$y$range$range[2]

p_se_m_no <- p_se_m_no +
  annotate("label",
    x = 0.6,
    y = y_max / 2,
    label = "se = 0.6"
  )

# 0.05, 1, sp
p_sp_m_no <- df_m_no %>%
  ggplot(aes(x = sp_true)) +
  geom_density(fill = "lightblue") +
  geom_vline(aes(xintercept = 0.8)) +
  theme_bw() +
  labs(x = "Specificity", y = "Density") +
  xlim(0, 1)

y_max <- ggplot2::layer_scales(p_sp_m_no)$y$range$range[2]

p_sp_m_no <- p_sp_m_no +
  annotate("label",
    x = 0.8,
    y = y_max / 2,
    label = "sp = 0.8",
    hjust = "right"
  )

## 0.05, 1.2 ----
# 0.05, 1.2, pX
p_pX_m_low <- df_m_low %>%
  ggplot(aes(x = p_x)) +
  geom_density(fill = "lightblue") +
  geom_vline(aes(xintercept = 0.5)) +
  theme_bw() +
  labs(x = "p(X)", y = "Density", title = "OR = 1.2") +
  xlim(0, 1)

y_max <- ggplot2::layer_scales(p_pX_m_low)$y$range$range[2]

p_pX_m_low <- p_pX_m_low +
  annotate("label",
    x = 0.5,
    y = y_max / 2,
    label = "pX = 0.5"
  )

# 0.05, 1.2, pAlpha
p_pAlpha_m_low <- df_m_low %>%
  ggplot(aes(x = p_yTrue_x0)) +
  geom_density(fill = "lightblue") +
  geom_vline(aes(xintercept = 0.05)) +
  theme_bw() +
  labs(x = "p(yTrue, X=0)", y = "Density") +
  xlim(0, 1)

y_max <- ggplot2::layer_scales(p_pAlpha_m_low)$y$range$range[2]

p_pAlpha_m_low <- p_pAlpha_m_low +
  annotate("label",
    x = 0.05,
    y = y_max / 2,
    label = "pAlpha = 0.05",
    hjust = "left"
  )

# 0.05, 1.2, ORx
p_ORx_m_low <- df_m_low %>%
  ggplot(aes(x = or_true)) +
  geom_density(fill = "lightblue") +
  geom_vline(aes(xintercept = 1.2)) +
  theme_bw() +
  labs(x = "OR", y = "Density") +
  xlim(0, 4)

y_max <- ggplot2::layer_scales(p_ORx_m_low)$y$range$range[2]

p_ORx_m_low <- p_ORx_m_low +
  annotate("label",
    x = 1.2,
    y = y_max / 2,
    label = "ORx = 1.2"
  )

# 0.05, 1.2, se
p_se_m_low <- df_m_low %>%
  ggplot(aes(x = se_true)) +
  geom_density(fill = "lightblue") +
  geom_vline(aes(xintercept = 0.6)) +
  theme_bw() +
  labs(x = "Sensitivity", y = "Density") +
  xlim(0, 1)

y_max <- ggplot2::layer_scales(p_se_m_low)$y$range$range[2]

p_se_m_low <- p_se_m_low +
  annotate("label",
    x = 0.6,
    y = y_max / 2,
    label = "se = 0.6"
  )

# 0.05, 1.2, sp
p_sp_m_low <- df_m_low %>%
  ggplot(aes(x = sp_true)) +
  geom_density(fill = "lightblue") +
  geom_vline(aes(xintercept = 0.8)) +
  theme_bw() +
  labs(x = "Specificity", y = "Density") +
  xlim(0, 1)

y_max <- ggplot2::layer_scales(p_sp_m_low)$y$range$range[2]

p_sp_m_low <- p_sp_m_low +
  annotate("label",
    x = 0.8,
    y = y_max / 2,
    label = "sp = 0.8",
    hjust = "right"
  )

## 0.05, 2 ----
# 0.05, 2, pX
p_pX_m_high <- df_m_high %>%
  ggplot(aes(x = p_x)) +
  geom_density(fill = "lightblue") +
  geom_vline(aes(xintercept = 0.5)) +
  theme_bw() +
  labs(x = "p(X)", y = "Density", title = "OR = 2") +
  xlim(0, 1)

y_max <- ggplot2::layer_scales(p_pX_m_high)$y$range$range[2]

p_pX_m_high <- p_pX_m_high +
  annotate("label",
    x = 0.5,
    y = y_max / 2,
    label = "pX = 0.5"
  )

# 0.05, 2, pAlpha
p_pAlpha_m_high <- df_m_high %>%
  ggplot(aes(x = p_yTrue_x0)) +
  geom_density(fill = "lightblue") +
  geom_vline(aes(xintercept = 0.05)) +
  theme_bw() +
  labs(x = "p(yTrue, X=0)", y = "Density") +
  xlim(0, 1)

y_max <- ggplot2::layer_scales(p_pAlpha_m_high)$y$range$range[2]

p_pAlpha_m_high <- p_pAlpha_m_high +
  annotate("label",
    x = 0.05,
    y = y_max / 2,
    label = "pAlpha = 0.05",
    hjust = "left"
  )

# 0.05, 2, ORx
p_ORx_m_high <- df_m_high %>%
  ggplot(aes(x = or_true)) +
  geom_density(fill = "lightblue") +
  geom_vline(aes(xintercept = 2)) +
  theme_bw() +
  labs(x = "OR", y = "Density") +
  xlim(0, 4)

y_max <- ggplot2::layer_scales(p_ORx_m_high)$y$range$range[2]

p_ORx_m_high <- p_ORx_m_high +
  annotate("label",
    x = 2,
    y = y_max / 2,
    label = "ORx = 2"
  )

# 0.05, 2, se
p_se_m_high <- df_m_high %>%
  ggplot(aes(x = se_true)) +
  geom_density(fill = "lightblue") +
  geom_vline(aes(xintercept = 0.6)) +
  theme_bw() +
  labs(x = "Sensitivity", y = "Density") +
  xlim(0, 1)

y_max <- ggplot2::layer_scales(p_se_m_high)$y$range$range[2]

p_se_m_high <- p_se_m_high +
  annotate("label",
    x = 0.6,
    y = y_max / 2,
    label = "se = 0.6"
  )

# 0.05, 2, sp
p_sp_m_high <- df_m_high %>%
  ggplot(aes(x = sp_true)) +
  geom_density(fill = "lightblue") +
  geom_vline(aes(xintercept = 0.8)) +
  theme_bw() +
  labs(x = "Specificity", y = "Density") +
  xlim(0, 1)

y_max <- ggplot2::layer_scales(p_sp_m_high)$y$range$range[2]

p_sp_m_high <- p_sp_m_high +
  annotate("label",
    x = 0.8,
    y = y_max / 2,
    label = "sp = 0.8",
    hjust = "right"
  )

p_m <- wrap_plots(
  p_pX_m_no, p_pX_m_low, p_pX_m_high,
  p_pAlpha_m_no, p_pAlpha_m_low, p_pAlpha_m_high,
  p_ORx_m_no, p_ORx_m_low, p_ORx_m_high,
  p_se_m_no, p_se_m_low, p_se_m_high,
  p_sp_m_no, p_sp_m_low, p_sp_m_high,
  nrow = 5, ncol = 3,
  axes = "collect"
) +
  plot_annotation(tag_levels = "A")

rm(
  p_pX_m_no, p_pX_m_low, p_pX_m_high,
  p_pAlpha_m_no, p_pAlpha_m_low, p_pAlpha_m_high,
  p_ORx_m_no, p_ORx_m_low, p_ORx_m_high,
  p_se_m_no, p_se_m_low, p_se_m_high,
  p_sp_m_no, p_sp_m_low, p_sp_m_high
)


## 0.5, 1 ----
# 0.5, 1, pX
p_pX_l_no <- df_l_no %>%
  ggplot(aes(x = p_x)) +
  geom_density(fill = "lightblue") +
  geom_vline(aes(xintercept = 0.5)) +
  theme_bw() +
  labs(x = "p(X)", y = "Density", title = "OR = 1") +
  xlim(0, 1)

y_max <- ggplot2::layer_scales(p_pX_l_no)$y$range$range[2]

p_pX_l_no <- p_pX_l_no +
  annotate("label",
    x = 0.5,
    y = y_max / 2,
    label = "pX = 0.5"
  )

# 0.5, 1, pAlpha
p_pAlpha_l_no <- df_l_no %>%
  ggplot(aes(x = p_yTrue_x0)) +
  geom_density(fill = "lightblue") +
  geom_vline(aes(xintercept = 0.5)) +
  theme_bw() +
  labs(x = "p(yTrue, X=0)", y = "Density") +
  xlim(0, 1)

y_max <- ggplot2::layer_scales(p_pAlpha_l_no)$y$range$range[2]

p_pAlpha_l_no <- p_pAlpha_l_no +
  annotate("label",
    x = 0.5,
    y = y_max / 2,
    label = "pAlpha = 0.5"
  )

# 0.5, 1, ORx
p_ORx_l_no <- df_l_no %>%
  ggplot(aes(x = or_true)) +
  geom_density(fill = "lightblue") +
  geom_vline(aes(xintercept = 1)) +
  theme_bw() +
  labs(x = "OR", y = "Density") +
  xlim(0, 4)

y_max <- ggplot2::layer_scales(p_ORx_l_no)$y$range$range[2]

p_ORx_l_no <- p_ORx_l_no +
  annotate("label",
    x = 1,
    y = y_max / 2,
    label = "ORx = 1"
  )

# 0.5, 1, se
p_se_l_no <- df_l_no %>%
  ggplot(aes(x = se_true)) +
  geom_density(fill = "lightblue") +
  geom_vline(aes(xintercept = 0.6)) +
  theme_bw() +
  labs(x = "Sensitivity", y = "Density") +
  xlim(0, 1)

y_max <- ggplot2::layer_scales(p_se_l_no)$y$range$range[2]

p_se_l_no <- p_se_l_no +
  annotate("label",
    x = 0.6,
    y = y_max / 2,
    label = "se = 0.6"
  )

# 0.5, 1, sp
p_sp_l_no <- df_l_no %>%
  ggplot(aes(x = sp_true)) +
  geom_density(fill = "lightblue") +
  geom_vline(aes(xintercept = 0.8)) +
  theme_bw() +
  labs(x = "Specificity", y = "Density") +
  xlim(0, 1)

y_max <- ggplot2::layer_scales(p_sp_l_no)$y$range$range[2]

p_sp_l_no <- p_sp_l_no +
  annotate("label",
    x = 0.8,
    y = y_max / 2,
    label = "sp = 0.8",
    hjust = "right"
  )

## 0.5, 1.2 ----
# 0.5, 1.2, pX
p_pX_l_low <- df_l_low %>%
  ggplot(aes(x = p_x)) +
  geom_density(fill = "lightblue") +
  geom_vline(aes(xintercept = 0.5)) +
  theme_bw() +
  labs(x = "p(X)", y = "Density", title = "OR = 1.2") +
  xlim(0, 1)

y_max <- ggplot2::layer_scales(p_pX_l_low)$y$range$range[2]

p_pX_l_low <- p_pX_l_low +
  annotate("label",
    x = 0.5,
    y = y_max / 2,
    label = "pX = 0.5"
  )

# 0.5, 1.2, pAlpha
p_pAlpha_l_low <- df_l_low %>%
  ggplot(aes(x = p_yTrue_x0)) +
  geom_density(fill = "lightblue") +
  geom_vline(aes(xintercept = 0.5)) +
  theme_bw() +
  labs(x = "p(yTrue, X=0)", y = "Density") +
  xlim(0, 1)

y_max <- ggplot2::layer_scales(p_pAlpha_l_low)$y$range$range[2]

p_pAlpha_l_low <- p_pAlpha_l_low +
  annotate("label",
    x = 0.5,
    y = y_max / 2,
    label = "pAlpha = 0.5"
  )

# 0.5, 1.2, ORx
p_ORx_l_low <- df_l_low %>%
  ggplot(aes(x = or_true)) +
  geom_density(fill = "lightblue") +
  geom_vline(aes(xintercept = 1.2)) +
  theme_bw() +
  labs(x = "OR", y = "Density") +
  xlim(0, 4)

y_max <- ggplot2::layer_scales(p_ORx_l_low)$y$range$range[2]

p_ORx_l_low <- p_ORx_l_low +
  annotate("label",
    x = 1.2,
    y = y_max / 2,
    label = "ORx = 1.2"
  )

# 0.5, 1.2, se
p_se_l_low <- df_l_low %>%
  ggplot(aes(x = se_true)) +
  geom_density(fill = "lightblue") +
  geom_vline(aes(xintercept = 0.6)) +
  theme_bw() +
  labs(x = "Sensitivity", y = "Density") +
  xlim(0, 1)

y_max <- ggplot2::layer_scales(p_se_l_low)$y$range$range[2]

p_se_l_low <- p_se_l_low +
  annotate("label",
    x = 0.6,
    y = y_max / 2,
    label = "se = 0.6"
  )

# 0.5, 1.2, sp
p_sp_l_low <- df_l_low %>%
  ggplot(aes(x = sp_true)) +
  geom_density(fill = "lightblue") +
  geom_vline(aes(xintercept = 0.8)) +
  theme_bw() +
  labs(x = "Specificity", y = "Density") +
  xlim(0, 1)

y_max <- ggplot2::layer_scales(p_sp_l_low)$y$range$range[2]

p_sp_l_low <- p_sp_l_low +
  annotate("label",
    x = 0.8,
    y = y_max / 2,
    label = "sp = 0.8",
    hjust = "right"
  )

## 0.5, 2 ----
# 0.5, 2, pX
p_pX_l_high <- df_l_high %>%
  ggplot(aes(x = p_x)) +
  geom_density(fill = "lightblue") +
  geom_vline(aes(xintercept = 0.5)) +
  theme_bw() +
  labs(x = "p(X)", y = "Density", title = "OR = 2") +
  xlim(0, 1)

y_max <- ggplot2::layer_scales(p_pX_l_high)$y$range$range[2]

p_pX_l_high <- p_pX_l_high +
  annotate("label",
    x = 0.5,
    y = y_max / 2,
    label = "pX = 0.5"
  )

# 0.5, 2, pAlpha
p_pAlpha_l_high <- df_l_high %>%
  ggplot(aes(x = p_yTrue_x0)) +
  geom_density(fill = "lightblue") +
  geom_vline(aes(xintercept = 0.5)) +
  theme_bw() +
  labs(x = "p(yTrue, X=0)", y = "Density") +
  xlim(0, 1)

y_max <- ggplot2::layer_scales(p_pAlpha_l_high)$y$range$range[2]

p_pAlpha_l_high <- p_pAlpha_l_high +
  annotate("label",
    x = 0.5,
    y = y_max / 2,
    label = "pAlpha = 0.5"
  )

# 0.5, 2, ORx
p_ORx_l_high <- df_l_high %>%
  ggplot(aes(x = or_true)) +
  geom_density(fill = "lightblue") +
  geom_vline(aes(xintercept = 2)) +
  theme_bw() +
  labs(x = "OR", y = "Density") +
  xlim(0, 4)

y_max <- ggplot2::layer_scales(p_ORx_l_high)$y$range$range[2]

p_ORx_l_high <- p_ORx_l_high +
  annotate("label",
    x = 2,
    y = y_max / 2,
    label = "ORx = 2"
  )

# 0.5, 2, se
p_se_l_high <- df_l_high %>%
  ggplot(aes(x = se_true)) +
  geom_density(fill = "lightblue") +
  geom_vline(aes(xintercept = 0.6)) +
  theme_bw() +
  labs(x = "Sensitivity", y = "Density") +
  xlim(0, 1)

y_max <- ggplot2::layer_scales(p_se_l_high)$y$range$range[2]

p_se_l_high <- p_se_l_high +
  annotate("label",
    x = 0.6,
    y = y_max / 2,
    label = "se = 0.6"
  )

# 0.5, 2, sp
p_sp_l_high <- df_l_high %>%
  ggplot(aes(x = sp_true)) +
  geom_density(fill = "lightblue") +
  geom_vline(aes(xintercept = 0.8)) +
  theme_bw() +
  labs(x = "Specificity", y = "Density") +
  xlim(0, 1)

y_max <- ggplot2::layer_scales(p_sp_l_high)$y$range$range[2]

p_sp_l_high <- p_sp_l_high +
  annotate("label",
    x = 0.8,
    y = y_max / 2,
    label = "sp = 0.8",
    hjust = "right"
  )

p_l <- wrap_plots(
  p_pX_l_no, p_pX_l_low, p_pX_l_high,
  p_pAlpha_l_no, p_pAlpha_l_low, p_pAlpha_l_high,
  p_ORx_l_no, p_ORx_l_low, p_ORx_l_high,
  p_se_l_no, p_se_l_low, p_se_l_high,
  p_sp_l_no, p_sp_l_low, p_sp_l_high,
  nrow = 5, ncol = 3,
  axes = "collect"
) +
  plot_annotation(tag_levels = "A")

rm(
  p_pX_l_no, p_pX_l_low, p_pX_l_high,
  p_pAlpha_l_no, p_pAlpha_l_low, p_pAlpha_l_high,
  p_ORx_l_no, p_ORx_l_low, p_ORx_l_high,
  p_se_l_no, p_se_l_low, p_se_l_high,
  p_sp_l_no, p_sp_l_low, p_sp_l_high
)

rm(
  df_l_high, df_l_low, df_l_no, df_m_high, df_m_low, df_m_no, df_s_high,
  df_s_low, df_s_no, y_max, test_2_result
)

1.3 Running simulation studies with `simChef`

Simulation studies need the following steps:

Prepare experiment
- Create functions
- Create experiment
Run experiment
Visualise

All steps are written in a single R script, which is included in the quarto report using #| file:.... See an example in Section 5.2.2.5.

1.3.1 Prepare experiment

1.3.1.1 Create functions

First, functions used in the simulation experiments need to be created. simChef knows the following types:

Data-generating process (DGP): Provide exactly one DGP function
Method: Describes analysis methods. Provide one or more.
Evaluator: Post-processing. Not necessary but a single evaluator can be provided.
Visualiser: Not used. Create plots after completing the simulation.

1.3.1.1.1 DGP

Arguments of the DGP function are the simulation modules’ parameters as well as n, which describes the number of simulated observations.

If any parameters are drawn from a distribution, start the function with drawing a single value.

Next, create the list of simulation module objects that is passed to argument dags of supfuns::run_sim(). Make sure to list them in the correct order.

Next, create the list of parameters that is passed to argument args of supfuns::run_sim() from the arguments of the DGP function and the values drawn from distributions.

Next, run function supfuns::run_sim().

Finally, return values. Return at least the data.frame created by supfuns::run_sim() and add parameter values in case they are needed by method or evaluator functions.

sim_dgp <- function(n, par_a, par_b) {
  par_c <- rbeta(1, 5, 10)
  
  dags <- list(sim_mod_1, sim_mod_2)
  args <- list(
    par_a = par_a,
    par_b = par_b,
    par_c = par_c
  )
  
  df <- run_sim(
    dags = dags,
    args = args,
    n = n
  )
  
  out <- list(
    df = df,
    par_a = par_a
  )
  
  return(out)
}

1.3.1.1.2 Methods

Method functions have exactly the arguments that are returned by the DGP function, i.e., at least df and maybe additional simulation module parameters. Note that all method functions need to have all returned values as arguments, i.e., even if an additional parameter is only needed by one of the method functions, this additional parameter nonetheless needs to be an argument of all method functions.

Method functions analyse the simulated data.frame and return analysis results. You can either conduct summaries here or move them to the evaluator function. Always name the elements in the list returned by the function.

sim_method <- function(df, par_a) {
  m <- glm(y ~ x, data = df)
  
  out <- list(
    model = m,
    nobs = nobs(m)
  )
  
  return(out)
}

1.3.1.1.3 Evaluator

A function that usually has the single argument fit_results and also returns the object fit_results. The function processes the method functions’ results.

sim_eval <- function(fit_results) {
  fit_results %<>%
    ...
  
  return(fit_results)
}

1.3.1.2 Create experiment

Next, based on the functions, simulation study elements must be created using simChef::create_dgp(), simChef::create_method(), and simChef::create_evaluator().

dgp <- create_dgp(
  .dgp_fun = sim_dgp,
  .name = "dgp"
)

meth <- create_method(
  .method_fun = sim_method,
  .name = "regression"
)

ev <- create_evaluator(
  .eval_fun = sim_eval,
  .name = "eval"
)

Next, one or more experiments need to be created using simChef::create_experiment(). Provide a name to the experiment, as well as future.packages and future.globals, if you want to use parallelisation. future.packages contains packages used in DGP, method, and evaluator functions. It contains at least supfuns as run_sim() is always used. future.global contains the simulation module objects.

Then, add DGP, method, and evaluator simulation study elements to the experiment using simChef::add_dgp(), simChef::add_method(), and simChef::add_evaluator().

Finally, use simChef::add_vary_across() to specify parameter values. Specify your DGP simulation study element for argument .dgp and include all arguments of the DGP function as arguments of add_vary_across(). Pass a list to each argument that contains one or more parameter values. All possible combinations of provided values are simulated. Then remove functions and simulation study elements.

sim_exp <- create_experiment(
  name = "My simulation study",
  future.packages = "supfuns",
  future.globals = c("sim_mod_1", "sim_mod_2")
) %>% 
  add_dgp(dgp) %>% 
  add_method(meth) %>% 
  add_evaluator(ev) %>% 
  add_vary_across(
    .dgp = dgp,
    n = list(1000),
    par_a = list(0.5), 
    par_b = list(0.3, 0.4, 0.8)
  )

rm(sim_dgp, sim_method, sim_eval, dgp, meth, ev)

1.3.2 Run experiment

After the experiment has been created, use simChef::run_experiment() to run the simulation study. Use caching as described below.

if (cache$exists("sim_result")) {
  sim_result <- cache$get("sim_result")
} else {
  sim_result <- sim_exp %>%
    run_experiment(n_reps = 100, save = FALSE)
  cache$set("sim_result", sim_result)
}
rm(sim_exp)

1.3.3 Visualise

Finally, create plots from the simulated data.

1.4 Documenting simulation modules in `quarto`

Simulation modules are documented in the “Pre-data” chapter of a project report. Sub-chapter “Causal model” contains one sub-chapter for each simulation module. See the template below and examples in Section 5.2.1 and Section 5.2.2.

1.4.1 Module <number>: <name>

1.4.1.1 Assumptions

Table 1.1: Assumptions of simulation module “<name>”

Assumption	Justification
<Assumption 1>	<Justification 1>
…	…

1.4.1.2 Variables

Table 1.2: Variables of simulation module “<name>”

Variable	Description
<Variable 1 using $Inline Math$>	<Variable description 1>
…	…

1.4.1.3 Parameters

Table 1.3: Parameters of simulation module “<name>”

Parameter	Description
<Parameter 1 using $Inline Math$>	<Parameter description 1>
…	…

1.4.1.4 Technical details

1.4.1.4.1 Equation block

The complete equation is inside double $$, i.e., double $$ as both the first and the last line
Use align to align all equations at = or ~
- Second line, i.e., directly after $$, is \begin{align}
- Second to last line, i.e., directly before $$, is \end{align}
- Add & in front of the symbols to align, i.e., use &= for $=$ and &\sim for $\sim$
Use \text{some text} to write some text in more of a text format than an equation format
Use _ for subscript and include the subcript in curly brackets if it is longer than 1 character, e.g.: Y_{True} or p_X
End your equations with \\ to start a new equation in a new line
Cases
- Start with \begin{cases} and end with \end{cases}
- To specify, in which cases a equation is applied, end all equations in the cases block with , &\quad\text{for} x = 1
- & in front of \quad is used for aligning them

1.4.1.4.2 Graphs

Type /dot until GraphViz Code Chunk is suggested, then click tab
YAML options in the codeblock start with //|
1. //| label: fig-graph-module<number>
2. //| fig-cap: 'Graph of simulation module "<name>" (Ellipse: Variable, Diamond: Parameter, Box: Variable from other module)'
3. //| fig-height: 2; fig-height is often needed to delete too much white space above and below the graph. Large graphs do not need it. Sometimes, other values give better results, e.g., 3.
Start with digraph keyword, graph name, e.g., sim_mod_1, and curly brackets: digraph sim_mod_number {}; All code goes inside the curly brackets.
Options:
1. rankdir = LR
2. bgcolor = "transparent"
Variable specification: Start with the name and add options in brackets node_name []
1. label =; Put label inside < >; put subscript inside <SUB> </SUB>
2. shape =; no shape for variables (ellipses), box for variables from other modules, diamond for parameters
Edges: node_name1 -> node_name2

digraph m_a {   
  rankdir = LR   
  bgcolor = "transparent"      
  
  from_other_module [label = <From other<SUB>module</SUB>>, shape = box]      
  var [label = <Var<SUB>1</SUB>>]      
  par [label = <A parameter>, shape = diamond]      
  
  from_other_module -> var   
  par -> var 
}

Figure 1.1: Graph of simulation module “name” (Ellipse: Variable, Diamond: Parameter, Box: Variable from other module)

1.4.1.4.3 Code

Create an R code chunk
Add three YAML options:
1. code-fold: true
2. code-summary: "Create module '<name>' in simcausal"
3. file: R/<file name>.R

1.5 Documenting simulation studies in `quarto`

Simulation studies are documented in the “Pre-data” chapter of a project report. Sub-chapter “Simulation” contains one sub-chapter for each simulation module. In addition, the first sub-chapter of “Simulation” is called “Testing” and contains the evaluation, if the simulation modules have been implemented correctly (see Section 1.2). See the template below.

1.5.1 Simulation <number>: <name>

Table 1.4: Input values for simulation <number>

Parameter	Input values
<Parameter 1 using \(Inline Math\)>	<Parameter 1 input value>
<Parameter 2 with different scenarios using \(Inline Math\)>	Scenario A: <Parameter 2 input value for scenario A> Scenario B: <Parameter 2 input value for scenario B>
…	…

Scenario A: <Parameter 2 input value for scenario A>

Scenario B: <Parameter 2 input value for scenario B>

…

<Include code chunk with the simulation script

code-fold: true
code-summary: "Running simulation <number>"
file: R/pre_3_<number>_sim<number>.R>

1.5.1.1 How to include Stan models?

Create an R code block
Use options:
1. #| class-output: stan
2. #| echo: false
Input code
1. For a stan file use: cat(readLines("Stan/file_name.stan"), sep = "\n")
2. For a few lines use: cat(c("Line 1", "Line 2", "..."), sep = "\n")
3. Example for b.: cat(c("// Priors", "a ~ normal(-3.4, 0.6);", "b ~ normal(0, 0.7);"), sep = "\n")

1.1 Creating simulation modules with simcausal