Replication notebook

Accuracy of the LiNGAM Estimation

Replication notebook for evaluating the accuracy of the LiNGAM-based contemporaneous causal structure estimation used in the event-day network analysis.

Overview

This notebook reproduces the Monte Carlo exercise used to evaluate how accurately LiNGAM recovers the contemporaneous causal structure in the illustrative structural system. The goal is to check whether the first identification step of the methodology can recover the correct directed graph from simulated data.

The simulation is based on two versions of the same four-variable contemporaneous system. In the baseline design, the nonzero contemporaneous coefficients are set to $\beta = 0.5$. In the reduced-signal design, the same edges are retained, but their magnitude is lowered to $\beta = 0.3$. This allows the replication to compare graph recovery when the structural links are stronger versus weaker.

For each design, the notebook generates repeated samples from the structural system using Student-$t$ shocks with different degrees of freedom. The degrees of freedom control the degree of non-Gaussianity in the shocks, which is important because LiNGAM relies on non-Gaussian independent innovations for causal orientation. After each simulated dataset is generated, the notebook estimates the graph with pcalg::lingam, compares the estimated pairwise relations with the true relations implied by $A_0$, and summarizes recovery accuracy across Monte Carlo replications.

The main outputs are pair-level recovery tables and graph-level summary tables. These correspond to the LiNGAM recovery results reported in Appendix A of the paper. Because the exercise uses repeated simulations, the notebook may take some time to run. The random seed is fixed to make the replication as reproducible as possible.

Helper functions for the Monte Carlo simulation

This section defines the auxiliary functions used throughout the notebook. The functions convert the true $A_0$ matrix into pairwise causal relations, normalize the adjacency matrix estimated by LiNGAM, compare estimated and true relations, and summarize the Monte Carlo results.

The functions are written once and then reused for both data-generating processes.

R code Cell 3
Show code
library(pcalg)

# The function `gen_svar()` used below is expected to be available in the
# R session, for example after loading the package/code used in the paper.
# `pcalg::lingam()` is used for the actual LiNGAM graph estimation.

# Safe division helper used for precision, recall, specificity and F1.
# If the denominator is zero, the function returns NA instead of producing
# Inf or NaN values in the summary tables.
safe_divide <- function(x, y) {
  ifelse(y > 0, x / y, NA_real_)
}

# Convert the true contemporaneous matrix A0 into a pairwise truth table.
#
# In the structural equation A0 y_t = epsilon_t, a nonzero off-diagonal
# element A0[i, j] means that variable j has a contemporaneous effect on
# variable i. Therefore, for each unordered pair (x, y), the function checks
# whether the true direction is x -> y, y -> x, or no edge.
make_true_pair_table <- function(A0, tol = 1e-10) {
  vars <- colnames(A0)
  pairs <- combn(vars, 2, simplify = FALSE)

  out <- lapply(pairs, function(pr) {
    x <- pr[1]
    y <- pr[2]
    ix <- match(x, vars)
    iy <- match(y, vars)

    # Directional indicators implied by the off-diagonal entries of A0.
    x_to_y <- abs(A0[iy, ix]) > tol
    y_to_x <- abs(A0[ix, iy]) > tol

    # Store the true pairwise relation in a readable string format.
    truth <- if (x_to_y && !y_to_x) {
      paste0(x, " -> ", y)
    } else if (y_to_x && !x_to_y) {
      paste0(y, " -> ", x)
    } else if (!x_to_y && !y_to_x) {
      "no edge"
    } else {
      stop("A0 is not acyclic.")
    }

    data.frame(
      pair  = paste0(x, " - ", y),
      x     = x,
      y     = y,
      truth = truth,
      stringsAsFactors = FALSE
    )
  })

  do.call(rbind, out)
}

# Convert a LiNGAM coefficient matrix into a binary adjacency matrix.
#
# The resulting matrix only records whether an edge is present or absent.
# The diagonal is set to zero because self-loops are not part of the graph
# recovery exercise.
normalize_amat <- function(amat, vars) {
  amat <- as.matrix(amat)
  amat[is.na(amat)] <- 0
  amat <- (amat != 0) * 1L
  diag(amat) <- 0L
  colnames(amat) <- rownames(amat) <- vars
  amat
}

# Extract the estimated relation for one unordered pair of variables.
#
# The convention is the same as above: Adj[i, j] = 1 means j -> i.
# The function returns x -> y, y -> x, or no edge.
get_estimated_relation <- function(Adj, x, y) {
  vars <- colnames(Adj)
  ix <- match(x, vars)
  iy <- match(y, vars)

  x_to_y <- Adj[iy, ix] == 1L
  y_to_x <- Adj[ix, iy] == 1L

  if (x_to_y && !y_to_x) {
    paste0(x, " -> ", y)
  } else if (y_to_x && !x_to_y) {
    paste0(y, " -> ", x)
  } else if (!x_to_y && !y_to_x) {
    "no edge"
  } else {
    stop("Estimated adjacency contains a 2-cycle.")
  }
}

# Pairwise structural Hamming distance.
#
# In this simplified pairwise setting, SHD is the number of unordered pairs
# for which the estimated relation differs from the true relation.
compute_pairwise_shd <- function(est_rel, true_rel) {
  sum(est_rel != true_rel)
}

# Summarize directional recovery for pairs where a true edge exists.
#
# The table separates three cases:
#   1. the correct direction was found;
#   2. the opposite direction was estimated;
#   3. the edge was missed.
summarize_orientation <- function(pair_counts, df, reps) {
  edge_rows <- pair_counts$truth != "no edge"

  if (!any(edge_rows)) {
    return(data.frame(
      df = df,
      reps = reps,
      n_true_edges = 0L,
      correct_dir_rate = NA_real_,
      reversed_rate    = NA_real_,
      missing_rate     = NA_real_,
      stringsAsFactors = FALSE
    ))
  }

  correct_dir <- 0L
  reversed    <- 0L
  missing     <- 0L

  idx <- which(edge_rows)

  for (i in idx) {
    truth_xy <- paste0(pair_counts$x[i], " -> ", pair_counts$y[i])
    truth_yx <- paste0(pair_counts$y[i], " -> ", pair_counts$x[i])

    if (pair_counts$truth[i] == truth_xy) {
      correct_dir <- correct_dir + pair_counts$x_to_y[i]
      reversed    <- reversed    + pair_counts$y_to_x[i]
    } else if (pair_counts$truth[i] == truth_yx) {
      correct_dir <- correct_dir + pair_counts$y_to_x[i]
      reversed    <- reversed    + pair_counts$x_to_y[i]
    }

    missing <- missing + pair_counts$none[i]
  }

  total_cases <- length(idx) * reps

  data.frame(
    df = df,
    reps = reps,
    n_true_edges = length(idx),
    correct_dir_rate = correct_dir / total_cases,
    reversed_rate    = reversed    / total_cases,
    missing_rate     = missing     / total_cases,
    stringsAsFactors = FALSE
  )
}

# Summarize binary edge detection, ignoring the direction of the edge.
#
# This distinguishes true positives, false negatives, false positives and
# true negatives at the pair level. These counts are then converted into
# standard classification metrics.
summarize_binary_detection <- function(pair_counts, df, reps) {
  true_has_edge <- pair_counts$truth != "no edge"
  est_has_edge_count <- pair_counts$x_to_y + pair_counts$y_to_x

  TP <- sum(est_has_edge_count[ true_has_edge])
  FN <- sum(pair_counts$none[   true_has_edge])
  FP <- sum(est_has_edge_count[!true_has_edge])
  TN <- sum(pair_counts$none[   !true_has_edge])

  precision   <- safe_divide(TP, TP + FP)
  recall      <- safe_divide(TP, TP + FN)
  specificity <- safe_divide(TN, TN + FP)
  f1 <- if (is.na(precision) || is.na(recall) || (precision + recall) == 0) {
    NA_real_
  } else {
    2 * precision * recall / (precision + recall)
  }

  data.frame(
    df = df,
    reps = reps,
    TP = TP,
    FN = FN,
    FP = FP,
    TN = TN,
    precision = precision,
    recall = recall,
    specificity = specificity,
    F1 = f1,
    stringsAsFactors = FALSE
  )
}

# Summarize graph-level recovery across Monte Carlo replications.
#
# `pair_accuracy` is the share of correctly classified pairwise relations.
# `mean_SHD` and `median_SHD` summarize the number of pairwise mistakes.
# `exact_recovery_rate` is the share of replications where the whole graph
# is recovered exactly.
summarize_graph_recovery <- function(pair_counts, shd_vec, df, reps, failed_runs, attempts) {
  pair_accuracy <- sum(pair_counts$correct) / (nrow(pair_counts) * reps)
  exact_recovery_count <- sum(shd_vec == 0L)

  data.frame(
    df = df,
    reps = reps,
    attempts = attempts,
    failed_runs = failed_runs,
    pair_accuracy = pair_accuracy,
    mean_SHD = mean(shd_vec),
    median_SHD = stats::median(shd_vec),
    sd_SHD = stats::sd(shd_vec),
    exact_recovery_count = exact_recovery_count,
    exact_recovery_rate = exact_recovery_count / reps,
    stringsAsFactors = FALSE
  )
}

# Run the Monte Carlo simulation for one value of the degrees of freedom.
#
# For each successful replication, the function:
#   1. generates a sample from the structural system;
#   2. estimates the contemporaneous graph with LiNGAM;
#   3. compares all pairwise estimated relations with the true relations;
#   4. stores pair-level and graph-level recovery statistics.
run_mc_single_df <- function(A0,
                             df,
                             n = 400,
                             burnin = 100,
                             reps = 1000,
                             standardize_for_dag = FALSE,
                             tol = 1e-8,
                             max_attempt_factor = 5) {

  vars <- colnames(A0)
  true_pairs <- make_true_pair_table(A0, tol = tol)

  # Initialize the pair-level counters. Each row corresponds to one
  # unordered pair of variables.
  pair_counts <- true_pairs
  pair_counts$x_to_y  <- 0L
  pair_counts$y_to_x  <- 0L
  pair_counts$none    <- 0L
  pair_counts$correct <- 0L

  shd_vec <- integer(reps)

  success <- 0L
  attempts <- 0L
  failed_runs <- 0L
  max_attempts <- reps * max_attempt_factor

  while (success < reps) {
    attempts <- attempts + 1L

    if (attempts > max_attempts) {
      stop("Too many failed simulation/estimation attempts.")
    }

    # Generate one Monte Carlo sample. The model has no lagged VAR part in
    # this exercise, so `list_A = list()` leaves only the contemporaneous
    # structure given by A0.
    X <- tryCatch(
      gen_svar(
        n = n,
        A0 = A0,
        list_A = list(),
        df = df,
        burnin = burnin
      ),
      error = function(e) NULL
    )

    if (is.null(X)) {
      failed_runs <- failed_runs + 1L
      next
    }

    X <- as.data.frame(as.matrix(X))
    colnames(X) <- vars

    # Standardization is optional. The default keeps the generated data on
    # its original scale, matching the reported replication setting.
    X_dag <- if (standardize_for_dag) scale(X) else as.matrix(X)

    # `pcalg::lingam()` uses random components. To make each LiNGAM call
    # internally reproducible without disturbing the outer Monte Carlo seed,
    # the current random state is saved and restored after estimation.
    old_seed <- if (exists(".Random.seed", envir = .GlobalEnv, inherits = FALSE)) {
      get(".Random.seed", envir = .GlobalEnv, inherits = FALSE)
    } else {
      NULL
    }

    set.seed(1119)
    fit_lingam <- tryCatch(
      pcalg::lingam(X_dag, verbose = FALSE),
      error = function(e) NULL
    )

    if (is.null(old_seed)) {
      rm(".Random.seed", envir = .GlobalEnv)
    } else {
      assign(".Random.seed", old_seed, envir = .GlobalEnv)
    }

    if (is.null(fit_lingam)) {
      failed_runs <- failed_runs + 1L
      next
    }

    # Prefer the pruned LiNGAM coefficient matrix when available.
    Bhat <- if (!is.null(fit_lingam$Bpruned)) fit_lingam$Bpruned else fit_lingam$B

    if (is.null(Bhat)) {
      failed_runs <- failed_runs + 1L
      next
    }

    Adj <- normalize_amat(Bhat, vars)

    # Translate the estimated adjacency matrix into pairwise string labels
    # so that the estimates can be compared directly to the truth table.
    est_rel <- mapply(
      FUN = function(x, y) get_estimated_relation(Adj, x, y),
      x = true_pairs$x,
      y = true_pairs$y,
      USE.NAMES = FALSE
    )

    # Update the Monte Carlo counters for each pair.
    pair_counts$x_to_y  <- pair_counts$x_to_y  + as.integer(est_rel == paste0(pair_counts$x, " -> ", pair_counts$y))
    pair_counts$y_to_x  <- pair_counts$y_to_x  + as.integer(est_rel == paste0(pair_counts$y, " -> ", pair_counts$x))
    pair_counts$none    <- pair_counts$none    + as.integer(est_rel == "no edge")
    pair_counts$correct <- pair_counts$correct + as.integer(est_rel == pair_counts$truth)

    success <- success + 1L
    shd_vec[success] <- compute_pairwise_shd(est_rel, true_pairs$truth)
  }

  # Convert raw counts into recovery rates.
  pair_counts$x_to_y_rate  <- pair_counts$x_to_y / reps
  pair_counts$y_to_x_rate  <- pair_counts$y_to_x / reps
  pair_counts$none_rate    <- pair_counts$none / reps
  pair_counts$correct_rate <- pair_counts$correct / reps

  orientation_summary <- summarize_orientation(pair_counts, df = df, reps = reps)
  binary_summary <- summarize_binary_detection(pair_counts, df = df, reps = reps)
  graph_summary <- summarize_graph_recovery(
    pair_counts = pair_counts,
    shd_vec = shd_vec,
    df = df,
    reps = reps,
    failed_runs = failed_runs,
    attempts = attempts
  )

  list(
    pair_counts = pair_counts,
    orientation_summary = orientation_summary,
    binary_summary = binary_summary,
    graph_summary = graph_summary,
    shd_vector = shd_vec
  )
}

# Combine the pair-level results from the different degrees of freedom into
# a single wide table. This is the format used in the appendix tables.
make_pair_wide_table <- function(res_list, dfs, type = c("counts", "rates")) {
  type <- match.arg(type)

  if (type == "counts") {
    cols_keep <- c("x_to_y", "y_to_x", "none", "correct")
  } else {
    cols_keep <- c("x_to_y_rate", "y_to_x_rate", "none_rate", "correct_rate")
  }

  tabs <- lapply(seq_along(dfs), function(i) {
    d <- dfs[i]
    tmp <- res_list[[i]]$pair_counts[, c("pair", "x", "y", "truth", cols_keep)]
    names(tmp)[-(1:4)] <- paste0("df", d, "_", cols_keep)
    tmp
  })

  out <- Reduce(function(a, b) {
    merge(a, b, by = c("pair", "x", "y", "truth"), all = TRUE)
  }, tabs)

  out[order(out$pair), ]
}

# Main wrapper used in the notebook.
#
# It runs the Monte Carlo exercise for all requested degrees of freedom and
# returns both pair-level tables and aggregate graph recovery summaries.
run_lingam_mc <- function(A0,
                          dfs = c(3, 6, 9),
                          n = 400,
                          burnin = 100,
                          reps = 1000,
                          standardize_for_dag = FALSE,
                          tol = 1e-8,
                          seed = 123) {

  set.seed(seed)

  res_list <- lapply(dfs, function(d) {
    run_mc_single_df(
      A0 = A0,
      df = d,
      n = n,
      burnin = burnin,
      reps = reps,
      standardize_for_dag = standardize_for_dag,
      tol = tol
    )
  })

  pair_table_counts <- make_pair_wide_table(res_list, dfs, type = "counts")
  pair_table_rates  <- make_pair_wide_table(res_list, dfs, type = "rates")

  orientation_table <- do.call(
    rbind,
    lapply(res_list, function(x) x$orientation_summary)
  )
  orientation_table <- orientation_table[order(orientation_table$df), ]

  binary_detection_table <- do.call(
    rbind,
    lapply(res_list, function(x) x$binary_summary)
  )
  binary_detection_table <- binary_detection_table[order(binary_detection_table$df), ]

  graph_recovery_table <- do.call(
    rbind,
    lapply(res_list, function(x) x$graph_summary)
  )
  graph_recovery_table <- graph_recovery_table[order(graph_recovery_table$df), ]

  list(
    pair_table_counts = pair_table_counts,
    pair_table_rates = pair_table_rates,
    orientation_table = orientation_table,
    binary_detection_table = binary_detection_table,
    graph_recovery_table = graph_recovery_table,
    raw_results = res_list
  )
}

Baseline LiNGAM recovery experiment: $\beta = 0.5$

This section runs the Monte Carlo recovery exercise for the baseline data-generating process. The nonzero contemporaneous coefficients in $A_0$ are set to $-0.5$, which corresponds to the stronger-link version of the illustrative system.

The simulation is run for Student-$t$ shocks with 3, 6 and 9 degrees of freedom. Lower degrees of freedom imply stronger non-Gaussianity, which should generally make LiNGAM orientation easier. The resulting pair-level counts and graph-level recovery statistics correspond to Table A.4 in Appendix A.

R code Cell 5
Show code
# Baseline contemporaneous structural matrix.
#
# The diagonal elements are fixed at one. The nonzero off-diagonal
# elements define the true contemporaneous causal graph:
#   Y1 -> Y3, Y2 -> Y3, Y1 -> Y4, and Y3 -> Y4.
A0 <- matrix(c(
  1.0,  0.0,  0.0,  0.0,
  0.0,  1.0,  0.0,  0.0,
 -0.5, -0.5,  1.0,  0.0,
 -0.5,  0.0, -0.5,  1.0
), nrow = 4, byrow = TRUE)

# Assign variable names used in the pairwise recovery tables.
colnames(A0) <- rownames(A0) <- paste0("Y", seq_len(nrow(A0)))

# Run the Monte Carlo experiment.
#
# `dfs = c(3, 6, 9)` compares different levels of non-Gaussianity.
# `n = 400` is the sample size used in each replication.
# `reps = 1000` is the number of successful Monte Carlo replications.
mc_out <- run_lingam_mc(
  A0 = A0,
  dfs = c(3, 6, 9),
  n = 400,
  burnin = 100,
  reps = 1000,
  tol = 1e-8,
  seed = 123
)

# Pair-level recovery counts by degrees of freedom.
mc_out$pair_table_counts

# Aggregate graph recovery statistics.
mc_out$graph_recovery_table
Output 1
A data.frame: 6 × 16
pairxytruthdf3_x_to_ydf3_y_to_xdf3_nonedf3_correctdf6_x_to_ydf6_y_to_xdf6_nonedf6_correctdf9_x_to_ydf9_y_to_xdf9_nonedf9_correct
<chr><chr><chr><chr><int><int><int><int><int><int><int><int><int><int><int><int>
1Y1 - Y2Y1Y2no edge 75988 988 2533942942 97 77826826
2Y1 - Y3Y1Y3Y1 -> Y3 9940 6 99495825 17958819105 76819
3Y1 - Y4Y1Y4Y1 -> Y4 9920 8 992993 7 0993958 42 0958
4Y2 - Y3Y2Y3Y2 -> Y310000 0100097129 0971869131 0869
5Y2 - Y4Y2Y4no edge 100990 990 28 7965965 61 35904904
6Y3 - Y4Y3Y4Y3 -> Y4 9920 8 99297030 0970870130 0870
Output 2
A data.frame: 3 × 10
dfrepsattemptsfailed_runspair_accuracymean_SHDmedian_SHDsd_SHDexact_recovery_countexact_recovery_rate
<dbl><dbl><int><int><dbl><dbl><dbl><dbl><int><dbl>
131000100000.99266670.04400.21001939570.957
261000100000.96650000.20100.63639928920.892
391000100000.87433330.75401.18274386640.664

Reduced-signal LiNGAM recovery experiment: $\beta = 0.3$

This section repeats the same Monte Carlo design after reducing the magnitude of the nonzero contemporaneous coefficients from $0.5$ to $0.3$. The true graph is unchanged, but the causal signal is weaker, so the exercise checks whether graph recovery deteriorates when the structural links are harder to detect.

The outputs are directly comparable to the baseline experiment above and correspond to Table A.5 in Appendix A.

R code Cell 7
Show code
# Reduced-signal contemporaneous structural matrix.
#
# The graph is the same as in the baseline experiment, but the nonzero
# coefficients are smaller in absolute value. This makes the causal links
# weaker while keeping the true edge pattern unchanged.
A0 <- matrix(c(
  1.0,  0.0,  0.0,  0.0,
  0.0,  1.0,  0.0,  0.0,
 -0.3, -0.3,  1.0,  0.0,
 -0.3,  0.0, -0.3,  1.0
), nrow = 4, byrow = TRUE)

# Assign the same variable names as in the baseline experiment.
colnames(A0) <- rownames(A0) <- paste0("Y", seq_len(nrow(A0)))

# Run the same Monte Carlo experiment with the weaker A0 coefficients.
mc_out <- run_lingam_mc(
  A0 = A0,
  dfs = c(3, 6, 9),
  n = 400,
  burnin = 100,
  reps = 1000,
  tol = 1e-8,
  seed = 123
)

# Pair-level recovery counts by degrees of freedom.
mc_out$pair_table_counts

# Aggregate graph recovery statistics.
mc_out$graph_recovery_table
Output 1
A data.frame: 6 × 16
pairxytruthdf3_x_to_ydf3_y_to_xdf3_nonedf3_correctdf6_x_to_ydf6_y_to_xdf6_nonedf6_correctdf9_x_to_ydf9_y_to_xdf9_nonedf9_correct
<chr><chr><chr><chr><int><int><int><int><int><int><int><int><int><int><int><int>
1Y1 - Y2Y1Y2no edge 75988988 8 18974974 21 22957957
2Y1 - Y3Y1Y3Y1 -> Y39242 74924854 65 81854702186112702
3Y1 - Y4Y1Y4Y1 -> Y49060 94906938 24 38938843115 42843
4Y2 - Y3Y2Y3Y2 -> Y39310 69931889 67 44889778187 35778
5Y2 - Y4Y2Y4no edge 110989989 24 8968968 27 13960960
6Y3 - Y4Y3Y4Y3 -> Y49131 86913864104 32864754205 41754
Output 2
A data.frame: 3 × 10
dfrepsattemptsfailed_runspair_accuracymean_SHDmedian_SHDsd_SHDexact_recovery_countexact_recovery_rate
<dbl><dbl><int><int><dbl><dbl><dbl><dbl><int><dbl>
131000100000.94183330.34900.56348606930.693
261000100000.91450000.51300.80154446390.639
391000100000.83233331.00611.08033844250.425