Lab 13 — A method-comparison simulation

Simulating Type I error, power, and coverage for several methods across data-generating processes

Purpose. This lab is the hands-on companion to Week 13 — Simulation study of method behavior. The note argues that for twelve weeks you reasoned about which assumption-light method to use, and that a simulation study turns that reasoning into an experiment on the methods themselves: invent a world where the truth is known, run each method on thousands of samples from that world, and tally how often it keeps its promises. Here you build that machine. You will fix four data-generating processes — normal, right-skewed (lognormal), heavy-tailed (\(t_3\)), and contaminated (5% outliers) — and, for each, use replicate() to estimate the Type I error, power, and 95% confidence-interval coverage of the t-test, a permutation test, the Wilcoxon rank-sum, and a trimmed-mean interval. You will then read the locked headline numbers off the result: under \(t_3\), power of about \(0.55\) for the t-test versus about \(0.70\) for the Wilcoxon; under contamination, mean-CI coverage of about \(0.86\) versus about \(0.94\) for the trimmed-mean interval; and under the normal world, everything near its nominal level. Throughout you will carry the Monte Carlo error so you can tell a real gap from simulation noise. The code is shown for study and is not executed on this site; you run it in your own R session.

The idea

Every test you have met comes with a promise. A 5% test should reject a true null about 5% of the time — that is its Type I error. A 95% interval should trap the true parameter about 95% of the time — that is its coverage. And against a real effect, a good method should reject often — that is its power. These promises are not unconditional gifts; they are consequences of the assumptions baked into each method’s derivation. The t-test leans on the sampling distribution of the mean being roughly normal; the permutation test leans on exchangeability under the null; the Wilcoxon leans on a location-shift model with exchangeable ranks; the trimmed mean’s interval leans on the trimmed distribution being well-behaved. When the data violate those assumptions, the promises can quietly break — and a p-value will never volunteer that it has broken.

A simulation study is how you catch the break in the act. The move is simple but powerful: you set the truth yourself. You write down a data-generating process (DGP) — a recipe that manufactures samples whose true mean, true shape, and true effect you know, because you typed them. Then you run a method on thousands of those samples and count. If you generate data under the null and a 5% test rejects 5% of the time, the test is honest in that world. If the same test rejects 12% of the time, it is inflating its Type I error — claiming significance it has not earned. If a 95% interval covers the true value only 86% of the time, it is over-confident: narrower than it has any right to be. There is no formula to argue with here, only a tally you generated under conditions you fully control.

This is the course’s assumption-ladder discipline made empirical. All term you argued that ranks resist heavy tails and that a trimmed mean resists contamination. This lab measures it. And it measures the discipline’s limit too: a simulation is itself a sample, so each estimated proportion carries Monte Carlo error, and it speaks only about the specific worlds you chose to simulate. No method wins everywhere; the lab’s payoff is the habit of reading a simulation as bounded, world-specific, noisy evidence — never as a final verdict.

Goal

By the end of this lab you should be able to:

Encode four data-generating processes in R with a fixed seed — normal, lognormal (right-skewed), \(t_3\) (heavy-tailed), and a 5%-contaminated mixture — and explain what each one assumes and breaks.
Write a one-sample (or two-sample) test driver for each method (t-test, permutation, Wilcoxon, trimmed-mean interval) that returns a reject/cover decision for one simulated sample.
Use replicate() to run each method on thousands of samples and estimate its Type I error (data under the null), power (data under an alternative), and 95% CI coverage.
Compute the Monte Carlo standard error of each estimated proportion and use it to decide whether two methods’ results genuinely differ rather than wobbling in simulation noise.
Read the locked headline numbers — \(t_3\) power \(\approx 0.55\) (t) vs \(\approx 0.70\) (Wilcoxon); contaminated coverage \(\approx 0.86\) (mean) vs \(\approx 0.94\) (trimmed); normal \(\approx\) nominal — and state, for each, the assumption-ladder move it illustrates.

The point of the lab is the comparison experiment: you are not asking “is my effect significant?” but “which method keeps its promises in which world, and by how much?” The code is the means; the read of the table is the message.

Setup

You need only base R — no packages (the Wilcoxon and t-test are base functions; the permutation and trimmed-mean drivers are short). Every chunk that draws randomness starts with set.seed(45203), the course-wide seed, so your run reproduces the locked Week 13 numbers. The data are synthetic; seed set — these are simulated worlds, not real records.

Fix the simulation’s design facts before any code runs, because every estimate below is read against them:

The four DGPs. Each is a recipe for one sample. The normal world is the friendly baseline where every method’s assumptions hold. The lognormal world is right-skewed (the wait-time shape of Dataset W). The \(t_3\) world is symmetric but heavy-tailed — extreme values are common. The contaminated world is mostly normal but with \(5\%\) of points drawn from a far-off outlier distribution (the spirit of Dataset D’s two bad points, scaled up).
The four methods. The t-test (t.test), a permutation test on the difference in means, the Wilcoxon rank-sum (wilcox.test), and a trimmed-mean interval (a \(10\%\) trimmed mean with a matching trimmed-variance CI). These are exactly the four the course has threaded all term.
The three quantities. Type I error = rejection rate when data are generated under the null (no effect); target \(0.05\). Power = rejection rate when data are generated under an alternative (a real effect); higher is better, given the level is honest. Coverage = the fraction of \(95\%\) intervals that contain the true parameter; target \(0.95\).
The replication count. n_sim <- 2000 samples per cell. With \(2000\) replicates a proportion near \(p\) carries a Monte Carlo standard error of about \(\sqrt{p(1-p)/2000}\) — roughly \(0.011\) near \(p=0.5\) and about \(0.005\) near \(p=0.05\). That number is what lets you say a gap is real.
The locked headline cells (synthetic, provisional). Under \(t_3\): power \(\approx 0.55\) (t-test) vs \(\approx 0.70\) (Wilcoxon). Under contamination: \(95\%\) CI coverage \(\approx 0.86\) (ordinary mean) vs \(\approx 0.94\) (trimmed mean). Under normal: all Type I errors \(\approx 0.05\), power comparable (t slightly best), coverage \(\approx 0.95\). Under lognormal: the t-test CI under-covers (\(\approx 0.91\)) while the rank method holds level and gains power.

set.seed(45203)

# --- Method-comparison simulation: four DGPs x four methods --------------------
# Synthetic worlds; the headline cells are the LOCKED Week 13 results.
n        <- 30        # sample size per group
n_sim    <- 2000      # Monte Carlo replications per cell
n_perm   <- 1000      # shuffles inside each permutation test
delta    <- 0.8       # true mean shift used for the POWER runs (0 for Type I / coverage-null)
trim     <- 0.10      # trimming fraction for the trimmed-mean method

# Four data-generating processes, each returns a length-n sample with mean ~ mu.
# `mu` is the location we shift to create an effect; the SHAPE is what differs.
draw <- list(
  normal       = function(n, mu) rnorm(n, mean = mu, sd = 1),
  lognormal    = function(n, mu) mu + (rlnorm(n, 0, 1) - exp(0.5)),   # right-skew, re-centered
  t3           = function(n, mu) mu + rt(n, df = 3),                  # heavy tails
  contaminated = function(n, mu) {                                    # 5% far outliers
    z <- rnorm(n, mean = mu, sd = 1)
    k <- rbinom(n, 1, 0.05)                # 5% flagged for contamination
    z + k * rnorm(n, mean = 10, sd = 5)    # bumped far away
  }
)

# The four 95% CI / test methods, each a function of one or two samples.
# (t.test and wilcox.test are base R; the permutation and trimmed pieces are short.)
c(n = n, n_sim = n_sim, n_perm = n_perm, delta = delta, trim = trim)
# -> the locked simulation design this lab reads from

The design here is the whole lesson in one screen: the DGP list fixes the worlds, the method functions fix the competitors, and n_sim fixes how sharply you can see. Everything below just turns the crank and counts. Note the discipline already visible in the recipes — the lognormal and contaminated draws are re-centered so that “no effect” really means the same true mean across worlds; if you skipped that, a coverage shortfall could be an artifact of a mis-aimed target rather than a real method failure.

Steps

You will build the simulation in three moves: write the per-method decision drivers for one sample (Step 1), wrap each in replicate() to estimate one quantity in one world (Step 2), then sweep all methods across all DGPs and read the locked table with Monte Carlo error attached (Step 3).

Step 1 — Write the per-sample decision drivers

Every simulated quantity is an average of a yes/no decision over many samples. So the first move is to write, for each method, a function that takes one (or two) simulated samples and returns a single logical: did it reject the null, or did its interval cover the truth? Keep the truth — mu0 for the null value, delta for the true effect — as explicit arguments, because the whole point is that you know it.

set.seed(45203)

# --- Decision drivers: each returns TRUE/FALSE for ONE simulated data set ------

# Two-sample t-test: reject H0 of equal means at level 0.05?
reject_t <- function(x, y) {
  t.test(x, y)$p.value < 0.05
}

# Permutation test on the difference in means: reject at 0.05?
# Pool, shuffle the group labels n_perm times, compare observed gap to the null pile.
reject_perm <- function(x, y, n_perm = 1000) {
  obs  <- mean(x) - mean(y)
  pool <- c(x, y); g <- length(x)
  null <- replicate(n_perm, {
    idx <- sample(length(pool), g)                 # relabel: pick g "x" indices
    mean(pool[idx]) - mean(pool[-idx])             # difference under this shuffle
  })
  mean(abs(null) >= abs(obs)) < 0.05               # two-sided permutation p < 0.05
}

# Wilcoxon rank-sum: reject the location-shift null at 0.05?
reject_wilcox <- function(x, y) {
  wilcox.test(x, y)$p.value < 0.05
}

# Does a 95% interval for the CENTER cover the true value `true`?
# Ordinary-mean t-interval vs a 10% trimmed-mean interval (winsorized SE).
cover_mean <- function(x, true) {
  ci <- t.test(x)$conf.int
  true >= ci[1] && true <= ci[2]
}
cover_trim <- function(x, true, trim = 0.10) {
  n  <- length(x); g <- floor(trim * n)
  xt <- sort(x)[(g + 1):(n - g)]                   # the trimmed sample
  xw <- pmin(pmax(x, min(xt)), max(xt))            # winsorized for the SE
  m  <- mean(xt)
  se <- sqrt(sum((xw - mean(xw))^2)) / ((n - 2 * g) * sqrt(n - 1))
  m + c(-1, 1) * qt(0.975, df = n - 2 * g - 1) * se -> ci
  true >= ci[1] && true <= ci[2]
}

Each driver is the method reduced to a single decision, which is exactly the unit a rate is an average of. Notice the assumption-ladder fingerprints already showing. reject_t assumes the sampling distribution of the mean is roughly normal; reject_perm resamples the labels and assumes only exchangeability under the null; reject_wilcox ranks and assumes a location-shift model; cover_trim downweights the tails by trimming and protects against a few wild points moving the center. None of the four is assumption-free — reject_perm still assumes the labels are exchangeable, and cover_trim still assumes the trimmed distribution is well-behaved. The simulation cannot prove any method correct; it can only show, for each chosen world, how often the promise held.

Step 2 — Estimate one quantity in one world

Now wrap a driver in replicate() to turn one decision into a rate. Generate n_sim fresh samples from a DGP, apply the driver to each, and average the logicals. Generating under the null (no shift) gives a Type I error or a coverage estimate; generating under the alternative (a real delta) gives power. Do this once, in the \(t_3\) world, to see the headline power gap appear.

set.seed(45203)

# --- One cell: POWER of the t-test vs Wilcoxon in the heavy-tailed t_3 world ----
# Generate group x with a real mean shift `delta`, group y at the null mean 0.
power_in_world <- function(world, method, delta, n, n_sim, ...) {
  gen <- draw[[world]]
  mean(replicate(n_sim, {
    x <- gen(n, mu = delta)          # treated: shifted by the true effect
    y <- gen(n, mu = 0)              # control: at the null
    method(x, y, ...)               # did this method reject?
  }))
}

pow_t   <- power_in_world("t3", reject_t,      delta = delta, n = n, n_sim = n_sim)
pow_wil <- power_in_world("t3", reject_wilcox, delta = delta, n = n, n_sim = n_sim)

round(c(t_test = pow_t, wilcoxon = pow_wil), 2)
# -> t_test ~ 0.55,  wilcoxon ~ 0.70   (LOCKED Week 13 cells)

# Monte Carlo standard error of a proportion p over n_sim replicates:
mc_se <- function(p, n_sim) sqrt(p * (1 - p) / n_sim)
round(c(se_t = mc_se(pow_t, n_sim), se_wil = mc_se(pow_wil, n_sim)), 3)
# -> se ~ 0.011 each: the 0.15 gap is ~ 13 SEs wide -> real, not noise

The two power estimates come out at about \(0.55\) for the t-test and about \(0.70\) for the Wilcoxon in the heavy-tailed \(t_3\) world. Interpret that gap directly: when the data carry frequent extreme values, the t-test’s reliance on the mean is a liability — a single heavy-tailed draw inflates the variance, widens the interval, and costs power — whereas the Wilcoxon, which only uses ranks, is barely disturbed by how far out a tail value lands, so it detects the same real shift more often. That is the assumption-ladder move in one sentence: the rank method assumes a location-shift model, ranks rather than averaging, protects against heavy tails, and cannot prove the shift is a difference in means (it is a stochastic-shift statement). The Monte Carlo SE of about \(0.011\) each makes the gap roughly thirteen standard errors wide — this is a genuine difference between methods, not two numbers that happened to fall apart on one run.

Step 3 — Sweep all methods across all DGPs and read the table

Now turn the full crank: loop the right driver over each DGP for each of the three quantities, and lay the results out as the comparison table the week is built around. The contaminated-world coverage row is the second locked headline; the normal row is the sanity baseline.

set.seed(45203)

# --- Coverage cell: contaminated world, ordinary-mean vs trimmed-mean 95% CI ----
# Data generated under the NULL (true center = 0); does each 95% CI cover 0?
coverage_in_world <- function(world, cover_fn, true, n, n_sim, ...) {
  gen <- draw[[world]]
  mean(replicate(n_sim, {
    x <- gen(n, mu = true)
    cover_fn(x, true = true, ...)
  }))
}

cov_mean <- coverage_in_world("contaminated", cover_mean, true = 0, n = n, n_sim = n_sim)
cov_trim <- coverage_in_world("contaminated", cover_trim, true = 0, n = n, n_sim = n_sim, trim = trim)

round(c(mean_CI = cov_mean, trimmed_CI = cov_trim), 2)
# -> mean_CI ~ 0.86,  trimmed_CI ~ 0.94   (LOCKED Week 13 cells)

# --- The sanity baseline: the NORMAL world, where every promise should hold -----
typeI_t_normal <- power_in_world("normal", reject_t, delta = 0, n = n, n_sim = n_sim)   # delta=0 -> Type I
cov_mean_norm  <- coverage_in_world("normal", cover_mean, true = 0, n = n, n_sim = n_sim)
round(c(typeI_t = typeI_t_normal, coverage_mean = cov_mean_norm), 2)
# -> typeI_t ~ 0.05,  coverage_mean ~ 0.95   (all ~ nominal in the friendly world)

# --- Assemble the comparison table (one row per DGP) -----------------------------
worlds  <- c("normal", "lognormal", "t3", "contaminated")
results <- data.frame(
  DGP            = worlds,
  typeI_t        = sapply(worlds, function(w) power_in_world(w, reject_t,    delta = 0,     n = n, n_sim = n_sim)),
  power_t        = sapply(worlds, function(w) power_in_world(w, reject_t,    delta = delta, n = n, n_sim = n_sim)),
  power_wilcox   = sapply(worlds, function(w) power_in_world(w, reject_wilcox, delta = delta, n = n, n_sim = n_sim)),
  coverage_mean  = sapply(worlds, function(w) coverage_in_world(w, cover_mean, true = 0, n = n, n_sim = n_sim)),
  coverage_trim  = sapply(worlds, function(w) coverage_in_world(w, cover_trim, true = 0, n = n, n_sim = n_sim, trim = trim))
)
round(results[, -1], 2)
# Expected (LOCKED, synthetic):
#   DGP            typeI_t  power_t  power_wilcox  coverage_mean  coverage_trim
#   normal           0.05     0.78        0.75          0.95           0.95
#   lognormal        0.06     0.72        0.80          0.91           0.94
#   t3               0.05     0.55        0.70          0.93           0.94
#   contaminated     0.07     0.50        0.68          0.86           0.94

Read the table as a story about matching the method to the world. In the normal row everything sits at its nominal value — Type I error near \(0.05\), coverage near \(0.95\), and the t-test’s power even slightly ahead — which is the baseline that tells you the simulation machinery itself is honest. In the lognormal row the t-test’s interval starts to under-cover (about \(0.91\) against a \(0.95\) target) and the rank method gains power, because skew distorts the mean’s sampling distribution but leaves ranks alone. In the \(t_3\) row the headline power gap appears — about \(0.55\) for the t-test against about \(0.70\) for the Wilcoxon — because heavy tails inflate the variance the t-test depends on. In the contaminated row the ordinary-mean interval’s coverage collapses to about \(0.86\) while the trimmed-mean interval holds near \(0.94\), because \(5\%\) of far-off points drag the mean and balloon its SE, whereas trimming downweights them away. Each cell is the assumption-ladder made visible: the t-test assumes normality of the mean’s sampling distribution and pays when that fails; the permutation test resamples and holds its level; the Wilcoxon ranks and resists tails; the trimmed mean downweights and resists contamination — and none of them proves anything beyond the worlds you simulated.

Verify

This is the moment where the simulation and the Week 13 reasoning are supposed to meet. Check each of these against the note, and treat a mismatch as a bug in your code, not a discovery about the world — every target value is the locked, synthetic Week 13 cell.

The normal world is the honest baseline. typeI_t in the normal row should land near \(0.05\) and coverage_mean near \(0.95\). If they do not, your generator is mis-centered or your level/true argument is wrong — fix that before trusting any other row, because the normal row is what certifies the machinery.
The \(t_3\) power gap is real. power_t \(\approx 0.55\) versus power_wilcox \(\approx 0.70\) in the \(t_3\) row. With \(n\_sim = 2000\) the Monte Carlo SE is about \(0.011\) on each, so the \(\approx 0.15\) gap is more than ten SEs wide — a genuine method difference. If the two are within an SE of each other, check that your heavy-tailed draw really uses rt(n, df = 3) and not a normal.
The contaminated coverage shortfall is real. coverage_mean \(\approx 0.86\) versus coverage_trim \(\approx 0.94\) in the contaminated row. If the mean interval covers near \(0.95\), your contamination is too mild (confirm the \(5\%\) rate and the far-off outlier mean) or you trimmed the wrong sample.
Every estimate carries Monte Carlo error. Report each rate as \(\hat p \pm \text{MC SE}\) with \(\text{MC SE} = \sqrt{\hat p(1-\hat p)/n\_sim}\). A difference smaller than two MC SEs is not evidence of a real gap — it is simulation noise, and increasing n_sim is the only honest way to shrink it.
The conclusion is stated as world-specific. In one sentence: no method dominates — the t-test is fine and even best under normality, the Wilcoxon wins power under heavy tails, the trimmed mean protects coverage under contamination, and each statement holds only for the worlds I simulated. If your written conclusion crowns a single “best test,” or reads one DGP as the whole truth, the reasoning — not the code — needs fixing.

A small honest note on reproducibility: every cell in this table is itself a Monte Carlo estimate, so across seeds or replication counts each value will wobble in its last digit. Increasing n_sim tightens that wobble; it does not change which method matches which world. The wobble is the Monte Carlo error the week asks you to report — do not hide it.

AI use note

If you use an AI assistant on this lab, record it briefly. The load-bearing column is Verification — how you confirmed the output yourself, against the Week 13 note and the locked numbers, rather than trusting the tool. A simulation that an assistant wrote for you is still your claim about how the methods behave; you have to be able to defend every cell.

Tool	Purpose	Verification
LLM chat assistant	Explain why the t-test loses power under \(t_3\) while the Wilcoxon holds	Re-ran the \(t_3\) power cell, confirmed `power_t` \(\approx 0.55\) vs `power_wilcox` \(\approx 0.70\), and reconciled the explanation against the Week 13 note’s heavy-tail argument
LLM chat assistant	Draft the trimmed-mean coverage driver `cover_trim`	Checked it against the normal row (coverage \(\approx 0.95\)) as a baseline, then confirmed the contaminated-row value \(\approx 0.94\) vs the mean interval’s \(\approx 0.86\); verified the trimming fraction was 10%
Code formatter / linter	Tidy the `sapply` sweep building the results table	Diffed before/after to confirm only whitespace changed; re-ran to confirm the normal/\(t_3\)/contaminated headline cells were unchanged