Week 13 — Simulation study of method behavior

How Type I error, power, and coverage compare across data-generating processes

The week question

For twelve weeks you have chosen a method — the t-test, a permutation test, the Wilcoxon rank-sum, a trimmed mean — by reasoning about what the data look like and what each method assumes. This week you turn that reasoning into an experiment on the methods themselves. The question is sharp and load-bearing: when the truth is known, how often does each method behave the way it promises — keep its Type I error at the nominal level, find a real effect, and cover the true value — and how does that behavior change as the data-generating process changes? A simulation study answers it by inventing a world, running the methods on thousands of samples drawn from that world, and tallying how they do. The payoff is a defensible, evidence-based answer to “which method here, and why,” instead of a habit.

Why this matters

Every test and interval you have learned comes with a promise: a 5% test rejects a true null about 5% of the time; a 95% interval covers the true parameter about 95% of the time. Those promises are conditional — they hold under the assumptions baked into the method’s derivation. The t-test’s promises lean on approximate normality of the sampling distribution; the permutation test’s lean on exchangeability under the null; the Wilcoxon’s lean on a location-shift model and exchangeable ranks; the trimmed mean’s interval leans on the trimmed distribution being well-behaved. When the real data violate those assumptions — skew, heavy tails, contamination — the promises can quietly break, and a formula will not tell you it has broken. A simulation tells you, because you set the truth yourself and then check whether the method recovered it.

This is the course’s assumption-ladder discipline made empirical. All twelve prior weeks you argued that ranks resist heavy tails and that a trimmed mean resists contamination. This week you measure it: under a \(t_3\) heavy-tailed world the Wilcoxon’s power (\(\approx 0.70\)) overtakes the t-test’s (\(\approx 0.55\)); under 5% contamination the ordinary-mean interval’s coverage sags to \(\approx 0.86\) while a trimmed-mean interval holds near \(\approx 0.94\). Crucially, you also learn the discipline’s limit: a simulation is itself a sample, so it carries Monte Carlo error, and it speaks only about the specific worlds you simulated. No method wins everywhere. The skill is to read a simulation as evidence — bounded, world-specific, noisy — rather than as a verdict.

Learning goals

By the end of this week you should be able to:

Define a data-generating process (DGP) and explain why fixing the truth is what makes a simulation able to score a method’s behavior.
Estimate a method’s Type I error, power, and confidence-interval coverage by replication, and say in words what each quantity promises and what a deviation from the nominal value means.
Read a method-comparison table across normal, right-skewed (lognormal), heavy-tailed (\(t_3\)), and contaminated (5% outliers) DGPs, and explain why each method rises or falls as the DGP changes.
Compute and report the Monte Carlo standard error of a simulated proportion, and use it to decide whether two methods’ results actually differ.
Articulate the week’s central caution — no method dominates across all DGPs — and resist reading a single run or a single DGP as the whole truth.

Core vocabulary

Data-generating process (DGP) — the known mechanism you sample from in a simulation: a distribution (or pair of distributions) with parameters you set, so the “truth” — the real effect, the real coverage target — is known by construction.
Type I error rate — the probability a test rejects when the null is true; you estimate it by generating data with no real effect and recording the fraction of rejections. The nominal target is the chosen level, here \(\alpha = 0.05\).
Power — the probability a test rejects when the null is false; you estimate it by generating data with a real effect of a set size and recording the fraction of rejections. Higher is better, only among methods that first hold their level.
Coverage — the fraction of confidence intervals that contain the true parameter; a valid 95% interval covers about \(0.95\) of the time. Under-coverage (e.g. \(0.86\)) means the interval is secretly more confident than it admits.
Replication (\(B\)) — the number of simulated samples drawn from a DGP; each is analyzed by every method, and the results are averaged. Larger \(B\) shrinks Monte Carlo error.
Monte Carlo error / Monte Carlo standard error (MCSE) — the sampling uncertainty of a simulated estimate, because a simulation is itself a finite sample. For an estimated proportion \(\hat p\) from \(B\) replications, \(\operatorname{MCSE}(\hat p) = \sqrt{\hat p(1-\hat p)/B}\).
Nominal vs. actual — the promised level/coverage (\(0.05\), \(0.95\)) versus what the method actually delivers under a given DGP. The gap between them is the whole subject of this week.
Robustness (of a method’s validity) — whether a method keeps its promises as the DGP departs from the ideal. A method can be valid (holds level/coverage) without being the most powerful, and the two properties trade off.

Concept development

A simulation study is an experiment whose units are samples

A simulation study flips the usual data analysis on its head. Normally you have one dataset and an unknown truth, and you infer the truth. Here you fix the truth — you choose the DGP, so you know the real effect and the real parameter exactly — and then you ask how a method behaves across many datasets that truth could produce. The “units” of this experiment are simulated samples; the “treatment” is the method; the “outcome” is whether the method did what it promised (rejected correctly, covered the truth).

The recipe is the same every time. (1) Specify a DGP with the truth known. (2) Draw a sample of size \(n\) from it. (3) Run every method on that sample and record each method’s decision (reject or not) and each interval (did it cover the true value?). (4) Repeat \(B\) times. (5) Average: the fraction of rejections is an estimate of Type I error (if the null is true) or power (if it is false); the fraction of covering intervals is an estimate of coverage. What is assumed is that your DGP is a relevant stand-in for the real-world data shape. What is resampled is nothing physical — you are drawing fresh synthetic samples from a known law. What it protects against is choosing a method on faith; what it cannot prove is that a real dataset matches any DGP you simulated. That last gap is why a simulation informs a choice but never closes it.

For this week’s locked meta-study, the four methods under test are the two-sample t-test, a permutation test (difference in means, labels shuffled under the null), the Wilcoxon rank-sum, and a trimmed-mean procedure (a 10% trimmed mean with its own interval). They are run across four DGPs, all synthetic and set.seed(45203).

Type I error and power: a method must hold its level before its power counts

Type I error is the first thing to check, because power is only meaningful for a method that holds its level. To estimate it, generate data under the null — two groups drawn from the same distribution, so the true effect is zero — and count rejections. A valid 5% method rejects about 5% of the time. To estimate power, generate data with a real shift of a fixed size and count rejections; among level-holding methods, the one rejecting most often is the most powerful for that DGP.

The locked results sort cleanly by DGP. Under a normal DGP all four methods hold the level — Type I error \(\approx 0.05\) for each — and their power is comparable, with the t-test slightly best (normality is exactly the world the t-test was built for, so nothing else can beat it by much, and ranks and trimming pay a small efficiency price). Under a heavy-tailed \(t_3\) DGP the story inverts: the permutation test still holds its level (it makes no normality assumption — it shuffles labels), but on power the Wilcoxon reaches \(\approx 0.70\) while the t-test reaches only \(\approx 0.55\). Heavy tails inflate the sample variance, which widens the t-test’s denominator and dulls its sensitivity; ranks are immune to how far out the tails reach, so the rank test keeps its edge. What is assumed by the permutation/Wilcoxon route is exchangeability under the null, not normality; what is ranked/shuffled is the data’s order/labels, not their raw magnitudes; what it protects against is tail-driven variance inflation; what it cannot prove is that the shift is the only difference between groups.

Coverage: an interval can lie about its own confidence

A confidence interval’s promise is coverage — a 95% interval should contain the true parameter about 95% of the time across repeated samples. A simulation checks this directly: build the interval on each of the \(B\) samples and count how many contain the known truth. When that fraction falls below the nominal level, the interval under-covers — it is narrower, and so more confident, than it has any right to be, and a reader who trusts it is being misled.

Two locked DGPs expose this. Under a right-skewed (lognormal) DGP, the t-based CI under-covers at \(\approx 0.91\) instead of \(0.95\): skew makes the sampling distribution of the mean asymmetric, so a symmetric t-interval misses the true mean too often on the long-tail side. The rank-sum route here holds its level and even gains power, because it targets a stochastic shift rather than a fragile mean. Under a contaminated DGP (5% of points replaced by outliers), the ordinary mean-based CI covers only \(\approx 0.86\), badly under-covering, while the trimmed-mean CI recovers to \(\approx 0.94\), essentially restoring the promise. What is assumed by the trimmed interval is that the central, untrimmed part of the distribution carries the signal; what is downweighted is the trimmed tail fraction (here 10% from each end); what it protects against is a few contaminating points dragging the center and inflating the spread; what it cannot prove is that the trimmed-away points were truly contamination rather than real, rare, informative responders. Trimming is a deliberate trade — resistance bought with a little efficiency and a judgment call about what to discard.

Monte Carlo error: a simulation is itself a sample

Every number in a simulation table is an estimate from \(B\) draws, so it carries its own uncertainty — Monte Carlo error. If you simulate the same DGP again with a different seed, the estimated Type I error will not land on exactly the same value; it will wobble. For a simulated proportion \(\hat p\) (a rejection rate or a coverage rate) from \(B\) replications, the Monte Carlo standard error is

\[ \operatorname{MCSE}(\hat p) = \sqrt{\frac{\hat p\,(1 - \hat p)}{B}} . \]

With \(B = 10{,}000\) and \(\hat p \approx 0.05\), \(\operatorname{MCSE} = \sqrt{0.05 \cdot 0.95 / 10{,}000} \approx 0.0022\) — so a simulated Type I error of \(0.05\) is really “\(0.05\) give or take about \(0.002\),” and an estimated \(0.052\) versus \(0.050\) is not a real difference. With \(\hat p \approx 0.70\) (the Wilcoxon’s \(t_3\) power), \(\operatorname{MCSE} = \sqrt{0.70 \cdot 0.30 / 10{,}000} \approx 0.0046\). The gap between Wilcoxon power \(0.70\) and t-test power \(0.55\) is about \(0.15\) — roughly \(30\) Monte Carlo standard errors wide — so that difference is real and not simulation noise. What is assumed is that the replications are independent draws from the DGP; what is computed is the binomial spread of a proportion; what it protects against is over-reading a wobble as a finding; what it cannot prove is anything about a DGP you did not simulate. Always report \(B\) and an MCSE, or at least state that you checked it — a comparison table without an error bar invites exactly the mistake this week is named for.

Worked examples

Worked example — the four-DGP meta-simulation (recurring slice)

What is assumed. You compare four methods — t-test, permutation, Wilcoxon rank-sum, 10% trimmed mean — across four data-generating processes — normal, right-skewed lognormal, heavy-tailed \(t_3\), and 5%-contaminated. For Type I error you draw both groups from the same distribution (true effect zero); for power you add a fixed real shift; for coverage you check whether each interval contains the known true center. Every sample is synthetic, \(B\) replications, set.seed(45203). The DGP is assumed to be a useful stand-in for a data shape you might actually meet — not assumed to be any particular real dataset.

Computation. The static R below shows the simulation skeleton and tabulates the locked results. It is shown as teaching code and is not executed here.

set.seed(45203)

B <- 10000          # replications per DGP (report this; it sets Monte Carlo error)
n <- 30             # per-group sample size
alpha <- 0.05       # nominal test level / 1 - 0.95 nominal coverage

# One DGP is a function returning two groups; the "truth" (shift) is set by us.
# (Skeleton only — the four DGPs swap in rnorm / rlnorm / rt(df=3) / a 5%-outlier mix.)
run_methods <- function(x, y) {
  c(
    t    = t.test(x, y)$p.value          < alpha,   # parametric two-sample t
    perm = perm_test(x, y)               < alpha,   # difference in means, labels shuffled
    wilc = wilcox.test(x, y)$p.value     < alpha,   # rank-sum / Mann-Whitney
    trim = trimmed_ci_excludes_0(x, y)              # 10% trimmed-mean interval test
  )
}

# Replicate across one DGP, then average -> Type I (no shift) or power (shift set).
reject_rate <- function(dgp) rowMeans(replicate(B, do.call(run_methods, dgp())))

# ----- LOCKED illustrative results (synthetic; verified: false) -----
# DGP            method   Type I    power     CI coverage
# Normal         t        ~0.05     ~best     ~0.95     (t-test slightly best; all comparable)
# Normal         perm     ~0.05     ~0.05 less          (holds level)
# Normal         wilc     ~0.05     comparable          (small efficiency cost)
# Right-skewed   t        ~0.05     ----      ~0.91     (t CI UNDER-covers under skew)
# Right-skewed   wilc     ~0.05     gains power         (rank-sum holds level + more power)
# Heavy-tailed   t        ~0.05     ~0.55               (tails inflate variance -> dull)
# Heavy-tailed   wilc     ~0.05     ~0.70               (ranks immune to tail reach)
# Heavy-tailed   perm     ~0.05     ----                (permutation holds level)
# Contaminated   mean-CI            ----      ~0.86     (5% outliers drag the center)
# Contaminated   trim-CI            ----      ~0.94     (10% trimming restores coverage)
#
# Monte Carlo SE at B = 10000:  sqrt(0.05*0.95/B) ~= 0.0022   (a 0.05 vs 0.052 gap is noise)
#                               sqrt(0.70*0.30/B) ~= 0.0046   (0.70 vs 0.55 power gap is REAL)

Interpretation. Read the table by DGP, never by method alone. Under the normal world every method holds Type I error near \(0.05\) and powers are comparable, the t-test slightly best — the ideal world rewards the method built for it, and ranks/trimming pay only a small efficiency price. Under skew, the t-interval’s coverage drops to \(\approx 0.91\) (it quietly over-promises), while the rank-sum holds its level and gains power. Under heavy tails, the permutation test holds its level and the Wilcoxon’s power \(\approx 0.70\) beats the t-test’s \(\approx 0.55\), a gap of \(0.15\) that dwarfs the Monte Carlo standard error (\(\approx 0.005\)), so it is a real difference, not simulation wobble. Under contamination, the ordinary-mean interval covers only \(\approx 0.86\) while the trimmed-mean interval recovers to \(\approx 0.94\). The single honest summary: no method wins everywhere — match the method to the data-generating reality, and report the Monte Carlo error so a reader can tell a finding from a wobble. What this simulation cannot prove is that your next real dataset is any of these four worlds; it narrows and justifies the choice, it does not make it for you.

Worked example — sizing a planned study with a power simulation (transfer, new context)

What is assumed. A campus counseling center is planning a trial of a brief sleep-hygiene module and wants to know whether \(n = 40\) participants per arm gives enough power to detect a wellbeing improvement of about half a standard deviation. Past intake data look mildly right-skewed, so the planners refuse to assume normality. They will compare the t-test and the Wilcoxon rank-sum before collecting any data, by simulating from a plausible skewed DGP with the real effect set to \(0.5\) SD. These numbers are illustrative and distinct from the four-DGP meta-study above.

Computation. The skeleton mirrors the meta-study: fix the DGP (a lognormal-ish skew with a true \(0.5\)-SD shift), draw two groups of \(40\), run both tests, repeat \(B = 10{,}000\) times, and record each method’s rejection rate as its estimated power.

set.seed(45203)

B <- 10000;  n <- 40;  alpha <- 0.05;  true_shift <- 0.5   # in SD units

planned_power <- replicate(B, {
  base <- rlnorm(n, meanlog = 0, sdlog = 0.5)               # mildly right-skewed control arm
  x    <- base                                              # control
  y    <- rlnorm(n, meanlog = true_shift, sdlog = 0.5)      # treated arm, real shift set
  c(t    = t.test(x, y)$p.value      < alpha,
    wilc = wilcox.test(x, y)$p.value < alpha)
})
power_est <- rowMeans(planned_power)
# Read the estimated power for each method, then attach its Monte Carlo SE:
#   MCSE = sqrt(p_hat * (1 - p_hat) / B)

Interpretation. The planners do not get a single “the study is powered” yes/no; they get two power estimates, one per method, each with a Monte Carlo standard error attached. Because the assumed world is skewed — exactly the regime where the meta-study showed the rank-sum holding its level and gaining power — the simulation will typically credit the Wilcoxon with the higher detection rate at this \(n\), so the defensible plan is to pre-register the rank-based analysis rather than the t-test, and to report \(\operatorname{MCSE} = \sqrt{\hat p(1-\hat p)/B}\) alongside each number. Note what carried over and what changed: the machinery is identical — fix a DGP with a known truth, replicate, average, attach an MCSE — only the purpose (planning a sample size, not auditing methods) and the context differ. And the same caution applies: this powers the study for the assumed skewed DGP; if the real wellbeing scores turn out heavy-tailed or contaminated instead, the plan must be revisited, because a simulation speaks only about the world you simulated.

A common mistake

The week’s central trap — and the reviewer check the build calls Risk 14 and Risk 15 — is reading a single simulation, or a single DGP, as the truth. It shows up in two braided forms.

The first form is ignoring Monte Carlo error: treating \(0.052\) as worse than \(0.050\), or announcing that “method A beats method B” because A’s simulated power was \(0.71\) and B’s was \(0.69\). A simulation is a finite sample, so every entry wobbles. At \(B = 10{,}000\) a rejection rate near \(0.05\) has a Monte Carlo standard error of about \(0.002\) and one near \(0.70\) has about \(0.005\); differences smaller than a couple of MCSEs are noise, not findings. The fix is mechanical and non-negotiable: report \(B\), compute \(\operatorname{MCSE}(\hat p) = \sqrt{\hat p(1-\hat p)/B}\), and only call a gap real when it is several MCSEs wide. (The Wilcoxon’s \(0.70\) versus the t-test’s \(0.55\) under \(t_3\) is real — about \(30\) MCSEs apart; an \(0.052\)-versus-\(0.050\) Type I gap is not.)

The second form is overgeneralizing from one DGP: running a normal-world simulation, seeing the t-test win, and concluding “the t-test is best.” It is best in the normal world — and only slightly. Change the DGP to skewed and its interval under-covers (\(0.91\)); change it to heavy-tailed and its power collapses behind the Wilcoxon’s (\(0.55\) vs \(0.70\)); change it to contaminated and the mean-CI’s coverage sags to \(0.86\) while a trimmed interval holds \(0.94\). No method dominates across all DGPs. A simulation study is only as broad as the worlds it sampled, so the honest report states the DGP coverage explicitly — “across normal, skewed, heavy-tailed, and contaminated processes” — and refuses to crown a universal winner. The deepest error is forgetting that even a careful four-DGP study cannot prove your real data are any of those four; it informs the choice, it does not certify it.

Low-stakes self-checks (ungraded)

These are for your own practice — ungraded, no submission.

In one sentence each, define Type I error, power, and coverage as quantities a simulation estimates, and say which DGP setting (null or with-effect) you use to estimate each.
At \(B = 10{,}000\), compute \(\operatorname{MCSE}\) for a simulated coverage of \(\hat p = 0.91\). Is the gap from the nominal \(0.95\) several MCSEs wide — i.e. real under-coverage, or plausibly noise?
A classmate runs one normal-DGP simulation, sees the t-test slightly ahead on power, and concludes “the t-test is the best method.” Name the two errors from “A common mistake” they have committed.
Explain, in your own words, why the t-test’s power falls behind the Wilcoxon’s under a heavy-tailed \(t_3\) DGP, in terms of what heavy tails do to the variance the t-test divides by.
Under the contaminated DGP, the mean-CI covers \(\approx 0.86\) and the trimmed-mean CI \(\approx 0.94\). Name the assumption-ladder move the trimmed interval makes — what it downweights, what it protects against, and what it cannot prove about the trimmed-away points.

Reading and source pointer

This week is grounded in the instructor notes (the primary course materials) for the logic of simulation studies — DGPs, Type I error, power, coverage, and Monte Carlo error — with the ModernDive (Ismay, Kim & Valdivia) treatment of simulation studies and repeated analysis grounding the replicate-and- tabulate workflow you will build in the companion lab, and the IMS (Çetinkaya-Rundel & Hardin) chapter on simulation-based inference grounding the resampling reference distributions the permutation and rank methods rely on. These notes are the course’s own synthesis, grounded in but not copied from the sources. No prose, examples, exercises, figures, or solutions are reproduced from any source.

Evidence and verification status

verified: false. The simulation logic on this page is course-authored, but every numeric value here is drafted, synthetic, and not independently checked. The load-bearing synthetic numbers are: the normal DGP Type I errors \(\approx 0.05\) with comparable power (t-test slightly best); the right-skewed t-CI coverage \(\approx 0.91\) with the rank-sum holding level and gaining power; the heavy-tailed \(t_3\) power of \(\approx 0.55\) (t-test) versus \(\approx 0.70\) (Wilcoxon) with the permutation test holding level; the contaminated mean-CI coverage \(\approx 0.86\) versus trimmed-mean CI coverage \(\approx 0.94\); and the Monte Carlo standard errors \(\approx 0.0022\) (at \(\hat p \approx 0.05\)) and \(\approx 0.0046\) (at \(\hat p \approx 0.70\)) computed at \(B = 10{,}000\). All simulated data are synthetic with set.seed(45203). These worked numbers are provisional and not independently verified — treat them as targets to reproduce, not as confirmed reference values.

Public vs. graded

These notes, the examples, and the practice here are public and ungraded — study material only. No graded prompts, answer keys, rubrics, point values, or due dates appear on this site. Graded method checkpoints, weekly quizzes, homework and method reports, resampling and robustness labs, the midterm, the applied robust-methods project, and the final exam live in Blackboard (the LMS), which is authoritative for due dates, submissions, and grades. If this page and Blackboard ever disagree, follow Blackboard.

Looking ahead

Next week we stop running methods against each other and start writing up a single applied comparison. The Week 14 robust-methods report workshop takes everything from this simulation discipline — state the data shape, run at least two reasonable methods, compare, check sensitivity, and bound the conclusion honestly — and turns it into a clear, reproducible report. The simulation table you read this week becomes the justification you cite when a report explains why it chose ranks over a t-test for a skewed outcome: not habit, but measured behavior across the worlds the data might be.