Week 4 — Randomization tests

When the reshuffle mimics the assignment mechanism, what does the test license?

The week question

Last week you shuffled group labels to build a reference distribution and read a permutation \(p\)-value, and you were careful to call the result a statement about a difference between two observed groups — nothing more. This week the shuffle machinery is the same down to the last line of code, but the study behind it is different in one decisive way: the Express workflow was randomly assigned to arrivals. That single design fact changes what the same number is allowed to say. So the week’s question is narrow and load-bearing: when the reshuffle mimics the assignment mechanism that actually produced the data, what does the test license — and what is it that does the licensing, the design or the \(p\)-value?

The honest one-line answer, which the rest of the page unpacks, is that a randomization test built on top of real random assignment licenses a causal read of the same shift you saw last week, and the license comes from the assignment, not from the size of \(p\). The reshuffle is just the engine that tells you how big a difference the assignment alone could have manufactured under a no-effect null.

Why this matters

This is the place in the course where two ideas that look identical in code part ways in meaning. A permutation test of an observational comparison (week 3) and a randomization test of a designed experiment (this week) both pool the data, shuffle labels, recompute a statistic, and read a tail probability. Run them on the same numbers and you get the same \(p \approx 0.02\). But the conclusion each one supports is different, and the difference lives entirely in how the data came to be, not in the arithmetic.

It matters because the most expensive mistake in applied work is over-claiming: reporting a permutation \(p\) from an observational comparison and then quietly writing “the Express workflow caused shorter waits.” The causal verb is not earned by a small \(p\); it is earned by the random assignment that makes the two arms exchangeable in expectation. When assignment is genuinely random, the reshuffle you perform on the computer is a faithful re-enactment of the chance mechanism the world already used — so the reference distribution is not an analogy, it is the design’s own null behavior. When assignment is not random, the very same reshuffle is only a model of “what if the labels were swappable,” and the strongest honest read is associational.

This is also the course’s assumption-light discipline in its sharpest form. A randomization test makes almost no distributional assumption — no normality, no equal variances, no large-sample formula. But “assumption-light” is never “assumption-free.” The randomization test leans hard on one structural assumption that the design, not the data, must supply: that the labels really were assigned by a known random mechanism. Name that trade every time, and you will never confuse a clean \(p\) with a clean causal claim.

Learning goals

By the end of this week you should be able to:

  • Distinguish a permutation test of an observational comparison from a randomization test of a randomized experiment, and state the one design fact — random assignment — that separates them.
  • Describe the randomization reference distribution — shuffle the assigned labels under the no-effect null, recompute the statistic, repeat — and read the randomization \(p\)-value as a tail probability of that distribution.
  • Explain why, when assignment was random, the reshuffle mimics the assignment mechanism and the test therefore licenses a causal reading of the shift, while the identical reshuffle on observational data licenses only an associational one.
  • Compute the observed effect on Dataset W as a difference in medians, situate \(-6\) minutes against a randomization distribution centered at \(0\), and read the randomization \(p \approx 0.02\).
  • Name the assumption-ladder move precisely: what is assumed (the assignment mechanism), what is resampled (the labels), what it protects against (confounding, distributional misspecification), and what it still cannot prove (external generalization; a guarantee that the design was sound).
  • Diagnose the week’s classic error — permuting the wrong thing / over-claiming — and state where the causal license actually comes from.

Core vocabulary

  • Randomization test — a test whose reference distribution is generated by re-applying, on the computer, the same random assignment mechanism the design used, under a null of no treatment effect. Mechanically a label shuffle; conceptually a re-enactment of the design.
  • Permutation test (contrast) — the identical shuffle, but justified by an assumption of exchangeability of two observed groups rather than by a known assignment mechanism. Same engine, weaker license.
  • Random assignment — a known chance mechanism (a coin, a draw, sample()) that decides which unit receives the treatment; the design fact that makes the two arms exchangeable in expectation and thereby blocks confounding.
  • Assignment mechanism — the actual rule by which units came to be treated or not. A randomization test is valid exactly when the on-computer reshuffle matches this real mechanism.
  • Randomization reference distribution — the spread of the test statistic across many reshuffles of the labels, holding the outcomes fixed; the set of effects the assignment alone could produce when the treatment does nothing.
  • Randomization \(p\)-value — the fraction of reshuffled statistics at least as extreme as the observed one; a tail probability, not a measure of design quality or of effect size.
  • Test statistic — here the difference in medians \(\tilde y_T - \tilde y_C\), chosen because Dataset W is right-skewed and the median resists the long tail.
  • Causal vs. associational read — what the design lets you say. Random assignment licenses “caused”; an observational comparison licenses only “is associated with.”

Concept development

Same machine, different warrant

Start with the machine, because it is genuinely the same as last week. Pool the \(50\) wait times, hold the \(50\) outcomes fixed, randomly relabel \(25\) of them “Express” and \(25\) “Standard,” recompute the difference in medians, and repeat about \(10{,}000\) times. That loop is the engine for both a permutation test and a randomization test. If you only watched the code run, you could not tell which one you were doing.

The difference is the warrant — the justification for why that loop is the right null model. In week 3 the warrant was an assumption: “if the Express and Standard waits are exchangeable (the labels carry no information), then any relabeling is as likely as the one we saw.” You assumed exchangeability; you did not know it. This week the warrant is a design fact: the Express workflow was randomly assigned to arrivals, so under a no-effect null each arrival’s wait would have been identical whichever label it drew, and the labels genuinely are interchangeable tags. You do not have to assume exchangeability — the randomization built it in. The reshuffle on your computer re-enacts the very coin the front desk flipped.

So the assumption-ladder move sharpens. What is assumed: under the null, an arrival’s wait does not depend on its label, and — supplied by the design, not the data — the labels were assigned by a known random mechanism. What is resampled: the assignment labels, with outcomes held fixed. What it protects against: confounding (random assignment balances measured and unmeasured pre-treatment characteristics in expectation) and distributional misspecification (no normal model is invoked). What it cannot prove: that the effect generalizes beyond units like these, or that the study was otherwise well run.

The reference distribution is the design’s own null behavior

Here is the conceptual payoff of random assignment. When labels were assigned by a known chance mechanism, the on-computer reshuffle is not a model of the null — it is the null distribution the design would generate if the treatment did nothing. Each reshuffle is one of the assignments the front desk could have made; the collection of their difference-in-medians is exactly the spread of effects the design manufactures from noise alone.

Write the statistic as \(T = \tilde y_T - \tilde y_C\), the difference in group medians. Under the no-effect null, relabeling cannot change any arrival’s wait, so the reshuffled statistics

\[ T^{(1)}, \; T^{(2)}, \; \dots, \; T^{(B)} \]

trace out the randomization reference distribution. Because Express and Standard each get \(25\) labels and the outcomes are fixed, that distribution is centered at \(0\) — a relabeling is as likely to favor “Express” as “Standard” when the treatment is inert. The two-sided randomization \(p\)-value is the fraction landing at least as far from \(0\) as your observed \(T\):

\[ p_{\text{rand}} = \frac{1}{B}\sum_{b=1}^{B} \mathbf{1}\bigl\{\,|T^{(b)}| \ge |T_{\text{obs}}|\,\bigr\}. \]

This is a tail probability of a distribution the assignment produced — which is why, with a real random assignment, the same shuffle that gave an associational read last week gives a causal one this week. The number did not change; the warrant did.

The locked numeric instance: Dataset W, now design-based

Dataset W is the Riverside service-wait world. Express intake (treatment) versus Standard intake (control), wait times in minutes, right-skewed with a couple of very long Standard waits. Data are synthetic; seed set. The observed slice is the same one you have carried since week 1:

  • Standard arm: \(n_C = 25\), median \(= 18\) minutes (mean \(\approx 22\), dragged up by two long waits near \(64\) and \(88\)).
  • Express arm: \(n_T = 25\), median \(= 12\) minutes (mean \(\approx 15\)).
  • Observed effect: difference in medians \(T_{\text{obs}} = 12 - 18 = -6\) minutes — Express is faster. (The difference in means is \(-7\), but it is unstable because the two long Standard waits move it, so the course reads the resistant median difference.)

Now the design fact that defines this week. Suppose arrivals were randomly assigned to the Express or the Standard workflow — a coin, not a self-selected queue. Then reshuffling the \(50\) labels mimics the assignment mechanism exactly. Build the randomization distribution of \(T\) by relabeling \(25/25\) about \(10{,}000\) times; it is centered at \(0\), and the observed \(-6\) minutes sits out in the tail. The two-sided randomization \(p \approx 0.02\).

Interpret it. About \(2\%\) of random reassignments of these same \(50\) waits would manufacture a median gap as large as \(6\) minutes if the Express workflow did nothing — so a gap this size is unlikely to be a fluke of the draw. And because the labels were assigned at random, the \(-6\) minutes reads as the causal effect of the Express workflow for arrivals like these: random assignment, not the small \(p\), is what upgrades “associated with shorter waits” to “shortened waits.” The assumption ladder: assumed — the known random assignment and the no-effect null; resampled — the \(50\) labels; protected against — confounding (a busier-or-calmer time of day cannot have sorted patient types into arms, because the coin did the sorting) and non-normality (the skew is irrelevant to a label shuffle); cannot prove — that other clinics, staff, or seasons would show the same gain.

The contrast with week 3, stated cleanly

It is worth pinning the contrast down in one place, because the entire week hinges on it.

  • Week 3 (permutation of an observational comparison): the Express and Standard waits were simply observed; no one was assigned. The shuffle is warranted by an assumed exchangeability. The permutation \(p \approx 0.02\) supports: Express waits are shifted shorter than Standard waits. The honest verb is associational — a confounder (maybe Express was offered first to calmer arrivals) could explain the shift.
  • Week 4 (randomization of a designed experiment): arrivals were randomly assigned. The shuffle is warranted by the known assignment mechanism. The randomization \(p \approx 0.02\) supports: the Express workflow caused shorter waits for arrivals like these. The honest verb is causal — random assignment rules out the confounder by balancing arms in expectation.

Same data shape, same statistic, same \(p \approx 0.02\), same code. Different design, different license. The randomization test does not get its causal warrant from a smaller \(p\); it gets it from the assignment, and the \(p\) only calibrates how surprising the observed gap would be under the no-effect null.

Worked examples

Worked example — Dataset W as a randomized service-wait experiment (recurring slice)

What is assumed. Arrivals at the Riverside service desk were randomly assigned, one at a time, to the Express intake workflow (\(n_T = 25\)) or the Standard workflow (\(n_C = 25\)) — a designed experiment, with the arrival as the assigned unit. The outcome is wait time in minutes, right-skewed. The null is no treatment effect: under it, each arrival’s wait is the same whichever workflow it drew. Data are synthetic; seed set.

The computation. The static R below reframes last week’s shuffle as design-based: it pools the waits, holds them fixed, re-enacts the random assignment about \(10{,}000\) times, and reads the randomization \(p\) on the difference in medians. It is shown as teaching code and is not executed in this draft site.

set.seed(45203)

# Dataset W (synthetic; seed set): 25 Express vs 25 Standard service waits, in minutes.
# Right-skewed; two long Standard waits near 64 and 88 drive the mean but not the median.
# Summarized to the locked medians for this static slice.
express  <- c(...)   # n_T = 25, median = 12 min
standard <- c(...)   # n_C = 25, median = 18 min, two long waits ~64 and ~88

t_obs <- median(express) - median(standard)   # observed effect -> -6 minutes (Express faster)

# Randomization reference distribution.
# DESIGN FACT: the Express workflow was RANDOMLY ASSIGNED to arrivals, so this reshuffle
# mimics the assignment mechanism -- it is the design's own null behavior, not a mere model.
waits <- c(express, standard)                  # pool the 50 outcomes, held fixed
rand  <- replicate(10000, {
  lab <- sample(rep(c("E", "S"), each = 25))   # re-enact the random assignment, outcomes fixed
  median(waits[lab == "E"]) - median(waits[lab == "S"])
})

rand_p <- mean(abs(rand) >= abs(t_obs))        # fraction as extreme as -6  -> ~0.02

# t_obs = -6 min   reference distribution centered at 0   randomization p ~= 0.02
# The CAUSAL read rests on the random assignment, NOT on the p-value.

The interpretation. The randomization distribution is centered at \(0\), and the observed \(T_{\text{obs}} = -6\) minutes sits in its tail, giving a two-sided randomization \(p \approx 0.02\). In words: only about \(2\%\) of random reassignments of these \(50\) waits would produce a median gap as large as \(6\) minutes if the Express workflow truly did nothing — so the observed gap is unlikely to be an artifact of which arrivals happened to land where. Because the workflow was assigned at random, this \(-6\) minutes is the causal effect of Express for arrivals like these: the assignment, not the \(0.02\), is what licenses the causal verb. The assumption-ladder move, stated plainly: assumed — a known random assignment plus the no-effect null; resampled — the \(50\) labels, outcomes fixed; protected against — confounding and any reliance on a normal model under skew; cannot prove — that the same effect would appear at a different clinic, with different staff, in a different season (that is external validity, which random assignment does not buy).

Worked example — a randomized A/B test of a page layout (transfer, new context)

What is assumed. A product team wants to know whether a redesigned page layout lowers the time on task for a checkout flow. It enrolls \(400\) sessions and, by a random draw, sends \(200\) to the new layout (treatment) and \(200\) to the current layout (control) — a completely randomized A/B test with the session as the assigned unit. Time on task is heavily right-skewed (a few sessions stall for a very long time), so the team uses the difference in medians as its statistic, exactly as Dataset W does. These numbers are illustrative and distinct from the Riverside world.

The computation. Suppose the new layout’s median time on task is \(48\) seconds and the current layout’s is \(55\) seconds, so the observed statistic is

\[ T_{\text{obs}} = 48 - 55 = -7 \text{ seconds.} \]

A randomization test shuffles the \(400\) layout labels under the null of no effect, holding each session’s time fixed, recomputes the median difference each time, and reads the fraction of reshuffles with \(|T^{(b)}| \ge 7\):

set.seed(45203)

# A/B test (synthetic; seed set): 200 new-layout vs 200 current-layout sessions, time-on-task (sec).
# Right-skewed; the median resists the long-stall tail. Summarized to the illustrative medians.
new     <- c(...)   # n = 200, median = 48 sec
current <- c(...)   # n = 200, median = 55 sec

t_obs <- median(new) - median(current)         # -> -7 seconds

times <- c(new, current)                        # pool, hold fixed
rand  <- replicate(10000, {
  lab <- sample(rep(c("N", "C"), each = 200))   # re-enact the random assignment
  median(times[lab == "N"]) - median(times[lab == "C"])
})
rand_p <- mean(abs(rand) >= abs(t_obs))         # randomization p for the A/B test
# t_obs = -7 sec   distribution centered at 0   (illustrative; numbers synthetic)

The interpretation. Because the layout was assigned at random to sessions, the \(-7\) seconds reads as the causal effect of the redesign on median time on task for sessions like these — random assignment, not a belief that “the new layout is obviously cleaner,” is what licenses the causal verb. Notice what transferred and what did not: the design move is identical to Dataset W — randomly assign, pick a resistant statistic, build the reference distribution by re-enacting the assignment, read the tail — and only the context, the unit, and the numbers changed. The same assumption ladder applies: assumed (the random assignment and the no-effect null); resampled (the \(400\) labels); protected against (confounding such as time-of-day traffic, and the skew that would wreck a mean-based normal test); and unable to prove (that the redesign helps on a different site, audience, or device — external validity the A/B test does not settle).

A common mistake

This week’s classic error is permuting the wrong thing — and, hand in hand with it, over-claiming (the design state’s Risk 2). It shows up in two linked forms.

The first form is attaching a causal verb to a small \(p\) from data that were never randomly assigned. Someone runs the week-3 shuffle on an observational Express-vs-Standard comparison, sees \(p \approx 0.02\), and writes “the Express workflow caused shorter waits.” The \(p\) is fine; the verb is not. The reshuffle there is warranted only by an assumed exchangeability, so the strongest honest read is associational — a confounder (perhaps Express was quietly offered first to calmer, less complicated arrivals) could produce the same \(-6\) minutes with no causal effect at all. The fix is to keep the warrant attached to the data: a randomization test earns “caused” only when the labels really were assigned by a known random mechanism, and you should be able to point to that mechanism, not to the \(p\)-value, when you justify the causal claim.

The second form is shuffling labels that the design does not let you swap. If arrivals were assigned in paired time-slots, or in blocks by clinic, or in clusters (whole sessions assigned together), then a free \(25/25\) relabeling no longer re-enacts the real assignment mechanism — it manufactures reassignments the design could never have produced, and the reference distribution becomes the wrong null. The reshuffle must mimic the actual assignment: shuffle within pairs for a paired design, within blocks for a blocked design, at the cluster level for a clustered design. Permute the wrong unit and the test is simply measuring the wrong thing, however small the resulting \(p\) looks.

Both forms reduce to one sentence worth memorizing: the reshuffle mimics the assignment mechanism; the causal read rests on the random assignment, not on the \(p\)-value. A small \(p\) from a broken or mis-specified shuffle is a precise description of the wrong null — and a small \(p\) from data with no random assignment is a precise description of an association, nothing more.

Low-stakes self-checks (ungraded)

These are for your own practice — ungraded, no submission.

  1. In one sentence each, state the warrant for the reshuffle in (a) a week-3 permutation test of an observational comparison and (b) a week-4 randomization test of a designed experiment. Which one lets you write “caused,” and why?
  2. The Dataset W randomization distribution is centered at \(0\). Explain in your own words why it is centered at \(0\) under the no-effect null, in terms of what relabeling does to the outcomes.
  3. A colleague reports a randomization \(p \approx 0.02\) from an Express-vs-Standard comparison and concludes the Express workflow caused the improvement. What is the one fact about the study you must know before you can agree with the word “caused”?
  4. The arrivals were assigned to Express or Standard in matched pairs (similar arrival times paired, one of each pair to each arm). Explain why a free \(25/25\) relabeling is now the wrong reshuffle, and say how you would fix it.
  5. Someone claims the randomization test is “assumption-free.” Name the one structural assumption it still requires, and say where that assumption comes from — the data or the design.
  6. The observed effect is a difference in medians of \(-6\) minutes rather than a difference in means of \(-7\). Give the one-sentence reason the course reads the median difference on Dataset W.

Reading and source pointer

This week is grounded in the instructor notes (the primary course materials) for the randomization test and the design-based causal read, with the IMS (Çetinkaya-Rundel & Hardin) treatment of randomization-based hypothesis testing for the reshuffle logic and the sequence, and the ModernDive (Ismay, Kim & Valdivia) infer workflow (specify → hypothesize → generate → calculate) for the reproducible reference-distribution pipeline you will build in the companion lab, Lab 4 — Build a randomization test. These notes are the course’s own synthesis, grounded in but not copied from the sources. No prose, examples, exercises, figures, or solutions are reproduced from any source.

Evidence and verification status

verified: false. The method logic on this page is course-authored, but every numeric value here is drafted, synthetic, and not independently checked. The load-bearing numbers are the Dataset W slice — Standard median \(18\) minutes, Express median \(12\) minutes, the observed difference in medians \(T_{\text{obs}} = -6\) minutes, the randomization reference distribution centered at \(0\), and the two-sided randomization \(p \approx 0.02\) — together with the illustrative A/B-test transfer figures (median \(48\) vs \(55\) seconds, \(T_{\text{obs}} = -7\)). All example data are synthetic with set.seed(45203) and were not executed here. These worked numbers are provisional and not independently verified — treat them as targets to reproduce, not as confirmed reference values.

Public vs. graded

These notes, the examples, and the practice here are public and ungraded — study material only. No graded prompts, answer keys, rubrics, point values, or due dates appear on this site. Graded method checkpoints, weekly quizzes, homework and method reports, resampling and robustness labs, the midterm, the applied robust-methods project, and the final exam live in Blackboard (the LMS), which is authoritative for due dates, submissions, and grades. If this page and Blackboard ever disagree, follow Blackboard.

Looking ahead

Next week we keep Dataset W and the same Express median, but we stop asking “is there a shift?” and start asking “how precise is the Express median itself?” The bootstrap answers that by resampling each arm with replacement and recomputing the median — and you will see the median’s bootstrap distribution come out lumpy and discrete (it can only take a handful of order-statistic values), with a bootstrap SE of the Express median around \(1.2\) minutes. The reshuffle gives way to the resample, and a new assumption-ladder trade comes with it: the bootstrap estimates sampling variability, and — unlike the randomization test — it can fail.

See also