Week 12 — Comparing parametric and nonparametric conclusions

When the methods agree, when they diverge, and how to report the difference

The week question

You have now built four ways to ask “is there a real difference here?” — the familiar parametric \(t\)-test, the permutation test, the rank-sum test, and (for a slope) least squares versus a robust fit. Run all of them on the same data and they will not always return the same number. This week’s question is the one a working analyst actually faces: when do the parametric and assumption-light conclusions agree, when do they diverge, and what do you do — and say — when they disagree? The answer is not “pick the test that gives the smallest p-value,” and it is not “the nonparametric test is always safer.” The answer is to read the shape of the data, choose the method whose assumptions that shape respects, and report the difference between methods honestly rather than burying it.

Why this matters

For eleven weeks the course has built assumption-light tools one at a time, each in the setting where it earns its keep. This week those tools finally stand side by side on the same dataset, and the comparison exposes a temptation that quietly distorts a great deal of applied work: method-shopping. When the \(t\)-test says \(p \approx 0.08\) and the rank-sum says \(p \approx 0.01\), it is easy to report only the one that crosses your favorite line and call it a day. That is not analysis; it is cherry-picking with extra steps.

The honest move is the opposite. A disagreement between methods is information — it is the data telling you that an assumption matters here. When the \(t\)-test and the permutation test agree, you have learned that the normal approximation was harmless for this question. When they diverge, you have learned that the skew or the outliers were doing real work, and the method that does not lean on the broken assumption is the one to trust. The disagreement is not a problem to be hidden; it is the diagnostic. This is also where the course’s signature discipline — the assumption ladder — pays off most directly: you can only say which method to trust if you can say, for each one, what it assumed, what it resampled or ranked or downweighted, what that protected against, and what it still cannot prove.

This matters beyond a grade. Method choice that follows the data shape is defensible to a reviewer, a regulator, or a co-author; method choice that follows the desired p-value is not. The skill this week is reporting a comparison in a way that someone who disagrees with you can still trust.

Learning goals

By the end of this week you should be able to:

Run a parametric and an assumption-light method on the same data and lay their conclusions side by side without privileging either one in advance.
Explain why the \(t\)-test, permutation test, and rank-sum test can give different p-values on skewed, outlier-laden data — and why they would agree on a large, clean, symmetric sample.
Diagnose a divergence by naming the assumption that fails (normality of the mean’s sampling distribution; the influence of a far point on least squares) rather than by declaring one test “more correct.”
Compare an OLS slope with a robust slope on a contaminated scatter and say what each estimates and which one answers the question asked.
Write an honest comparison: report what every reasonable method said, scope the conclusion to the data shape, and resist “the method that wins here wins everywhere.”

Core vocabulary

Parametric method — an inference procedure that assumes a distributional form (here, that the relevant sampling distribution is approximately normal), e.g. the two-sample \(t\)-test, which leans on the mean and its standard error.
Assumption-light method — a procedure that replaces a distributional assumption with resampling, ranking, or downweighting: the permutation test, the rank-sum test, the bootstrap, a robust estimator. Light is not free — each still assumes something (often exchangeability, or symmetry, or that contamination is a minority).
Agreement / divergence — whether two methods point to the same conclusion. Divergence is diagnostic: it flags that an assumption is doing real work on this dataset.
Data shape — the features that decide which assumptions hold: sample size, skew, heavy tails, outliers, ties, ordinal scale, contamination. The shape, not a ranking of tests, drives the choice.
Concordant vs. discordant conclusions — concordant: all reasonable methods agree, so the conclusion is robust to method choice. Discordant: they disagree, so you must say which assumption broke and report the spread.
Sensitivity (to method) — how much the answer moves when you change the method. Low method-sensitivity is reassuring; high method-sensitivity must be reported, not hidden.
Honest reporting — stating what every reasonable method returned, naming the data feature that drove any divergence, and scoping the conclusion to the data shape.

Concept development

Why agreement and divergence both happen

Start with the cleanest case. On a large, symmetric, outlier-free sample the \(t\)-test, the permutation test, and the rank-sum test will return nearly the same p-value, because the assumption the \(t\)-test needs — that the mean’s sampling distribution is approximately normal — is essentially true there. The Central Limit Theorem makes the sampling distribution of the mean normal as \(n\) grows whenever the tails are not too wild; the permutation test builds the same reference distribution by shuffling, and on symmetric data ranks track the raw values closely. Three roads, one destination. When that happens, the comparison has taught you something valuable: the parametric approximation was harmless here, so you may report the familiar \(t\)-test result with a clear conscience.

Now break the shape. Introduce right skew and a couple of long-tailed outliers, as in the service-wait data. Two things go wrong for the \(t\)-test at once. First, the mean is pulled toward the long tail, so it is no longer a faithful summary of the typical value. Second — and this is the load-bearing failure — the few extreme observations inflate the sample standard deviation, which inflates the standard error in the denominator of the \(t\)-statistic, which shrinks \(t\) toward zero and pushes the p-value up. The \(t\)-test becomes conservative in the worst way: it loses power to detect a shift that is plainly there, because the same outliers that signal a heavy tail also blur the test’s own ruler. The permutation test (which uses the data’s own shuffle distribution) and the rank-sum test (which replaces values with ranks, capping the influence of any one point) do not pay that penalty. So they keep the power the \(t\)-test gave away, and they return smaller p-values.

This is the whole logic of the week in one paragraph: the methods diverge exactly when an assumption is doing work, and they converge when it is not. Divergence is not noise to be averaged away — it is a measurement of how much the assumption mattered.

The locked instance — Dataset W (service wait times)

Hold the same skewed, outlier-laden two-group comparison the course has used since Week 1: Standard intake (\(n_C = 25\), median \(18\) min, mean \(\approx 22\) with two long waits near \(64\) and \(88\)) versus Express intake (\(n_T = 25\), median \(12\) min, mean \(\approx 15\)). Express is faster; the question is whether that shift is real. Run three tests on the same 50 waits and lay the conclusions side by side (all numbers synthetic; seed set):

\[ p_{t\text{-test}} \approx 0.08, \qquad p_{\text{permutation}} \approx 0.02, \qquad p_{\text{rank-sum}} \approx 0.01 . \]

Read the gradient. The \(t\)-test (\(p \approx 0.08\)) does not clear the conventional line — not because the effect is absent, but because the two very long Standard waits near \(64\) and \(88\) minutes inflate the pooled standard deviation, swell the standard error, and weaken the test. The permutation test (\(p \approx 0.02\)) shuffles the 50 group labels under exchangeability and reads the observed median gap against the data’s own reference distribution, so the inflated SD never enters; it detects the shift. The rank-sum test (\(p \approx 0.01\)) replaces the waits with their ranks, so the \(88\)-minute wait becomes simply “the largest rank” and cannot drag the analysis around; it is the sharpest of the three here. Assumption-ladder move: the \(t\)-test assumes an approximately normal sampling distribution for the mean difference and is not protected against the tail; the permutation test assumes exchangeability under the null and resamples labels, protecting against the distributional shape but not against a wrong null; the rank-sum assumes the two distributions differ by a location shift and ranks the data, protecting against outliers’ magnitude but discarding how big the difference is. None of the three can prove why a wait was long — only that, accounting for shape, Express is faster.

The pivotal reading: on this shape the assumption-light methods are not “being generous.” They are recovering power the \(t\)-test threw away when its assumption broke. On a large, clean, symmetric version of the same comparison, all three p-values would land in the same neighborhood, and you would report the \(t\)-test without apology. The disagreement is the data shape, made numeric.

The same logic for a slope — Dataset D (engagement vs. wellbeing gain)

The agree-or-diverge question is not only about p-values; it is about estimates too. Take Dataset D: wellbeing gain against sessions attended for \(n = 40\) participants, whose clean structure (\(\text{gain} \approx 2 + 1.5 \cdot \text{sessions}\)) is spoiled by two contaminating points — a high-leverage data-entry-style point at sessions \(= 20\) with gain \(= 2\), and a vertical outlier at sessions \(= 5\) with gain \(= 40\). Fit the line two ways:

\[ \hat\beta_1^{\text{OLS}} \approx 0.6, \qquad \hat\beta_1^{\text{robust}} \approx 1.45 . \]

These are not two estimates of the same number with sampling noise between them; they are answering different questions about a contaminated dataset. Least squares minimizes \(\sum r_i^2\), so the squared residual of the far high-leverage point dominates the fit and flattens the slope from the clean \(1.5\) down to \(\approx 0.6\). The robust fit (Theil–Sen, the median of pairwise slopes, \(\approx 1.45\)) downweights that point and recovers the structure the bulk of the data actually shows. Assumption-ladder move: OLS assumes that every point is a trustworthy draw from one model and is not protected against a single far point; the robust fit assumes that contamination is a minority and downweights outlying residuals, protecting the slope against the two bad points — but it cannot prove those two points are errors rather than signal. So the honest report is not “the slope is \(0.6\)” or “the slope is \(1.45\)” but: the bulk of participants show a slope near \(1.45\); two contaminating points pull the least-squares line down to \(0.6\); here is which is which and why. Same discipline as the p-value gradient — report the difference, name the data feature that caused it, and scope the claim.

Choosing — the question and the shape, not “which test is more correct”

Put the two examples together and the decision rule falls out. You do not choose a method by asking “which test is more correct in the abstract?” — there is no such ranking. You choose by asking two questions in order. First, what is the question? A difference in typical wait points you toward the median, ranks, or a robust center; a difference in total throughput might genuinely be about the mean (a hospital that cares about total minutes of waiting does care about the long tail). The rank-sum answers “is an Express wait usually shorter?”; the \(t\)-test answers “is the mean wait lower?” — and on skewed data those are not the same question. Second, what is the data shape? If it is large, clean, and symmetric, the methods agree and the parametric route is fine and efficient. If it is skewed, heavy-tailed, ordinal, or contaminated, the assumption-light route protects the conclusion. The method follows the question and the shape; the p-value is a consequence of that choice, never the reason for it.

Worked examples

Worked example — Dataset W, three tests side by side (recurring slice)

What is assumed. Fifty service waits in minutes, \(25\) Standard and \(25\) Express, right-skewed with two long Standard waits near \(64\) and \(88\). The question is whether Express intake shifts waits lower. We run a two-sample \(t\)-test (assumes an approximately normal sampling distribution for the mean difference), a permutation test (assumes exchangeability of the labels under the null), and a Wilcoxon rank-sum test (assumes a location shift between two otherwise-similar distributions). Data are synthetic; seed set.

Computation. The static R below lines up all three p-values on the same 50 waits. It is shown as teaching code and is not executed here.

set.seed(45203)

# Synthetic Dataset W: right-skewed service waits, two long Standard waits.
standard <- c(rgamma(23, shape = 4, scale = 5), 64, 88)   # n_C = 25, median ~ 18
express  <- rgamma(25, shape = 3, scale = 4)               # n_T = 25, median ~ 12

# 1) Parametric: two-sample t-test on the means
p_t <- t.test(express, standard)$p.value
#   the two long Standard waits inflate the SD -> SE up -> t down -> p_t ~ 0.08

# 2) Permutation: shuffle the 50 labels, recompute the median difference
waits  <- c(express, standard)
labels <- rep(c("E", "S"), each = 25)
obs    <- median(express) - median(standard)              # observed gap ~ -6 min
perm   <- replicate(10000, {
  lab <- sample(labels)                                   # reshuffle under H0
  median(waits[lab == "E"]) - median(waits[lab == "S"])
})
p_perm <- mean(abs(perm) >= abs(obs))                     # two-sided -> p_perm ~ 0.02

# 3) Rank-based: Wilcoxon rank-sum / Mann-Whitney on the ranks
p_rank <- wilcox.test(express, standard)$p.value          # ranks cap outliers -> p_rank ~ 0.01

# p_t ~ 0.08   p_perm ~ 0.02   p_rank ~ 0.01
# SAME data, SAME effect (Express faster) -- the spread is the data SHAPE talking.

Interpretation. The three p-values — \(0.08\), \(0.02\), \(0.01\) — describe the same shift in the same data; the gradient is the skew and the two long waits, not three different effects. The \(t\)-test (\(0.08\)) fails to clear the conventional line because the \(64\)- and \(88\)-minute waits inflate its standard deviation and blunt its ruler; the permutation test (\(0.02\)) sidesteps that by using the data’s own shuffle distribution; the rank-sum (\(0.01\)) is sharpest because ranks cap the influence of the longest wait entirely. Assumption-ladder move: what was assumed differs by method (normal mean-difference / exchangeability / location shift); what was resampled or ranked differs (nothing / labels / values→ranks); what each protects against is the tail (only the last two do); what none of them can prove is the cause of any single long wait. The honest one-line report is: all three methods agree Express is faster; the \(t\)-test is weakest here because the right tail inflates its standard error, so on this skewed, outlier-laden shape the permutation and rank-sum conclusions are the ones to trust. On a large, clean, symmetric sample these three would have agreed, and the \(t\)-test would have been perfectly fine — the divergence is specific to this shape.

Worked example — a clean, symmetric sample where the methods agree (transfer, new context)

What is assumed. A campus testing center measures the time (in seconds) to complete a short standardized task for \(n = 120\) students split evenly between two interface versions, A and B. The task times are roughly symmetric, mound-shaped, with no outliers and a large sample. The question is whether version B is faster. We run the same three methods. These numbers are illustrative and distinct from Dataset W.

Computation. Suppose version A averages \(\bar y_A \approx 42\) s and version B averages \(\bar y_B \approx 39\) s, with comparable spreads and no long tail. Running the trio:

\[ p_{t\text{-test}} \approx 0.03, \qquad p_{\text{permutation}} \approx 0.03, \qquad p_{\text{rank-sum}} \approx 0.03 . \]

set.seed(45203)
# Large, clean, symmetric: 60 per arm, mound-shaped, no outliers (illustrative).
a <- rnorm(60, mean = 42, sd = 6)
b <- rnorm(60, mean = 39, sd = 6)

p_t    <- t.test(b, a)$p.value            # ~ 0.03
obs    <- mean(b) - mean(a)
pool   <- c(b, a); lab0 <- rep(c("B","A"), each = 60)
perm   <- replicate(10000, { l <- sample(lab0); mean(pool[l=="B"]) - mean(pool[l=="A"]) })
p_perm <- mean(abs(perm) >= abs(obs))     # ~ 0.03
p_rank <- wilcox.test(b, a)$p.value       # ~ 0.03

# p_t ~ p_perm ~ p_rank ~ 0.03  -- on clean, symmetric, large data the methods AGREE.

Interpretation. Here all three p-values land at \(\approx 0.03\). The agreement is not luck: with a large, symmetric, outlier-free sample the mean’s sampling distribution really is approximately normal, so the \(t\)-test’s assumption holds, and ranks track the raw values closely, so the rank-sum carries the same signal. Assumption-ladder move: the same three assumptions are made, but here none of them is strained, so the resampling and ranking buy you nothing extra — and that is itself the finding. The design move is identical to Dataset W — run a parametric and two assumption-light methods, lay the conclusions side by side — but because the shape is clean, the honest report is the mirror image: the methods agree, so the conclusion is robust to method choice, and the efficient \(t\)-test is a perfectly good way to report it. This is the case that keeps you from over-correcting: assumption-light methods are insurance against bad shape, not a tax you must always pay.

A common mistake

The week’s central trap is “the method that wins here wins everywhere” (Risks 7, 12, and 15) — generalizing a method’s victory on one data shape into a universal ranking of tests.

It sounds like: “the rank-sum beat the \(t\)-test on the waits, so nonparametric tests are just better — I’ll use them for everything.” Or its mirror image: “the \(t\)-test is the standard, and the nonparametric tests only gave smaller p-values because they’re looser, so I’ll ignore them.” Both draw a universal conclusion from a shape-specific result, and both are wrong. The rank-sum won on Dataset W because the data were skewed with outliers — exactly the shape that breaks the \(t\)-test’s standard error. Change the shape to large, clean, and symmetric (the transfer example) and the advantage evaporates: all three agree, and the \(t\)-test is the efficient choice. There is no context-free ranking of tests, so “wins here” never licenses “wins everywhere.”

Three corrections keep you honest. First, scope every conclusion to the data shape. Say “on this skewed, outlier-laden sample the rank-sum is sharper,” not “the rank-sum is better.” The qualifier is the whole point. Second, report the difference rather than hiding it. When the \(t\)-test says \(0.08\) and the rank-sum says \(0.01\), the honest report states both, names the right tail as the cause, and explains which to trust and why — it does not quietly drop the \(0.08\) because it is inconvenient, nor quietly drop the \(0.01\) because it is unfamiliar. Selectively reporting the method that crosses your preferred line is method-shopping, and a reader who re-runs the other test will catch it. Third, remember assumption-light is not assumption-free. The permutation test still assumes exchangeability; the rank-sum still assumes a location shift; the robust slope still assumes contamination is a minority. A method that “wins” by quietly violating its own assumption has not won at all — so naming what each method assumed is part of reporting the comparison, not an optional extra.

Low-stakes self-checks (ungraded)

These are for your own practice — ungraded, no submission.

On Dataset W the \(t\)-test gives \(p \approx 0.08\) while the rank-sum gives \(p \approx 0.01\), on the same data. In one sentence, name the data feature that causes the gap and the mechanism by which it weakens the \(t\)-test specifically.
A classmate says “the permutation test gave a smaller p-value, so it found a bigger effect.” Explain what is wrong with equating a smaller p-value with a bigger effect.
For the clean, symmetric transfer example all three methods returned \(\approx 0.03\). State what that agreement tells you about the \(t\)-test’s assumption there, and what you would therefore report.
On Dataset D the OLS slope is \(\approx 0.6\) and the robust slope is \(\approx 1.45\). Say which one describes the bulk of the participants, which question each one answers, and why they differ.
Write one honest sentence reporting the Dataset W comparison to a reader who has not seen the data — one that scopes the conclusion to the data shape and reports the difference between methods.
A colleague concludes “nonparametric tests are always safer, so I’ll never use a \(t\)-test again.” Give two reasons this over-corrects, and name a data shape where the \(t\)-test is the better choice.

Reading and source pointer

This week is grounded in the instructor notes (the primary course materials) for comparing parametric and assumption-light conclusions and reporting the difference honestly, with IMS (Çetinkaya-Rundel & Hardin) on choosing and comparing inferential approaches for the framing of when simulation-based and theory-based methods agree and diverge. These notes are the course’s own synthesis, grounded in but not copied from the sources. No prose, examples, exercises, figures, or solutions are reproduced from any source.

Evidence and verification status

verified: false. The method logic on this page is course-authored, but every numeric value here is drafted, synthetic, and not independently checked. The page’s load-bearing numbers are the Dataset W comparison — \(t\)-test \(p \approx 0.08\), permutation \(p \approx 0.02\), rank-sum \(p \approx 0.01\) on the same 50 waits (Standard median \(18\), Express median \(12\), two long Standard waits near \(64\) and \(88\)) — the Dataset D slopes — OLS \(\approx 0.6\) vs robust \(\approx 1.45\) — and the illustrative clean-sample transfer numbers (all three methods \(\approx 0.03\) on \(\bar y_A \approx 42\) vs \(\bar y_B \approx 39\)). All data are synthetic with set.seed(45203). These worked numbers are provisional and not independently verified — treat them as targets to reproduce, not as confirmed reference values.

Public vs. graded

These notes, the examples, and the practice here are public and ungraded — study material only. No graded prompts, answer keys, rubrics, point values, or due dates appear on this site. Graded method checkpoints, weekly quizzes, homework and method reports, resampling and robustness labs, the midterm, the applied robust-methods project, and the final exam live in Blackboard (the LMS), which is authoritative for due dates, submissions, and grades. If this page and Blackboard ever disagree, follow Blackboard.

Looking ahead

Next week we stop running the methods on one dataset and start running them on thousands — a simulation study of method behavior. Instead of asking “which method wins on these particular waits?” we fix a data-generating process, generate many samples from it, and measure each method’s Type I error, power, and confidence-interval coverage directly. The week’s lesson is the rigorous version of this week’s intuition: on normal data all methods hold level and the \(t\)-test is slightly most powerful; on right-skewed data the \(t\)-test’s CI under-covers while the rank-sum holds level and gains power; on heavy-tailed and contaminated data the assumption-light methods pull further ahead. No method wins everywhere — and next week we will see exactly where each one wins, by counting.