Methods glossary

The vocabulary used across the course

Keep this page open while you read the notes. One discipline runs down every table: the assumption ladder. For each method, you should always be able to say four things — what it assumes, what it resamples / ranks / downweights, what it protects against, and what it cannot prove. Assumption-light is never assumption-free: a permutation test still assumes exchangeability, a bootstrap interval is a procedure with failure cases, and a robust estimator buys resistance by spending efficiency. Name the trade every time. All numeric values mentioned come from the synthetic Riverside Wellness Program datasets (W, S, L, D) and are provisional — the worked numbers are provisional pending review.

Empirical distributions, order statistics & ranks

The data’s own distribution is the engine behind both ranks and the bootstrap. You read it instead of assuming a normal curve.

Term	Meaning
empirical distribution / ECDF	the data’s own distribution, \(\hat F_n(x) = \frac{1}{n}\sum_i \mathbf{1}\{x_i \le x\}\) — the fraction of the sample at or below \(x\); the engine behind both ranks and the bootstrap
order statistics	the sorted values \(x_{(1)} \le x_{(2)} \le \dots \le x_{(n)}\); the building blocks of quantiles and the median
quantile	the value below which a given fraction of the data falls; the \(0.5\) quantile is the median, \(0.25\) and \(0.75\) are the quartiles
median \(\tilde x\)	the middle order statistic — a resistant center; on Dataset W the Standard median is \(18\) min while the mean \(\approx 22\) min is dragged up by two long waits (\(\approx 64\), \(88\))
rank \(R_i\)	an observation’s position \(1, 2, \dots, n\) after sorting; replacing values by ranks discards the spacing but keeps the ordering
mid-rank	the average rank assigned to tied observations, so ties share rank fairly (e.g. two values tied for ranks \(4\) and \(5\) each get \(4.5\))

Ladder move. Ranks and the ECDF assume only that you can order the data; they protect against a heavy tail or a single huge value distorting a summary; they cannot prove anything about the spacing you threw away — a rank gap of one can hide a tiny or an enormous gap in the raw scale.

Permutation & randomization

Both build a reference distribution by shuffling labels. The difference is what the shuffle stands for — exchangeability (a modelling assumption) versus the assignment mechanism (a design fact).

Term	Meaning
exchangeability	the null assumption that, were there no group effect, the group labels are interchangeable — any relabeling is equally likely
permutation test	pool the data, repeatedly shuffle the group labels under exchangeability, recompute the statistic; on Dataset W shuffle the \(50\) labels \(\approx 10{,}000\) times and recompute the median difference
randomization test	the same shuffle machinery, but justified by how treatment was actually assigned; if Express was randomly assigned, the reshuffle mimics the assignment mechanism and licenses a causal reading
reference (null) distribution	the distribution of the statistic across all the shuffles — centered at \(0\) for the median difference; the observed \(-6\) min sits in its tail
permutation / randomization \(p\)-value	the tail fraction of shuffles at least as extreme as the observed statistic; on Dataset W, two-sided \(p \approx 0.02\)
test statistic	the one number summarizing the effect (here the difference in medians, \(12 - 18 = -6\) min) that the shuffle distribution is built around

Ladder move. A permutation test assumes exchangeability under the null; it resamples by reshuffling labels; it protects against needing a normality or large-sample assumption; it cannot prove causation unless a randomization (design-based) justification supplies it. Permuting the wrong thing — shuffling labels when the units are paired or dependent — silently breaks the test.

The bootstrap

Resampling with replacement samples from \(\hat F_n\) to approximate sampling variability. It is a procedure, not a guarantee, and it has named failure cases.

Term	Meaning
resample with replacement	draw \(n\) values from the data with replacement (i.e. sample from \(\hat F_n\)); some original values appear twice, some not at all
bootstrap distribution	the spread of a statistic across many resamples; for the Express median it is lumpy/discrete, taking only a few distinct order-statistic values
bootstrap SE	the standard deviation of the statistic across resamples; on Dataset W the Express-median SE \(\approx 1.2\) min and the difference-in-medians SE \(\approx 2.0\) min
percentile interval	a CI read straight off the resample quantiles; the \(95\%\) percentile CI for the difference in medians is \(\approx (-10, -2)\) min — it excludes \(0\)
basic interval	a CI that reflects the percentile interval about the observed estimate; can disagree with the percentile interval under skew
BCa interval	bias-corrected and accelerated — adjusts the percentile endpoints for bias and skew; the preferred interval when the bootstrap distribution is asymmetric
failure case	the bootstrap understates uncertainty for extreme order statistics (e.g. the sample maximum wait — it can never resample beyond what was seen), and is unreliable for very small \(n\) or dependent data

Ladder move. The bootstrap assumes the sample represents the population and (for the plain version) that observations are independent; it resamples rows to mimic sampling variability; it protects against needing a closed-form SE formula; it cannot prove the interval is valid when the statistic depends on the tail (the maximum) or when rows are dependent. A bootstrap interval is not “model-free truth.”

Rank-based tests

Ranks turn a distributional question into a counting/ordering question. The three one-sample methods form a clean ladder; the two-sample method reads as a stochastic shift, not a mean difference.

Term	Meaning
sign test	uses only the signs of paired differences; on Dataset S, \(11\) of \(14\) nonzero differences are positive, counted against Binomial\((14, 0.5)\), two-sided \(p \approx 0.057\)
Wilcoxon signed-rank \(W^+\)	ranks the magnitudes \(\|d_i\|\) and sums the positive ranks; on Dataset S, \(W^+\) is large, \(p \approx 0.02\) — sharper than the sign test because it uses magnitude and sign
symmetry assumption	the signed-rank test assumes the paired differences are symmetric about the median — stronger than the sign test, weaker than normality
rank-sum / Mann–Whitney \(U\)	pool two groups, rank all values, compare rank totals; on Dataset W, \(p \approx 0.01\) — reads as a location shift, not a difference in means
probability of superiority \(P(X<Y)\)	the chance a random value from one group beats a random value from the other; on Dataset W, \(\hat P(\text{Express} < \text{Standard}) \approx 0.72\) (an Express wait is usually shorter)
ties / mid-ranks	tied values share mid-ranks in the rank-sum computation, so ties do not inflate the statistic

Ladder move. The one-sample ladder is sign test \(\subset\) signed-rank \(\subset\) paired \(t\)-test: each step assumes more (signs only → symmetric differences → normality) and, when its assumption holds, gains power. The rank-sum protects against skew and outliers but cannot prove a difference in means — reading \(P(X<Y)\) as a mean gap is the classic error.

Ordinal & categorical outcomes

Respect the measurement scale. Ordered categories carry information that a nominal test throws away, and that you destroy if you average the labels.

Term	Meaning
ordinal vs nominal	ordinal categories have a meaningful order (\(1\)–\(5\) satisfaction); nominal categories do not (blood type) — the order is information you keep or discard
mid-ranks (ordinal)	rank the ordinal scores and average tied ranks within each category, so the rank-based test respects the ordering
averaging the labels (a trap)	the mean of numeric codes (Express \(\approx 4.12\), Standard \(\approx 3.38\) on Dataset L) treats the labels as equally spaced — is the step \(1\to2\) the same “amount” as \(4\to5\)?
nominal \(\chi^2\) test	a chi-square test of independence treats categories as unordered; on Dataset L, \(\chi^2 \approx 9.9\) on \(4\) df, \(p \approx 0.04\) — it throws away the ordering
ordinal trend / rank test	uses the ordering and is more powerful here; on Dataset L the rank-based \(p \approx 0.01\), with \(P(\text{Express} > \text{Standard}) \approx 0.66\)
median category	the resistant ordinal center; Express median category \(= 4\), Standard \(= 3\)

Ladder move. A trend/rank test assumes the categories are ordered; it ranks the scores (mid-ranks for ties); it protects against the power loss of pretending the categories are unordered; it cannot prove the distances between adjacent categories are equal — which is exactly why you do not average the labels.

Robustness, outliers & influence

Robust summaries resist contamination; robust regression recovers structure that one bad point would otherwise distort. The price is efficiency, and the discipline is to investigate, not auto-delete.

Term	Meaning
trimmed mean \(\bar x_\alpha\)	the mean after dropping the top and bottom \(\alpha\) fraction; on Dataset D the \(10\%\) trimmed mean of the gain is \(8.3\) vs the ordinary mean \(11\)
MAD	median absolute deviation, a resistant spread; on Dataset D the MAD-based SD \(\approx 5\) vs the ordinary SD \(9\) (inflated by a \(+40\) point)
IQR	the interquartile range \(Q_3 - Q_1\), a resistant spread and the basis of the boxplot outlier rule
breakdown point	the fraction of contamination an estimator tolerates before it can be driven arbitrarily wrong — \(0\) for the mean, \(\approx 0.5\) for the median
influence	how much one observation moves the estimate; a point with high influence pulls the fit toward itself
outlier	an unusual \(y\) value (a vertical outlier — e.g. Dataset D’s gain \(= 40\) at sessions \(= 5\))
high leverage	an unusual \(x\) value (e.g. Dataset D’s point at sessions \(= 20\) with gain \(= 2\), at the edge of the \(x\) range)
OLS (least squares)	minimizes \(\sum r_i^2\), so one far point dominates; on Dataset D the leverage point flattens the slope to \(\approx 0.6\) vs the clean \(\approx 1.5\)
L1 / least absolute deviations	minimizes \(\sum \|r_i\|\) — more resistant to vertical outliers; recovers slope \(\approx 1.5\) on Dataset D
Theil–Sen	the median of all pairwise slopes — high breakdown; slope \(\approx 1.45\) on Dataset D
Huber M-estimator	downweights large residuals via a hybrid squared/absolute loss; slope \(\approx 1.4\) on Dataset D

Ladder move. A robust summary assumes most of the data are clean; it downweights or trims the contaminating minority; it protects against a few outliers or one leverage point hijacking the answer; it cannot prove a flagged point is an error — flagged by the boxplot rule, \(|z| > 3\), leverage, or Cook’s distance, but you investigate, you do not auto-delete. The trade is efficiency: under clean data the robust fit is slightly less precise than OLS.

A short static idiom shows the contrast on Dataset D (synthetic; seed set; not executed here):

set.seed(45203)
# Dataset D: wellbeing gain vs sessions, two contaminating points
mean(gain)               # 11   (inflated by the +40 vertical outlier)
median(gain)             #  8   (resistant)
mean(gain, trim = 0.10)  #  8.3 (10% trimmed mean)
mad(gain)                #  ~5  (MAD-based spread, vs SD ~9)

coef(lm(gain ~ sessions))["sessions"]            # ~0.6  (OLS, distorted)
coef(MASS::rlm(gain ~ sessions))["sessions"]     # ~1.4  (Huber M-estimator)
coef(MASS::lqs(gain ~ sessions))["sessions"]     # ~1.45 (resistant, Theil-Sen-like)
# robust fits recover the clean slope ~1.5; OLS does not

The OLS slope \(\approx 0.6\) is the leverage point flattening the line; the robust slopes \(\approx 1.4\)–\(1.45\) recover the clean structure. What is assumed is that the bulk of the points follow the linear trend; what is downweighted is the contaminating minority; what cannot be proven from the fit alone is whether those two points are data-entry errors or real extreme responders.

Method comparison & simulation

You judge a method not by a single dataset but by how it behaves across many simulated datasets from a known data-generating process. No method wins everywhere.

Term	Meaning
data-generating process (DGP)	the known mechanism a simulation draws from (Normal, right-skewed lognormal, heavy-tailed \(t_3\), or contaminated with \(5\%\) outliers) — the “truth” you can check methods against
Type I error	the rate of rejecting a true null; the nominal level is \(0.05\), and you check whether the actual rate matches
power	the rate of detecting a real effect; on heavy-tailed \(t_3\) data, simulated power is t \(\approx 0.55\) vs Wilcoxon \(\approx 0.70\)
coverage	the rate at which a CI contains the true value; on right-skewed data the t-test CI under-covers (coverage \(\approx 0.91\) vs nominal \(0.95\)); under \(5\%\) contamination the mean-CI covers \(\approx 0.86\) vs the trimmed-mean CI \(\approx 0.94\)
nominal vs actual	the target level/coverage vs what the method actually delivers in simulation — a method that doesn’t hold its nominal level is not trustworthy
Monte Carlo error	the simulation’s own uncertainty from using finitely many replicates; more replicates shrink it, and reported simulation numbers carry it
bias / efficiency	how far an estimator is off on average (bias) and how variable it is (efficiency) — robust methods trade a little efficiency under clean data for resistance under contamination

Ladder move. A simulation study assumes the DGP you specify; it resamples by generating many datasets from that DGP; it protects against judging a method on one lucky (or unlucky) sample; it cannot prove how a method behaves under a DGP you did not simulate. The lesson across the locked results: on Normal data all methods hold level and the \(t\)-test is slightly best, but under skew, heavy tails, or contamination the assumption-light methods hold their level and gain power where the parametric method quietly fails — match the method to the data-generating reality, not to “which test is more correct.”

This page is a study reference. For graded specifics — deadlines, submissions, and policies — Blackboard (the LMS) is authoritative.