Methods glossary
The vocabulary used across the course
Keep this page open while you read the notes. One discipline runs down every table: the assumption ladder. For each method, you should always be able to say four things — what it assumes, what it resamples / ranks / downweights, what it protects against, and what it cannot prove. Assumption-light is never assumption-free: a permutation test still assumes exchangeability, a bootstrap interval is a procedure with failure cases, and a robust estimator buys resistance by spending efficiency. Name the trade every time. All numeric values mentioned come from the synthetic Riverside Wellness Program datasets (W, S, L, D) and are provisional — the worked numbers are provisional pending review.
Empirical distributions, order statistics & ranks
The data’s own distribution is the engine behind both ranks and the bootstrap. You read it instead of assuming a normal curve.
| Term | Meaning |
|---|---|
| empirical distribution / ECDF | the data’s own distribution, \(\hat F_n(x) = \frac{1}{n}\sum_i \mathbf{1}\{x_i \le x\}\) — the fraction of the sample at or below \(x\); the engine behind both ranks and the bootstrap |
| order statistics | the sorted values \(x_{(1)} \le x_{(2)} \le \dots \le x_{(n)}\); the building blocks of quantiles and the median |
| quantile | the value below which a given fraction of the data falls; the \(0.5\) quantile is the median, \(0.25\) and \(0.75\) are the quartiles |
| median \(\tilde x\) | the middle order statistic — a resistant center; on Dataset W the Standard median is \(18\) min while the mean \(\approx 22\) min is dragged up by two long waits (\(\approx 64\), \(88\)) |
| rank \(R_i\) | an observation’s position \(1, 2, \dots, n\) after sorting; replacing values by ranks discards the spacing but keeps the ordering |
| mid-rank | the average rank assigned to tied observations, so ties share rank fairly (e.g. two values tied for ranks \(4\) and \(5\) each get \(4.5\)) |
Ladder move. Ranks and the ECDF assume only that you can order the data; they protect against a heavy tail or a single huge value distorting a summary; they cannot prove anything about the spacing you threw away — a rank gap of one can hide a tiny or an enormous gap in the raw scale.
Permutation & randomization
Both build a reference distribution by shuffling labels. The difference is what the shuffle stands for — exchangeability (a modelling assumption) versus the assignment mechanism (a design fact).
| Term | Meaning |
|---|---|
| exchangeability | the null assumption that, were there no group effect, the group labels are interchangeable — any relabeling is equally likely |
| permutation test | pool the data, repeatedly shuffle the group labels under exchangeability, recompute the statistic; on Dataset W shuffle the \(50\) labels \(\approx 10{,}000\) times and recompute the median difference |
| randomization test | the same shuffle machinery, but justified by how treatment was actually assigned; if Express was randomly assigned, the reshuffle mimics the assignment mechanism and licenses a causal reading |
| reference (null) distribution | the distribution of the statistic across all the shuffles — centered at \(0\) for the median difference; the observed \(-6\) min sits in its tail |
| permutation / randomization \(p\)-value | the tail fraction of shuffles at least as extreme as the observed statistic; on Dataset W, two-sided \(p \approx 0.02\) |
| test statistic | the one number summarizing the effect (here the difference in medians, \(12 - 18 = -6\) min) that the shuffle distribution is built around |
Ladder move. A permutation test assumes exchangeability under the null; it resamples by reshuffling labels; it protects against needing a normality or large-sample assumption; it cannot prove causation unless a randomization (design-based) justification supplies it. Permuting the wrong thing — shuffling labels when the units are paired or dependent — silently breaks the test.
The bootstrap
Resampling with replacement samples from \(\hat F_n\) to approximate sampling variability. It is a procedure, not a guarantee, and it has named failure cases.
| Term | Meaning |
|---|---|
| resample with replacement | draw \(n\) values from the data with replacement (i.e. sample from \(\hat F_n\)); some original values appear twice, some not at all |
| bootstrap distribution | the spread of a statistic across many resamples; for the Express median it is lumpy/discrete, taking only a few distinct order-statistic values |
| bootstrap SE | the standard deviation of the statistic across resamples; on Dataset W the Express-median SE \(\approx 1.2\) min and the difference-in-medians SE \(\approx 2.0\) min |
| percentile interval | a CI read straight off the resample quantiles; the \(95\%\) percentile CI for the difference in medians is \(\approx (-10, -2)\) min — it excludes \(0\) |
| basic interval | a CI that reflects the percentile interval about the observed estimate; can disagree with the percentile interval under skew |
| BCa interval | bias-corrected and accelerated — adjusts the percentile endpoints for bias and skew; the preferred interval when the bootstrap distribution is asymmetric |
| failure case | the bootstrap understates uncertainty for extreme order statistics (e.g. the sample maximum wait — it can never resample beyond what was seen), and is unreliable for very small \(n\) or dependent data |
Ladder move. The bootstrap assumes the sample represents the population and (for the plain version) that observations are independent; it resamples rows to mimic sampling variability; it protects against needing a closed-form SE formula; it cannot prove the interval is valid when the statistic depends on the tail (the maximum) or when rows are dependent. A bootstrap interval is not “model-free truth.”
Rank-based tests
Ranks turn a distributional question into a counting/ordering question. The three one-sample methods form a clean ladder; the two-sample method reads as a stochastic shift, not a mean difference.
| Term | Meaning |
|---|---|
| sign test | uses only the signs of paired differences; on Dataset S, \(11\) of \(14\) nonzero differences are positive, counted against Binomial\((14, 0.5)\), two-sided \(p \approx 0.057\) |
| Wilcoxon signed-rank \(W^+\) | ranks the magnitudes \(|d_i|\) and sums the positive ranks; on Dataset S, \(W^+\) is large, \(p \approx 0.02\) — sharper than the sign test because it uses magnitude and sign |
| symmetry assumption | the signed-rank test assumes the paired differences are symmetric about the median — stronger than the sign test, weaker than normality |
| rank-sum / Mann–Whitney \(U\) | pool two groups, rank all values, compare rank totals; on Dataset W, \(p \approx 0.01\) — reads as a location shift, not a difference in means |
| probability of superiority \(P(X<Y)\) | the chance a random value from one group beats a random value from the other; on Dataset W, \(\hat P(\text{Express} < \text{Standard}) \approx 0.72\) (an Express wait is usually shorter) |
| ties / mid-ranks | tied values share mid-ranks in the rank-sum computation, so ties do not inflate the statistic |
Ladder move. The one-sample ladder is sign test \(\subset\) signed-rank \(\subset\) paired \(t\)-test: each step assumes more (signs only → symmetric differences → normality) and, when its assumption holds, gains power. The rank-sum protects against skew and outliers but cannot prove a difference in means — reading \(P(X<Y)\) as a mean gap is the classic error.
Ordinal & categorical outcomes
Respect the measurement scale. Ordered categories carry information that a nominal test throws away, and that you destroy if you average the labels.
| Term | Meaning |
|---|---|
| ordinal vs nominal | ordinal categories have a meaningful order (\(1\)–\(5\) satisfaction); nominal categories do not (blood type) — the order is information you keep or discard |
| mid-ranks (ordinal) | rank the ordinal scores and average tied ranks within each category, so the rank-based test respects the ordering |
| averaging the labels (a trap) | the mean of numeric codes (Express \(\approx 4.12\), Standard \(\approx 3.38\) on Dataset L) treats the labels as equally spaced — is the step \(1\to2\) the same “amount” as \(4\to5\)? |
| nominal \(\chi^2\) test | a chi-square test of independence treats categories as unordered; on Dataset L, \(\chi^2 \approx 9.9\) on \(4\) df, \(p \approx 0.04\) — it throws away the ordering |
| ordinal trend / rank test | uses the ordering and is more powerful here; on Dataset L the rank-based \(p \approx 0.01\), with \(P(\text{Express} > \text{Standard}) \approx 0.66\) |
| median category | the resistant ordinal center; Express median category \(= 4\), Standard \(= 3\) |
Ladder move. A trend/rank test assumes the categories are ordered; it ranks the scores (mid-ranks for ties); it protects against the power loss of pretending the categories are unordered; it cannot prove the distances between adjacent categories are equal — which is exactly why you do not average the labels.
Robustness, outliers & influence
Robust summaries resist contamination; robust regression recovers structure that one bad point would otherwise distort. The price is efficiency, and the discipline is to investigate, not auto-delete.
| Term | Meaning |
|---|---|
| trimmed mean \(\bar x_\alpha\) | the mean after dropping the top and bottom \(\alpha\) fraction; on Dataset D the \(10\%\) trimmed mean of the gain is \(8.3\) vs the ordinary mean \(11\) |
| MAD | median absolute deviation, a resistant spread; on Dataset D the MAD-based SD \(\approx 5\) vs the ordinary SD \(9\) (inflated by a \(+40\) point) |
| IQR | the interquartile range \(Q_3 - Q_1\), a resistant spread and the basis of the boxplot outlier rule |
| breakdown point | the fraction of contamination an estimator tolerates before it can be driven arbitrarily wrong — \(0\) for the mean, \(\approx 0.5\) for the median |
| influence | how much one observation moves the estimate; a point with high influence pulls the fit toward itself |
| outlier | an unusual \(y\) value (a vertical outlier — e.g. Dataset D’s gain \(= 40\) at sessions \(= 5\)) |
| high leverage | an unusual \(x\) value (e.g. Dataset D’s point at sessions \(= 20\) with gain \(= 2\), at the edge of the \(x\) range) |
| OLS (least squares) | minimizes \(\sum r_i^2\), so one far point dominates; on Dataset D the leverage point flattens the slope to \(\approx 0.6\) vs the clean \(\approx 1.5\) |
| L1 / least absolute deviations | minimizes \(\sum |r_i|\) — more resistant to vertical outliers; recovers slope \(\approx 1.5\) on Dataset D |
| Theil–Sen | the median of all pairwise slopes — high breakdown; slope \(\approx 1.45\) on Dataset D |
| Huber M-estimator | downweights large residuals via a hybrid squared/absolute loss; slope \(\approx 1.4\) on Dataset D |
Ladder move. A robust summary assumes most of the data are clean; it downweights or trims the contaminating minority; it protects against a few outliers or one leverage point hijacking the answer; it cannot prove a flagged point is an error — flagged by the boxplot rule, \(|z| > 3\), leverage, or Cook’s distance, but you investigate, you do not auto-delete. The trade is efficiency: under clean data the robust fit is slightly less precise than OLS.
A short static idiom shows the contrast on Dataset D (synthetic; seed set; not executed here):
set.seed(45203)
# Dataset D: wellbeing gain vs sessions, two contaminating points
mean(gain) # 11 (inflated by the +40 vertical outlier)
median(gain) # 8 (resistant)
mean(gain, trim = 0.10) # 8.3 (10% trimmed mean)
mad(gain) # ~5 (MAD-based spread, vs SD ~9)
coef(lm(gain ~ sessions))["sessions"] # ~0.6 (OLS, distorted)
coef(MASS::rlm(gain ~ sessions))["sessions"] # ~1.4 (Huber M-estimator)
coef(MASS::lqs(gain ~ sessions))["sessions"] # ~1.45 (resistant, Theil-Sen-like)
# robust fits recover the clean slope ~1.5; OLS does notThe OLS slope \(\approx 0.6\) is the leverage point flattening the line; the robust slopes \(\approx 1.4\)–\(1.45\) recover the clean structure. What is assumed is that the bulk of the points follow the linear trend; what is downweighted is the contaminating minority; what cannot be proven from the fit alone is whether those two points are data-entry errors or real extreme responders.
Method comparison & simulation
You judge a method not by a single dataset but by how it behaves across many simulated datasets from a known data-generating process. No method wins everywhere.
| Term | Meaning |
|---|---|
| data-generating process (DGP) | the known mechanism a simulation draws from (Normal, right-skewed lognormal, heavy-tailed \(t_3\), or contaminated with \(5\%\) outliers) — the “truth” you can check methods against |
| Type I error | the rate of rejecting a true null; the nominal level is \(0.05\), and you check whether the actual rate matches |
| power | the rate of detecting a real effect; on heavy-tailed \(t_3\) data, simulated power is t \(\approx 0.55\) vs Wilcoxon \(\approx 0.70\) |
| coverage | the rate at which a CI contains the true value; on right-skewed data the t-test CI under-covers (coverage \(\approx 0.91\) vs nominal \(0.95\)); under \(5\%\) contamination the mean-CI covers \(\approx 0.86\) vs the trimmed-mean CI \(\approx 0.94\) |
| nominal vs actual | the target level/coverage vs what the method actually delivers in simulation — a method that doesn’t hold its nominal level is not trustworthy |
| Monte Carlo error | the simulation’s own uncertainty from using finitely many replicates; more replicates shrink it, and reported simulation numbers carry it |
| bias / efficiency | how far an estimator is off on average (bias) and how variable it is (efficiency) — robust methods trade a little efficiency under clean data for resistance under contamination |
Ladder move. A simulation study assumes the DGP you specify; it resamples by generating many datasets from that DGP; it protects against judging a method on one lucky (or unlucky) sample; it cannot prove how a method behaves under a DGP you did not simulate. The lesson across the locked results: on Normal data all methods hold level and the \(t\)-test is slightly best, but under skew, heavy tails, or contamination the assumption-light methods hold their level and gain power where the parametric method quietly fails — match the method to the data-generating reality, not to “which test is more correct.”
This page is a study reference. For graded specifics — deadlines, submissions, and policies — Blackboard (the LMS) is authoritative.