Robustness & outliers guide

Resistant summaries, the breakdown point, and the do-not-auto-delete rule

Keep this page open while you work through the robust-methods weeks (10–11) and the report workshop. The single discipline that runs through every table below is the assumption ladder: a resistant summary still assumes something, it just assumes less about the tail of the distribution. Robust does not mean assumption-free — it means a single bad point cannot run the answer. Every numeric value here comes from the synthetic recurring study Dataset D (wellbeing gain vs sessions attended, \(n = 40\), two contaminating points) and is provisional (the worked numbers are provisional pending review).

The contamination in Dataset D is fixed and worth holding in your head as you read: a clean linear structure, gain \(\approx 2 + 1.5 \cdot \text{sessions}\) with residual SD \(\approx 4\), is spoiled by exactly two points — a high-leverage data-entry-style point at sessions \(= 20\) with gain \(= 2\) (a “bad \(y\)” at the far edge of \(x\)), and a vertical outlier at sessions \(= 5\) with gain \(= 40\). The whole guide is about what those two points do to a summary or a fit, how you spot them, and what you are allowed to do about them.

Resistant summaries vs the mean and SD

A summary is resistant (robust) when moving or corrupting a small fraction of the data barely moves it. The mean and standard deviation are not resistant: each observation enters with full weight, so one extreme value drags both. The median, the trimmed mean, the MAD, and the IQR each cap how much any single point can contribute.

Summary	What it measures	Definition (sketch)	Resistant?	Dataset D value
mean \(\bar x\)	center	\(\frac{1}{n}\sum_i x_i\)	no — every point full weight	gain mean \(= 11\)
median \(\tilde x\)	center	middle order statistic \(x_{(\lceil n/2\rceil)}\)	yes — depends on rank, not magnitude	gain median \(= 8\)
\(10\%\) trimmed mean \(\bar x_\alpha\)	center	mean after dropping the lowest/highest \(10\%\)	yes — discards the tails before averaging	gain trimmed mean \(= 8.3\)
standard deviation (SD)	spread	\(\sqrt{\frac{1}{n-1}\sum_i (x_i-\bar x)^2}\)	no — squares the deviations	gain SD \(= 9\)
MAD-based SD	spread	\(1.4826 \cdot \operatorname{median}_i \|x_i - \tilde x\|\)	yes — built from a median of deviations	gain MAD-SD \(\approx 5\)
IQR	spread	\(Q_3 - Q_1\)	yes — uses only the middle half	(middle-half width)

Read the two centers side by side. The gain mean \(= 11\) sits well above the median \(= 8\) because the single \(+40\) vertical outlier pulls the average up; the \(10\%\) trimmed mean \(= 8.3\) lands right next to the median because trimming removes that point before averaging. The gap between mean and median is itself a diagnostic for skew or contamination — when they disagree this much, ask which one your question actually wants. Assumption-ladder move: the median and trimmed mean assume only that “center” means a typical value, they rank or discard the extreme observations, this protects against a handful of corrupted points, but it cannot prove the \(+40\) is an error — it only refuses to let that one point set the headline.

The spread tells the same story. The ordinary SD \(= 9\) is inflated because squaring the \(+40\) deviation makes that one residual dominate the sum; the MAD-based SD \(\approx 5\) reports the spread of the bulk of the data because a median of absolute deviations ignores how far the worst point is. The \(1.4826\) factor rescales the MAD so that, under a normal model, it estimates the same \(\sigma\) as the SD — that constant is the one place the MAD quietly assumes a reference shape. Assumption-ladder move: MAD-SD assumes a typical-spread target and (for the scaling constant) a normal reference, it is built from ranked deviations, it protects against tail inflation, but it cannot prove the data are clean — it just reports the middle’s spread honestly.

The breakdown point

The breakdown point of an estimator is the largest fraction of the data you can corrupt — push to \(\pm\infty\) — before the estimate itself can be driven arbitrarily far. It is the single cleanest number for “how many bad points can this summary survive.” A high breakdown point is exactly what “resistant” means, made precise.

Estimator	Breakdown point	Reading
mean \(\bar x\)	\(0\) (i.e. \(1/n \to 0\))	one point taken to \(\infty\) takes the mean with it
standard deviation	\(0\)	same — one point dominates the squared sum
\(10\%\) trimmed mean	\(\approx 0.10\)	survives corruption up to the trimmed fraction
median \(\tilde x\)	\(\approx 0.50\)	half the data must be corrupted before it breaks
MAD	\(\approx 0.50\)	a median of deviations inherits the median’s resistance

The mean’s breakdown point is \(0\): a single arbitrarily large value sends \(\bar x\) off to infinity, which is exactly why the gain mean is \(11\) and not \(8\). The median’s breakdown point is \(\approx 0.5\) — you would have to corrupt nearly half the \(40\) participants before the middle value could be forced anywhere you like. The \(10\%\) trimmed mean sits between them at \(\approx 0.10\): it tolerates contamination up to the fraction it trims and no further. Assumption-ladder move: the breakdown point assumes nothing about the distribution’s shape — it is a worst-case count, ranking estimators by how much arbitrary corruption they absorb. It protects your reasoning by naming the limit explicitly, but it cannot prove that real data are anywhere near the worst case — a low breakdown point is a warning, not a verdict that your particular mean is wrong.

A common confusion: a high breakdown point is not free. The median and trimmed mean throw information away, so on genuinely clean, symmetric data they are less efficient than the mean — wider confidence intervals for the same sample size. You buy resistance with efficiency. Name that trade every time you reach for a resistant summary; “assumption-light” is never “assumption-free.”

Outlier, high leverage, and influence — three different things

These three words are often used interchangeably and they should not be. They describe different positions in a regression of \(y\) on \(x\), they are detected by different tools, and a point can be one without being another. The two contaminating points in Dataset D were chosen precisely to separate the ideas.

Term	What is unusual	Detected by	Dataset D example
outlier	the response \(y\) (large residual)	boxplot rule; \(\|z\| > 3\) on residuals	the \(+40\) gain at sessions \(= 5\)
high leverage	the predictor \(x\) (far from \(\bar x\))	leverage \(h_{ii}\) (hat values)	the point at sessions \(= 20\)
influence	the fit moves if you remove it	Cook’s distance \(D_i\)	the leverage point flattening the slope

An outlier is a point with an unusual \(y\) for its \(x\) — a large residual. The boxplot rule flags a value beyond \(Q_1 - 1.5\,\text{IQR}\) or \(Q_3 + 1.5\,\text{IQR}\); the standardized-residual rule flags \(|z| > 3\), where \(z_i = r_i/\hat\sigma\) is the residual \(r_i\) scaled by the residual spread. The \(+40\) gain at sessions \(= 5\) is the textbook outlier: it sits far above the line, so its residual is huge, and both rules catch it. Assumption-ladder move: the boxplot and \(|z|\) rules assume a rough idea of “typical spread” (the IQR, or \(s\) — itself non-resistant, so a severe outlier can hide other outliers by inflating \(s\)), they flag points by magnitude, they protect you from missing a gross error, but they cannot prove a flagged point is wrong.

High leverage is about the predictor: a point with an unusual \(x\), far from \(\bar x\), has high leverage \(h_{ii}\) (its hat-value), measuring how much its own \(y\) pulls its own fitted value. The data-entry-style point at sessions \(= 20\) sits at the far edge of the \(x\) range, so it has high leverage regardless of its \(y\). High leverage is potential influence, not actual influence — a high-leverage point that happens to fall on the line does no harm. A rough flag is \(h_{ii} > 2(p+1)/n\). Assumption-ladder move: leverage assumes the linear model’s \(x\)-geometry, it ranks points by their \(x\)-distance, it protects you by saying “this point could dominate,” but it cannot prove the point is bad — only that it is positioned to do damage.

Influence is the consequence: a point is influential when removing it changes the fit — the slope, intercept, or predictions move noticeably. Cook’s distance \(D_i\) combines residual size and leverage into one number: it is large only when a point is both off the line and out at the edge of \(x\). In Dataset D the sessions \(= 20\) leverage point is the influential one — it has the leverage to pull the line and a \(y\) off the clean trend, so it flattens the slope. A common flag is \(D_i > 4/n\). Assumption-ladder move: Cook’s distance assumes the fitted model, it measures the actual movement of the fit under deletion, it protects you from quietly trusting a slope that one point set, but it cannot prove the point is an error — it tells you the answer depends on it, which is the cue to investigate, not to delete.

The one-line summary: outlier = unusual \(y\); high leverage = unusual \(x\); influence = the fit moves. A point can be a high-leverage point with no influence (it lands on the line) or an outlier with little influence (it is in the middle of the \(x\) range, so it tugs but cannot lever). Read all three before deciding anything.

Robust regression: OLS vs L1, Theil–Sen, Huber

When the two-summary picture above moves to a fit, the same resistance question returns: does one point set the slope? Ordinary least squares (OLS) minimizes \(\sum_i r_i^2\), the sum of squared residuals. Squaring means a single far point contributes enormously, so the line bends toward it — its breakdown point is \(0\). In Dataset D this is dramatic: the clean OLS slope \(\approx 1.5\) collapses to an OLS slope \(\approx 0.6\) once the high-leverage point at sessions \(= 20\) is included, because that point levers the line flatter. OLS assumes roughly constant-variance, outlier-free errors; when that holds it is efficient and optimal, which is why it remains the default — but it has no defense against contamination.

Least absolute deviations (L1 / quantile-median regression) minimizes \(\sum_i |r_i|\) instead of \(\sum_i r_i^2\). Because it does not square, a far residual no longer dominates, and the fit tracks the bulk of the points. On Dataset D the L1 slope \(\approx 1.5\), recovering the clean structure. L1 is the regression analogue of the median: resistant to vertical outliers, though still vulnerable to a high-leverage point that pulls in \(x\).

Theil–Sen takes the median of all pairwise slopes \(\frac{y_j - y_i}{x_j - x_i}\) over point pairs. Using a median of slopes gives it a high breakdown point against \(y\)-outliers and a clean, rank-flavored interpretation. On Dataset D the Theil–Sen slope \(\approx 1.45\), essentially the clean value — the two contaminating pairs are outvoted by the many clean ones. It assumes a roughly linear relationship and trades some efficiency for its resistance.

Huber M-estimation keeps the least-squares idea but downweights large residuals: small residuals are treated quadratically (efficient, like OLS), large ones linearly (resistant, like L1), with a tuning constant setting the crossover. It is a deliberate compromise — nearly as efficient as OLS on clean data, far more resistant on contaminated data. On Dataset D the Huber slope \(\approx 1.4\), again close to the clean \(1.5\). M-estimation guards well against \(y\)-outliers; for high-leverage points you reach for high-breakdown variants.

Fit	Minimizes / uses	Resists	Dataset D slope
clean OLS (no contamination)	\(\sum_i r_i^2\)	nothing — reference value	\(\approx 1.5\)
OLS (with contamination)	\(\sum_i r_i^2\)	nothing	\(\approx 0.6\)
L1 / LAD	\(\sum_i \|r_i\|\)	vertical outliers	\(\approx 1.5\)
Theil–Sen	median of pairwise slopes	\(y\)-outliers (high breakdown)	\(\approx 1.45\)
Huber M-estimator	quadratic-then-linear loss	large residuals (downweighted)	\(\approx 1.4\)

The contrast is the whole lesson: OLS \(0.6\) vs robust \(\approx 1.45\). The robust fits agree with each other and with the clean slope, while OLS alone reports a badly attenuated \(0.6\). Assumption-ladder move: the robust fits assume a linear relationship over the clean data, they downweight or out-vote the contaminating points, they protect the slope from a single leverage point, but they cannot prove the clean structure is the true one — if the two odd points are real signal (a genuine ceiling at high attendance, a genuine super- responder), the robust fit is now ignoring the most interesting data. Resistance and signal can look identical; only investigation tells them apart, which is the next section.

Investigate, do not auto-delete

The detection tools above tell you a point is unusual. They do not tell you to remove it. Deleting flagged points by reflex is one of the classic errors of applied analysis: it manufactures clean-looking results, hides the very structure that mattered, and is not reproducible. The rule is investigate, do not auto-delete. Work through this short decision list before you change a single row.

Is it a data-entry or measurement error? Check the source record. A sessions \(= 20\) point with gain \(= 2\) may be a transposition, a unit slip, or a miscoded missing value. A confirmed error can be corrected or removed — and you say so, with the reason, in the writeup.
Is it a real but extreme observation? A genuine super-responder (\(+40\) gain) or a genuine ceiling case is data, not noise. You do not delete real data because it is inconvenient; the extreme may be the most important finding in the study.
Can’t tell? Then report the analysis with and without the point. If the conclusion is the same either way, the point did not matter and you say so. If the conclusion flips — as the Dataset D slope flips between \(0.6\) and \(\approx 1.45\) — that fragility is itself the headline result, and hiding it by deleting the point would be the error.

A practical default that avoids the delete-or-keep dilemma: report a resistant estimate alongside the classical one. Show the OLS slope and the Theil–Sen slope; show the mean and the median. When they agree, your conclusion is robust to the tail. When they disagree, you have learned that a few points are driving the classical answer — which is exactly the information a reader needs. Assumption-ladder move: the investigate-don’t-delete rule assumes nothing statistical at all — it is a reporting discipline. It protects against silently engineered results and irreproducibility, but it cannot prove which estimate is right; it only guarantees that the reader can see how much the answer depends on the disputed points.

A static R sketch (not executed)

The code below is shown as a teaching idiom, not run — this diagnostic renders R-free. The seed is set wherever randomness would enter so the illustration is reproducible.

set.seed(45203)

# Dataset D: wellbeing gain, n = 40, with two contaminating points.
# gain holds the outcome; sessions holds the predictor (synthetic; seed set).

# --- resistant vs non-resistant summaries of the gain outcome ---
mean(gain)                 # 11   (pulled up by the +40 vertical outlier)
median(gain)               #  8   (resistant center)
mean(gain, trim = 0.10)    #  8.3 (10% trimmed mean, close to the median)
sd(gain)                   #  9   (inflated by squaring the +40)
mad(gain)                  #  5   (MAD-based SD, resistant; 1.4826 * median|x - med|)

# --- outlier / leverage / influence diagnostics on the fit ---
fit <- lm(gain ~ sessions)
which(abs(rstandard(fit)) > 3)   # flags the +40 vertical outlier (unusual y)
which(hatvalues(fit) > 2 * 2/40) # flags the sessions = 20 point  (unusual x)
which(cooks.distance(fit) > 4/40)# flags the influential leverage point (fit moves)

# --- OLS vs robust slopes (robust shown for the idea, not executed) ---
coef(fit)["sessions"]                  # 0.6  (OLS, attenuated by leverage point)
coef(MASS::rlm(gain ~ sessions))[2]    # 1.4  (Huber M-estimator, downweights)
mblm::mblm(gain ~ sessions)$coef[2]    # 1.45 (Theil-Sen, median of pairwise slopes)

Read the output as a comment, not a result: the contrast between mean \(= 11\) and median \(= 8\), and between the OLS 0.6 and the robust 1.4–1.45, is the same story the tables tell. Assumption-ladder move: the code assumes the model is fit to the contaminated data, it ranks/downweights via the robust calls, it protects by making the disagreement visible, but it cannot prove which slope is the truth — it surfaces the dependence so you investigate.

A common mistake

The signature error here is deleting flagged points to clean up the picture. A boxplot whisker, a \(|z| > 3\), or a large Cook’s distance is a flag, not a verdict. Removing the sessions \(= 20\) point would make the OLS slope snap from \(0.6\) back to \(\approx 1.5\) and produce a tidy, significant, and dishonest result, because the whole point of Dataset D is that the answer depends on two points. The second, quieter mistake is trusting a single non-resistant summary — reporting only the gain mean \(= 11\) and the OLS slope \(= 0.6\) without ever computing the median \(= 8\) or a robust slope, so you never notice the contamination at all. The fix for both is the same: compute a resistant summary next to the classical one, and when they disagree, investigate and report both rather than choosing the convenient number.

Evidence and verification status

verified: false. The method logic on this page — what a resistant summary is, the breakdown point, the outlier/leverage/influence distinction, and the OLS-versus-robust contrast — is course-authored. But every numeric value here is drafted, synthetic, and not independently checked: the Dataset D gain mean \(= 11\), median \(= 8\), \(10\%\) trimmed mean \(= 8.3\), SD \(= 9\), MAD-based SD \(\approx 5\), the breakdown-point figures (\(0\) for the mean, \(\approx 0.5\) for the median, \(\approx 0.10\) for the \(10\%\) trimmed mean), and the slope values (clean OLS \(\approx 1.5\), contaminated OLS \(\approx 0.6\), L1 \(\approx 1.5\), Theil–Sen \(\approx 1.45\), Huber \(\approx 1.4\)) are illustrative. They are provisional, not confirmed reference values.