Assumptions & diagnostics guide

What each method assumes, and how to check it

Keep this page open while you fit anything. It is the fourth step of the analysis blueprint — Question → Structure → Method → Assumptions & diagnostics → Estimate & uncertainty → Conclusion — pulled out into one reference. Every method in this course (the paired and two-sample \(t\), one-way and two-way ANOVA, simple and multiple regression, ANCOVA, chi-square, and logistic regression) buys you an estimate with a confidence interval only when the structure it assumes is roughly the structure your data actually have. The diagnostics below are how you earn the estimate; they are not box-checking before a verdict.

Three disciplines run through the whole page and you should be able to recite them:

Report the estimate with its uncertainty, not a bare verdict. A diagnostic that passes does not license a lone p-value — it licenses a mean difference, an effect size, a slope, an adjusted mean, or an odds ratio, each with an interval. A diagnostic that fails tells you which estimate is no longer trustworthy.
Keep statistical significance, practical significance, and causation distinct. No diagnostic upgrades an observational association into a causal claim. Levene’s test passing does not mean the support center caused the higher scores; a clean QQ plot does not make a \(6\)-point gain practically large. Assumptions buy you a valid estimate, nothing more.
Investigate, do not auto-delete. Outliers and influential points are signals to examine, never rows to silently drop. A point can be a data-entry slip, a genuine extreme student, or the very case that breaks your model. Deleting it to clean up a plot is the most common way an honest analysis goes quietly wrong. Report the analysis with and without the point and say what changed.

All numbers referenced come from the five synthetic Cypress Ridge College Student-Success datasets (seed set, set.seed(35203)) and are illustrative, not independently verified. R is shown as static, non-executed code; it is not run on this site.

How to read this guide

Each method gets the same three-column table — the assumption ladder for applied methods. Fill these three cells in your head before you read any p-value:

Column	The question it answers
Assumes	What must be (approximately) true for this method’s estimate and interval to be trustworthy?
How to check	The plot, test, or number you actually look at — what the diagnostic is.
If it fails	The remedy or the fallback method — what you do instead, never “ignore and report anyway.”

A recurring caution: a diagnostic test (Levene, Shapiro–Wilk) answers “is the assumption violated enough to detect?”, which is sensitive to sample size — it can flag a trivial departure in a huge sample and miss a real one in a tiny sample. Prefer pictures (residual plots, QQ plots, leverage plots) as your primary evidence and use the formal test as a supporting number, not the verdict.

The five datasets, by structure

The same world appears under five structures, and structure is what selects the method — and therefore the assumptions. Synthetic; seed set (set.seed(35203)).

Dataset	Structure	Method(s) it carries	Where it teaches
P	paired pre/post, same \(n = 30\)	paired \(t\)	wk 4
G	two independent groups, \(n_1 = n_2 = 45\)	two-sample / Welch \(t\)	wk 5
F	four formats, \(n = 25\) each (\(+\) pretest covariate)	one-way ANOVA; ANCOVA	wk 6–8, wk 11
X	\(2 \times 2\) factorial, \(n = 20\) per cell	two-way ANOVA	wk 9
R	\(n = 120\) predictors → score & pass/fail	regression; chi-square; logistic	wk 10, wk 12, wk 13

Paired and two-sample \(t\)-tests

Paired \(t\) (Dataset P, wk 4)

The structure is the assumption that matters most: you have one measurement twice on each unit (the same \(n = 30\) students, pre and post), so you analyze the \(30\) paired differences \(d_i = \text{post} - \text{pre}\), never the two columns as independent groups. With \(\bar d = +6.0\), \(s_d = 9.0\), \(\mathrm{SE} = 9/\sqrt{30} \approx 1.64\), the paired \(t = 6.0/1.64 \approx 3.65\) on \(29\) df gives the 95% CI \((2.6, 9.4)\) points and effect size \(d_z = 6/9 \approx 0.67\). The interval is the deliverable, not the p-value.

Assumes	How to check	If it fails
The \(30\) differences are independent across students	Design: one row per student, no shared rooms/sections driving scores	Clustered data needs a mixed model; not covered here — flag it
The differences are roughly normal (not the raw pre/post columns)	QQ plot of the \(d_i\); histogram; the difference, not each column, must be near-normal	Wilcoxon signed-rank as a fallback; with \(n = 30\) the CLT makes the \(t\) fairly robust
The pairing is correctly preserved	Confirm each post is matched to its own pre	Mismatched pairs invalidate everything — recheck the merge

set.seed(35203)
# Dataset P: pre/post on the same 30 students -> analyze the differences
t.test(post, pre, paired = TRUE)            # paired, NOT two independent samples
#  t = 3.65, df = 29, p-value = 0.001
#  95 percent confidence interval:  2.6  9.4   (points)
#  mean of the differences:  6.0
qqnorm(post - pre); qqline(post - pre)      # check normality of the DIFFERENCES

The locked contrast. If you wrongly treated pre and post as two independent samples, the SE would be \(\sqrt{12^2/30 + 11^2/30} \approx 2.97\) — nearly double the paired \(1.64\). Using the wrong design is itself an assumption failure: it throws away the between-student variation that pairing removes, and inflates your interval.

Two-sample / Welch \(t\) (Dataset G, wk 5)

Here the structure is two independent groups (Support \(n_1 = 45\) vs Self-guided \(n_2 = 45\); different students), and the estimate is the difference in means \(78 - 72 = 6.0\) points, Welch \(\mathrm{SE} \approx 2.38\), \(t \approx 2.53\), 95% CI \((1.3, 10.7)\), Cohen’s \(d = 6/11.27 \approx 0.53\) (medium).

Assumes	How to check	If it fails
The two samples are independent (different units, no pairing)	Design: distinct students per group	If paired, switch to the paired \(t\) (wk 4) — the designs are not interchangeable
Each group’s values are roughly normal (or \(n\) large enough for the CLT)	QQ plot / histogram within each group	Welch \(t\) tolerates mild non-normality at \(n = 45\); severe skew → rank-sum
Equal variance — only if you use the pooled \(t\)	Compare group SDs (\(10.5\) vs \(12.0\)); a side-by-side boxplot; Levene’s test	Use Welch (the safe default), which does not assume equal variance — here pooled and Welch nearly agree because \(n_1 = n_2\)

set.seed(35203)
# Dataset G: two independent groups -> Welch is the safe default
t.test(final ~ group, data = G)             # Welch by default (var.equal = FALSE)
#  t = 2.53, df = 86, p-value = 0.013
#  95 percent confidence interval:  1.3  10.7   (points)

Reach for Welch by default. Equal-variance (pooled) \(t\) is only marginally more efficient and fails badly under unequal variance with unequal \(n\). And remember the conclusion ceiling: because students chose the support center, this is association, not causation — no \(t\)-test assumption removes that selection.

One-way ANOVA (Dataset F, wk 6–8)

One-way ANOVA compares the four format means (\(L = 74\), \(LL = 81\), \(O = 70\), \(H = 79\); grand mean \(76\)) and estimates how much of the score variance format explains: \(F = 616.7/81 \approx 7.61\) on \((3, 96)\), \(\eta^2 = 1850/9626 \approx 0.19\). The assumptions are checked on the residuals (each score minus its group mean), because that is what the F-test’s denominator \(\mathrm{MSE} = 81\) summarizes.

Assumes	How to check	If it fails
Independence of observations	Design: different students, no shared influence within a format	Dependence (shared sections) needs a mixed model — flag it
Normal residuals	QQ plot of residuals; residuals-vs-fitted for symmetry	Mild non-normality is fine at these \(n\); heavy skew → transform or Kruskal–Wallis
Equal variance across the four groups (homoscedasticity)	Residuals-vs-fitted (constant band); Levene’s test	Welch one-way ANOVA, or a variance-stabilizing transform

For Dataset F the residuals are roughly normal — a near-linear QQ plot with one mild low outlier, an Online student near \(45\). Investigate, do not auto-delete: check whether that score is a recording error or a real struggling student before deciding anything. Levene’s test gives \(p \approx 0.40\), so equal variance is reasonable (Online is slightly more spread, not alarmingly so), and independence holds by design.

set.seed(35203)
fit <- aov(final ~ format, data = F)
summary(fit)                                 # F = 7.61 on (3, 96), p = 0.0001
plot(fit, which = 1)                         # residuals vs fitted -> equal-variance band
plot(fit, which = 2)                         # QQ plot of residuals -> normality
car::leveneTest(final ~ format, data = F)    # Levene's test  p = 0.40 (equal var OK)

Report \(F\) with \(\eta^2 \approx 0.19\) — format explains about \(19\%\) of score variance — not the F-statistic alone. A significant \(F\) says “the means are not all equal”; it does not say which differ (that is wk 8) or that format caused the gap.

Two-way ANOVA (Dataset X, wk 9)

The \(2 \times 2\) design (Delivery × Background, \(n = 20\) per cell, \(\mathrm{MSE} = 81\) on \(76\) df) carries the same residual assumptions as one-way ANOVA, plus one reading rule that is really an interpretation discipline.

Assumes	How to check	If it fails
Independence of observations	Design: distinct students per cell	mixed model if clustered
Normal residuals, equal variance across the four cells	Residuals-vs-fitted; QQ of residuals; Levene across cells	transform; Welch-type correction
Adequate, ideally balanced cell sizes	Cell counts (\(20\) each here, so balanced)	unbalanced designs need Type II/III sums of squares — name which you used

set.seed(35203)
fit2 <- aov(final ~ delivery * background, data = X)
summary(fit2)
#  delivery               F = 10.4   p = 0.002
#  background             F = 67.2   p < 0.001
#  delivery:background    F = 5.0    p = 0.028   <- read this FIRST
interaction.plot(X$background, X$delivery, X$final)   # non-parallel lines

The diagnostic that is really a reading rule: the interaction \(F \approx 5.0\) (\(p \approx 0.028\)) is significant, so read the interaction first. The In-person advantage is \(73 - 62 = 11\) points for weak-background students but only \(85 - 83 = 2\) for strong-background students. With a real interaction the main effects are conditional — do not report “Online is \(6.5\) points worse” as if it applied uniformly. Look at the interaction plot (non-parallel lines) before the main-effect table.

Simple and multiple regression (Dataset R, wk 10)

Simple regression gives \(\widehat{\text{final}} = 55 + 1.6\cdot\text{hours}\) (\(R^2 \approx 0.30\), slope \(\mathrm{SE} \approx 0.22\), 95% CI \((1.16, 2.04)\)); multiple regression gives \(\widehat{\text{final}} = 30 + 1.1\cdot\text{hours} + 0.25\cdot\text{att} + 0.30\cdot\text{pretest}\) (\(R^2 \approx 0.46\)). The hours slope drops \(1.6 \to 1.1\) after adjustment — that drop is confounding, and the partial slope means “holding attendance and pretest fixed.” Regression has the richest diagnostic kit in the course.

Assumes	How to check	If it fails
Linearity — \(Y\) is linear in each predictor	Residuals-vs-fitted (no curve); component-plus-residual plots	add a term, transform \(X\), or use a nonlinear fit
Independent residuals	Design; for ordered data, residual-vs-order plot	dependence needs time-series / mixed methods
Constant-variance (homoscedastic) residuals	Residuals-vs-fitted (even band, no funnel); scale–location plot	transform \(Y\) (e.g. log); robust/weighted SEs
Roughly normal residuals (for the interval and CI)	QQ plot of residuals	large \(n\) helps; transform if severe
No severe multicollinearity	VIF for each predictor	drop/combine redundant predictors; here hours–attendance \(r \approx 0.45\), VIF \(\approx 1.3\), fine
No single point distorting the fit (influence)	residual-vs-leverage plot; Cook’s distance; hat values	investigate; refit with and without — do not auto-delete

set.seed(35203)
fit <- lm(final ~ hours + attendance + pretest, data = R)
summary(fit)                                 # R^2 = 0.46; hours slope 1.1 (was 1.6 simple)
plot(fit, which = 1)                         # residuals vs fitted -> linearity + equal variance
plot(fit, which = 2)                         # QQ -> residual normality
plot(fit, which = 5)                         # residuals vs leverage -> influence (Cook's D)
car::vif(fit)                                # all VIF ~ 1.3, well below the rule-of-thumb 5

For Dataset R the residuals are roughly normal with mild heteroscedasticity, VIFs are near \(1.3\) (no collinearity problem), and there is one high-leverage student. Investigate, do not drop — a high-leverage point sits at an extreme \(x\) and can swing the slope, but it may be a perfectly real high-effort student. Refit with and without it and report whether the slope and its interval move. A high-leverage point (\(x\) unusual) is not the same as a large-residual outlier (\(y\) far from the line); an influential point is one that actually moves the fit (high leverage and a large residual) — Cook’s distance is the single number that combines them.

Report the slope with its 95% CI \((1.16, 2.04)\) for hours in the simple fit, and read the partial slope as adjusted — “each extra study-hour per week is associated with about \(+1.1\) final points, holding attendance and pretest fixed.” Observational predictors buy association; “holding fixed” is a modeling statement, not a controlled experiment.

ANCOVA (Dataset F + pretest, wk 11)

ANCOVA compares the format means adjusted for the pretest covariate — putting the formats “at the same baseline.” Adjusting (common slope \(b \approx 0.45\)) shrinks the gaps: adjusted means \(L\,74.5\), \(LL\,80.6\), \(O\,70.9\), \(H\,78.1\); the format effect after adjustment is \(F \approx 6.2\) on \((3, 95)\), \(\eta^2_{\text{partial}} \approx 0.16\) (down from the unadjusted \(0.19\) — some apparent format advantage was baseline). ANCOVA carries all the regression/ANOVA residual assumptions plus two of its own.

Assumes	How to check	If it fails
All ANOVA residual assumptions (independence, normal residuals, equal variance)	residual plots, QQ, Levene — as above	as above
Parallel slopes / homogeneity of regression — the covariate’s slope is the same in every group	Test the format × pretest interaction; plot fitted lines per group	If slopes differ, a single adjusted mean is misleading — report group-specific slopes instead
The covariate is measured before treatment (pre-treatment), not affected by it	Timeline: pretest is baseline readiness, taken first	Adjusting for a post-treatment covariate removes part of the effect you wanted — do not do it

set.seed(35203)
# Parallel-slopes check FIRST: is the format x pretest interaction needed?
anova(lm(final ~ format * pretest, data = F))   # interaction NS, p = 0.5  -> slopes parallel
# Valid ANCOVA: common-slope adjustment
fitc <- lm(final ~ pretest + format, data = F)
anova(fitc)                                      # covariate F = 30 (p<0.001); format F = 6.2
emmeans::emmeans(fitc, "format")                 # adjusted means: 74.5, 80.6, 70.9, 78.1

For Dataset F the format × pretest interaction is non-significant (\(p \approx 0.5\)), so the parallel-slopes assumption holds and the single common slope is valid; the covariate is genuinely pre-treatment (baseline readiness). The pre-treatment rule is load-bearing: adjusting for something measured after the formats acted would erase part of the very format effect you are estimating. Report the adjusted means with their intervals, and read them as “formats compared at the same baseline” — still observational, still association.

Chi-square test of independence (Dataset R, wk 12)

The \(3 \times 2\) table (pass × support program) has counts None \(18/22\), Drop-in \(24/16\), Structured \(30/10\), giving \(\chi^2 = 7.5\) on \(2\) df (\(p \approx 0.024\)), with expected pass per program \(= 40(0.6) = 24\). The chi-square has fewer assumptions than the mean-based methods, but they are easy to violate in small tables.

Assumes	How to check	If it fails
Expected counts \(\ge 5\) in (essentially) every cell	Inspect the expected-count table, not the observed	Combine sparse categories, or use Fisher’s exact test
Independence of observations — each student counted once	Design: one row per student, no double-counting	Repeated measures invalidate the \(\chi^2\) — restructure the data
The table holds counts, not percentages or means	Confirm cells are frequencies	Re-tabulate from raw counts

set.seed(35203)
tab <- table(R$program, R$pass)
chisq.test(tab)                              # X-squared = 7.5, df = 2, p = 0.024
chisq.test(tab)$expected                     # all expected counts >= 5 (24 each here) -> OK

Here every expected count is \(24 \ge 5\), so the approximation is safe, and each student appears once. Report an effect, not just the \(\chi^2\): Structured vs None gives a risk difference \(0.75 - 0.45 = 0.30\), a relative risk \(0.75/0.45 \approx 1.67\), and an odds ratio \(\approx 3.67\). And because students self-select into programs, a significant association is not evidence the program caused passing — the conclusion ceiling again.

Logistic regression (Dataset R, wk 13)

For a binary outcome (pass \(= \text{final} \ge 70\)), logistic regression models the log-odds: \(\mathrm{logit}(\hat p) = b_0 + 0.22\cdot\text{hours} + 0.04\cdot\text{pretest} + 0.6\,[\text{Drop-in}] + 1.0\,[\text{Structured}]\). The OR per study-hour is \(e^{0.22} \approx 1.25\); the adjusted Structured-vs-None OR is \(e^{1.0} \approx 2.72\) — shrunk from the raw \(3.67\) once hours and pretest are held fixed (confounding, yet again). Its assumptions differ from the linear models’.

Assumes	How to check	If it fails
Correct link — log-odds is linear in the predictors	binned-residual plot; compare to a flexible fit	add interactions/splines; try a different link
Independence of observations	Design: one row per student	clustered → mixed/GEE logistic
Linearity of the logit in each continuous predictor	plot smoothed log-odds vs the predictor (e.g. hours)	transform the predictor; add a quadratic term
No severe separation — no predictor perfectly splits pass/fail	Watch for huge coefficients with enormous SEs; warnings on fit	penalized (Firth) logistic; combine categories
No severe multicollinearity	VIF on the predictors	drop/combine, as in linear regression

set.seed(35203)
fitg <- glm(pass ~ hours + pretest + program, data = R, family = binomial)
summary(fitg)                                # watch for huge coef + huge SE (separation)
exp(coef(fitg))                              # ORs: hours 1.25; Structured vs None 2.72 (adjusted)
exp(confint(fitg))                           # report each OR WITH its interval

Two locked discipline points live in the diagnostics. First, coefficients are on the log-odds scale — exponentiate to an odds ratio (\(e^{0.22} \approx 1.25\) per hour), and read a predicted probability (the S-curve \(p = 1/(1+e^{-\eta})\), e.g. \(\approx 0.56\) for a high-effort Structured student vs \(\approx 0.05\) for a low-effort None student) as the conclusion — never the raw logit. Second, \(\mathrm{OR} \ne \mathrm{RR}\): do not report the \(2.72\) odds ratio as if it were a \(2.72\)-times risk. And separation is the silent killer — a coefficient that blows up with an enormous standard error means a predictor perfectly separated the outcome, so the estimate is unstable; switch to a penalized fit rather than trusting it.

The investigate-do-not-delete rule, in one place

Every method above can produce an unusual point, and the response is always the same workflow, never deletion:

Point type	What it is	The diagnostic	The honest move
Outlier (\(y\))	response far from the model’s prediction	large standardized residual; QQ plot tail	examine the case; report with and without
High-leverage (\(x\))	predictor value far from the others	hat value; residual-vs-leverage plot	check it is real; see if the slope moves
Influential	actually changes the fit (leverage \(+\) residual)	Cook’s distance	refit both ways; state what changed

A point that is unusual but correct and real stays in — it is part of the population you are describing. A point that is a demonstrable error can be corrected or removed with a documented reason. What you never do is delete a point because it spoils a plot or a p-value. The data cannot tell you which kind it is; your investigation does.

A few drifts to resist

Drift	The disciplined move
trusting Levene/Shapiro over the picture	read the plot first; use the test as a supporting number, mindful it scales with \(n\)
checking normality of the raw columns in a paired test	check normality of the differences
pooled \(t\) by default	use Welch unless equal variance is genuinely justified
reporting main effects when the interaction is significant	read the interaction first; main effects are conditional
reading a logit coefficient as a probability	exponentiate to an OR, and report a predicted probability
deleting an influential point	investigate, do not auto-delete; refit both ways
treating a passed diagnostic as a causal license	assumptions buy a valid estimate, not causation

When the assumptions genuinely hold, the parametric method is the efficient, correct choice and you report its estimate with its interval plainly. When a diagnostic fails, it is telling you which estimate to stop trusting and which fallback to reach for — that is the whole point of step 4 of the blueprint.

Public vs. graded

These notes, the examples, and the practice here are public and ungraded — study material only. No graded prompts, answer keys, rubrics, point values, or due dates appear on this site. Graded applied-methods checkpoints, weekly quizzes, homework and analysis memos, applied analysis labs, the midterm, the applied methods project, and the final exam live in Blackboard (the LMS), which is authoritative for due dates, submissions, and grades. If this page and Blackboard ever disagree, follow Blackboard.