Method chooser (decision guide)
From a data shape and a question to a defensible method
Keep this page open while you read the notes. It is a decision guide, not a flowchart that picks “the” test. The habit it teaches runs through every table below and is the spine of the whole course — the analysis blueprint, six steps walked for every method: (1) Question — are you comparing, explaining, or predicting? (2) Structure — the unit of analysis, the response versus the explanatory / grouping / covariate variables, the outcome type (quantitative, categorical, binary), and the design (paired vs independent, one factor vs two, observational vs experimental). (3) Method — the analysis that matches that structure, and why this one and not a neighbor. (4) Assumptions & diagnostics — what it assumes and how you check. (5) Estimate & uncertainty — what the model estimates (a mean difference, an effect size, a slope, an adjusted mean, an odds ratio), reported with a confidence interval, never as a bare p-value. (6) Conclusion — statistical versus practical significance, association versus causation, and what the analysis cannot support.
This guide deliberately lays out candidates — what each assumes and what each estimates — rather than naming “the” test, because the same numbers can call for different methods when the question changes. Two disciplines run inside every cell and recur on every page: report the estimate with its uncertainty, not just a verdict, and keep statistical significance, practical importance, and a causal claim distinct — observational data buy association, not causation. All numeric values referenced come from the synthetic Cypress Ridge College Student-Success datasets (seed set, set.seed(35203)) and are provisional — the worked numbers are provisional pending review. R is shown only as static, non-executed code.
The five recurring datasets are referenced throughout by their structure, because structure — not subject matter — is what drives method choice:
| Dataset | What it holds | Its structure | Where it teaches |
|---|---|---|---|
| P | pre/post readiness on the same \(n = 30\) students | paired, one quantitative response measured twice | one-sample & paired (wk 4) |
| G | final scores, Support vs Self-guided, \(n_1 = n_2 = 45\) | two independent groups, quantitative response, observational | two-group (wk 5) |
| F | final score by Format (L, LL, O, H), \(n = 25\) each, plus a pretest covariate | one factor with 4 levels (+ covariate) | one-way ANOVA (wk 6–8), ANCOVA (wk 11) |
| X | final score, Delivery × Background \(2\times 2\), \(n = 20\) per cell | two crossed factors, quantitative response | two-way ANOVA (wk 9) |
| R | hours / attendance / pretest / program → final score & pass/fail, \(n = 120\) | a quantitative predictor, a categorical predictor, and a binary outcome | regression (wk 10), categorical (wk 12), logistic (wk 13) |
How to read this guide
For every candidate method below, fill in the same blueprint columns before you compare any p-values. These are steps 3–6 of the blueprint, laid out so you can see the trade-offs side by side:
| Column | The blueprint question it answers |
|---|---|
| Method | Which analysis matches this structure (step 3)? |
| Key assumption | What must be true (or approximately true) for the claim to hold, and how do you check it (step 4)? |
| Estimate (with uncertainty) | What does the model estimate, reported with a CI / effect size — not a bare p (step 5)? |
| Conclusion it can / can’t support | Statistical vs practical vs causal — what would it oversell to claim (step 6)? |
A method is well chosen when you can write a sentence in each cell and the estimate cell holds a quantity with an interval, not a verdict. If your estimate cell says only “significant” or “\(p < 0.05\),” you have stopped one step short of the conclusion the course asks for.
Step 1–2 — name the question and the structure
Two analyses of the same numbers can call for different methods because they ask different questions. Pin down the question (compare? explain? predict?) and the structure (outcome type × design) before choosing. The grid below maps the structure to the cell of this guide that fits it.
| Outcome type | Design / structure | Question | Go to |
|---|---|---|---|
| quantitative | one group, or the same unit measured twice (like P) | did the typical value, or the typical change, move? | One group / paired |
| quantitative | two independent groups (like G) | do the two group means differ? | Two independent groups |
| quantitative | one factor, three or more groups (like F) | do any of the group means differ — and which? | Many groups, one factor |
| quantitative | two crossed factors (like X) | does each factor matter, and do they interact? | Two factors |
| quantitative | a quantitative predictor (like R) | how does the response change with the predictor? | A quantitative predictor |
| quantitative | groups plus a quantitative covariate (like F + pretest) | do groups differ after adjusting for the covariate? | Groups plus a covariate |
| categorical | two categorical variables in a table (like R: pass × program) | are the two categorical variables associated? | Two categorical variables |
| binary | a binary outcome with one or more predictors (like R: pass) | how do the predictors change the odds of the outcome? | A binary outcome |
Each cell below opens with the question and the structure, then lays out the candidate(s), what each assumes, and — the point of the course — what each estimates. Choose for a purpose; do not run every test and report the smallest p.
One group, or paired — a single quantitative response (structure: Dataset P)
Question and structure. Dataset P measures a readiness diagnostic on the same \(n = 30\) students before and after a support module. The structure that must be respected is the pairing: each student is their own control, so you analyze the \(30\) paired differences \(d_i = \text{post} - \text{pre}\), never the two columns as if they were independent samples. The question — did the typical change move off zero? — is a one-sample question about the differences.
| Method | Key assumption | Estimate (with uncertainty) | Conclusion it can / can’t support |
|---|---|---|---|
| Paired \(t\)-test (one-sample \(t\) on the differences) | the \(30\) differences are roughly normal (check a QQ plot of \(d_i\)); pairs independent | mean difference \(\bar d = +6.0\) pts; \(\mathrm{SE} = 9/\sqrt{30} \approx 1.64\); \(t \approx 3.65\) on \(29\) df, \(p \approx 0.001\); 95% CI \((2.6, 9.4)\) pts; \(d_z = 6/9 \approx 0.67\) | that readiness rose on average over the module; not that the module caused it (single arm, no control) and not whether \(+6\) pts is practically meaningful — that is a judgment on the scale |
| One-sample \(t\) against a fixed target | the single sample is roughly normal | a mean with a CI relative to a benchmark (e.g. “is post-readiness above \(65\)?”) | a comparison to a known standard, not a before/after change |
What this says. The paired analysis reports a \(+6\)-point gain with a 95% CI of \((2.6, 9.4)\) — an estimate with its uncertainty, not “\(p < 0.05\).” Pairing is what makes it powerful: if you wrongly treated pre and post as two independent samples of \(30\), the SE would be \(\sqrt{12^2/30 + 11^2/30} \approx 2.97\) — nearly double the paired SE of \(1.64\) — because pairing removes between-student variation. The classic error is exactly that: running an independent two-sample test on paired data, discarding the pairing and the power it buys. Practical vs statistical: a \(+6\)-point gain on a \(100\)-point scale is modest-to-meaningful; significance does not settle importance. See week 4.
Two independent groups — comparing two means (structure: Dataset G)
Question and structure. Dataset G compares final scores for Support (\(n_1 = 45\)) versus Self-guided (\(n_2 = 45\)) students. The structure is independent groups (different students), and the data are observational — students self-selected into the support center. The question is whether the two group means differ.
| Method | Key assumption | Estimate (with uncertainty) | Conclusion it can / can’t support |
|---|---|---|---|
| Welch two-sample \(t\) (the safe default) | approximate normality (or large \(n\) via the CLT); does not assume equal variances | mean difference \(78 - 72 = 6.0\) pts; \(\mathrm{SE} \approx 2.38\), df \(\approx 86\), \(t \approx 2.53\), \(p \approx 0.013\); 95% CI \((1.3, 10.7)\) pts; Cohen’s \(d = 6/11.27 \approx 0.53\) (medium) | that the Support mean is higher; not that the support center caused higher scores — motivated students self-select |
| Pooled two-sample \(t\) | additionally that the two variances are equal (\(s_1 = 10.5\), \(s_2 = 12.0\) here — close) | nearly identical to Welch when \(n_1 = n_2\) (\(\mathrm{SE} \approx 2.38\)) | the same comparison, but only when equal-variance is justified — otherwise prefer Welch |
What this says. Report the \(6\)-point difference with its CI \((1.3, 10.7)\) and \(d \approx 0.53\), not the lone \(p\). Prefer Welch unless equal variances are clearly justified — it costs almost nothing here and protects you when spreads differ. The deepest point is step 6: because students chose the support center, this is association, not causation; a confound (motivation) plausibly drives both the choice and the score. A \(6\)-point gap is about half a standard deviation — medium, not trivial, but not dramatic. Contrast the paired design of week 4; see week 5 and its lab.
Many groups, one factor — one-way ANOVA (structure: Dataset F)
Question and structure. Dataset F compares final scores across four instructional formats — Lecture (L), Lecture+Lab (LL), Online (O), Hybrid (H), \(n = 25\) each. With more than two groups, running all pairwise \(t\)-tests inflates the family-wise error rate; the question “do any means differ?” is answered by one omnibus test, and “which differ?” by controlled follow-ups.
| Method | Key assumption | Estimate (with uncertainty) | Conclusion it can / can’t support |
|---|---|---|---|
| One-way ANOVA (omnibus \(F\)) | roughly normal residuals; equal variances across groups (Levene’s test \(p \approx 0.40\) here — fine); independence | means \(L\,74, LL\,81, O\,70, H\,79\); \(F = 616.7/81 \approx 7.61\) on \((3, 96)\), \(p \approx 0.0001\); effect size \(\eta^2 = 1850/9626 \approx 0.19\) (format explains \(\approx 19\%\) of variance) | that some format means differ; not which pairs differ, and not causation (formats may enroll different students) |
| Tukey HSD (all pairwise, error-rate controlled) | as ANOVA; controls family-wise error across all 6 pairs | critical difference \(\approx 6.64\); significant: \(LL-O = 11\), \(H-O = 9\), \(LL-L = 7\); not: \(H-L = 5\), \(L-O = 4\), \(LL-H = 2\) | which pairs differ with the family-wise error held at 5% |
| Planned contrast (pre-specified question) | a single contrast chosen before looking; \(\sum c_j = 0\) | “hands-on (LL,H) vs delivered-only (L,O)”: \(\hat\psi = 80 - 72 = 8\) pts; \(\mathrm{SE} = 1.8\); \(t \approx 4.44\), \(p < 0.001\) | a pre-specified comparison, more powerful than post-hoc; it cannot answer questions you did not plan |
What this says. The omnibus \(F \approx 7.61\) with \(\eta^2 \approx 0.19\) says format matters and roughly how much. Unadjusted pairwise comparisons would wrongly flag \(H-L\) and \(L-O\); multiplicity control (Tukey / Bonferroni) prevents that, and a pre-specified contrast is more powerful than post-hoc snooping for a planned question. The common error is reporting a bare omnibus \(p\) with no effect size, or chasing every pairwise difference without error-rate control. See week 6, week 7, week 8, and the ANOVA lab.
Two factors — two-way ANOVA and interaction (structure: Dataset X)
Question and structure. Dataset X is a \(2\times 2\) design: Delivery {In-person, Online} crossed with Background {Weak, Strong}, \(n = 20\) per cell. Two crossed factors raise a question one factor cannot: do the factors interact — does the effect of one depend on the level of the other?
| Method | Key assumption | Estimate (with uncertainty) | Conclusion it can / can’t support |
|---|---|---|---|
| Two-way ANOVA with interaction | normal residuals; equal cell variances (\(\mathrm{MSE} = 81\)); independence | Delivery \(F \approx 10.4\), \(p \approx 0.002\); Background \(F \approx 67.2\), \(p < 0.001\); Interaction \(F \approx 5.0\), \(p \approx 0.028\); cell means In-person/Weak \(73\), In-person/Strong \(85\), Online/Weak \(62\), Online/Strong \(83\) | that the In-person advantage depends on background (\(11\) pts for Weak, \(2\) pts for Strong) — read the interaction first |
| Two separate one-way ANOVAs | (a tempting shortcut) | each factor alone — but this hides the interaction | nothing about whether the factors interact — it cannot see the \(11\)-vs-\(2\) pattern |
What this says. When the interaction is significant, the main effects are conditional: do not report “Online is \(6.5\) points worse” as if it applied uniformly — it costs weak-background students \(11\) points but strong-background students only \(2\). Read the interaction plot (non-parallel lines) before the main-effect table. The classic error is reporting marginal main effects while a real interaction is present, which misstates what the data show. See week 9.
A quantitative predictor — regression (structure: Dataset R)
Question and structure. Dataset R relates study hours/week to final score for \(n = 120\) students, with attendance and a pretest also recorded. The question is explanatory: how does the response change with the predictor — and does that change survive adjustment for other predictors?
| Method | Key assumption | Estimate (with uncertainty) | Conclusion it can / can’t support |
|---|---|---|---|
| Simple linear regression | linearity; roughly constant-variance, normal residuals; no overly influential point | \(\widehat{\text{final}} = 55 + 1.6\cdot\text{hours}\); slope SE \(\approx 0.22\), \(t \approx 7.3\), \(p < 0.001\), 95% CI \((1.16, 2.04)\); \(R^2 \approx 0.30\) | that each extra study-hour is associated with \(+1.6\) final points; not that studying causes it (observational) |
| Multiple regression (adjusting for attendance, pretest) | as above, plus low multicollinearity (VIF \(\approx 1.3\) — fine) | \(\widehat{\text{final}} = 30 + 1.1\cdot\text{hours} + 0.25\cdot\text{att} + 0.30\cdot\text{pretest}\); \(R^2 \approx 0.46\); the hours slope drops \(1.6 \to 1.1\) | the partial slope, “holding attendance and pretest fixed”; the drop reveals confounding, not causation |
What this says. Report the slope with its CI and \(R^2\), and notice the headline move: the hours slope drops from \(1.6\) to \(1.1\) after adjustment, because students who study more also attend more and start higher — confounding. The partial slope answers a different question (“hold the others fixed”) than the simple slope (“ignore them”). Watch for an influential high-leverage point — investigate, do not auto-delete. This bridges directly to ANCOVA: adjustment changes the estimate. See week 10 and its lab.
Groups plus a covariate — ANCOVA (structure: Dataset F + pretest)
Question and structure. Take the four-format comparison of Dataset F and add a pretest covariate that correlates with the final score (\(r \approx 0.50\) within group). If the formats started at slightly different baselines, a raw comparison confounds format with baseline. The question becomes: do the formats differ at the same baseline?
| Method | Key assumption | Estimate (with uncertainty) | Conclusion it can / can’t support |
|---|---|---|---|
| ANCOVA (group means adjusted for the covariate) | the usual ANOVA assumptions, plus parallel slopes (homogeneity of regression: format × pretest interaction NS, \(p \approx 0.5\) — valid) | adjusted means \(L\,74.5, LL\,80.6, O\,70.9, H\,78.1\) (gaps shrink); covariate \(F \approx 30\), \(p < 0.001\); format after adjustment \(F \approx 6.2\) on \((3,95)\), \(p \approx 0.0007\), \(\eta^2_{\text{partial}} \approx 0.16\) (down from \(0.19\)) | the format effect adjusted for baseline readiness; not causation (formats still observational) |
| Unadjusted one-way ANOVA | ignores the covariate | the raw means (\(\eta^2 \approx 0.19\)) | a comparison that confounds format with baseline differences |
What this says. Adjustment shrinks the format gaps and the effect size (\(\eta^2\,0.19 \to 0.16\)) — some of the apparent format advantage was really baseline advantage. ANCOVA is only valid when the parallel-slopes assumption holds; check the format × covariate interaction first. The comparison is now “formats at the same baseline,” a cleaner estimate — but still association, not causation. See week 11.
Two categorical variables — the contingency table (structure: Dataset R, pass × program)
Question and structure. Cross-tabulate pass/fail against support program {None, Drop-in, Structured} (\(40\) each) — a \(3 \times 2\) table. Both variables are categorical, so means and slopes do not apply; the question is whether the two variables are associated.
| Method | Key assumption | Estimate (with uncertainty) | Conclusion it can / can’t support |
|---|---|---|---|
| Chi-square test of independence | expected counts not too small (all \(\ge 5\) here — expected pass \(= 40(0.6) = 24\)); independent observations | pass rates None \(45\%\), Drop-in \(60\%\), Structured \(75\%\); \(\chi^2 = 3.75 + 0 + 3.75 = 7.5\) on \(2\) df, \(p \approx 0.024\) | that pass rate and program are associated; not the direction or size by itself, and not causation |
| Effect measures (report alongside the test) | a chosen reference comparison | Structured vs None: risk difference \(= 0.30\), relative risk \(\approx 1.67\), odds ratio \(\approx 3.67\) | the magnitude of association — the part the bare \(\chi^2\) omits |
What this says. A significant \(\chi^2\) alone is a verdict, not an estimate. Pair it with an effect measure — the risk difference \(0.30\), RR \(\approx 1.67\), or OR \(\approx 3.67\) for Structured vs None — so you report how much, not just whether. Note \(\mathrm{OR} \ne \mathrm{RR}\); say which you mean. And because students self-select into programs, a significant association is not proof the program caused passing. See week 12.
A binary outcome — logistic regression (structure: Dataset R, pass)
Question and structure. The outcome pass \(= (\text{final} \ge 70)\) is binary, with quantitative and categorical predictors. Linear regression is wrong for a 0/1 outcome (it can predict probabilities outside \([0,1]\)); logistic regression models the log-odds and lets you adjust several predictors at once.
| Method | Key assumption | Estimate (with uncertainty) | Conclusion it can / can’t support |
|---|---|---|---|
| Logistic regression | a linear logit; independent observations; enough events per predictor | \(\mathrm{logit}(\hat p) = b_0 + 0.22\cdot\text{hours} + 0.04\cdot\text{pretest} + 0.6\,[\text{Drop-in}] + 1.0\,[\text{Structured}]\); OR per study-hour \(= e^{0.22} \approx 1.25\); OR Structured vs None (adjusted) \(= e^{1.0} \approx 2.72\) | the adjusted change in odds; a predicted probability (\(\approx 0.56\) high-effort Structured vs \(\approx 0.05\) low-effort None); not causation |
| Reading the raw logit as a probability | (a tempting error) | the coefficient \(1.0\) is on the log-odds scale | nothing on the probability scale until you exponentiate and back-transform |
What this says. Coefficients live on the log-odds scale: exponentiate to an odds ratio, and read a predicted probability (the S-curve \(p = 1/(1+e^{-\eta})\)), never the raw logit, as the conclusion. The adjusted OR for Structured vs None shrinks from the raw \(3.67\) to \(\approx 2.72\) once you adjust for hours and pretest — confounding again, the throughline of this dataset. And \(\mathrm{OR} \ne \mathrm{RR}\): an odds ratio of \(2.72\) is not “\(2.72\) times as likely to pass.” See week 13 and its lab.
A compact decision table
One screen, the whole guide. Read across: structure → method → what it estimates. The estimate column is the one the course cares about most — it is never a bare p-value.
| If the outcome is… | …and the structure is… | Method | It estimates | Key assumption | Week |
|---|---|---|---|---|---|
| quantitative | one group / paired (P) | paired (one-sample) \(t\) | mean difference + CI; \(d_z\) | normal differences | 4 |
| quantitative | two independent groups (G) | Welch two-sample \(t\) | mean difference + CI; Cohen’s \(d\) | approx. normal; unequal var OK | 5 |
| quantitative | one factor, \(\ge 3\) groups (F) | one-way ANOVA (+ Tukey / contrast) | which means differ; \(\eta^2\) | equal variances; normal residuals | 6–8 |
| quantitative | two crossed factors (X) | two-way ANOVA | main effects + interaction | equal cell variances | 9 |
| quantitative | a quantitative predictor (R) | simple / multiple regression | slope + CI; \(R^2\) | linearity; constant variance | 10 |
| quantitative | groups + covariate (F + pretest) | ANCOVA | adjusted means; partial \(\eta^2\) | parallel slopes | 11 |
| categorical | two categorical variables (R) | chi-square + effect measure | association; RD / RR / OR | expected counts \(\ge 5\) | 12 |
| binary | binary outcome + predictors (R) | logistic regression | odds ratio; predicted probability | linear logit | 13 |
A small R idiom for each, shown for the shape of the call only — not executed in this build (R is not installed; set.seed(35203) where randomness would enter):
set.seed(35203)
# paired / one-sample (P)
t.test(post, pre, paired = TRUE)
# two independent groups, Welch (G) — the safe default
t.test(final ~ group, data = G) # var.equal = FALSE by default
# one-way ANOVA + Tukey (F)
fit <- aov(final ~ format, data = F); summary(fit); TukeyHSD(fit)
# two-way ANOVA with interaction (X) — read the interaction row first
summary(aov(final ~ delivery * background, data = X))
# multiple regression (R)
summary(lm(final ~ hours + attendance + pretest, data = R))
# ANCOVA: covariate first, then the group factor (F + pretest)
summary(aov(final ~ pretest + format, data = F))
# contingency table (R: pass x program)
chisq.test(table(R$program, R$pass))
# logistic regression (R) — coefficients are log-odds; exponentiate for ORs
fit <- glm(pass ~ hours + pretest + program, family = binomial, data = R)
exp(coef(fit)); exp(confint(fit)) # odds ratios with CIsA note on choosing — purpose over reflex
The whole guide reduces to a few sentences worth carrying. Each drift below has a disciplined move that keeps you inside the blueprint:
| Drift to resist | The disciplined move |
|---|---|
| reporting a bare p-value as the result | report the estimate with its CI / effect size — a mean difference, a slope, an OR |
| running an independent test on paired data | preserve the pairing; analyze the differences (Dataset P) |
| treating an observational association as causal | say “associated with,” not “causes”; name the likely confound |
| ignoring unequal variance in two groups | default to Welch; check spreads before pooling |
| chasing every pairwise difference | use Tukey / a planned contrast; control the family-wise error |
| reading main effects past a real interaction | read the interaction plot first; report conditional effects (Dataset X) |
| deleting an influential point silently | investigate, do not auto-delete; report with and without |
| reading a logit coefficient as a probability | exponentiate to an OR, back-transform to a predicted probability |
| confusing statistical with practical significance | judge the estimate against the scale, not against \(p < 0.05\) |
When a method’s assumptions genuinely hold and its estimate answers your question, it is the right, efficient choice — say why this one, report the estimate with its uncertainty, and bound the conclusion to what the design (observational vs experimental) can support. Match the method to the question and the structure, not to habit.
Evidence and verification status
verified: false. The decision logic and the blueprint framing on this page are course-authored, but every numeric value referenced here — P’s paired mean difference \(+6\), \(t \approx 3.65\), CI \((2.6, 9.4)\), \(d_z \approx
0.67\) and the independent-SE contrast \(\approx 2.97\); G’s difference \(6\), Welch \(t \approx 2.53\), CI \((1.3,
10.7)\), \(d \approx 0.53\); F’s means \(L\,74, LL\,81, O\,70, H\,79\), \(F \approx 7.61\), \(\eta^2 \approx 0.19\), Tukey critical difference \(\approx 6.64\), contrast \(\hat\psi = 8\), and the ANCOVA adjusted means (\(74.5, 80.6, 70.9, 78.1\)) with \(F \approx 6.2\), \(\eta^2_{\text{partial}} \approx 0.16\); X’s cell means (\(73, 85, 62, 83\)) and \(F\)’s (\(\approx 10.4, 67.2, 5.0\)); R’s slopes (\(1.6 \to 1.1\)), \(R^2\) (\(0.30, 0.46\)), \(\chi^2 = 7.5\) with RD \(0.30\) / RR \(\approx 1.67\) / OR \(\approx 3.67\), and the logistic ORs (\(e^{0.22} \approx
1.25\), \(e^{1.0} \approx 2.72\)) with predicted probabilities (\(\approx 0.56, 0.05\)) — is drafted, synthetic, and not independently checked; the data are simulated with set.seed(35203) and R is not executed in this build. These worked numbers are provisional and not independently verified — treat them as targets to reproduce, not as confirmed reference values.
Public vs. graded
These notes, the examples, and the practice here are public and ungraded — study material only. No graded prompts, answer keys, rubrics, point values, or due dates appear on this site. Graded applied-methods checkpoints, weekly quizzes, homework and analysis memos, applied analysis labs, the midterm, the applied methods project, and the final exam live in Blackboard (the LMS), which is authoritative for due dates, submissions, and grades. If this page and Blackboard ever disagree, follow Blackboard.
See also
- Methods glossary — the vocabulary and notation behind every term used here.
- Assumptions & diagnostics guide — what each method assumes and how to check it (normality, equal variance, parallel slopes, expected counts, the linear logit).
- Reporting & interpretation guide — effect sizes and confidence intervals, practical vs statistical significance, and association vs causation in applied reporting.