Methods glossary

The applied-methods vocabulary, organized by theme

Keep this page open while you read the weekly notes. One discipline runs down every section: the analysis blueprint — for any method, you should be able to name its place in the same six steps. (1) the question (comparing, explaining, or predicting?), (2) the structure (the unit of analysis; response vs explanatory, grouping, or covariate; the outcome type; the design), (3) the method that matches that structure, (4) the assumptions and the diagnostics that check them, (5) the estimate with its uncertainty — a mean difference, an effect size, a slope, an odds ratio, reported with a confidence interval, never as a bare p-value, and (6) the conclusion, which keeps statistical significance, practical significance, and causation distinct. Two habits live inside the blueprint and recur on every page: report the estimate, not just a verdict, and remember that observational data buy association, not causation. Name both every time.

All numeric values mentioned come from the five synthetic Cypress Ridge College Student-Success Study datasets (P, G, F, X, R), generated with set.seed(35203), and are provisional pending review (R is not executed in this build). Use them to learn the vocabulary; do not treat any figure as a confirmed reference.

The five datasets are referenced throughout by their structure, because structure — not subject matter — is what picks the method:

Dataset What it holds Its structure Where it teaches
P pre/post readiness on the same \(n = 30\) students paired (each student is their own control) one-sample & paired comparisons (wk 4)
G final scores, Support vs Self-guided (\(n_1 = n_2 = 45\)) two independent groups, observational two-group comparison & effect size (wk 5)
F final scores under four formats (L, LL, O, H), \(N = 100\) one factor, many groups; + a pretest covariate one-way ANOVA, contrasts, ANCOVA (wk 6–8, 11)
X final scores in a Delivery × Background \(2\times 2\), \(N = 80\) two factors, factorial two-way ANOVA & interaction (wk 9)
R hours/attendance/pretest/program → final & pass/fail, \(n = 120\) regression + categorical + binary regression, contingency, logistic (wk 10, 12, 13)

How to read this glossary

Each entry is a one-to-three-sentence plain-language gloss, organized by the stage of the blueprint it belongs to: first the question and structure vocabulary that sets up every analysis, then estimation and uncertainty (the language of step 5), then the method families in course order — group comparisons, models (regression, ANCOVA), and categorical outcomes. The recurring discipline closes each section: an estimate is reported with its interval, and a conclusion keeps statistical, practical, and causal language apart. For the decision logic — which method a given shape calls for — see the companion method chooser; for assumption checking see the assumptions and diagnostics guide; for writing results up see the reporting and interpretation guide.

Questions & structure

Before any method, the blueprint asks two things: what are you asking? and what is the shape of the data? This is step 1 and step 2, and getting them right determines everything downstream — the wrong structure picks the wrong method (an independent test on paired data; a mean on ordinal labels).

Term Meaning
statistical question what you are actually asking: comparing groups, explaining an outcome with predictors, or predicting a new value — different questions call for different methods even on the same numbers
unit of analysis the entity one row of data describes (here, one student); the level at which the response and the design are defined
response \(Y\) the outcome being modeled or compared (final score, readiness gain, pass/fail)
explanatory variable \(X\) a quantitative predictor of the response (study hours, pretest, attendance in Dataset R)
grouping factor a categorical variable that splits units into groups to compare (Format in Dataset F; Support vs Self-guided in G)
covariate a quantitative variable you adjust for rather than the variable of interest (the pretest in ANCOVA, wk 11) — it accounts for baseline differences
outcome type whether \(Y\) is quantitative (final score), categorical/nominal (program), ordinal (ordered ratings), or binary (pass/fail) — the single biggest driver of method choice
design — paired vs independent are the two sets of measurements on the same units (paired, Dataset P) or on different units (independent, Dataset G)? The wrong call wastes power or invents it
design — one factor vs two one grouping factor (Format, Dataset F) vs two crossed factors (Delivery × Background, Dataset X), which lets you ask about interaction
design — observational vs experimental were groups assigned (experiment, supports a causal reading) or self-selected (observational, supports only association)? Datasets G, F, and R are observational
confounding a third variable tangled with both the predictor and the outcome (motivated students self-select into Support in G; study more and attend more in R), which is why adjustment changes estimates

Blueprint move. Steps 1–2 assume nothing statistical yet — they only describe the question and the data. But they protect against the most expensive errors in the whole course: choosing a method that does not match the structure, and reading association as causation. They cannot by themselves license a causal claim — only a randomized design can, and Datasets G, F, and R are observational, so they buy association, not causation.

Estimation & uncertainty

Step 5 of the blueprint, and the course’s loudest discipline: report the estimate with its uncertainty, not just a verdict. A p-value alone tells you almost nothing useful — a tiny effect can be “significant” with enough data, and a large effect can be “non-significant” with too little.

Term Meaning
point estimate the single best value of the quantity you care about — a mean \(\bar x\), a mean difference \(\bar d\) or \(\bar x_1 - \bar x_2\), a slope \(\hat\beta\), an odds ratio \(\widehat{\mathrm{OR}}\)
standard error (SE) the standard deviation of the point estimate across hypothetical repeated samples — how much it would wobble; on Dataset P the paired SE is \(s_d/\sqrt n = 9/\sqrt{30} \approx 1.64\)
confidence interval (CI) a range of plausible values for the estimate; on Dataset G the difference of \(6.0\) points has a 95% CI of \((1.3, 10.7)\) — report this, not the p-value alone
p-value the probability of a result at least this extreme if the null were true; it is not the probability the null is true, and not a measure of effect size — Dataset G’s \(p \approx 0.013\)
effect size a scale-free or scale-meaningful measure of how big the effect is — Cohen’s \(d\), \(\eta^2\), \(R^2\), a risk difference, relative risk, or odds ratio — the antidote to a bare p-value
Cohen’s \(d\) a standardized mean difference, \((\bar x_1 - \bar x_2)/s_p\); on Dataset G, \(d = 6/11.27 \approx 0.53\) (a medium effect, \(\approx\) half a standard deviation)
statistical significance the data are inconsistent with the null at your chosen level (\(p\) small) — a statement about evidence against chance, nothing more
practical significance whether the effect is big enough to matter in context; a \(+6\)-point gain on a 100-point readiness scale (Dataset P) is modest-to-meaningful — statistical \(\ne\) practical

Blueprint move. Estimation assumes a model for the sampling variability (which the method-specific sections below spell out); it delivers a point estimate plus an interval; it protects against the false confidence of a lone p-value; it cannot tell you whether an effect matters — that judgment is practical significance, and it lives in the subject context, not in the arithmetic. The classic error this guards against: reporting “\(p < 0.05\)” as if it were the whole finding.

Group comparisons

When the question is do these groups differ?, the structure decides the method: paired vs independent, two groups vs many, one factor vs two. Each is the same blueprint with a different estimate — a mean difference, a contrast, an \(F\)-ratio — always reported with its interval.

Term Meaning
paired \(t\)-test compares the same units measured twice by analyzing the \(n\) paired differences \(d_i\); on Dataset P, \(\bar d = +6.0\), \(t = 6.0/1.64 \approx 3.65\) on 29 df, \(p \approx 0.001\), 95% CI \((2.6, 9.4)\)
\(d_z\) (paired effect size) the standardized paired difference \(\bar d / s_d\); on Dataset P, \(6/9 \approx 0.67\)
independent (two-sample) \(t\)-test compares different units in two groups via \(\bar x_1 - \bar x_2\); on Dataset G the difference is \(6.0\) points
pooled \(t\) the two-sample test that assumes equal variances, using a pooled SD \(s_p\); on Dataset G, \(s_p \approx 11.27\), \(\mathrm{SE} \approx 2.38\), \(t \approx 2.53\)
Welch \(t\) the two-sample test that does not assume equal variances — the safer default; on Dataset G it gives \(\mathrm{SE} \approx 2.38\), df \(\approx 86\), \(p \approx 0.013\), nearly identical here because \(n_1 = n_2\)
one-way ANOVA compares several group means at once via the \(F\)-ratio; on Dataset F (four formats), \(F = \mathrm{MS}_{\text{between}}/\mathrm{MSE} = 616.7/81 \approx 7.61\) on \((3, 96)\), \(p \approx 0.0001\)
\(\mathrm{MSE}\) the within-group mean squared error — the pooled variance estimate, the ANOVA’s yardstick for noise; on Dataset F, \(\mathrm{MSE} = 81\) (so a within-group SD \(\approx 9\))
\(\eta^2\) (eta-squared) the share of total variance explained by the grouping factor; on Dataset F, \(\eta^2 = 1850/9626 \approx 0.19\) — format explains \(\approx 19\%\) of score variance
multiple comparisons problem running many pairwise tests inflates the chance of a false positive somewhere (family-wise error); on Dataset F, unadjusted comparisons would wrongly flag \(H{-}L\) and \(L{-}O\)
Tukey HSD a post-hoc procedure that controls family-wise error across all pairwise comparisons; on Dataset F the critical difference is \(\approx 6.64\), so \(LL{-}O = 11\), \(H{-}O = 9\), \(LL{-}L = 7\) are significant but \(H{-}L = 5\), \(L{-}O = 4\), \(LL{-}H = 2\) are not
Bonferroni a simpler, more conservative multiplicity control — divide \(\alpha\) by the number of comparisons; trades power for a guaranteed family-wise rate
planned contrast a pre-specified linear combination of means \(\hat\psi = \sum c_j \bar x_j\) (with \(\sum c_j = 0\)), more powerful than post-hoc for a planned question; on Dataset F the “hands-on vs delivered-only” contrast is \(\hat\psi = 80 - 72 = 8\) points, \(\mathrm{SE} = 1.8\), \(t \approx 4.44\), \(p < 0.001\)
main effect the average effect of one factor across the levels of the other; on Dataset X, the Delivery margin is In-person \(79\) vs Online \(72.5\) (gap \(6.5\))
interaction when the effect of one factor depends on the level of the other (non-parallel lines); on Dataset X the In-person advantage is \(11\) points for weak-background but only \(2\) for strong-background students — interaction \(F \approx 5.0\), \(p \approx 0.028\)

Blueprint move. Each comparison assumes its own conditions — paired/independence, approximate normality of the response or the differences, and (for the pooled \(t\) and ANOVA) roughly equal variances. Each estimates a quantity you report with an interval: a mean difference, a contrast \(\hat\psi\), an \(\eta^2\). What none of them can prove is causation when the groups were not randomized — Datasets G and F are observational, so a significant difference is an association. Two locked cautions: read the interaction before the main effects (on Dataset X, “Online is \(6.5\) worse” is misleading because it depends on background), and control the error rate when you make many comparisons.

Models — regression & ANCOVA

When the question shifts from which group? to how does the outcome change with a predictor? — or what is the group effect after adjusting for a covariate? — you fit a model. The estimate becomes a slope (or an adjusted mean), and adjustment is where confounding shows its hand.

Term Meaning
simple linear regression one quantitative predictor: \(\hat Y = b_0 + b_1 X\); on Dataset R, \(\widehat{\text{final}} = 55 + 1.6\cdot\text{hours}\), so each extra study-hour/week is associated with \(+1.6\) final points
slope and its CI the estimated change in \(Y\) per one-unit change in \(X\), reported with an interval; on Dataset R the hours slope is \(1.6\), \(\mathrm{SE} \approx 0.22\), \(t \approx 7.3\), 95% CI \((1.16, 2.04)\)
multiple regression several predictors at once: \(\hat Y = b_0 + b_1 X_1 + b_2 X_2 + \dots\); on Dataset R, \(\widehat{\text{final}} = 30 + 1.1\cdot\text{hours} + 0.25\cdot\text{att} + 0.30\cdot\text{pretest}\)
partial (adjusted) slope a coefficient in a multiple regression — the effect of one predictor holding the others fixed; on Dataset R the hours slope drops \(1.6 \to 1.1\) after adjustment, because students who study more also attend more and start higher (confounding)
\(R^2\) the share of the response’s variance the model explains; on Dataset R simple regression gives \(R^2 \approx 0.30\), the multiple model \(\approx 0.46\)
residual the gap between an observed value and the fitted line, \(y_i - \hat y_i\); residual patterns are the main regression diagnostic (Dataset R residual SD \(\approx 10\))
leverage how unusual a point’s predictor values are; a high-leverage point can pull the line toward itself (Dataset R has one such student — investigate, do not drop)
influence how much a single point actually moves the fitted estimates (e.g. Cook’s distance) — leverage is potential, influence is realized
multicollinearity / VIF predictors that are correlated with each other inflate their SEs; the variance inflation factor measures it — on Dataset R, hours and attendance correlate \(r \approx 0.45\), VIF \(\approx 1.3\) (fine)
ANCOVA compares group means adjusted for a quantitative covariate; on Dataset F, adjusting final scores for the pretest gives adjusted format means \(\approx L\,74.5\), \(LL\,80.6\), \(O\,70.9\), \(H\,78.1\) (gaps shrink — some advantage was baseline)
adjusted means the group means you would expect if every group had the same covariate value — the fair comparison ANCOVA produces
parallel-slopes assumption ANCOVA assumes the covariate’s slope is the same in every group (no factor × covariate interaction); on Dataset F that interaction is NS (\(p \approx 0.5\)), so ANCOVA is valid
extrapolation predicting outside the observed range of \(X\) — unsupported by the data and a common reporting error

Blueprint move. Regression and ANCOVA assume a roughly linear relationship, roughly normal and constant-variance residuals, and (for ANCOVA) parallel slopes. They estimate a slope or an adjusted mean, each with a CI. They protect against nothing automatically: a partial slope is only “all else equal” for the variables you actually included, so an unmeasured confounder still biases it. They cannot prove causation from observational data — adjustment narrows but does not close the gap between association and causation. The locked lesson links the weeks: on Dataset R the hours slope shrinks \(1.6 \to 1.1\) under adjustment, exactly as the format effect shrinks under ANCOVA on Dataset F — adjustment changes the estimate, and that change is the finding.

Categorical outcomes

When the outcome itself is a category — which program, pass or fail — means and slopes give way to counts, rates, and odds. The estimates are a risk difference, a relative risk, an odds ratio, or a predicted probability, and the association-vs-causation discipline is sharpest here because these data are observational.

Term Meaning
contingency table a cross-tabulation of counts for two categorical variables; on Dataset R the \(3\times 2\) pass × program table is None \(18/22\), Drop-in \(24/16\), Structured \(30/10\) (pass rates \(45\%\), \(60\%\), \(75\%\))
expected counts the counts you would see under independence (no association); on Dataset R, expected passes per program \(= 40 \times 0.60 = 24\) — all expected counts \(\ge 5\), so the chi-square is valid
chi-square test tests whether two categorical variables are associated, \(\chi^2 = \sum (O-E)^2/E\) on \((r-1)(c-1)\) df; on Dataset R, \(\chi^2 = 3.75 + 0 + 3.75 = 7.5\) on 2 df, \(p \approx 0.024\)
risk (rate) the probability of the outcome in a group — here the pass rate (\(45\%\) for None, \(75\%\) for Structured)
risk difference one rate minus another; Structured vs None on Dataset R is \(0.75 - 0.45 = 0.30\) — a 30-percentage-point gap in pass rate
relative risk (RR) the ratio of two risks; Structured vs None is \(0.75/0.45 \approx 1.67\) — Structured students pass at \(1.67\times\) the rate
odds the ratio of the chance of an event to the chance of no event, \(p/(1-p)\) — distinct from a probability
odds ratio (OR) the ratio of two odds; Structured vs None on Dataset R is \(3.0/0.818 \approx 3.67\) — and crucially \(\mathrm{OR} \ne \mathrm{RR}\) (the OR is larger here)
logistic regression a model for a binary outcome on the log-odds scale: \(\mathrm{logit}(p) = \ln\frac{p}{1-p} = b_0 + \sum b_k X_k\); on Dataset R, \(\mathrm{logit}(\hat p) = b_0 + 0.22\cdot\text{hours} + 0.04\cdot\text{pretest} + 0.6[\text{Drop-in}] + 1.0[\text{Structured}]\)
log-odds (logit) the scale logistic coefficients live on — not probabilities; you exponentiate a coefficient to read it as an odds ratio
odds ratio from a coefficient \(e^{b_k}\); on Dataset R, \(e^{0.22} \approx 1.25\) per study-hour, and \(e^{1.0} \approx 2.72\) for Structured vs None — which shrinks from the raw \(3.67\) after adjusting for hours and pretest (confounding again)
predicted probability the S-curve \(p = 1/(1+e^{-\eta})\) that turns a log-odds back into a probability — the readable conclusion; on Dataset R a high-effort Structured student is at \(\approx 0.56\) vs a low-effort None student at \(\approx 0.05\) (illustrative)

Blueprint move. These methods assume independent observations and, for the chi-square, large-enough expected counts; logistic regression assumes the log-odds are linear in the predictors. They estimate a risk difference, a relative risk, an odds ratio, or a predicted probability — each reportable with an interval. They protect against misreading the scale only if you are careful: a logistic coefficient is a log-odds, so exponentiate to an OR and read a predicted probability as the conclusion, never the raw logit; and \(\mathrm{OR} \ne \mathrm{RR}\). What they cannot prove is that a program caused passing: on Dataset R students self-select into programs, so the significant association is not proof of cause — and the adjusted OR shrinking from \(3.67\) to \(2.72\) shows how much of the raw association was confounding.

A note on using the glossary — blueprint over catalog

The temptation this glossary works against is treating its entries as a box of named tests — “use the paired \(t\) here, the chi-square there.” They are not a catalog. Every term above is one expression of the same six steps, and the point is to see the connections:

Drift to resist The disciplined move
reporting a bare p-value report the estimate with its confidence interval (a difference, a slope, an OR), and an effect size
confusing statistical with practical significance ask separately whether the effect is real (the test) and whether it matters (the context)
calling an observational association causal name the design; Datasets G, F, R are observational — they buy association, not causation
using the wrong design (paired vs independent) check the structure first; pairing (Dataset P) removes between-unit variation and is more powerful
ignoring unequal variance prefer Welch unless equal variances are justified
many comparisons without error control use Tukey or Bonferroni, or a pre-specified contrast
misreading an interaction read the interaction before the main effects (Dataset X)
deleting an influential point investigate, do not auto-delete (Dataset R’s high-leverage student)
reading a logit as a probability exponentiate to an odds ratio; report a predicted probability as the conclusion

When the structure and the question are clear, the right method usually names itself, and the estimate it produces — reported with its uncertainty, bounded by what the design can support — is the whole point of the analysis. That is the blueprint, and it is the same on every page.

Evidence and verification status

verified: false. The blueprint framing, the vocabulary, and the organization on this page are course-authored, but every numeric value referenced here — Dataset P’s paired difference \(+6.0\), SE \(\approx 1.64\), paired \(t \approx 3.65\), CI \((2.6, 9.4)\), and \(d_z \approx 0.67\); Dataset G’s difference \(6.0\), \(s_p \approx 11.27\), \(t \approx 2.53\), \(p \approx 0.013\), CI \((1.3, 10.7)\), and \(d \approx 0.53\); Dataset F’s \(F \approx 7.61\), \(\eta^2 \approx 0.19\), Tukey critical difference \(\approx 6.64\), contrast \(= 8\), and the adjusted means \(L\,74.5\) / \(LL\,80.6\) / \(O\,70.9\) / \(H\,78.1\); Dataset X’s cell means, the main-effect gaps, and the interaction \(F \approx 5.0\); and Dataset R’s slopes (\(1.6\), \(1.1\)), \(R^2\) (\(0.30\), \(0.46\)), \(\chi^2 = 7.5\), risk difference \(0.30\), RR \(\approx 1.67\), OR \(3.67\) raw and \(e^{1.0} \approx 2.72\) adjusted, and the predicted probabilities (\(\approx 0.56\), \(\approx 0.05\)) — is drafted, synthetic (set.seed(35203)), and not independently checked; R is not executed in this build; the worked numbers are provisional and not independently verified.

Public vs. graded

These notes, the examples, and the practice here are public and ungraded — study material only. No graded prompts, answer keys, rubrics, point values, or due dates appear on this site. Graded applied-methods checkpoints, weekly quizzes, homework and analysis memos, applied analysis labs, the midterm, the applied methods project, and the final exam live in Blackboard (the LMS), which is authoritative for due dates, submissions, and grades. If this page and Blackboard ever disagree, follow Blackboard.

See also