Week 9 — Error rates, power & decisions

What are we risking when we decide, and how often will we be wrong?

The week question

Last week you ran a test and read a \(p\)-value. This week the question turns from “what does this one result say?” to “how often will a procedure like this lead me astray, and what does being wrong cost me?” A test is a rule for deciding between two stories about the world, and any decision rule applied to noisy data will sometimes decide wrong. So the week’s question is:

When I reject or fail to reject a hypothesis, what are the two ways I can be wrong, how often does each happen, and how should the consequences of each error shape the rule I use?

This shifts attention from a single number to the long-run behavior of the procedure, and then from long-run behavior to cost. By the end you should see that choosing a significance level \(\alpha\) is not a neutral statistical formality — it is a choice about which kind of error you are more willing to tolerate.

Why this matters

Every test you will ever run lives downstream of a question someone actually cares about: should the reading program be expanded, should a drug be approved, should a manufacturing line be stopped. The test does not answer that question by itself. It produces a decision under uncertainty, and that decision can fail in two different directions with two different prices.

If you only know the \(p\)-value from one study, you know nothing about how trustworthy your rule is. A rule that rejects \(H_0\) far too easily will cry wolf; a rule that almost never rejects will miss real effects. Understanding error rates lets you reason about the rule before you ever see data — and understanding loss lets you set the rule to match what is actually at stake. This is the bridge from “statistics” to “decisions,” and it is why the course is named the way it is.

It also sharpens two misreadings that quietly survive from Week 8. A \(p\)-value is still not the probability that \(H_0\) is true (Risk 6), and “fail to reject” is still not “accept” or “prove” \(H_0\) (Risk 7). Error rates make the reason concrete: even a good rule is wrong a fixed fraction of the time, so a single decision can never certify a hypothesis.

Learning goals

By the end of this week you should be able to:

  • Define a Type I error (rejecting a true \(H_0\)) and a Type II error (failing to reject a false \(H_0\)), and say which conditioning each one lives under.
  • State that the Type I error rate is the significance level \(\alpha\), that the Type II error rate is \(\beta\), and that power \(= 1 - \beta\) is the probability of correctly rejecting a specified false \(H_0\).
  • Explain why power is not a single number but a function of the true parameter value, the sample size \(n\), the chosen \(\alpha\), and the variability of the data.
  • Describe the \(\alpha\)\(\beta\) trade-off: lowering \(\alpha\) (fewer false alarms) generally raises \(\beta\) (more missed effects) for a fixed \(n\), and only more data buys down both.
  • Frame a test as a decision with a loss, written \(\operatorname{Loss}(\theta, a)\), and explain why choosing \(\alpha\) is choosing a balance between the cost of a false positive and the cost of a false negative.
  • Keep the conditioning straight: power is computed under a specific alternative, not under \(H_0\), and never read it as a probability that the alternative is true.

Core vocabulary

A test sorts every possible dataset into “reject \(H_0\)” or “do not reject \(H_0\).” Cross that decision with the unknown truth and you get a \(2\times2\) table of outcomes. Two cells are correct; two are errors.

\(H_0\) is true \(H_0\) is false
Reject \(H_0\) Type I error (rate \(\alpha\)) correct decision (power \(= 1-\beta\))
Do not reject \(H_0\) correct decision (\(1-\alpha\)) Type II error (rate \(\beta\))
  • Type I error — rejecting \(H_0\) when \(H_0\) is in fact true (a false positive; you claim an effect that is not there). Its long-run rate is the significance level \(\alpha\), chosen before seeing data. The rate is conditional: \(\alpha = P(\text{reject } H_0 \mid H_0 \text{ true})\).
  • Type II error — failing to reject \(H_0\) when \(H_0\) is in fact false (a false negative; you miss a real effect). Its rate is \(\beta = P(\text{fail to reject } H_0 \mid H_0 \text{ false, at a specified }\theta)\).
  • Power \(= 1 - \beta = P(\text{reject } H_0 \mid H_0 \text{ false, at a specified }\theta)\) — the probability the test catches a real effect of a given size. Power is a property of the procedure at a named alternative, not a statement about which hypothesis is true.
  • Significance level \(\alpha\) — the Type I rate you are willing to accept. The rejection region is built so that, if \(H_0\) holds, you reject only a fraction \(\alpha\) of the time.
  • Effect size — how far the truth sits from \(H_0\) (here, how far \(\theta\) is from \(0.5\)). Bigger effects are easier to detect, so power rises with effect size.
  • Loss \(\operatorname{Loss}(\theta, a)\) — the cost of taking action \(a\) when the parameter is \(\theta\). We write it out or as Loss; we never write it \(L\), which is reserved for the likelihood (Risk 15).

A vocabulary trap worth naming now: \(\alpha\) and \(\beta\) are computed under different worlds. \(\alpha\) conditions on \(H_0\) being true; \(\beta\) and power condition on a specific value in \(H_a\). They are not two slices of one probability distribution, and they do not add to anything in particular.

Concept development

Two errors, two conditionings

The reason there are exactly two error types is that there are two ways the world can be ((H_0) true or false) and two decisions you can make. Line them up and the only mistakes are: reject when you should not, or fail to reject when you should.

The crucial discipline — the one the course returns to constantly — is that each rate is a conditional probability, and the conditions differ. Hold \(H_0\) fixed as true and ask how often your rule rejects: that fraction is \(\alpha\). Now suppose instead the truth is some specific value in \(H_a\) and ask how often your rule fails to reject: that fraction is \(\beta\). You cannot compute either one without first declaring which world you are standing in. This is the same conditioning move as the \(p\)-value (which lives under \(H_0\)), and keeping it explicit is what stops “power” from collapsing into the forbidden reading “probability the alternative is true” (Risk 6).

Notice what is random and what is fixed. The parameter \(\theta\) is fixed but unknown; it does not have a probability. What is random is the data, and therefore the decision, which is a function of the data. Error rates are statements about that randomness in the decision, conditional on a fixed state of the world.

Power is a function, not a number

Because \(\beta\) depends on which alternative is true, there is no single “the power” of a test. Power is a curve. The further the true \(\theta\) sits from the null value, the more the test statistic is pushed into the rejection region, and the higher the power. Four levers move it:

  • Effect size — the distance from \(H_0\) to the truth. Larger distance, higher power.
  • Sample size \(n\) — more data shrinks the standard error, sharpening the test. Larger \(n\), higher power.
  • Significance level \(\alpha\) — a larger \(\alpha\) enlarges the rejection region, so you reject more often under both worlds. Larger \(\alpha\), higher power (but also more Type I errors).
  • Variability — noisier data (larger \(\sigma\) or, here, \(\theta\) near \(0.5\)) lowers power.

Schematically, for the proportion test of \(H_0:\theta = 0.5\) against \(H_a:\theta > 0.5\) at level \(\alpha\), power at a true value \(\theta_1\) is the probability the test statistic clears the critical value when the data are generated under \(\theta_1\):

\[ \operatorname{power}(\theta_1) = P\!\left( \hat p \ge p_0 + z_{1-\alpha}\,\operatorname{SE}_0 \;\middle|\; \theta = \theta_1 \right), \]

where \(p_0 = 0.5\), \(\operatorname{SE}_0 = \sqrt{p_0(1-p_0)/n}\) is the standard error computed under the null, and \(z_{1-\alpha} = z_{0.95} \approx 1.645\) for a one-sided \(\alpha = 0.05\). The probability on the right is then evaluated using the sampling distribution of \(\hat p\) under \(\theta_1\), whose own standard error is \(\sqrt{\theta_1(1-\theta_1)/n}\). The two standard errors differ because the rejection rule is built under \(H_0\) but the data are generated under \(H_a\) — a detail that is easy to lose and worth holding onto.

Choosing \(\alpha\) is choosing a risk trade-off

For a fixed sample size, the two error rates pull against each other. Shrink the rejection region to make false alarms rare (\(\alpha\) down) and you also reject real effects less often (\(\beta\) up, power down). Widen it for more power and you accept more false alarms. The only way to push both errors down at once is to collect more data.

The conventional \(\alpha = 0.05\) is a default, not a law of nature. It encodes a particular stance: “be fairly reluctant to claim an effect.” Whether that stance is right depends on what each error costs — which is where decisions and loss enter.

A test is a decision under loss

Frame the test as choosing an action \(a\) — say \(a_1 =\) “act as if the program works” or \(a_0 =\) “act as if it does not.” Attach a loss \(\operatorname{Loss}(\theta, a)\): the cost of taking action \(a\) when the truth is \(\theta\). A correct decision costs little; the two errors cost something, and usually not the same thing.

If a false positive means rolling out an expensive program that does nothing, and a false negative means shelving a program that genuinely helps students, those are different harms. A decision rule should weigh them by their costs, not treat both errors as equally bad. In this language, \(\alpha\) is the lever that trades the frequency of false positives against the frequency of false negatives; the loss is what tells you where to set the lever. A rule that minimizes expected loss is the bridge from “a test” to “a defensible decision,” and it is exactly the idea the Bayesian decision lens in Week 12 makes formal.

Worked examples

Worked example — power of the reading-fluency \(\theta\)-test

Recall Strand A of the recurring reading-fluency study (synthetic; seed set, set.seed(35103)). Of \(n = 40\) students, \(x = 26\) passed, \(\hat p = 0.65\). Last week we tested \(H_0:\theta = 0.5\) against \(H_a:\theta > 0.5\) at \(\alpha = 0.05\) and found \(z = 1.90\), one-sided \(p \approx 0.029\) — a borderline rejection.

Now ask a design question, before any new data: if the program’s true pass rate were \(\theta = 0.65\), how often would a study like this one actually detect it? That is the power of the test at the alternative \(\theta_1 = 0.65\), with \(n = 40\) and \(\alpha = 0.05\).

Setup. The rejection rule lives under the null: reject when \(\hat p\) clears \(p_0 + z_{0.95}\,\operatorname{SE}_0\), with \(p_0 = 0.5\), \(\operatorname{SE}_0 = \sqrt{0.5\cdot 0.5/40} \approx 0.0791\), and \(z_{0.95} \approx 1.645\). The data, however, are imagined as generated under \(\theta_1 = 0.65\), whose sampling standard error is \(\sqrt{0.65\cdot 0.35/40} \approx 0.0754\). Power is the probability that \(\hat p\) lands in the rejection region when the data come from \(\theta_1\).

Computation (illustrative, synthetic). Working it through with these numbers gives a power that is illustratively modest — roughly \(0.5\) for detecting \(\theta = 0.65\) at \(\alpha = 0.05\) with only \(n = 40\). Treat this as approximate and synthetic, not a verified value: the point is the order of magnitude, not a precise figure. A power near one-half means a study this size would correctly flag a true \(0.65\) pass rate only about half the time — and miss it the other half (\(\beta \approx 0.5\)).

Interpretation. This is sobering and instructive. The Week 8 result was a borderline rejection; the power calculation explains why the result was borderline rather than decisive. At \(n = 40\), a true effect of this size sits right at the edge of what the test can reliably catch, so the outcome is close to a coin flip between “reject” and “fail to reject.” What is random here is the data and hence the decision; what is fixed is the (hypothesized) truth \(\theta_1 = 0.65\); what is assumed is independent pass/fail outcomes (Risk 14) and a normal approximation to \(\hat p\). The lesson for design: if detecting an effect of this size matters, \(n = 40\) is underpowered, and you would want a larger sample before drawing a firm conclusion. And the lesson for reading: a failure to reject here would not mean the program doesn’t work (Risk 7) — with power near \(0.5\), a miss is entirely expected even when the effect is real.

A static, non-executed R sketch of the same calculation by direct simulation — the kind of thing you would run to see the power rather than look it up:

# Power of the one-sided proportion test at theta = 0.65, alpha = 0.05, n = 40
# STATIC, non-executed teaching code. Synthetic study; seed set.
set.seed(35103)

n      <- 40
p0     <- 0.5            # null value
alpha  <- 0.05
theta1 <- 0.65          # the alternative we want to detect

# Rejection threshold for p-hat, built UNDER THE NULL (one-sided, upper tail)
se0     <- sqrt(p0 * (1 - p0) / n)        # ~ 0.0791
crit_ph <- p0 + qnorm(1 - alpha) * se0    # qnorm(0.95) ~ 1.645  ->  ~ 0.630

# Simulate many studies UNDER THE ALTERNATIVE theta1, see how often we reject
n_sim   <- 10000
x_sim   <- rbinom(n_sim, size = n, prob = theta1)  # data generated at theta1
phat    <- x_sim / n
power_hat <- mean(phat >= crit_ph)        # fraction that clear the null threshold

power_hat
# ~ 0.5  (illustratively modest, SYNTHETIC / approximate — verified: false)
# Read it as: about half of size-40 studies would detect a true 0.65 pass rate.

The simulation makes the conditioning visible: the threshold crit_ph is computed under \(H_0\), but the data x_sim are drawn under \(\theta_1\). Power is the overlap between the alternative’s sampling distribution and the null’s rejection region — and at \(n = 40\) that overlap is only about half.

Worked example — power of a two-group test (transfer)

Now move to a fresh context: a clinic pilots a new tutoring protocol and wants to know whether it will be able to detect a meaningful difference between two groups before it commits resources. This mirrors Strand C of the study (a two-group comparison) but the numbers here are a separate illustrative scenario.

Setup. Two groups, treatment and control, \(n_T = n_C = 18\) each, outcome measured on a continuous score. The null is \(H_0:\) no difference in means; the alternative is a true mean difference of, say, \(\Delta = 3\) points against a within-group SD of roughly \(4.35\). The test statistic is the standardized difference \(t = \hat d / \operatorname{SE}(\hat d)\), with \(\operatorname{SE}(\hat d) = s_{\text{pool}}\sqrt{1/n_T + 1/n_C} \approx 4.35\sqrt{1/18 + 1/18} \approx 1.45\).

Computation (illustrative, synthetic). The same four levers apply. Power rises if the true difference \(\Delta\) is larger, if each group is bigger, if \(\alpha\) is more lenient, or if the within-group SD is smaller. With a true difference of \(3\) points and these group sizes, the effect is roughly \(\Delta / \operatorname{SE}(\hat d) \approx 3/1.45 \approx 2.07\) standard errors from the null — comparable in spirit to the proportion example, so the power is again only moderate, not high. (All values synthetic, approximate, and verified: false.)

Interpretation. The transfer is the whole point: nothing about the reasoning changed when we moved from a proportion to a two-group mean. Power is still the chance of correctly rejecting a specified false null; it still rises with effect size and \(n\) and falls with noise; and a non-significant result here would still not prove the groups are identical. What changed is only the arithmetic of the test statistic and its standard error. The design takeaway also transfers: if the clinic needs to reliably catch a \(3\)-point difference, \(18\) per group is likely too few, and a power calculation done now — before data — tells them how many participants to recruit. The conditioning is identical to the proportion case: power is computed under the assumed alternative \(\Delta = 3\), with the data (and therefore the decision) random and the truth fixed.

A common mistake

The signature mistake this week is reading power, \(\alpha\), or \(\beta\) as a probability about which hypothesis is true. Power is not “the probability the alternative is correct,” and \(1 - \alpha\) is not “the probability \(H_0\) is true.” Every one of these is a conditional probability about the decision, computed by fixing a state of the world and asking how often the random data lead the rule astray. The parameter is fixed and has no probability in this frequentist frame (Risk 6).

Two companions to this trap:

  • Treating “fail to reject” as “accept” or “prove” \(H_0\) (Risk 7). When power is low — as in this week’s \(n = 40\) example — a non-significant result is exactly what you would expect even if the effect is real. Low power means absence of evidence is weak evidence of absence. Never write “the program has no effect” from a single failure to reject; write “this study did not detect an effect,” and report the power if you can.
  • Writing the loss as \(L\) (Risk 15). \(L(\theta)\) is the likelihood (Weeks 5–6), a function of \(\theta\) given the data — and emphatically not a probability distribution over \(\theta\). The decision loss is \(\operatorname{Loss}(\theta, a)\), written out, never abbreviated to \(L\). Two different ideas, two different symbols; conflating them is the kind of notation slip that quietly corrupts a derivation.

The repair for all of these is the same move you have practiced since Week 1: before you state a probability, say out loud what is conditioned on, what is random, and what is fixed.

Low-stakes self-checks (ungraded)

These are practice only — no points, no submission, no key. Work them, then check your reasoning against the sections above.

  1. In one sentence each, state which world ((H_0) true or false) you condition on to compute \(\alpha\), and which you condition on to compute \(\beta\).
  2. A colleague lowers \(\alpha\) from \(0.05\) to \(0.01\) to “be safer,” keeping \(n\) fixed. What happens to the Type II error rate and to power, and why?
  3. The reading-fluency \(\theta\)-test has power roughly \(0.5\) at \(\theta = 0.65\), \(n = 40\). Name two changes to the study that would raise the power, and say which lever each one moves.
  4. Rewrite this sentence to fix its error: “The test was not significant, so we proved the program does not work.” What is the correct claim, and why does low power make the original especially wrong?
  5. A false positive here means expanding a useless program; a false negative means shelving a helpful one. Sketch in words how \(\operatorname{Loss}(\theta, a)\) for the two errors would push you to set \(\alpha\) higher or lower, and explain your reasoning.
  6. True or false, with a reason: “Power is the probability that the program really works.” (It is false — say precisely what power is a probability of.)

Reading and source pointer

For this week, read the MIT OCW 18.05 treatment of Type I and Type II error and power (the hypothesis-testing readings that introduce the error table, \(\alpha\), \(\beta\), and the power of a test). For a lighter, plain-language framing of error rates and the false-positive/false-negative trade-off, the Introduction to Modern Statistics (Çetinkaya-Rundel & Hardin, CC BY-SA 3.0) discussion of decision errors is a good calibration read. These notes are the course’s own synthesis, grounded in but not copied from the sources.

The decision-and-loss framing previews the Bayesian decision-theory lens in Week 12; the error-rate vocabulary extends directly from the hypothesis-test machinery of Week 8.

Formula-verification status

verified: false. The formulas and every numeric value on this page are drafted, synthetic, and not independently checked. The load-bearing items here — the power expression \(\operatorname{power}(\theta_1) = P(\hat p \ge p_0 + z_{1-\alpha}\operatorname{SE}_0 \mid \theta = \theta_1)\), the under-null standard error \(\operatorname{SE}_0 = \sqrt{0.5\cdot 0.5/40} \approx 0.0791\), the critical multiplier \(z_{0.95} \approx 1.645\), the illustrative power \(\approx 0.5\) for detecting \(\theta = 0.65\) at \(\alpha = 0.05\), \(n = 40\), and the two-group standard error \(\operatorname{SE}(\hat d) \approx 1.45\) with \(t \approx 2.07\) — are all part of the recurring synthetic study (set.seed(35103)) and are presented “as if computed.” The course math gate is BLOCKED: do not treat any value as a confirmed reference until the human/source sign-off in _state/notation_ledger.md §5 is complete. The power figure in particular is flagged as approximate and order-of-magnitude only.

Public vs. graded

These notes, the examples, and the practice here are public and ungraded — study material only. No graded prompts, answer keys, rubrics, point values, or due dates appear on this site. Graded inference checkpoints, quizzes, homework, inference labs, the midterm, the project, and the final live in Blackboard (the LMS), which is authoritative for due dates, submissions, and grades. If this page and Blackboard ever disagree, follow Blackboard.

Looking ahead

Next week, Week 10 estimates uncertainty without a formula, by resampling the data itself. Instead of plugging into a standard-error equation, you will draw new samples from your own sample — the bootstrap — and let the spread of the resampled estimates stand in for the sampling distribution. It is the same inferential question (how variable is my estimate?) answered by computation rather than algebra, and it sets up the randomization tests that follow.

See also