Lab 5 — Two-group comparison & effect size

Estimate a group difference, its interval, and Cohen’s d in R

Purpose. This lab is the hands-on companion to Week 5 — Two-group comparisons. The note walks the analysis blueprint through an independent two-group comparison: a question about whether two groups differ, a structure with one quantitative response and one two-level grouping factor, a method (the Welch two-sample $t$), the assumptions it leans on, the estimate with its uncertainty (a mean difference and a confidence interval, never a lone p-value), and a conclusion that keeps statistical, practical, and causal claims apart. Here you build that comparison from the ground up on Dataset G — the Cypress Ridge final scores for students who used the support center (Support) versus those who did not (Self-guided). You will encode the synthetic slice, summarise and plot the two groups, run t.test(score ~ group) to read the difference, its 95% interval, and the p-value, and then compute Cohen’s $d$ so the result is reported as a standardized estimate, not a verdict. The code is shown for study and is not executed on this site; you run it yourself in your own R session.

The idea

A two-group comparison hands you two piles of numbers and one practical question: do students who used the support center score higher on the final than students who went it alone, and if so, by how much, and how sure can you be? The temptation is to skip straight to “is it significant?” and report a p-value. The whole point of this lab — and of the week-5 note it accompanies — is to resist that and instead build the comparison up in the order the blueprint demands: a point estimate first (the difference in means), then its standard error and a confidence interval, then the test, then a standardized effect size, and only then a conclusion that is careful about what kind of claim the data can carry.

Here is the structure, named before any code runs, because the method follows from it. The unit of analysis is the individual student. The response $Y$ is the final exam score, a quantitative outcome on a $0$–$100$ scale. The explanatory variable is a single two-level grouping factor, $\text{group} \in \{\text{Support}, \text{Self-guided}\}$. The two groups are made of different students — nobody appears in both — so the design is independent, not paired. (That contrast with the paired pre/post design of week 4 is the hinge of the whole week: pairing would let each student be their own control, but here there is no natural pairing, so each group’s variation stays in the comparison.) And crucially, the design is observational: students chose whether to use the support center. No one assigned them. That single fact governs the conclusion at the very end, no matter how small the p-value turns out to be.

Given that structure, the matching method is the two-sample $t$, and the safe default is the Welch form, which does not assume the two groups share a common variance. The estimate it is built around is the difference in sample means, \[ \bar x_{\text{Support}} - \bar x_{\text{Self-guided}} = 78.0 - 72.0 = 6.0 \text{ points}. \] That $+6$ is the number the entire analysis is about. Everything else — the standard error, the interval, the t-statistic, the p-value, Cohen’s $d$ — is there to say how precise that $6$ is and how large it is on a scale you can interpret. The Welch standard error for the difference combines the two groups’ spreads, \[ \mathrm{SE} = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}} = \sqrt{\frac{10.5^2}{45} + \frac{12.0^2}{45}} \approx 2.38 \text{ points}, \] the t-statistic is $t = 6.0 / 2.38 \approx 2.53$ on about $86$ degrees of freedom (Welch’s fractional df), and the two-sided p-value is about $0.013$. The 95% confidence interval for the difference is \[ 6.0 \pm t_{.975}\,(2.38) \approx 6.0 \pm 4.7 = (1.3, 10.7) \text{ points}. \] Read that interval as the estimate-with-uncertainty the course keeps asking for: the data are consistent with a true advantage anywhere from about $1$ point to about $11$ points; they rule out no difference (the interval excludes $0$), but they do not pin the size down tightly.

The last piece is the standardized effect size, Cohen’s $d$, which re-expresses the $6$-point gap in standard-deviation units so you can judge whether it is practically big, not just detectably nonzero: \[ d = \frac{\bar x_1 - \bar x_2}{s_p} = \frac{6.0}{11.27} \approx 0.53, \] where $s_p \approx 11.27$ is the pooled within-group SD. A $d$ of about half a standard deviation is a medium effect by the usual rough convention — neither trivial nor enormous. Holding the p-value and the effect size side by side is the discipline the lab is teaching: the small p-value says the difference is unlikely to be zero; the $d \approx 0.53$ says it is moderately large; and neither of them, because the data are observational, says the support center caused the higher scores. Motivated students self-select into help. That is the association-not-causation caveat, and it survives every line of R below.

Goal

By the end of this lab you should be able to:

Encode the Support-vs-Self-guided final-score slice of Dataset G in R with the course seed, and name the student as the unit, the final score as the quantitative response, and group as the two-level independent factor.
Summarise the two groups — means, SDs, and counts — and read a side-by-side boxplot to see the structure (center, spread, overlap) before fitting anything.
Run t.test(score ~ group) (the Welch two-sample $t$) and read off the three things that matter: the difference in means ($+6$ points), its 95% confidence interval $(1.3, 10.7)$, and the p-value ($\approx 0.013$) — reporting the estimate and interval, not a bare verdict.
Compute Cohen’s $d$ ($\approx 0.53$), either by hand from the pooled SD or with the effsize package, and say in a sentence what a medium standardized effect means in context.
State the conclusion in the blueprint’s three registers — statistical significance, practical significance, and the observational (association-not-causation) caveat — without letting the small p-value do work it cannot do.

The point of the lab is the estimate, not the test. A two-sample $t$-test is one instance of the analysis blueprint, and its job is to attach an honest uncertainty to a difference you already care about — not to deliver a yes/no answer. Name the blueprint move as you go: you are choosing the method that matches an independent two-group quantitative structure, checking what it assumes (independence, roughly normal-or-large groups, and — for the pooled form — equal variance, which Welch sidesteps), producing an estimate with its uncertainty, and writing a conclusion that keeps statistical, practical, and causal claims separate.

Setup

You need only base R for the comparison itself; the optional Cohen’s $d$ shortcut uses the small effsize package, but Step 4 also shows the by-hand computation so you can run the whole lab with no packages at all. Every chunk that draws randomness starts with set.seed(35203), the course-wide seed, so your synthetic Dataset G reproduces the locked numbers the week-5 note reports. The data are synthetic; seed set — these are not real student records, so treat every value below as a drafted target, not a confirmed reference.

Fix the design facts before any code runs, because the method and every number follow from them:

Unit of analysis: the individual student. There are $n_1 = 45$ Support students and $n_2 = 45$ Self-guided students, $90$ in all. Different students in each group — nobody is counted twice.
Response: the final exam score $Y$, quantitative, $0$–$100$.
Explanatory variable: group, a two-level factor $\{\text{Support}, \text{Self-guided}\}$.
Design: independent (not paired — there is no student-to-student matching) and observational (students chose to use the support center; no random assignment).
The locked slice: Support mean $= 78.0$, SD $= 10.5$, $n_1 = 45$; Self-guided mean $= 72.0$, SD $= 12.0$, $n_2 = 45$; difference $= 6.0$ points; Welch $t \approx 2.53$ on df $\approx 86$, $p \approx 0.013$; 95% CI for the difference $(1.3, 10.7)$; Cohen’s $d \approx 0.53$ (medium).

set.seed(35203)

# --- Dataset G: final exam scores, Support vs Self-guided (Cypress Ridge) ------
# Observational two-group slice: students SELF-SELECTED into the support center,
# so the design buys ASSOCIATION, not causation. Unit = student; response = final
# score (0-100); factor = group (2 levels). Synthetic; LOCKED summary stats.
n1 <- 45                                       # Support students
n2 <- 45                                       # Self-guided students

# Draw each group to its LOCKED mean and SD, then re-center/re-scale so the sample
# mean and SD hit the locked targets exactly (synthetic data, reproducible slice).
support     <- rnorm(n1, mean = 78, sd = 10.5)
support     <- 78 + (support - mean(support)) / sd(support) * 10.5
selfguided  <- rnorm(n2, mean = 72, sd = 12.0)
selfguided  <- 72 + (selfguided - mean(selfguided)) / sd(selfguided) * 12.0

# Tidy long form: one row per student, two columns (group, score).
g <- data.frame(
  group = factor(rep(c("Support", "Self-guided", levels = c("Self-guided", "Support")),
                     times = c(n1, n2))),
  score = c(support, selfguided)
)

head(g)                                        # one row per student: group, score
table(g$group)                                 # -> Self-guided 45, Support 45

The re-centering lines are a deliberate convenience: they force the synthetic sample to land on the locked summary statistics (mean $78$, SD $10.5$ for Support; mean $72$, SD $12$ for Self-guided) so your output matches the week-5 note to the digit. In a real analysis you would never re-scale observed data — you would take the numbers as they came. Here the data are invented to teach the method, so we pin them to the locked slice. The structure that matters is the same either way: $90$ rows, one per student, a quantitative score, and a two-level group factor with no student in both arms.

Steps

You will build the comparison in four moves: encode the slice (Step 1), summarise and see the two groups before fitting (Step 2), run the Welch $t$-test and read the estimate, interval, and p-value (Step 3), and standardize the difference with Cohen’s $d$ before writing the three-register conclusion (Step 4).

Step 1 — Seed and build the synthetic Dataset G

Start the comparison by fixing the seed and confirming the slice is encoded the way the design says it should be: two independent groups of $45$ students, a quantitative score, and a group factor. The sanity checks below recover the locked group means and SDs before any test runs — if these are off, every later number will be off too.

set.seed(35203)

# Rebuild Dataset G (same construction as Setup) so this chunk stands alone.
n1 <- 45; n2 <- 45
support    <- rnorm(n1, 78, 10.5); support    <- 78 + (support    - mean(support))    / sd(support)    * 10.5
selfguided <- rnorm(n2, 72, 12.0); selfguided <- 72 + (selfguided - mean(selfguided)) / sd(selfguided) * 12.0

g <- data.frame(
  group = factor(rep(c("Support", "Self-guided"), times = c(n1, n2)),
                 levels = c("Self-guided", "Support")),
  score = c(support, selfguided)
)

# Confirm the LOCKED structure and summary stats before fitting anything:
nrow(g)                                        # -> 90 students total
tapply(g$score, g$group, mean)                 # -> Self-guided 72, Support 78
tapply(g$score, g$group, sd)                   # -> Self-guided 12.0, Support 10.5
diff(tapply(g$score, g$group, mean))           # -> 6  (Support minus Self-guided)

The design move here is to make the structure literal before reaching for a test. nrow(g) confirms $90$ independent students; tapply(..., mean) recovers the locked group means $72$ and $78$; tapply(..., sd) recovers the locked SDs $12.0$ and $10.5$; and diff() returns the $6$-point gap that the whole analysis is about. Setting levels = c("Self-guided", "Support") fixes the factor ordering so “Self-guided” is the reference and R reports the difference as Support minus Self-guided — a positive $+6$. Name the blueprint move: this step is Structure, not yet Method. You have not assumed anything about the shape of the scores yet; you have only confirmed that the data have the independent two-group form the Welch $t$ is built for.

Step 2 — Summarise the groups and draw a side-by-side boxplot

Before fitting, look. A side-by-side boxplot shows the two groups’ centers, spreads, and overlap at a glance, and a summary table puts numbers on what the picture shows. This is the Assumptions & diagnostics stop on the blueprint done the cheap way first: you are checking, by eye, whether the groups are roughly symmetric (so a $t$ is reasonable) and whether their spreads look similar enough to matter (the spreads differ a little here — Self-guided is more variable — which is exactly why Welch, not the pooled $t$, is the safe default).

set.seed(35203)

# Numeric summary: means, SDs, and counts per group (the estimate's raw material).
aggregate(score ~ group, data = g,
          FUN = function(x) c(mean = mean(x), sd = sd(x), n = length(x)))
#   group        score.mean score.sd score.n
#   Self-guided  72.0       12.0     45
#   Support      78.0       10.5     45

# See the structure before fitting: side-by-side boxplot of score by group.
boxplot(score ~ group, data = g,
        ylab = "Final exam score (0-100)",
        xlab = "Group",
        main = "Final score by support-center use (Dataset G, synthetic)")

The summary recovers the locked means ($72$, $78$) and SDs ($12.0$, $10.5$), and the boxplot shows two overlapping distributions whose medians sit about $6$ points apart, with the Self-guided box a little taller (more spread). Read three things off this picture before you fit. First, the overlap is substantial — these are not two cleanly separated clouds, which is your visual warning that a $6$-point gap, while real-looking, is modest relative to the within-group spread. Second, the shapes look roughly symmetric with no wild outliers, so the $t$’s near-normality assumption is reasonable for groups this size. Third, the spreads differ a little, which is the concrete reason to prefer Welch: the pooled $t$ would assume equal variances, and there is no reason to force that here. Looking first is what keeps you from running a test whose assumptions the data quietly violate.

Two side-by-side box plots of final exam score for Dataset G. The Support group box is centered near a mean of 78 with a diamond marker labeled mean 78 and a spread labeled SD 10.5; the Self-guided group box is centered near a mean of 72 with a diamond marker labeled mean 72 and a wider spread labeled SD 12.0. The two boxes overlap substantially across the middle of the score range, and each group has 45 students. — Figure 1: **Look before you fit — the two groups side by side.** Boxplots of the synthetic Dataset G final scores: Support (mean $78$, SD $10.5$, $n=45$) sits about $6$ points above Self-guided (mean $72$, SD $12.0$, $n=45$), but the boxes overlap heavily and Self-guided is a little more spread. That visible overlap is the warning that a $6$-point gap is modest relative to within-group variation, and the slightly unequal spreads are exactly why **Welch**, not the pooled $t$, is the safe default. Synthetic; seed 35203.

Step 3 — Run the Welch two-sample t-test and read the estimate

Now fit the method that matches the structure. t.test(score ~ group) runs the Welch two-sample $t$-test by default (R does not assume equal variances unless you pass var.equal = TRUE). The formula score ~ group reads as “model the score as a function of group” — the same response-tilde-explanatory syntax you will reuse for ANOVA and regression later in the course.

set.seed(35203)

# Welch two-sample t-test (the safe default: does NOT assume equal variances).
fit <- t.test(score ~ group, data = g)         # var.equal = FALSE by default = Welch
fit
# Welch Two Sample t-test
# data:  score by group
# t = -2.53, df = 86, p-value = 0.013
# alternative hypothesis: true difference in means ... is not equal to 0
# 95 percent confidence interval:
#  -10.7  -1.3
# sample estimates:
#   mean in group Self-guided   mean in group Support
#                        72.0                    78.0

# Pull the pieces out by name so you report the ESTIMATE, not a bare verdict:
diff(rev(fit$estimate))                         # -> 6.0   (Support minus Self-guided)
fit$conf.int                                    # -> (-10.7, -1.3)  for Self-guided - Support
fit$p.value                                     # -> ~ 0.013

Read the output as an estimate with its uncertainty, in the blueprint’s words. The two group means are $72.0$ and $78.0$, so the estimated difference is $6.0$ points in Support’s favor. The t-statistic is about $-2.53$ on about $86$ degrees of freedom (the sign is just R’s group ordering; the magnitude $2.53$ is what matters), and the two-sided p-value is about $0.013$ — the difference is statistically distinguishable from zero at the usual $5\%$ level. The 95% confidence interval R prints runs from about $-10.7$ to $-1.3$ for Self-guided minus Support; flip the sign and it is $(1.3, 10.7)$ points for Support minus Self-guided, exactly the interval the week-5 note locks. That interval is the headline: the data are consistent with a true Support advantage of anywhere from about $1$ to about $11$ points, and they exclude $0$. Report the $6$-point difference with that interval — that is the estimate-with-uncertainty discipline. A bare “$p = 0.013$, significant” would throw away the two most useful numbers (how big and how precise) and is exactly the classic error the course warns against.

A bar chart of two mean final exam scores: Support at 78, drawn with diagonal hatching as the higher bar, and Self-guided at 72, each bar carrying its printed value label. A double-headed arrow on the right spans from 72 up to 78 and is labeled difference equals plus 6 points. A caption below states the Welch t equals 2.53 on 86 degrees of freedom, p equals 0.013, and standard error of the difference 2.38. — Figure 2: **The Welch two-sample $t$-test as a picture: report the difference, not a verdict.** The two group means, Support $78$ and Self-guided $72$, differ by $+6$ points — the estimate the whole analysis is about. The Welch test gives $t=2.53$ on about $86$ degrees of freedom with $p=0.013$ and a standard error of the difference of $2.38$, so the $+6$ is statistically distinguishable from zero. Read the $+6$ (with its interval, next) rather than a bare significant-or-not verdict. Synthetic; seed 35203.

Step 4 — Compute Cohen’s d and write the three-register conclusion

The interval tells you the difference in the response’s own units (points); Cohen’s $d$ re-expresses it in standard-deviation units so you can judge whether $6$ points is practically large. Compute it two ways — by hand from the pooled SD, and with effsize::cohen.d() — so you can run the lab with or without the package and confirm the two agree.

set.seed(35203)

# --- Cohen's d by hand: standardized mean difference using the POOLED SD --------
m1 <- mean(g$score[g$group == "Support"]);     s1 <- sd(g$score[g$group == "Support"])
m2 <- mean(g$score[g$group == "Self-guided"]); s2 <- sd(g$score[g$group == "Self-guided"])

s_pooled <- sqrt(((n1 - 1) * s1^2 + (n2 - 1) * s2^2) / (n1 + n2 - 2))
s_pooled                                       # -> ~ 11.27  (pooled within-group SD)

cohen_d <- (m1 - m2) / s_pooled
cohen_d                                        # -> ~ 0.53   (medium standardized effect)

# --- The same d from the effsize package (optional; same answer) ----------------
# install.packages("effsize")                  # run once if you do not have it
library(effsize)
cohen.d(score ~ group, data = g)               # -> estimate ~ 0.53, magnitude: medium

Two overlapping bell-shaped distribution curves of final exam score with a common pooled standard deviation of 11.27. The Self-guided curve is a solid slate-blue line centered at 72; the Support curve is a dashed brown line with diagonal hatching centered at 78. A double-headed arrow between the two means is labeled d equals 6 over 11.27 equals 0.53 standard deviations, medium. The curves overlap across most of their range. — Figure 3: **Cohen’s $d\approx0.53$ is a medium separation, not a clean split.** Re-expressing the $+6$-point gap in standard-deviation units gives $d = 6/11.27 \approx 0.53$ (pooled SD $11.27$). Drawn as two score distributions — Self-guided centered at $72$, Support at $78$ — the curves overlap heavily: half a standard deviation apart is a moderate, not dramatic, effect. The small $p$-value says the gap is unlikely to be zero; this picture says it is only *medium* in size. Synthetic; seed 35203.

The pooled within-group SD comes out $s_p \approx 11.27$, and the standardized difference is $d = 6.0 / 11.27 \approx 0.53$ — about half a standard deviation, a medium effect by the usual rough convention. The effsize call returns the same $0.53$ and labels it “medium,” a useful cross-check that your by-hand pooled SD is right. Now write the conclusion in the blueprint’s three registers, keeping them explicitly separate:

Statistical significance: with $t \approx 2.53$, $p \approx 0.013$, and a 95% interval $(1.3, 10.7)$ that excludes $0$, the $6$-point difference is unlikely to be a fluke of sampling — the data are inconsistent with no difference.
Practical significance: $d \approx 0.53$ says the gap is a medium effect — about half a standard deviation, roughly $6$ points on a $100$-point final. That is meaningful but not dramatic, and the wide interval ($1$ to $11$ points) means the data do not promise a large advantage. Statistical significance and practical importance are different questions, and you should answer both.
Association, not causation: because students chose whether to use the support center, this is an observational comparison. Motivated, better-prepared students plausibly self-select into help, so the $6$-point gap may reflect who uses the center as much as what the center does. The small p-value does not upgrade the claim to causal — only random assignment could do that, and there was none here. The honest sentence is “students who used the support center scored about $6$ points higher (95% CI $1.3$ to $10.7$, $d \approx 0.53$), but this is an association, not evidence that the center caused the gain.”

Verify

This is the moment where your simulation and the week-5 reasoning are supposed to meet. Check each item against the companion note, and treat any mismatch as a bug in your code, not a discovery about the world — every target value is the locked, synthetic Dataset G slice (illustrative, R not executed on this site).

A horizontal number line showing the difference in mean final score, Support minus Self-guided, in points. A diamond marker at plus 6 is labeled estimate plus 6, with a 95 percent confidence interval error bar running from 1.3 on the left to 10.7 on the right. A red dashed vertical line at 0 is labeled 0 equals no difference, the null; the confidence interval lies entirely to the right of it. A caption notes the interval excludes 0 so p equals 0.013 is significant, but d equals 0.53 is only medium and the design is observational. — Figure 4: **Significant is not the same as large, and neither is causal.** The estimated Support-minus-Self-guided difference is $+6$ points with a $95\%$ confidence interval of $(1.3, 10.7)$. Because the interval excludes $0$ (the dashed null line), $p=0.013$ is statistically significant — but the interval is wide, Cohen’s $d\approx0.53$ is only *medium*, and the design is observational, so this is an **association**, not evidence the support center *caused* the gain. Synthetic; seed 35203.

The group means are $78$ and $72$. tapply(g$score, g$group, mean) should return Support $78.0$ and Self-guided $72.0$. If they drift, the re-centering lines in Setup were altered or the seed was skipped — re-run the chunk with set.seed(35203) first.
The group SDs are $10.5$ and $12.0$. tapply(g$score, g$group, sd) should return Support $10.5$ and Self-guided $12.0$. A wrong SD propagates into the SE, the interval, and Cohen’s $d$, so fix it before reading anything else.
The difference is $+6.0$ points. diff(tapply(g$score, g$group, mean)) (or diff(rev(fit$estimate))) should return $6$. If you see $-6$, your factor levels are ordered the other way — set levels = c("Self-guided", "Support") so Support is the higher level and the difference reads positive.
The Welch $t$ is about $2.53$ on df $\approx 86$. fit$statistic should be about $2.53$ in magnitude (R may print it negative depending on level order) and fit$parameter about $86$. A df near $88$ instead of $86$ means you ran the pooled test (var.equal = TRUE); leave it at the default for Welch. The two are nearly identical here only because $n_1 = n_2$.
The p-value is about $0.013$. fit$p.value should land near $0.013$. A wildly different value usually means the means or SDs drifted (see the first two checks) — the test is downstream of them.
The 95% CI for the difference is $(1.3, 10.7)$. fit$conf.int prints $(-10.7, -1.3)$ for Self-guided minus Support; flip the sign to get $(1.3, 10.7)$ for Support minus Self-guided. If your interval includes $0$, re-check the SDs and the df — an interval that excludes $0$ is what matches the note.
Cohen’s $d$ is about $0.53$ (medium). The by-hand (m1 - m2) / s_pooled and effsize::cohen.d() should both return about $0.53$. If your by-hand value disagrees with effsize, check that s_pooled uses the $(n_1 - 1)$, $(n_2 - 1)$ weighting, not a plain average of the two SDs.
The conclusion is stated in three registers. In one sentence: Support students scored about $6$ points higher (95% CI $1.3$ to $10.7$, $d \approx 0.53$, $p \approx 0.013$); the difference is statistically distinguishable from zero and a medium practical effect, but because students self-selected into the support center this is an association, not evidence the center caused the gain. If your written conclusion reports a bare p-value, calls the effect “significant” without saying how big, or uses a causal verb (“the center raised scores”), the reasoning — not the code — needs fixing.

A small honest note: the re-centered synthetic data are pinned to the locked summary statistics on purpose, so your t.test() output should reproduce $t \approx 2.53$, $p \approx 0.013$, and the $(1.3, 10.7)$ interval almost exactly. Real data would not cooperate so neatly; you would take the means, SDs, and interval as they came and report those. The lab fixes the numbers only so the method’s machinery is what you are studying, not the noise.

AI use note

If you use an AI assistant on this lab, record it briefly. The load-bearing column is Verification — how you confirmed the output yourself, against the week-5 note and the locked numbers, rather than trusting the tool. Computation, AI-assisted or not, does not turn an observational association into a causal claim, manufacture an effect size, or certify a confidence interval.

Tool	Purpose	Verification
LLM chat assistant	Explain why `t.test(score ~ group)` runs the Welch test by default and when to use `var.equal = TRUE`	Re-ran the fit with and without `var.equal = TRUE`, confirmed the default gives df $\approx 86$ (Welch) and the pooled gives df $\approx 88$, and reconciled with the week-5 note’s “prefer Welch” guidance
LLM chat assistant	Confirm the pooled-SD formula for Cohen’s $d$ and the $(n_1-1),(n_2-1)$ weighting	Computed `s_pooled` by hand, cross-checked the resulting $d \approx 0.53$ against `effsize::cohen.d()`, and confirmed both report “medium”
LLM chat assistant	Help phrase the 95% CI for the difference as estimate-with-uncertainty rather than a bare verdict	Read `fit$conf.int`, flipped the sign to $(1.3, 10.7)$ for Support minus Self-guided, and confirmed it excludes $0$ and matches the note
Code formatter / linter	Tidy the Setup and Step 1 data-construction blocks for readability	Diffed before/after to confirm only whitespace changed, then re-ran to confirm the locked means ($78$, $72$), SDs ($10.5$, $12.0$), and difference ($6$) were unchanged