Simulation and reproducibility

Week 9 — first reproducible simulation in a Quarto report: seed, generate, summarize, interpret

A short conceptual reading to continue Module B — R / computation, visualization, simulation, reporting. The companion hands-on walkthrough is Lab 7 — A small reproducible simulation in R. The exact assignment prompt, due dates, and submission details for the Week 9 simulation report live in the Assignments/LMS space.

Week 7 opened Module B by adding R chunks to the same Quarto container students had used since Week 1: load a small tidy dataset, inspect it, compute a few summaries, and write a sentence of prose under each piece of output. Week 8 added one more piece of substance — a figure, produced by a ggplot chunk that runs during the render. Week 9 adds one more: the data itself becomes computed, generated by your own R code under a stated seed rather than loaded from a built-in dataset.

There is still no new editor, no new render engine, and no new portfolio convention this week. You stay in VS Code, you render with Quarto to PDF, and your weekly artifact lives next to hw01/–hw04/, hw07/, hw08/, and latex-project/ in math-software-portfolio/. There is also no new package this week — set.seed(), sample(), table(), mean(), sum(), length(), and barplot() are all base R. What’s new is that the report writes a small piece of R code that generates the data, and the reproducibility of that data under a stated seed is the load-bearing skill.

One container, four substances now

Modules A and B share the same Quarto-to-PDF render chain, the same editor, and the same “render then read” verification habit. Week 7 added R chunks + loaded data; Week 8 added a figure; Week 9 adds simulated data — data the report generates rather than loads. The container did not change; the substance got richer again.

The arc of the weekly document, at the end of Week 9, looks like this from the top of the rendered PDF down: title block, short intro paragraph naming the simulation and the question and the seed, a setup chunk that calls set.seed(...), a simulation chunk that runs the random process, a summary chunk with a sentence under it, optionally one supporting summary or visualization, and a short interpretation paragraph that ties the simulation back to the question — under the stated seed. The same edit → render → look → re-render loop you have used since Week 1 still applies, with one new check: confirm that the seed is set, and that re-rendering produces the same numbers.

A simulation answers a question

Before you write sample(...) or any other random call, you should be able to say in one sentence what question your simulation will help answer. A simulation without a question is a number dump; a simulation with a question is evidence.

Examples of questions a small simulation can help answer:

Under this seed, how often does a simulated fair coin come up heads in 1000 flips?
Under this seed, how are 500 fair-die rolls distributed across the six faces?
Under this seed, what does the running average of repeated simulated trials look like as the number of trials grows?

If you cannot state the question, the simulation is not ready to write yet — pick a question first.

`set.seed()` is the load-bearing function

R’s random number generator is pseudo-random: it produces numbers that look random but are completely determined by a starting point called a seed. Calling set.seed(N) once before any random call locks the entire sequence of random numbers to a known starting point. Two students running the same code with the same seed get the same numbers — every time.

A simulation without set.seed() is not reproducible. The next render produces different numbers; the prose interpretation you wrote last night silently becomes wrong. In a report, unreproducible output is not evidence; it is noise.

A simulation with set.seed() is reproducible. The rendered PDF, the simulated values, the summary statistics, and the sentences of prose all line up — and they will still line up the next time someone runs your .qmd.

The minimum-viable simulation in two chunks

The smallest useful simulation has two chunks: a setup chunk that sets the seed, and a generation chunk that runs the random call. The illustrative example below shows the shape (Lab 7 walks the same arc on a larger simulation with summary and prose):

```{r}
set.seed(2026)
sample(c("H", "T"), size = 10, replace = TRUE)
```

Two pieces in one chunk: a seed and a generation call. The seed must come first; once it is set, every subsequent random call is deterministic given that seed. Lab 7 walks the larger 1000-flip version end to end.

Code, output, and prose around every simulation chunk

Module A had a lesson: an equation needs surrounding prose. An equation sitting alone, with no setup and no interpretation, is an equation-dump, and a document made of equation-dumps is hard to read.

Week 7 had the same lesson, applied to code: a chunk that runs is not a chunk that is understood; a summary that appears in the PDF without a sentence saying what it means is a code-dump.

Week 8 had the same lesson, applied to figures: a plot inside a report needs one short sentence of prose underneath it saying what the rendered figure shows.

Week 9 has the same lesson, applied to simulated output: a chunk that produces a simulated value or summary needs one short sentence of prose underneath naming what the simulation under the stated seed produced. A column of simulated numbers with no prose is a simulation-dump, and a report whose simulation chunks stand on their own is hard to grade and harder to read.

The shape to write toward, every time you add a simulation chunk:

A short sentence of context — what is this chunk simulating, and what question does the simulation help answer?
The simulation chunk itself (preceded somewhere above by the seed).
A short sentence of interpretation — what does the rendered output actually show under the stated seed?

Step 3 is non-negotiable. The sentence describes what the rendered numbers show, not the simulation’s shape (“this is 1000 coin flips”) and not a guess from the code (“this should produce roughly 500 heads”). Under the stated seed, the rendered output produced a specific count — the prose names that count.

The “render then read” habit, applied to simulated output

Rendering has been verification since Week 1: edit the source, render, look at the PDF, fix anything that does not match what you intended, re-render. Week 7 extended the habit to computed output. Week 8 extended it to figures.

Week 9 extends the habit to simulation with a Week-9-specific discipline: re-render the document twice and confirm the numbers are the same. If the seed is set correctly and runs before the simulation chunk, the second render produces the same simulated values, the same counts, the same summary stats, and the same prose alignment. If the second render produces different numbers, the seed is missing, misplaced, or wrong.

This is a more concrete verification than Weeks 7 or 8 offered. Reproducibility under a stated seed is something you can test in under a minute.

Repeated trials with `replicate()` (optional extension)

Sometimes the question is not “what did this one simulation produce” but “what would many of these simulations look like together?” The base-R function replicate(n, expr) runs the expression n times and collects the results into a vector. For example, the lab’s optional extension uses replicate() to repeat a 100-flip coin simulation many times and look at the distribution of the proportion of heads across those many repetitions. The cluster of proportions hovers around 1/2 with a shrinking spread as the number of repetitions grows. This kind of clustering shows up everywhere in statistics; it is sometimes called sampling behavior, and the formal name and proof live in a probability course you may take later. Week 9 only demonstrates the clustering; it does not derive it.

Where the data comes from

For Week 9, the data is simulated. That puts the week in category 2 of the four acceptable data sources on the Data guidelines page — simulated data generated in-script with set.seed(). Built-in R datasets (category 1) remain available for prior weeks; public datasets (category 3) and instructor-approved student-selected datasets (category 4) are not used in Week 9.

What this week’s lab does

Lab 7 — A small reproducible simulation in R walks the whole arc — set a seed → generate 1000 coin flips → summarize the counts → optionally visualize → optionally extend with replicate() → render → read — using base R only. It shows the cleanest possible first reproducible simulation inside a Quarto report on a small toy process students already intuit. Do the lab on your own machine, in your own portfolio folder.

AI in Week 9

AI assistants are useful in the same ways they have been since Week 1: explanation, debugging, syntax lookup, drafting. In Week 9 specifically, they help with R syntax lookup (which random-generation function does X, what replace = TRUE means), debugging an erroring chunk, explaining what a piece of simulation code does, and rephrasing prose under the simulation chunks.

Two things AI cannot do for you in Week 9:

Read your simulation output for you. What your prose says about the simulation must match what the rendered output actually shows — not what an assistant told you the simulation would produce. AI assistants sometimes hallucinate the exact numeric output of set.seed(N); sample(...) calls because they do not actually run R; their predicted numbers are a guess, not a reading.
Generate “simulation output” you did not actually run. If your .qmd claims that set.seed(2026) produced “503 heads,” that number must come from a chunk that actually ran during the render. Do not paste an assistant’s predicted count as if it were output your code produced.

The three-line AI Use Note (Tool / Purpose / Verification) applies. This week the Verification line should describe how you confirmed that what your prose says about the simulation is what the simulation under the stated seed actually produced — the cleanest verification is to re-render the document with the unchanged source and confirm the numbers match. See the AI use guidelines for the full pattern.

What you’ll do this week

In one paragraph: you will set a random seed at the top of a Quarto document, generate a small repeated random process in R, summarize the simulated outcomes with one or two basic summaries, optionally visualize the simulated distribution with a small plot, render everything inside a Quarto-to-PDF report, and write prose that explains what the simulation under the stated seed produced without overclaiming. You follow the same edit → render → read habit you have used since Week 1; the only new thing is that the numbers in the PDF come from R running your code under a known seed. The exact assignment prompt, due dates, and submission details for the Week 9 simulation report live in the Assignments/LMS space.

Looking ahead to the R Project

Week 10 is the R Project, and the R Project offers two tracks. Track A — data analysis and visualization with ggplot2 — is the natural extension of Week 8. Track B — simulation, sampling behavior, or a CLT-style investigation — is the direct extension of Week 9. The seed-first, simulate, summarize, interpret-under-the-seed workflow you build this week is the workflow you would extend if you choose Track B in Week 10. The exact project prompt, the track-selection mechanics, the conference sign-up, and the submission details live in the course LMS.