Lab 4 — Build and validate a DATA step
Cleaning the participants table and proving it is analysis-ready
Purpose. This lab is the hands-on companion to Week 4 — DATA step logic. The note develops the DATA step and the program data vector — read a row, run your logic, write a row, loop — and the cleaning decisions that turn a raw table into one you can trust. Here you build that cleaning as a runnable program, step by step, and then prove the result is analysis-ready by reading the log and reconciling the counts. Every SAS block below is shown as static, non-executed code; the logs and output are hand-authored synthetic listings — SAS is proprietary and is not run in this build.
The idea
A DATA step is where a messy export becomes a table the rest of the course can rely on. The raw participants table of the wellness-program study arrives with 210 rows and a known set of quality problems — duplicate ids, internal test rows, an impossible age, blank sex on rows that turn out to be the test and duplicate records, a date stored as text, impossible zero BMIs. The job is not just to write the cleaning logic but to leave a trace that it did what you intended: the log’s input and output counts, the NMISS columns, and a PROC FREQ you can check against the expected frequencies. The discipline this lab drills is the course’s recurring test — would someone else be able to rerun this and get the same 200 unique participants, with the same missing counts? The answer is “yes” only when the counts reconcile and nothing vanished unannounced.
The recurring data here are synthetic; seed streaminit(20260824). The wellness-program study stands in for a community screening program; it is not real health data, and any pattern you see is observational, not causal.
Goal
Take the raw rivercity.participants_raw table (210 rows) and, in DATA step logic, produce the clean work.participants_clean table of 200 unique participants — flagging the age = 199 typo and the two impossible baseline_bmi = 0 values to missing, recoding the 10 blank sex values to an explicit "U" so they stay visible, parsing the character enroll_date into a real SAS date, deleting the 2 internal test rows, and removing the 8 duplicate participant_id rows. Then validate: reconcile the row counts (210 → 208 → 200), confirm the cleaned frequencies (sex F 104 / M 96; arm coaching 100 / usual_care 100; site North 70 / Central 66 / South 64), and confirm the deliberate missingness (age NMISS = 1, baseline_bmi NMISS = 2). Finish by checking that every load-bearing number matches the companion week note.
Setup
Open SAS Studio in your course-designated environment (SAS OnDemand for Academics, SAS Viya for Learners, or a university-supported SAS installation — see the SAS access setup page). Create one program file, top to bottom, with a libname pointing at the project folder of permanent datasets, the options you rely on, and the seed call — the seed plays no role in the deterministic cleaning here, but fixing it keeps every lab reproducible by the same convention. Keep WORK for scratch outputs and write only the final cleaned table where you need it permanently.
/* Lab 4 — build and validate a DATA step on the wellness-program study.
Synthetic data; seed call streaminit(20260824). NOT executed in this build. */
options validvarname=v7 nodate nonumber;
libname rivercity "/home/u_course/rivercity"; /* permanent project library */
data _null_;
call streaminit(20260824); /* lab seed convention (no randomness used here) */
run;
There is nothing to read off this block yet — it only points a libref at the folder and sets options. The workflow move is organize before you compute: a reader who opens your program sees the library, the options, and the seed up front, so the run is reproducible from the first line.
Steps
Build the cleaning in four steps, and read the log and check a count after each one. A DATA step that renders without error is not a DATA step that did the right thing — the counts and the NMISS are the evidence.
Step 1 — read the raw table and look before you clean
Before changing anything, read the raw table and take its measure: how many rows, how many variables, and where the trouble is. A quick PROC FREQ with the / missing option surfaces the 10 blank sex values instead of hiding them, and PROC MEANS with n nmiss min max exposes the impossible age = 199 and baseline_bmi = 0 as out-of-range extremes.
/* Step 1: read the raw table; profile it before touching a value */
data work.participants_raw;
set rivercity.participants_raw; /* 210 rows, as exported */
run;
proc freq data=work.participants_raw;
tables sex / missing; /* show blanks, do not hide them */
run;
proc means data=work.participants_raw n nmiss min max;
var age baseline_bmi; /* expose the impossible extremes */
run;
The synthetic log:
SAS log (synthetic)
NOTE: There were 210 observations read from the data set RIVERCITY.PARTICIPANTS_RAW.
NOTE: The data set WORK.PARTICIPANTS_RAW has 210 observations and 8 variables.
NOTE: There were 210 observations read from the data set WORK.PARTICIPANTS_RAW.
NOTE: PROCEDURE FREQ used (Total process time):
real time 0.03 seconds
NOTE: There were 210 observations read from the data set WORK.PARTICIPANTS_RAW.
The synthetic profiling output:
Output (synthetic, not executed) — PROC FREQ of sex, PROC MEANS of age and baseline_bmi
sex Frequency Percent
----- --------- -------
(blank) 10 4.8
F 104 49.5
M 96 45.7
Variable N NMISS Min Max
------------ ---- ----- ------ ------
age 210 0 19 199
baseline_bmi 210 0 0 44
CHECK. The log reads 210 observations and 8 variables — that is the table you expected to start from. The profiling confirms the locked problems are present before cleaning: PROC FREQ shows 10 blank sex rows alongside F 104 / M 96, and the three counts sum to the raw 210. Those 10 blank-sex rows are the same internal test and duplicate records you will remove in Steps 2 and 3, which is why the raw F/M counts (104 / 96) already equal the final cleaned F/M split — no F or M row is dropped or recoded. PROC MEANS shows age ranging up to the impossible 199 and baseline_bmi down to the impossible 0. The workflow move is profile first: you have read the raw table and documented exactly what is wrong, so every later change is a decision you can point to, not a guess.
Step 2 — clean values row by row in the DATA step
Now make each cleaning decision explicitly, one row at a time. Flag the impossible values to missing rather than deleting them — deleting destroys information and hides the data-quality problem, while a missing value keeps the row, shows up in NMISS, and lets a later reviewer decide. Recode the 10 blank sex values to a visible "U", parse the character enroll_date through the MMDDYY10. informat into a real SAS date, and delete the two internal test rows.
/* Step 2: clean values per row -- flag (do not silently drop) impossible values */
data work.participants_step;
set work.participants_raw;
if age = 199 then age = .; /* impossible age -> missing + later review */
if missing(sex) then sex = "U"; /* 10 blanks -> explicit unknown */
if baseline_bmi = 0 then baseline_bmi = .; /* impossible BMI -> missing */
enroll_dt = input(enroll_date, mmddyy10.); /* character "08/24/2026" -> SAS date (numeric) */
format enroll_dt date9.;
if region = "TEST" then delete; /* drop the 2 internal test rows */
run;
The synthetic log:
SAS log (synthetic)
NOTE: There were 210 observations read from the data set WORK.PARTICIPANTS_RAW.
NOTE: The data set WORK.PARTICIPANTS_STEP has 208 observations and 9 variables.
NOTE: DATA statement used (Total process time):
real time 0.02 seconds
cpu time 0.01 seconds
CHECK. The log reads 210 and writes 208 — exactly the 2 test rows removed by if region = "TEST" then delete;, and no others, since flagging a value to missing keeps its row. Those 2 test rows were 2 of the 10 blank-sex rows you recoded to "U", so after this step the table holds F 104, M 96, and U 8 (= 208). The variable count rose from 8 to 9 because enroll_dt is new. There is no NOTE: Invalid data for enroll_dt line, so the MMDDYY10. informat parsed every date string; if it had failed on a row you would see that NOTE and a missing enroll_dt. Name the types: enroll_date is character, enroll_dt is numeric (a SAS date is a count of days since 1960-01-01). The workflow move is recode and account — 210 in, 208 out, and you can say which 2 rows left and why.
Step 3 — remove the duplicate ids and land on 200
The eight duplicate participant_id rows are not a per-value edit — they are a key problem, so handle them with a sort that keeps the first row per key. PROC SORT … NODUPKEY BY participant_id; orders the table by the key and drops any row whose key repeats. This is the move that turns 208 into the locked 200.
/* Step 3: remove the 8 duplicate participant_id rows (sort, keep first per key) */
proc sort data=work.participants_step
out=work.participants_clean
nodupkey;
by participant_id;
run;
The synthetic log:
SAS log (synthetic)
NOTE: There were 208 observations read from the data set WORK.PARTICIPANTS_STEP.
NOTE: 8 observations with duplicate key values were deleted.
NOTE: The data set WORK.PARTICIPANTS_CLEAN has 200 observations and 9 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.02 seconds
CHECK. The log says it deleted 8 observations with duplicate key values and wrote 200 — reconcile the whole path: 210 raw − 2 test rows = 208 (Step 2), then 208 − 8 duplicate ids = 200 (Step 3). That is the exact arithmetic the companion note locks. Those 8 duplicate-participant_id rows were the 8 remaining U (unknown-sex) records from Step 2, so removing them lands the final table on F 104 / M 96 with no surviving U — which is why the cleaned sex split exactly matches the raw F/M counts. A caution: NODUPKEY removes rows that repeat the key, which is what you want for a duplicated participant_id; it is not the same as NODUP (which removes rows that are identical across all variables) — confusing the two is a common bug. The workflow move is de-duplicate on the key and confirm the count — the table is now one row per participant.
Step 4 — validate the cleaned table
Cleaning is only half the job; the other half is proving it. Run PROC FREQ on the categorical variables and PROC MEANS with n nmiss min max on the numeric ones, and check every number against what you expected. This is the validation step the rest of the course leans on.
/* Step 4: validate -- frequencies, ranges, and deliberate missingness */
proc freq data=work.participants_clean;
tables sex arm site / missing;
run;
proc means data=work.participants_clean n nmiss min max;
var age baseline_bmi;
run;
The synthetic validation output:
Output (synthetic, not executed) — PROC FREQ and PROC MEANS on WORK.PARTICIPANTS_CLEAN
sex Frequency arm Frequency site Frequency
----- --------- ---------- --------- ------- ---------
F 104 coaching 100 North 70
M 96 usual_care 100 Central 66
South 64
Variable N NMISS Min Max
------------ ---- ----- ------- -----
age 199 1 21 78
baseline_bmi 198 2 17 44
CHECK. The frequencies match the locked cleaned values exactly — sex F 104 / M 96, arm coaching 100 / usual_care 100, site North 70 / Central 66 / South 64 — and each set sums to 200, so no participant was dropped or double-counted. The / missing option confirms there are no remaining blank sex rows and no U row: the 10 blanks were recoded to the visible "U" in Step 2 (2 of them on the deleted test rows), and the 8 that remained were precisely the duplicate-participant_id records cleared in Step 3 — so the unknown-sex rows are fully accounted for and the cleaned table is F/M only. That is why the final F/M split equals the raw F/M counts: no F or M row was ever dropped or recoded. The NMISS column shows the cleaning was deliberate, not lossy: age has 1 missing (the flagged 199, so N = 199) and baseline_bmi has 2 missing (the flagged zeros, so N = 198), and the surviving ranges are now plausible (age 21–78, baseline_bmi 17–44). The workflow move is validate before you trust — the table is analysis-ready because the counts, frequencies, and NMISS all check out, not because the code rendered.
Verify
Walk the whole result against the companion Week 4 note and confirm each load-bearing number agrees.
- The row-count path reconciles to 200. 210 raw rows − 2 test rows = 208 (Step 2 log), then 208 − 8 duplicate ids = 200 (Step 3 log). The log lines
8 observations with duplicate key values were deletedand200 observationsare the proof. This is the locked 210 → 208 → 200 path from the note. If your final count is not 200, do not proceed — find the step where the count diverged. - The cleaned frequencies match the note exactly.
sexF 104 / M 96,armcoaching 100 / usual_care 100,siteNorth 70 / Central 66 / South 64 — and each sums to 200. These are the locked PROC FREQ values; a mismatch means a recode or a drop did something you did not intend. The 10 blank-sexrows were recoded to the visible"U"and then fully removed as the 2 test rows and 8 duplicate-id rows, so the final table is F/M only and the cleaned F/M split equals the raw F/M counts. - The missingness is deliberate and accounted for.
ageNMISS = 1 (the flagged199) andbaseline_bmiNMISS = 2 (the flagged zeros), with N = 199 and 198 respectively. Nothing vanished unannounced — the impossible values are visible inNMISS, not silently deleted. This is the note’s “flag, do not silently drop” discipline. - The types are right.
enroll_dtis numeric (a SAS date), not the characterenroll_date; the absence of anyInvalid data for enroll_dtNOTE confirms theMMDDYY10.informat read every row. A number stored as text would block PROC MEANS — so confirming the type now prevents a silent failure later. - The log is clean of unexpected signals. No
ERROR, and the only NOTEs are the expected counts and (in the note’s longer cleaning step) theMissing values were generatedline from flagging. An unexpectedWARNING— for example a many-to-many merge note — would mean the logic did something other than what you read.
All five checks passing is the lab’s definition of analysis-ready, and every number here is the synthetic locked value of the wellness-program study. Because these are hand-authored and not run, agreement with the note is internal-consistency evidence only — it is not confirmation against a real SAS execution. The SAS execution/output gate is BLOCKED; the page is verified: false until the human/SAS-run sign-off in the course’s private notation and verification ledger §5. Treat the cleaned table as a teaching object on synthetic, observational data, never a real health finding.
AI use note
You may use an AI assistant to explain a function or help debug your own SAS, but the verification line is yours: rerun the program in your own SAS session, read the log, and reconcile the counts against this lab’s Verify section before you trust anything the assistant produced. Record an AI Use Note on any graded work.
| Tool | Purpose | Verification |
|---|---|---|
| which assistant you used, with approximate date or version | what you used it for (e.g. explaining the MMDDYY10. informat, or why NODUPKEY differs from NODUP) |
how you checked it: confirmed the final count is 200 in the log, reconciled 210 → 208 → 200 by hand, matched the PROC FREQ to F 104 / M 96 and the NMISS to 1 and 2, and confirmed enroll_dt is numeric |
Verification is load-bearing: an AI can write a PROC SORT … NODUPKEY for you, but you confirm it deleted exactly 8 duplicate-key rows and landed on 200 — and that you can say why the impossible values were flagged to missing rather than deleted.
See also
- Companion note: Week 4 — DATA step logic — the PDV, IF-THEN/ELSE, the subsetting IF, and the missing-value traps this lab applies.
- Next: Week 5 — Importing, cleaning, and validating — the full import-and-validate pipeline that wraps this cleaning.
- SAS workflow glossary — DATA step, PDV, character vs numeric, missing values, informat vs format.
- Log and verification guide — reading NOTE / WARNING / ERROR and checking counts, types, and
NMISS. - SAS access setup — getting into the course-designated SAS environment.
The graded deliverable, its rubric, and due date live in Blackboard (the LMS) — this page is study and practice only. All SAS, logs, output, and numbers here are synthetic and verified: false; the SAS execution/output gate is blocked pending sign-off.