Week 4 — DATA step logic
Creating, transforming, subsetting, and cleaning data row by row
The week question
In Week 3 you learned what a SAS dataset is — a library of rows and columns, each variable with a type, a label, and a format — and you read a participant table that someone else had already cleaned. Real analytic data almost never arrives that tidy. This week’s question is the first true data-preparation question of the course: how do you take a raw table and, row by row, build the analysis-ready version you can trust? That “row by row” is the heart of it. The DATA step is SAS’s per-observation engine: it reads one row, runs your logic on it, writes a row, and loops to the next. To use it well you have to picture the program data vector (the PDV) — the single-row workspace SAS fills, edits, and writes on every pass. Once you can see the PDV, creating a variable, recoding a value, dropping a bad row, and deciding what to do with a missing value all become the same small set of moves. We will use those moves to turn the raw participants table of the wellness-program study into the cleaned table the rest of the course depends on — and, as always, we will read the log and check the counts to confirm we did what we think we did.
Why this matters
The DATA step is the workhorse you will return to every week after this one — importing (Week 5), reshaping and merging (Week 12), preparing the modeling slices (Weeks 9–11), and assembling the final reproducible report (Week 14) all run through DATA step logic. It matters here for three reasons. First, the PDV is a threshold concept: until you can picture SAS holding one row at a time, half of SAS’s behavior — why a retain is needed, why if x < 5 quietly keeps missing rows, why a variable is “uninitialized” — looks like magic. Once you see the one-row workspace, those stop being surprises. Second, this is where data cleaning becomes deliberate. A typo of age = 199, a blank sex, an impossible baseline_bmi = 0, two internal test rows — each is a decision, not an accident to ignore, and the DATA step is where you make the decision and leave a trace of it. Third, it is where the course’s two load-bearing disciplines first bite: character versus numeric (a number stored as text blocks every later PROC) and missing-value handling (missing propagates through arithmetic and behaves treacherously in comparisons). Get these right in Week 4 and the rest of the workflow rests on solid ground; get them wrong and every downstream count, mean, and model inherits the error silently.
Learning goals
By the end of this week you should be able to:
- Describe what a DATA step does on each iteration — read a row into the PDV, run your statements, write a row, loop — and explain why that “one row at a time” picture predicts SAS’s behavior.
- Create and transform variables with the assignment statement, and convert between character and numeric with
input()/put(), naming the type at every step. - Use IF-THEN/ELSE, a subsetting IF, and WHERE to recode values and keep or drop rows, and say which rows each one keeps — including what happens to missing values.
- Handle missing values deliberately: set an implausible value to missing, recode blanks, and use
nmiss/missing()to confirm what you did instead of letting rows vanish unannounced. - Clean the raw
participantsslice from 210 rows to 200 unique participants — flag theage = 199typo to missing, address the 12 blanksexvalues, drop the 8 duplicate ids and 2 test rows — and verify the result by reading the log and checking counts, types, andNMISS. - Read the DATA step’s log (NOTE / WARNING / ERROR) and treat the row counts and any type-conversion or missing-value notes as part of the output, not noise to skip.
Core vocabulary
The week’s SAS terms, defined plainly. These mirror the SAS workflow glossary; keep character vs numeric and missing sharp in both words and code.
- DATA step — a SAS program block that begins with
data <out>;and ends withrun;. It reads input row by row, runs your statements on each row, and writes rows to the output dataset. It is SAS’s per-observation programming engine, distinct from aPROC. - Program data vector (PDV) — the in-memory single-row workspace SAS builds for a DATA step. On each iteration SAS clears the PDV (for read-in variables), reads or computes values into it, then writes the row to the output. Picture one row of slots being filled, edited, and emptied over and over.
- Assignment statement —
new_var = expression;— computes a value and stores it in a variable in the PDV, creating the variable if it does not exist. The type ofnew_varis set the first time SAS sees it. - IF-THEN / ELSE — conditional logic:
if <condition> then <action>;with an optionalelse. Used to recode, flag, or set values per row. It does not, by itself, drop rows. - Subsetting IF —
if <condition>;with nothen— keeps only rows where the condition is true and silently drops the rest. It runs as the row passes through the PDV. - WHERE statement —
where <condition>;— selects rows before they enter the PDV, so it cannot reference variables you create inside the step. Often more efficient than a subsetting IF for raw filtering. - Character vs numeric — a variable is one or the other for its whole life.
participant_idis numeric;sex/site/armare character. Numbers stored as text must be converted withinput(text, informat.)before any math;put(number, format.)goes the other way. - Missing value — numeric missing is
.; character missing is blank" ". Test it withmissing(x); count it withnmissin PROC MEANS. Missing propagates in arithmetic and sorts as less than any number in comparisons — the week’s central trap.
Concept development
The DATA step and the PDV — one row at a time
Every DATA step is a loop. SAS sets up a program data vector, a one-row workspace with a slot for each variable, then repeats: read (or compute) the next row’s values into the PDV, run your statements top to bottom, write the PDV to the output dataset, and start over. When the input is exhausted, the step stops. That single picture explains most of what a DATA step does — including the things that look strange before you see it. Here is the smallest possible step that creates a variable from existing ones.
/* Synthetic wellness-program study; seed call streaminit(20260824) used elsewhere.
This step reads the raw participants table and derives one new variable. */
options validvarname=v7;
data work.participants_demo;
set rivercity.participants_raw; /* read one row at a time into the PDV */
bmi_flag = (baseline_bmi = 0); /* assignment: 1 if impossible BMI, else 0 */
run;
The synthetic log for this step:
SAS log (synthetic)
NOTE: There were 210 observations read from the data set RIVERCITY.PARTICIPANTS_RAW.
NOTE: The data set WORK.PARTICIPANTS_DEMO has 210 observations and 9 variables.
NOTE: DATA statement used (Total process time):
real time 0.02 seconds
cpu time 0.01 seconds
What to check: the input count (210 observations read) and the output count (210 observations) match — this step created a column, it did not add or drop rows — and the variable count rose from 8 to 9 because bmi_flag is new. The workflow move here is read and create: you read the raw table and created one derived variable, and the log’s two counts confirm the row count is unchanged and exactly one variable was added.
Creating and transforming variables — and the type that bites
The assignment statement builds and transforms columns. The one thing you cannot ignore is type: a variable is character or numeric for life, and a number that arrived as text will block every later PROC MEANS until you convert it. In the raw study, enroll_date came in as the character string "08/24/2026" — it looks like a date but SAS stores it as text, so no date arithmetic works until you read it through an informat.
data work.participants_typed;
set rivercity.participants_raw;
/* enroll_date arrived as character "08/24/2026"; read it into a real SAS date */
enroll_dt = input(enroll_date, mmddyy10.); /* character -> numeric date */
format enroll_dt date9.; /* display the numeric date readably */
days_enrolled = today() - enroll_dt; /* now date math is legal */
run;
The synthetic log:
SAS log (synthetic)
NOTE: There were 210 observations read from the data set RIVERCITY.PARTICIPANTS_RAW.
NOTE: The data set WORK.PARTICIPANTS_TYPED has 210 observations and 10 variables.
What to check: there is no NOTE: Invalid data for enroll_date line, which is what you would see if the mmddyy10. informat failed to read the string — so every row parsed. The workflow move is convert and verify: enroll_dt is now numeric (a SAS date is a number of days since 1960-01-01) displayed with a date9. format, and days_enrolled only computes because the conversion succeeded. Had you skipped the input() and written today() - enroll_date directly, the log would have warned NOTE: Character values have been converted to numeric — a silent coercion you never want to rely on. Name the type every time it matters: enroll_date is character, enroll_dt is numeric.
IF-THEN/ELSE, subsetting, and missing values — keeping the right rows
Three statements decide what happens to a row, and they are easy to confuse. IF-THEN/ELSE recodes or flags a value but keeps the row. A bare subsetting IF keeps only matching rows and drops the rest. WHERE filters rows before they reach the PDV, so it cannot see variables you create in the step. The trap lives in how all three treat missing: a numeric missing is ., and SAS orders it below every real number, so if age < 30 quietly includes the age = 199-turned-missing rows, while if age > 30 excludes them.
data work.participants_clean;
set rivercity.participants_raw;
/* 1. Flag the impossible age typo (199) to missing -- deliberately, not silently */
if age = 199 then age = .;
/* 2. Recode blank sex (12 rows) to an explicit "U" so it is visible, not invisible */
if missing(sex) then sex = "U";
/* 3. Drop the 2 internal test rows by their region label */
if region = "TEST" then delete;
/* 4. Keep only adult participants -- watch how missing age behaves here */
if age >= 18; /* subsetting IF: missing age is < 18, so it is DROPPED */
run;
The synthetic log:
SAS log (synthetic)
NOTE: Missing values were generated as a result of performing an operation on missing values.
NOTE: There were 210 observations read from the data set RIVERCITY.PARTICIPANTS_RAW.
NOTE: The data set WORK.PARTICIPANTS_CLEAN has 207 observations and 8 variables.
What to check: 210 read, 207 written — the 2 TEST rows were deleted and the single age = 199 row was dropped by if age >= 18 (its age is now missing, and missing sorts below 18). That is the behavior you wanted here, but it is also the trap: if you had written if age <= 65 instead, the same missing-age row would have survived, because missing is less than 65. Always decide consciously whether a comparison should include or exclude missing, and confirm the count. The workflow move is recode, drop, and account: every row that left the table left for a reason you can name, and the log’s 207 lets you reconcile it against the 210 you started with. (The 8 duplicate-id rows are removed in the worked example below, which is the move that lands the final 200.)
Worked examples
Worked example — cleaning the wellness-program participants table (the recurring slice)
The task. RiverCity Wellness exported 210 raw rows of enrolled participants, and the table has the locked quality problems: 8 duplicate participant_id rows, 2 internal test rows, one age typo of 199, 12 blank sex values, enroll_date stored as character, and 2 impossible baseline_bmi = 0 values. Build the analysis-ready table of 200 unique participants in DATA step logic, then verify it. Data are synthetic; seed streaminit(20260824); this is not real health data.
options validvarname=v7;
/* Step A: clean values row by row (type fixes, missing-value decisions, drop test rows) */
data work.participants_step;
set rivercity.participants_raw;
if age = 199 then age = .; /* impossible age -> missing + later review */
if missing(sex) then sex = "U"; /* 12 blanks -> explicit unknown */
if baseline_bmi = 0 then baseline_bmi = .; /* impossible BMI -> missing */
enroll_dt = input(enroll_date, mmddyy10.); /* character "08/24/2026" -> SAS date */
format enroll_dt date9.;
if region = "TEST" then delete; /* drop the 2 internal test rows */
run;
/* Step B: remove the 8 duplicate participant_id rows (sort, then keep first per key) */
proc sort data=work.participants_step out=work.participants_clean nodupkey;
by participant_id;
run;
The synthetic log:
SAS log (synthetic)
NOTE: There were 210 observations read from the data set RIVERCITY.PARTICIPANTS_RAW.
NOTE: The data set WORK.PARTICIPANTS_STEP has 208 observations and 9 variables.
NOTE: There were 208 observations read from the data set WORK.PARTICIPANTS_STEP.
NOTE: 8 observations with duplicate key values were deleted.
NOTE: The data set WORK.PARTICIPANTS_CLEAN has 200 observations and 9 variables.
The synthetic verification output:
Output (synthetic, not executed) — PROC FREQ and PROC MEANS on WORK.PARTICIPANTS_CLEAN
sex Frequency arm Frequency site Frequency
----- --------- ---------- --------- ------- ---------
F 104 coaching 100 North 70
M 96 usual_care 100 Central 66
South 64
Variable N NMISS Mean Min Max
------------ ---- ----- ------- ----- -----
age 199 1 46.2 21 78
baseline_bmi 198 2 28.7 17 44
The verification check. Reconcile the counts: 210 read − 2 test rows = 208 (Step A), then 208 − 8 duplicate ids = 200 (Step B). The log’s 8 observations with duplicate key values were deleted and final 200 observations confirm the arithmetic. The PROC FREQ matches the locked cleaned frequencies exactly — sex F 104 / M 96, arm coaching 100 / usual_care 100, site North 70 / Central 66 / South 64 — and the NMISS column shows the cleaning was deliberate, not lossy: age has 1 missing (the flagged 199) and baseline_bmi has 2 missing (the flagged zeros). Nothing vanished unannounced.
Interpretation. You read 210 raw rows, made each cleaning decision explicitly, and produced 200 unique analysis-ready participants — and you can prove it from the log and the NMISS counts rather than on trust. The workflow move is the course’s recurring test: would someone else be able to rerun this and get 200, with the same 1 and 2 missing? Yes — because the steps are in code, top to bottom, and the counts reconcile. Two cautions hold: the flagged-missing age and baseline_bmi values are now ., so any later mean must report N and NMISS, and the sex = "U" recode makes the unknowns visible for a later decision rather than hiding them inside a blank.
Worked example — transfer: deriving a BMI category in a new dataset
The task. Switch context to a different (synthetic) table — a campus fitness-center intake sheet, not the wellness study — with one row per member and a numeric bmi column. Derive a categorical bmi_category variable with the standard cut points, handling missing bmi deliberately so it does not get silently bucketed. Data are synthetic; seed streaminit(20260824).
data work.intake_categorized;
set campus.fitness_intake; /* one row per member; numeric bmi */
length bmi_category $ 12; /* set the new char variable's length first */
if missing(bmi) then bmi_category = "Unknown"; /* handle missing FIRST */
else if bmi < 18.5 then bmi_category = "Underweight";
else if bmi < 25 then bmi_category = "Normal";
else if bmi < 30 then bmi_category = "Overweight";
else bmi_category = "Obese";
run;
proc freq data=work.intake_categorized;
tables bmi_category / missing;
run;
The synthetic log:
SAS log (synthetic)
NOTE: There were 150 observations read from the data set CAMPUS.FITNESS_INTAKE.
NOTE: The data set WORK.INTAKE_CATEGORIZED has 150 observations and 4 variables.
The synthetic output:
Output (synthetic, not executed) — PROC FREQ of bmi_category
bmi_category Frequency Percent
------------ --------- -------
Underweight 9 6.0
Normal 61 40.7
Overweight 47 31.3
Obese 28 18.7
Unknown 5 3.3
---- -----
Total 150 100.0
The verification check. The frequencies sum to 150, matching the rows read — no member was dropped or double-counted. Crucially, the 5 Unknown rows appear in the table because the if missing(bmi) clause ran before the numeric comparisons; had you tested if bmi < 18.5 first, those 5 missing-bmi rows would have fallen into Underweight (since missing < 18.5 is true), silently corrupting the smallest category. The length bmi_category $ 12; statement guarantees the longest label, "Underweight", is not truncated.
Interpretation. The same DATA step moves — assignment, IF-THEN/ELSE cascade, explicit missing handling — transfer cleanly to a brand-new table; nothing about the logic was specific to the wellness study. The lesson that carries is order your conditions so missing is caught on purpose: in SAS, missing is not “no category,” it is “a value smaller than every threshold,” so it must be handled first or it lands wherever the first true comparison sends it. This is a derived display variable on synthetic data, not a clinical classification of anyone.
A common mistake
The week’s signature trap is letting missing values slip through a comparison and then through your counts. It shows up in three linked ways:
- The comparison trap. Because numeric missing sorts below every number,
if x < cutoffkeeps missing rows andif x > cutoffdrops them. A subsettingif age >= 18looks like it only filters out children, but it also silently removes every row whoseageis missing — sometimes what you want, sometimes a quiet data loss. Decide which, then confirm the row count. - Silent type coercion. Writing arithmetic on a character-stored number (like the raw
enroll_date) does not error; SAS converts it and logs onlyNOTE: Character values have been converted to numeric. If the string is not a clean number the result is missing, and your means drift with no ERROR to warn you. Convert explicitly withinput(..., informat.)and read the log for that NOTE. - Dropping instead of flagging. Deleting every row with a bad value (
if age = 199 then delete;) destroys information and hides the data-quality problem. Setting it to missing (if age = 199 then age = .;) keeps the row, makes the issue visible inNMISS, and lets a later reviewer decide. Flag, do not silently drop.
The defense is the same each time: after any DATA step that recodes or subsets, run a quick proc means data=… n nmiss min max; (or proc freq … / missing;), reconcile the row count against what you started with, and read the log for Missing values were generated, converted to numeric, and the input/output counts. A rendered, syntax-highlighted DATA step is not evidence it did the right thing — the counts and the NMISS are.
Low-stakes self-checks (ungraded)
These are for self-study only — ungraded, no submission.
- In your own words, describe one full iteration of a DATA step in terms of the PDV: what gets read, what your statements do, and what gets written. Why does this picture explain that a subsetting
ifcan drop a row mid-step? - The raw
participantstable has 210 rows. Trace the path to 200: which statement removes the 2 test rows, which removes the 8 duplicate ids, and why doesage = 199 then age = .;not change the final count? - You write
if baseline_bmi < 15;to drop implausibly low BMIs. After flagging the two zeros to missing withif baseline_bmi = 0 then baseline_bmi = .;, do those two rows survive this subsettingif? Explain using how SAS orders missing. - Distinguish WHERE from a subsetting IF in one sentence each. Which one can reference a variable you create earlier in the same DATA step, and why?
enroll_dateholds the text"08/24/2026". Write the assignment that turns it into a real SAS date, name the informat, and say what the log lineNOTE: Invalid data for enroll_dtwould mean if you saw it.- A classmate cleans the table with
if age = 199 then delete;. Name two things that go wrong compared with settingageto missing, in this week’s vocabulary (thinkNMISSand traceability).
Reading and source pointer
For this week’s procedures, read the SAS documentation on the DATA step and the program data vector (how a step reads input and builds each row), on IF-THEN/ELSE, the subsetting IF, and the WHERE statement (conditional logic and row selection), and on the assignment statement and the INPUT/PUT functions and the MISSING function (creating and transforming variables and testing missing values). Use these as syntax-and-option reference pointers — what an informat does, how nodupkey behaves, how missing() reads — not as a source to copy. These notes are the course’s own synthesis: grounded in the SAS documentation and open statistics references, but not copied from them. SAS® and all SAS Institute product names are the property of SAS Institute Inc.
For the hands-on build, work the companion Lab 4 — Build and validate a DATA step, which carries the fuller cleaning sequence step by step.
Verification & reproducibility status
verified: false. The SAS code, log excerpts, and every numeric value on this page are hand-authored, synthetic, and were NOT run — SAS is proprietary and is not executed in this build. The course SAS execution/output gate is BLOCKED; a rendered code block or typed listing is not evidence the code runs or the numbers are right. The load-bearing numbers here — the 210 raw rows cleaning to 200 unique participants (via −2 test rows to 208, then −8 duplicate ids); the cleaned frequencies sex F 104 / M 96, arm coaching 100 / usual_care 100, site North 70 / Central 66 / South 64; and the NMISS counts of 1 for age and 2 for baseline_bmi — are the locked synthetic values of the wellness-program study (seed streaminit(20260824)), drafted “as if run” for this draft site and cross-checked only for internal and narrative consistency. The transfer example’s counts (150 members, 5 Unknown) are likewise synthetic and notional. These are not real health data, and the wellness study is observational. Do not treat any value here as a confirmed reference until the human/SAS-run sign-off in the course’s private notation and verification ledger §5 is complete.
Public vs. graded
These notes, the SAS examples, and the practice here are public and ungraded — study material only. No graded prompts, answer keys, rubrics, point values, or due dates appear on this site. Graded SAS workflow checkpoints, skill checks, homework, analytics labs, the midterm practical, the final analytics project, and the final practical live in Blackboard (the LMS), which is authoritative for due dates, submissions, and grades. If this page and Blackboard ever disagree, follow Blackboard.
Looking ahead
Next week we widen the lens from logic to the full import-and-validate pipeline: Week 5 brings the raw file in with PROC IMPORT, runs the same cleaning you wrote here, and validates the result with PROC CONTENTS and a deliberate row-count and type review — turning the locked 210 → 200 path into a repeatable, documented import you could hand to someone else. The DATA step logic from this week is the engine inside that pipeline.
See also
- Previous week: Week 3 — Libraries, datasets, variables, formats — the variable attributes and the
enroll_dateinformat/format this week’s logic builds on. - Next week: Week 5 — Importing, cleaning, and validating — the full import-and-validate pipeline that wraps this week’s DATA step.
- Companion lab: Lab 4 — Build and validate a DATA step.
- SAS workflow glossary — DATA step, PDV, character vs numeric, missing values, informat vs format.
- Log and verification guide — reading NOTE / WARNING / ERROR and checking counts, types, and
NMISS.