metasurvey offers two ways to create recipes from existing STATA code:
transpile_stata() – converts .do files to
recipes in seconds. Great for migrating legacy code quickly (see
vignette("stata-transpiler")).The transpiler is a pragmatic shortcut: it reads hundreds of lines of
STATA and produces a working recipe, but the output inherits the
original code’s structure – long gen/replace
chains become long step_recode calls, temporary variables
survive, and STATA-specific patterns (like mvencode) get
translated literally rather than rethought.
A hand-crafted recipe, on the other hand, lets you redesign the logic in R from the ground up. You pick meaningful variable names, combine related transformations into a single step, and skip intermediate variables that only existed because STATA needed them. The result is shorter, easier to read, and easier to maintain.
This vignette builds a demographics recipe from scratch in about 20
lines of R. A transpiled version of the same pipeline would take 80+
steps and carry over variable names like bc_pe2 and
bc_pe3 that mean nothing outside the original
.do file.
We start with an empty Survey object. This declares the survey type and edition without loading any data yet – the recipe will work on whatever data we feed it later.
Now let’s attach some sample data. In production this would come from
anda_download_microdata("2023") or a local file; here we
simulate it.
set.seed(42)
n <- 200
dt <- data.table::data.table(
id = rep(1:50, each = 4),
nper = rep(1:4, 50),
pesoano = runif(n, 50, 300),
e26 = sample(1:2, n, replace = TRUE),
e27 = sample(0:90, n, replace = TRUE),
e30 = sample(1:7, n, replace = TRUE),
e51_2 = sample(c(0:6, -9), n, replace = TRUE),
region_4 = sample(1:4, n, replace = TRUE)
)
svy <- svy |> set_data(dt)Every transformation is a step. By default, steps are lazy: they record what to do without executing it. This lets you inspect and modify the pipeline before materializing the results.
Compare this with the transpiler approach:
transpile_stata() would produce one step per STATA command,
faithfully preserving every gen and replace.
Here we think in terms of the output variables we want, not the
commands we need to type.
Nothing happened to the data yet:
The original column names are still there because the step is pending. Let’s keep adding steps.
In STATA this would be a gen + replace +
replace sequence (3 commands). With
step_recode it’s a single, declarative mapping that
produces human-readable labels:
The STATA equivalent uses five replace lines with
inrange(). Here we write the same logic as a single recode
with readable conditions:
svy <- svy |>
step_recode(area,
region_4 == 1 ~ "Montevideo",
region_4 == 2 ~ "Urban >5k",
region_4 == 3 ~ "Urban <5k",
region_4 == 4 ~ "Rural",
.default = NA_character_,
comment = "Geographic area from region_4"
)Notice that all our output variables have meaningful labels instead of numeric codes. A transpiled recipe would keep the original integer codes (1, 2, 3…) because that’s what the STATA code used. Building from scratch lets you choose the representation that makes analysis easier.
At this point we have seven pending steps. Let’s see what was recorded:
This is one of the key advantages of building from scratch: 7 steps that each do one clear thing. A transpiled version of the full IECON demographics module has 80+ steps because it preserves every intermediate STATA command.
The pipeline is a DAG (directed acyclic graph) of transformations.
view_graph() renders it as an interactive network – each
node is a step, and edges show variable dependencies:
The interactive DAG is not rendered in this vignette to keep the
package size small. Run view_graph() in your R session to
explore it. With only 7 nodes the graph is clean and navigable. Compare
that with a transpiled recipe where the DAG can have 100+ nodes – still
useful for auditing, but much harder to read at a glance.
For static output we can inspect the step list:
A recipe bundles the steps with metadata so anyone can reproduce the same pipeline on different data. We create the recipe before baking – the lazy steps are the pipeline:
rec <- steps_to_recipe(
name = "ECH Demographics (minimal)",
user = "research_team",
svy = svy,
steps = get_steps(svy),
description = paste(
"Harmonized demographics: sex, age group, relationship,",
"education level, and geographic area."
),
topic = "demographics"
)
recThe recipe auto-generates documentation from the steps:
Now let’s execute the steps. bake_steps() runs all
pending steps in order and returns the transformed survey:
The data has the new columns with readable labels:
The raw variables are gone:
Recipes serialize to JSON for version control and sharing:
The JSON is human-readable and diffable in git:
The same recipe works on any edition. Load the recipe from JSON, attach it to new data, and bake:
rec_loaded <- read_recipe(f)
svy_2024 <- survey_empty(type = "ech", edition = "2024") |>
set_data(data.table::data.table(
id = rep(1:30, each = 3),
nper = rep(1:3, 30),
pesoano = runif(90, 50, 300),
e26 = sample(1:2, 90, replace = TRUE),
e27 = sample(0:90, 90, replace = TRUE),
e30 = sample(1:7, 90, replace = TRUE),
e51_2 = sample(c(0:6, -9), 90, replace = TRUE),
region_4 = sample(1:4, 90, replace = TRUE)
)) |>
add_recipe(rec_loaded) |>
bake_recipes()
head(get_data(svy_2024)[, .(hh_id, person_id, sex, age_group, area)])No code changes needed. The recipe encodes the logic, not the data.
Transpiler (transpile_stata()) |
Hand-crafted recipe | |
|---|---|---|
| Speed | Seconds – instant migration | Hours – requires understanding the logic |
| Steps | 80-200 per module (one per STATA line) | 5-20 (one per concept) |
| Variable names | Inherits STATA names (bc_pe2, bc_pe3) |
Your own names (sex, age_group) |
| Labels | Numeric codes (1, 2, 3) |
Readable labels ("Male", "Female") |
| Readability | Faithful to original, verbose | Clean, self-documenting |
| Maintenance | Hard to modify individual steps | Easy to change any mapping |
| DAG visualization | Large, hard to read | Compact, meaningful nodes |
| Best for | Migrating legacy code fast | New projects, critical pipelines |
Recommended workflow: use
transpile_stata() to migrate your existing .do
files immediately so you have a working baseline. Then gradually replace
transpiled recipes with hand-crafted ones as you review each module. The
transpiled version keeps you running; the hand-crafted version is where
you want to end up.
| Manual STATA scripts | metasurvey recipe |
|---|---|
Copy-paste .do files per year |
One recipe, any edition |
| Undocumented variable names | Auto-generated input/output docs |
| No dependency tracking | DAG visualization with view_graph() |
| Flat scripts, no validation | validate() checks required variables |
Email .do files to colleagues |
publish_recipe() to shared registry |
| Re-run entire script to test | Lazy steps: inspect before baking |
depends_on_recipescertify_recipe() to mark it as reviewed or officialpublish_recipe(rec) uploads
to the shared registry where others can find it with
search_recipes(topic = "demographics").do files, use transpile_stata() to generate a
working baseline immediately – see
vignette("stata-transpiler") – then refine the output into
a hand-crafted recipe like the one in this vignette