Many Latin American research groups maintain decades of STATA
.do files that process household survey microdata. These
scripts encode institutional knowledge about variable harmonization,
income decomposition, and indicator construction – but they are locked
in a format that is hard to version, share, or integrate with modern R
workflows.
The metasurvey transpiler converts STATA
.do files into metasurvey Recipe objects. This enables:
data.table-backed Survey objectThe transpiler handles the most common STATA patterns found in survey processing scripts: variable generation, conditional replacement, recoding, aggregation, loops, missing-value encoding, and label extraction.
library(metasurvey)
# Transpile a single .do file
result <- transpile_stata("demographics.do")
result$steps[1:3]
#> [1] "step_rename(svy, hh_id = \"id\", person_id = \"nper\")"
#> [2] "step_compute(svy, weight_yr = pesoano)"
#> [3] "step_compute(svy, sex = e26)"The result is a list with four elements:
| Element | Description |
|---|---|
steps |
Character vector of metasurvey step calls |
labels |
Variable and value labels extracted from the .do
file |
warnings |
Any commands that required manual review |
stats |
Counts of translated, skipped, and manual-review commands |
The transpiler works in four passes:
.do file
|
v
[1] parse_do_file() -- tokenize lines into command objects
|
v
[2] translate_commands() -- map each STATA command to metasurvey steps
|
v
[3] optimize_steps() -- consolidate consecutive renames, drops, etc.
|
v
[4] Recipe / JSON -- bundle steps with metadata
parse_do_file() reads a .do file and
produces a list of command objects. It handles:
*, //,
and /* */ block comments/// and
/* */ used as continuation markersforeach and
forvalues are unrolled, substituting backtick macros
(`var') with each iteration valuecapture,
bysort group:, and command abbreviations (g
for gen, cap for capture)Each parsed command is mapped to one or more metasurvey step strings. The following table shows the supported STATA commands and their translations.
Simple variable creation translates to step_compute:
step_compute(svy, sex = q01)
step_compute(svy, is_urban = (region < 3))
step_compute(svy, age_group = -9L)When a gen includes an if clause, the
condition is wrapped in fifelse:
The most common pattern in survey .do files is
initializing a variable and then filling it with conditional
replacements:
gen relationship = -9
replace relationship = 1 if q05 == 1
replace relationship = 2 if q05 == 2
replace relationship = 3 if inrange(q05, 3, 5)
replace relationship = 4 if q05 == 6When all right-hand sides are constants, the
transpiler emits a single step_recode:
step_recode(svy, relationship,
q05 == 1 ~ 1L,
q05 == 2 ~ 2L,
q05 >= 3 & q05 <= 5 ~ 3L,
q05 == 6 ~ 4L,
.default = -9L)When any right-hand side is an expression, the
transpiler emits a chain of step_compute with
fifelse:
STATA recode with parenthesized mappings or inline
syntax:
recode urban_filter (0=2)
recode edu_level (2=2) (3=-9) (4=3) (5=4), gen(edu_compat)
recode var1 var2 var3 .=0step_compute(svy, urban_filter = data.table::fifelse(
urban_filter == 0, 2, urban_filter))
step_compute(svy, edu_compat = edu_level)
step_compute(svy, edu_compat = data.table::fifelse(
edu_compat == 2, 2, edu_compat))
# ... one fifelse per mapping
# Multi-variable recode: one step per variable
step_compute(svy, var1 = data.table::fifelse(is.na(var1), 0, var1))
step_compute(svy, var2 = data.table::fifelse(is.na(var2), 0, var2))
step_compute(svy, var3 = data.table::fifelse(is.na(var3), 0, var3))step_compute(svy, hh_income = sum(income, na.rm = TRUE),
.by = "household")
step_compute(svy, max_age = max(age, na.rm = TRUE),
.by = "household")Supported egen functions: sum,
max, min, mean,
count, sd, median,
total, rowtotal, rowmean.
Loops are expanded during parsing. The transpiler unrolls
foreach with both in lists and
of numlist ranges, including nested loops:
Expands to 4 pairs of gen+replace, each transpiled independently:
Consecutive renames are consolidated into a single
step_rename call, and consecutive drops are merged into one
step_remove.
STATA-specific syntax in expressions is automatically translated:
| STATA | R (data.table) |
|---|---|
inrange(x, a, b) |
(x >= a & x <= b) |
inlist(x, 1, 2, 3) |
(x %in% c(1, 2, 3)) |
var == . |
is.na(var) |
var != . |
!is.na(var) |
. (as value) |
NA |
string(var) |
as.character(var) |
var[_n-1] |
data.table::shift(var, 1, type = "lag") |
var[_n+1] |
data.table::shift(var, 1, type = "lead") |
_N |
.N |
STATA allows variable ranges like aux1-aux4 meaning
aux1 aux2 aux3 aux4. The transpiler expands these in
drop, recode, and mvencode
commands:
Variable and value labels are extracted and stored in the recipe metadata:
Commands that do not modify survey data are silently skipped during transpilation. These include:
use, save, import,
export, insheet, outsheettabulate, summarize,
describe, list, browse,
displayif/else, while,
program, exitset, sort, order,
compress, formatglobal, local,
scalar, matrixThe $stats element of the result reports how many
commands fell into each category.
The following .do file is a simplified version of a
typical survey demographics module. It is not a real production script,
but it uses the same patterns found in actual ECH processing
pipelines.
Save this as demo_module.do:
* ──────────────────────────────────────────────
* Demographics module -- simplified example
* ──────────────────────────────────────────────
rename id hh_id
rename nper person_id
gen weight_yr = pesoano
gen weight_qt = pesotri
* ── Sex ──
gen sex = q01
* ── Relationship to head ──
g relationship = -9
replace relationship = 1 if q05 == 1
replace relationship = 2 if q05 == 2
replace relationship = 3 if inrange(q05, 3, 5)
replace relationship = 4 if q05 == 6
replace relationship = 5 if q05 == 7
* ── Area ──
gen area = .
replace area = 1 if region == 1
replace area = 2 if region == 2
replace area = 3 if region == 3
* ── Education level (harmonized) ──
recode q20 (2=2) (3=-9) (4=3) (5=4), gen(edu_compat)
* ── Household-level age stats ──
bysort hh_id: egen max_age = max(edad)
bysort hh_id: egen n_members = count(person_id)
* ── Initialize health insurance contributions ──
foreach i of numlist 1/3 {
gen contrib`i' = 0
replace contrib`i' = amount if provider == `i'
}
* ── Encode missing values ──
mvencode contrib1 contrib2 contrib3, mv(0)
* ── Clean up ──
drop region q01 q05 q20
* ── Labels ──
lab var sex "Sex"
lab var relationship "Relationship to household head"
lab def sex_lbl 1 "Male" 2 "Female"
lab val sex sex_lbl
lab def rel_lbl 1 "Head" 2 "Spouse" 3 "Child" 4 "Other relative" 5 "Non-relative"
lab val relationship rel_lbllibrary(metasurvey)
# Write the example do-file to a temp location
# Note: STATA macros use backtick-quote (`var') which we build with paste0
bt <- "`" # backtick
sq <- "'" # single quote
do_lines <- c(
"rename id hh_id",
"rename nper person_id",
"gen weight_yr = pesoano",
"gen weight_qt = pesotri",
"gen sex = q01",
"g relationship = -9",
"replace relationship = 1 if q05 == 1",
"replace relationship = 2 if q05 == 2",
"replace relationship = 3 if inrange(q05, 3, 5)",
"replace relationship = 4 if q05 == 6",
"replace relationship = 5 if q05 == 7",
"gen area = .",
"replace area = 1 if region == 1",
"replace area = 2 if region == 2",
"replace area = 3 if region == 3",
"recode q20 (2=2) (3=-9) (4=3) (5=4), gen(edu_compat)",
"bysort hh_id: egen max_age = max(edad)",
"bysort hh_id: egen n_members = count(person_id)",
"foreach i of numlist 1/3 {",
paste0("gen contrib", bt, "i", sq, " = 0"),
paste0("replace contrib", bt, "i", sq, " = amount if provider == ", bt, "i", sq),
"}",
"mvencode contrib1 contrib2 contrib3, mv(0)",
"drop region q01 q05 q20",
'lab var sex "Sex"',
'lab var relationship "Relationship to household head"',
'lab def sex_lbl 1 "Male" 2 "Female"',
"lab val sex sex_lbl",
'lab def rel_lbl 1 "Head" 2 "Spouse" 3 "Child" 4 "Other relative" 5 "Non-relative"',
"lab val relationship rel_lbl"
)
do_file <- tempfile(fileext = ".do")
writeLines(do_lines, do_file)
result <- transpile_stata(do_file)rec <- Recipe$new(
id = "example_demographics",
name = "Demographics (transpiled)",
user = "research_team",
edition = "2022",
survey_type = "ech",
default_engine = "data.table",
depends_on = character(0),
description = "Harmonized demographics from STATA transpilation",
steps = result$steps,
labels = result$labels
)
# Save as JSON
save_recipe(rec, "demographics_recipe.json")
# Apply to survey data
svy <- survey_empty(type = "ech", edition = "2022") |>
set_data(my_data) |>
add_recipe(rec) |>
bake_recipes()For projects that organize .do files by year and
thematic module, transpile_stata_module() processes an
entire year directory and groups the results into separate Recipe
objects:
recipes <- transpile_stata_module(
year_dir = "do_files/2022",
year = 2022,
user = "research_team",
output_dir = "recipes/"
)
names(recipes)
#> [1] "data_prep" "demographics" "income_detail"
#> [4] "income_aggregate" "cleanup"
# Each recipe has inter-module dependencies
recipes$income_detail$depends_on_recipes
#> [1] "ech_2022_data_prep" "ech_2022_demographics"transpile_coverage() reports how many commands in a
.do file (or directory) can be automatically
transpiled:
transpile_coverage("do_files/")
#> file total translated skipped manual coverage
#> 1 2022/2_correc_datos.do 82 60 22 0 100.00
#> 2 2022/3_compatibiliz...do 420 380 40 0 100.00
#> 3 2022/4_ingreso_ht11...do 310 270 40 0 100.00The coverage_pct column reports the percentage of
data-transforming commands that were translated
(excluding skipped non-data commands). A value below 100% means some
commands need manual review – look at the $warnings element
for details.
The transpiler does not handle:
merge commands (these depend on external files and are
translated as comments with # MANUAL_REVIEW)collapse / reshape (structural data
transformations)program definitionsmata blocks or plugin calls/* */ block comments that contain
/* */ line continuations internally (rare; only seen in
commented-out legacy code)Commands that fall outside the transpiler’s scope are flagged with
# MANUAL_REVIEW in the output and counted in
$stats$manual_review.
| Feature | Status |
|---|---|
| gen / generate | Fully supported |
| replace (conditional) | Fully supported |
| gen + replace chains | Auto-grouped into step_recode or step_compute |
| recode (single & multi-var) | Fully supported |
| egen with by-groups | Fully supported |
| foreach / forvalues | Expanded during parsing |
| mvencode | Fully supported |
| destring / tostring | Fully supported |
| rename / drop / keep | Fully supported |
| Variable & value labels | Extracted to recipe metadata |
| STATA expressions | inrange, inlist, missing, lag/lead, _N |
| Variable ranges | Expanded (e.g., var1-var4) |
| Nested loops | Recursive expansion |
| Line continuation (///) | Joined during parsing |
| capture prefix | Handled (errors suppressed) |
| bysort prefix | Converted to .by parameter |