Estimation Workflows

What is a Workflow?

After transforming survey data with steps and recipes, the next task is estimation: computing means, totals, ratios, and their standard errors while accounting for the complex survey design.

The workflow() function wraps the estimators from the survey package (svymean, svytotal, svyratio, svyby) and returns tidy results as a data.table that include:

Initial Setup

We use the Academic Performance Index (API) dataset from the survey package, which contains real data from stratified schools in California.

library(metasurvey)
library(survey)
library(data.table)

data(api, package = "survey")
dt <- data.table(apistrat)

svy <- Survey$new(
  data    = dt,
  edition = "2000",
  type    = "api",
  psu     = NULL,
  engine  = "data.table",
  weight  = add_weight(annual = "pw")
)

Basic Estimation

Mean

We estimate the population mean of the API score in the year 2000:

result <- workflow(
  list(svy),
  survey::svymean(~api00, na.rm = TRUE),
  estimation_type = "annual"
)

result
#>                      stat    value       se         cv confint_lower
#>                    <char>    <num>    <num>      <num>         <num>
#> 1: survey::svymean: api00 662.2874 9.585429 0.01447322      643.5003
#>    confint_upper
#>            <num>
#> 1:      681.0745

Total

We estimate total enrollment across all schools:

result_total <- workflow(
  list(svy),
  survey::svytotal(~enroll, na.rm = TRUE),
  estimation_type = "annual"
)

result_total
#>                        stat   value       se         cv confint_lower
#>                      <char>   <num>    <num>      <num>         <num>
#> 1: survey::svytotal: enroll 3687178 164532.3 0.04462283       3364700
#>    confint_upper
#>            <num>
#> 1:       4009655

Multiple Estimates at Once

You can pass multiple estimation calls to workflow() to compute them in a single step:

results <- workflow(
  list(svy),
  survey::svymean(~api00, na.rm = TRUE),
  survey::svytotal(~enroll, na.rm = TRUE),
  estimation_type = "annual"
)

results
#>                        stat        value           se         cv confint_lower
#>                      <char>        <num>        <num>      <num>         <num>
#> 1:   survey::svymean: api00     662.2874 9.585429e+00 0.01447322      643.5003
#> 2: survey::svytotal: enroll 3687177.5324 1.645323e+05 0.04462283  3364700.1537
#>    confint_upper
#>            <num>
#> 1:      681.0745
#> 2:  4009654.9112

Domain Estimation

We use survey::svyby() to compute estimates by subpopulations (domains):

# Mean API score by school type
api_by_type <- workflow(
  list(svy),
  survey::svyby(~api00, ~stype, survey::svymean, na.rm = TRUE),
  estimation_type = "annual"
)

api_by_type
#>                              stat  value       se         cv confint_lower
#>                            <char>  <num>    <num>      <num>         <num>
#> 1: survey::svyby: api00 [stype=E] 674.43 12.49343 0.01852443      649.9433
#> 2: survey::svyby: api00 [stype=H] 625.82 15.34078 0.02451309      595.7526
#> 3: survey::svyby: api00 [stype=M] 636.60 16.50239 0.02592270      604.2559
#>    confint_upper  stype
#>            <num> <fctr>
#> 1:      698.9167      E
#> 2:      655.8874      H
#> 3:      668.9441      M
# Mean enrollment by awards status
enroll_by_award <- workflow(
  list(svy),
  survey::svyby(~enroll, ~awards, survey::svymean, na.rm = TRUE),
  estimation_type = "annual"
)

enroll_by_award
#>                                  stat    value       se         cv
#>                                <char>    <num>    <num>      <num>
#> 1:  survey::svyby: enroll [awards=No] 727.5958 57.54094 0.07908366
#> 2: survey::svyby: enroll [awards=Yes] 520.5114 25.51854 0.04902590
#>    confint_lower confint_upper awards
#>            <num>         <num> <fctr>
#> 1:      614.8177      840.3740     No
#> 2:      470.4960      570.5269    Yes

Quality Assessment

The coefficient of variation (CV) measures the precision of an estimate. You can use evaluate_cv() to classify quality following standard guidelines:

CV Range Quality Recommendation
< 5% Excellent Use without restrictions
5-10% Very good Use with confidence
10-15% Good Use for most purposes
15-25% Acceptable Use with caution
25-35% Poor Only for general trends
>= 35% Unreliable Do not publish
# Evaluate quality of the API score estimate
cv_pct <- results$cv[1] * 100
quality <- evaluate_cv(cv_pct)

cat("CV:", round(cv_pct, 2), "%\n")
#> CV: 1.45 %
cat("Quality:", quality, "\n")
#> Quality: Excellent

RecipeWorkflow: Publishable Estimates

A RecipeWorkflow bundles estimation calls with metadata, making the analysis reproducible and shareable. It records:

Creating a RecipeWorkflow

wf <- RecipeWorkflow$new(
  name = "API Score Analysis 2000",
  description = "Mean API score estimation by school type",
  user = "Research Team",
  survey_type = "api",
  edition = "2000",
  estimation_type = "annual",
  recipe_ids = character(0),
  calls = list(
    "survey::svymean(~api00, na.rm = TRUE)",
    "survey::svyby(~api00, ~stype, survey::svymean, na.rm = TRUE)"
  )
)

wf
#> 
#> ── Workflow: API Score Analysis 2000 ──
#> Author:  Research Team
#> Survey:  api / 2000
#> Version: 1.0.0
#> Description: Mean API score estimation by school type
#> Certification: community
#> Estimation types: annual
#> 
#> ── Calls (2) ──
#>   1. survey::svymean(~api00, na.rm = TRUE)
#>   2. survey::svyby(~api00, ~stype, survey::svymean, na.rm = TRUE)

Publishing to the Registry

We publish the workflow so that others can discover and reuse it:

# Configure a local backend
wf_path <- tempfile(fileext = ".json")
set_workflow_backend("local", path = wf_path)

# Publish
publish_workflow(wf)

# Discover workflows
all_wf <- list_workflows()
length(all_wf)
#> [1] 1

# Search by text
found <- search_workflows("income")
length(found)
#> [1] 0

# Filter by survey type
ech_wf <- filter_workflows(survey_type = "ech")
length(ech_wf)
#> [1] 0

Finding Workflows Associated with a Recipe

If you have a recipe and want to know which estimates have been published for it, you can use find_workflows_for_recipe():

# Create a workflow that references a recipe
wf2 <- RecipeWorkflow$new(
  name            = "Labor Market Estimates",
  user            = "Team",
  survey_type     = "ech",
  edition         = "2023",
  estimation_type = "annual",
  recipe_ids      = c("labor_force_recipe_001"),
  calls           = list("survey::svymean(~employed, na.rm = TRUE)")
)

publish_workflow(wf2)

# Find all workflows that use this recipe
related <- find_workflows_for_recipe("labor_force_recipe_001")
length(related)
#> [1] 1
if (length(related) > 0) cat("Found:", related[[1]]$name, "\n")
#> Found: Labor Market Estimates

Sharing via the Remote API

For broader dissemination, you can publish workflows to the metasurvey API:

# Requires authentication
api_login("you@example.com", "password")

# Publish
api_publish_workflow(wf)

# Browse
all <- api_list_workflows(survey_type = "ech")
specific <- api_get_workflow("workflow_id_here")

Full Pipeline

Below is a complete pipeline from raw data to publishable estimation, using the API dataset:

# 1. Create survey from real data
dt_full <- data.table(apistrat)

svy_full <- Survey$new(
  data    = dt_full,
  edition = "2000",
  type    = "api",
  psu     = NULL,
  engine  = "data.table",
  weight  = add_weight(annual = "pw")
)

# 2. Apply steps: compute derived variables
svy_full <- step_compute(svy_full,
  api_growth = api00 - api99,
  high_growth = ifelse(api00 - api99 > 50, 1L, 0L),
  comment = "API score growth indicators"
)

svy_full <- step_recode(svy_full, school_level,
  stype == "E" ~ "Elementary",
  stype == "M" ~ "Middle",
  stype == "H" ~ "High",
  .default = "Other",
  comment = "School level classification"
)

# 3. Estimate means
estimates <- workflow(
  list(svy_full),
  survey::svymean(~api_growth, na.rm = TRUE),
  survey::svymean(~high_growth, na.rm = TRUE),
  estimation_type = "annual"
)

estimates
#>                            stat      value        se         cv confint_lower
#>                          <char>      <num>     <num>      <num>         <num>
#> 1:  survey::svymean: api_growth 32.8925184 2.1583789 0.06561914    28.6621734
#> 2: survey::svymean: high_growth  0.2938489 0.0363651 0.12375443     0.2225746
#>    confint_upper
#>            <num>
#> 1:    37.1228633
#> 2:     0.3651232
# 4. Domain estimation (by school type)
by_school <- workflow(
  list(svy_full),
  survey::svyby(~api00, ~stype, survey::svymean, na.rm = TRUE),
  estimation_type = "annual"
)

by_school
#>                              stat  value       se         cv confint_lower
#>                            <char>  <num>    <num>      <num>         <num>
#> 1: survey::svyby: api00 [stype=E] 674.43 12.49343 0.01852443      649.9433
#> 2: survey::svyby: api00 [stype=H] 625.82 15.34078 0.02451309      595.7526
#> 3: survey::svyby: api00 [stype=M] 636.60 16.50239 0.02592270      604.2559
#>    confint_upper  stype
#>            <num> <fctr>
#> 1:      698.9167      E
#> 2:      655.8874      H
#> 3:      668.9441      M
# 5. Assess quality
for (i in seq_len(nrow(estimates))) {
  cv_val <- estimates$cv[i] * 100
  cat(
    estimates$stat[i], ":",
    round(cv_val, 1), "% CV -",
    evaluate_cv(cv_val), "\n"
  )
}
#> survey::svymean: api_growth : 6.6 % CV - Very good 
#> survey::svymean: high_growth : 12.4 % CV - Good

Provenance: Data Lineage

Every Survey object records provenance metadata: where the data came from, which steps were applied, how many rows survived each step, and which versions of R and metasurvey were used. This makes it possible to trace any estimate back to the raw data.

# Provenance is populated automatically after bake_steps()
prov <- provenance(svy_full)
prov
#> ── Data Provenance ─────────────────────────────────────────────────────────────
#> Loaded: 2026-02-20T14:09:33 
#> Initial rows: 200 
#> 
#> Environment:
#>   metasurvey: 0.0.21 
#>   R: 4.4.3 
#>   survey: 4.4.8

Provenance is also attached to workflow() results, so you can always inspect the full lineage of an estimate:

prov_wf <- provenance(estimates)
cat("metasurvey version:", prov_wf$environment$metasurvey_version, "\n")
#> metasurvey version: 0.0.21
cat("Steps applied:", length(prov_wf$steps), "\n")
#> Steps applied: 0

For audit trails, export provenance to JSON:

provenance_to_json(prov, "audit_trail.json")

To compare two runs (e.g., different editions), use provenance_diff():

diff <- provenance_diff(prov_2022, prov_2023)
diff$steps_changed
diff$n_final_changed

Publication-Quality Tables

workflow_table() formats estimation results as publication-ready tables using the gt package. It adds confidence intervals, CV quality classification with color coding, and provenance-based source notes.

workflow_table(estimates)
Survey Estimation Results
Statistic Estimate SE CI Lower CI Upper CV (%) Quality
:svymean: api_growth 32.89 2.158 28.66 37.12 6.6 Very good
:svymean: high_growth 0.29 0.036 0.22 0.37 12.4 Good
metasurvey 0.0.21 | CI: 95% | 2026-02-20

You can customize the output:

# Spanish locale, hide SE, custom title
workflow_table(
  estimates,
  locale = "es",
  show_se = FALSE,
  title = "API Growth Indicators",
  subtitle = "California Schools, 2000"
)
API Growth Indicators
California Schools, 2000
Statistic Estimate CI Lower CI Upper CV (%) Quality
:svymean: api_growth 32,89 28,66 37,12 6,6 Very good
:svymean: high_growth 0,29 0,22 0,37 12,4 Good
metasurvey 0.0.21 | CI: 95% | 2026-02-20

For domain estimates, the table detects group columns automatically:

workflow_table(by_school)
Survey Estimation Results
Statistic stype Estimate SE CI Lower CI Upper CV (%) Quality
:svyby: api00 E 674.43 12.493 649.94 698.92 1.9 Excellent
:svyby: api00 H 625.82 15.341 595.75 655.89 2.5 Excellent
:svyby: api00 M 636.60 16.502 604.26 668.94 2.6 Excellent
metasurvey 0.0.21 | CI: 95% | 2026-02-20

Export to any format supported by gt::gtsave():

tbl <- workflow_table(estimates)
gt::gtsave(tbl, "estimates.html")
gt::gtsave(tbl, "estimates.docx")
gt::gtsave(tbl, "estimates.png")

Next Steps

mirror server hosted at Truenetwork, Russian Federation.