Step 01 · Source
Find a real published study.
Every workflow begins with a peer-reviewed paper, a government survey, or an industry benchmark report. We extract the verbatim question, the cited population, and the published outcome — those three are the ground truth we will calibrate against. The full list of papers we've indexed is on the benchmark page.
Step 02 · Population
Build the population from Census data, not guesses.
We start with the U.S. Census American Community Survey (ACS), the Public Use Microdata Sample (PUMS), and the Bureau of Labor Statistics (BLS). Each simulated agent is a draw from these distributions — age, household composition, income, ethnicity, geography, education — with the exact joint frequencies the published study's population had. No synthetic personas, no LLM-imagined demographics.
Step 03 · Ask the question
Run the published stimulus, unchanged.
The agents see the question the way real respondents saw it. Word-order and framing are preserved verbatim. We do not re-prompt or paraphrase the stimulus to nudge accuracy — every run is judged against the literal published wording.
Step 04 · Calibrate
Three independent runs. Take the worst.
Each workflow is run three times against the same population (K=3). We compute per-KPI MAPE accuracy (mean absolute percent error) for each run and require both the mean and the minimum to clear a 90% floor before the workflow is allowed to publish. A single lucky run never makes it through. Variance is real and the gate is honest about it.
Step 05 · Publish
Wins and misses, same prominence.
Every workflow that clears the gate gets an auto-generated case study with the per-agent transcripts, segment breakdown, top objections, and round-by-round sentiment. Every workflow that misses gets the same treatment on the benchmark page — we publish the miss, the delta, and the methodology version that produced it. The miss disappears from the customer surface only after a new calibration loop closes it.