Simulix

Methodology · in the clear

Census-grounded. Inspectable. Replayable.

Two hundred strangers don't represent a country. Two hundred thousand Census-grounded agents do. The methodology is the agreement. The benchmark is the audit.

Step 01 · Source

Find a real published study.

Every workflow begins with a peer-reviewed paper, a government survey, or an industry benchmark report. We extract the verbatim question, the cited population, and the published outcome — those three are the ground truth we will calibrate against. The full list of papers we've indexed is on the benchmark page.

Step 02 · Population

Build the population from Census data, not guesses.

We start with the U.S. Census American Community Survey (ACS), the Public Use Microdata Sample (PUMS), and the Bureau of Labor Statistics (BLS). Each simulated agent is a draw from these distributions — age, household composition, income, ethnicity, geography, education — with the exact joint frequencies the published study's population had. No synthetic personas, no LLM-imagined demographics.

Step 03 · Ask the question

Run the published stimulus, unchanged.

The agents see the question the way real respondents saw it. Word-order and framing are preserved verbatim. We do not re-prompt or paraphrase the stimulus to nudge accuracy — every run is judged against the literal published wording.

Step 04 · Calibrate

Three independent runs. Take the worst.

Each workflow is run three times against the same population (K=3). We compute per-KPI MAPE accuracy (mean absolute percent error) for each run and require both the mean and the minimum to clear a 90% floor before the workflow is allowed to publish. A single lucky run never makes it through. Variance is real and the gate is honest about it.

Step 05 · Publish

Wins and misses, same prominence.

Every workflow that clears the gate gets an auto-generated case study with the per-agent transcripts, segment breakdown, top objections, and round-by-round sentiment. Every workflow that misses gets the same treatment on the benchmark page — we publish the miss, the delta, and the methodology version that produced it. The miss disappears from the customer surface only after a new calibration loop closes it.

Accuracy, in three measures.

Every claim about accuracy on this site can be traced back to one of the three rows below. There is no fourth number we are hiding.

MeasurePlain EnglishSource
MAPE (mean absolute percent error)Per-KPI distance between predicted and published, expressed as a percent.Mathematically equivalent to the CLI exporter's _mape_pct.
K-sample gate (K=3)Mean of three independent runs must clear floor; minimum must clear floor − 2 percentage points.CLAUDE.md I-4.
Default accuracy floor90% per published KPI; configurable per workflow when justified.docs/launch/LAUNCH_READINESS.md.

The sources, exactly.

The population layer pulls from public Census data; the question layer comes from the cited paper. Both are linked from every case study.

SourceWhat we use it for
U.S. Census ACSJoint demographics — age × income × geography × household composition.
U.S. Census PUMSPublic Use Microdata Sample — anonymised individual records used for finer joint estimates.
BLSOccupation × wage × industry distributions for workplace and economic scenarios.
Source paper / surveyPublished stimulus and the cited target population (defines the weighting on top of ACS/PUMS/BLS).

Known limitations

What this approach is not.

  • • Simulix is not a substitute for IRB-reviewed primary research. It is a way to ask one careful question a thousand ways before you commission a panel.
  • • Predictions are only as good as the demographic distribution the workflow targets. If a study's population is not representable in ACS/PUMS/BLS, the workflow is rejected at calibration time.
  • • LLM behaviour drifts as upstream model versions change. We run K=3 every week against the same gate and re-publish when accuracy moves more than 2 percentage points.
  • • We do not simulate non-U.S. populations at launch. EU coverage is on the roadmap — see changelog.

The proof

Public benchmark ledger.

Every prediction, every outcome, every miss recorded with the same prominence as every hit.

The receipts

Auto-generated case studies.

One per workflow that clears the gate. Per-agent transcripts, segment breakdowns, round-by-round sentiment.

Methodology — Simulix