Case studies · auto-generated

Case studies, asked a thousand ways.

Every entry below is a real calibration run against a cited published source. Each row links to the workflow that produced the prediction, the per-agent transcripts, and the published benchmark used as ground truth.

We do not write these by hand. The auto-publish pipeline (THE FLOW step 8) writes a new entry every time a calibration loop clears the 90% mean-accuracy floor (K=3 samples; see I-4 in the public hard rules).

Calibrating

First batch publishing soon.

The auto-publish pipeline writes one entry per workflow that clears the K=3 accuracy gate (per CLAUDE.md I-4). Until the first batch lands here, the live benchmark ledger is the most granular public proof we ship.

View benchmark ledger