Case studies · auto-generated
Case studies, asked a thousand ways.
Every entry below is a real calibration run against a cited published source. Each row links to the workflow that produced the prediction, the per-agent transcripts, and the published benchmark used as ground truth.
We do not write these by hand. The auto-publish pipeline (THE FLOW step 8) writes a new entry every time a calibration loop clears the 90% mean-accuracy floor (K=3 samples; see I-4 in the public hard rules).
Calibrating
First batch publishing soon.
The auto-publish pipeline writes one entry per workflow that clears the K=3 accuracy gate (per CLAUDE.md I-4). Until the first batch lands here, the live benchmark ledger is the most granular public proof we ship.