Evals
Unified Reasoning Engine™ evaluation science
Ada's modular reasoning architecture was replaced with dual reasoning with a re-base eval harness and Legitimacy Classifier. Adversarial pass rate: 88% to 97%.
Delta evaluation: Production replay pipeline
DE replays production conversations through modified prompts & models, using an LLM-as-judge. Verdicts aggregate into win-rate metrics; traces into themes.