Methodology
Benchmarks are run weekly on standardized hardware (AWS c7g.xlarge, 4 vCPU, 8GB RAM, Ubuntu 24.04).
Token efficiency is measured using equivalent prompts across Claude 4.5 Sonnet. All results are averages
of 10 runs with cold cache. Last updated: Feb 17, 2026.