AI Model Benchmarks
Verified scores across LLM, Image & Video models — sourced from official leaderboards. Charts auto-refresh daily via ISR.
Models Tracked
20
Top Intelligence
Gemini 3.1 Pro
Score: 57
Top Arena ELO
Claude Opus 4.6
ELO: 1504
Open Source
9
of 20 models
Run Locally
9
via Ollama / llama.cpp
SWE-bench Verified
Source% of real GitHub issues resolved by mini-SWE-agent — standardised agentic coding eval.
Arena ELO vs Intelligence Index
Human preference (Arena) vs composite intelligence (AA Index). Models in the top-right corner excel at both.
GPQA Diamond: 198 PhD-level science MCQs (biology, chemistry, physics). Expert human accuracy ≈65 %. Source: Artificial Analysis / model reports.
SWE-bench Verified: % of real GitHub issues resolved by mini-SWE-agent — standardised agentic coding eval. Source: swebench.com official leaderboard.
ARC-AGI 2: Abstract visual grid reasoning. Adversarially designed against neural nets. Human ≈98 %. Source: arcprize.org leaderboard.
Chatbot Arena ELO: Bradley-Terry ELO from 5.4M+ blind human-preference votes on arena.ai. Higher = preferred. Source: arena.ai Text Arena.
AA Intelligence Index: Composite of 10 evals (GDPval, Terminal-Bench, SciCode, HLE, GPQA, etc.). Independent. Source: Artificial Analysis.
LiveCodeBench: Contamination-free coding benchmark using problems from competitive programming contests. Continuously updated. Source: livecodebench.github.io.
TerminalBench 2.0: 89 real-world tasks in containerised terminal environments — software eng, ML, security, data science. Source: tbench.ai leaderboard.
τ-Bench (Retail): Tool-agent-user interaction benchmark. Tests policy adherence and tool use in real-world retail scenarios. Source: taubench.com.
SciCode: 338 sub-tasks from 80 real research problems across 16 scientific disciplines. Background-assisted evaluation. Source: scicode-bench.github.io.
Cost per 1M Tokens: Blended average cost for 1M input + output tokens. Lower = more economical. Source: Provider Pricing.
Throughput: Average generation speed in tokens per second. Higher = faster responses. Source: Artificial Analysis.
Humanity's Last Exam: 2,500 PhD-level multidisciplinary questions. Designed to be unsearchable. Source: Epoch AI / Center for AI Safety.
FrontierMath: Research-level mathematical problems that top models previously scored ~0%. Source: Epoch AI.
GDPval (Economic Value): ELO rating based on performance in 44 knowledge-work occupations vs human experts. Source: OpenAI / Artificial Analysis.
— not officially benchmarked / not publicly reported for that model. Data verified 16 March 2026. Page auto-refreshes via ISR every 24 hours. SWE-bench and Arena scores are agentic/interactive and may vary by setup.