AI Model Benchmarks
Verified scores for 20 frontier models across 5 key benchmarks — sourced from official leaderboards. Charts auto-refresh daily via ISR.
Models Tracked
20
Top Intelligence
Gemini 3.1 Pro
Score: 57
Top Arena ELO
Claude Opus 4.6
ELO: 1504
Open Source
9
of 20 models
Run Locally
9
via Ollama / llama.cpp
AA Intelligence Index
SourceComposite of 10 evals (GDPval, Terminal-Bench, SciCode, HLE, GPQA, etc.). Independent.
Arena ELO vs Intelligence Index
Human preference (Arena) vs composite intelligence (AA Index). Models in the top-right corner excel at both.
Benchmark Details
GPQA Diamond
198 PhD-level science MCQs (biology, chemistry, physics). Expert human accuracy ≈65 %.
SWE-bench Verified
% of real GitHub issues resolved by mini-SWE-agent — standardised agentic coding eval.
ARC-AGI 2
Abstract visual grid reasoning. Adversarially designed against neural nets. Human ≈98 %.
Chatbot Arena ELO
Bradley-Terry ELO from 5.4M+ blind human-preference votes on arena.ai. Higher = preferred.
AA Intelligence Index
Composite of 10 evals (GDPval, Terminal-Bench, SciCode, HLE, GPQA, etc.). Independent.
GPQA Diamond: 198 PhD-level science MCQs (biology, chemistry, physics). Expert human accuracy ≈65 %. Source: Artificial Analysis / model reports.
SWE-bench Verified: % of real GitHub issues resolved by mini-SWE-agent — standardised agentic coding eval. Source: swebench.com official leaderboard.
ARC-AGI 2: Abstract visual grid reasoning. Adversarially designed against neural nets. Human ≈98 %. Source: arcprize.org leaderboard.
Chatbot Arena ELO: Bradley-Terry ELO from 5.4M+ blind human-preference votes on arena.ai. Higher = preferred. Source: arena.ai Text Arena.
AA Intelligence Index: Composite of 10 evals (GDPval, Terminal-Bench, SciCode, HLE, GPQA, etc.). Independent. Source: Artificial Analysis.
— not officially benchmarked / not publicly reported for that model. Data verified March 2026. Page auto-refreshes via ISR every 24 hours. SWE-bench and Arena scores are agentic/interactive and may vary by setup.