AI Model Benchmarks

Verified scores for 20 frontier models across 5 key benchmarks — sourced from official leaderboards. Charts auto-refresh daily via ISR.

Models Tracked

20

Top Intelligence

Gemini 3.1 Pro

Score: 57

Top Arena ELO

Claude Opus 4.6

ELO: 1504

Open Source

9

of 20 models

Run Locally

9

via Ollama / llama.cpp

Categories:

AA Intelligence Index

Source

Composite of 10 evals (GDPval, Terminal-Bench, SciCode, HLE, GPQA, etc.). Independent.

GoogleAnthropicOpenAIZ.aiMoonshotAlibabaDeepSeekMiniMaxxAIMistralMeta

Arena ELO vs Intelligence Index

Human preference (Arena) vs composite intelligence (AA Index). Models in the top-right corner excel at both.

AnthropicGoogleOpenAIxAIZ.aiAlibabaMoonshotDeepSeekMiniMaxMistralMeta
Scale:Top tierStrongGoodAverageWeakPoor— Arena ELO normalised to 1250–1550 · AA Index to 0–60 for colour

GPQA Diamond: 198 PhD-level science MCQs (biology, chemistry, physics). Expert human accuracy ≈65 %. Source: Artificial Analysis / model reports.

SWE-bench Verified: % of real GitHub issues resolved by mini-SWE-agent — standardised agentic coding eval. Source: swebench.com official leaderboard.

ARC-AGI 2: Abstract visual grid reasoning. Adversarially designed against neural nets. Human ≈98 %. Source: arcprize.org leaderboard.

Chatbot Arena ELO: Bradley-Terry ELO from 5.4M+ blind human-preference votes on arena.ai. Higher = preferred. Source: arena.ai Text Arena.

AA Intelligence Index: Composite of 10 evals (GDPval, Terminal-Bench, SciCode, HLE, GPQA, etc.). Independent. Source: Artificial Analysis.

— not officially benchmarked / not publicly reported for that model. Data verified March 2026. Page auto-refreshes via ISR every 24 hours. SWE-bench and Arena scores are agentic/interactive and may vary by setup.