AI Model Benchmarks

Verified scores across LLM, Image & Video models — sourced from official leaderboards. Charts auto-refresh daily via ISR.

Models Tracked

20

Top Intelligence

Gemini 3.1 Pro

Score: 57

Top Arena ELO

Claude Opus 4.6

ELO: 1504

Open Source

9

of 20 models

Run Locally

9

via Ollama / llama.cpp

Categories:

SWE-bench Verified

Source

% of real GitHub issues resolved by mini-SWE-agent — standardised agentic coding eval.

AnthropicGoogleMiniMaxOpenAIZ.aiMoonshotDeepSeek

Arena ELO vs Intelligence Index

Human preference (Arena) vs composite intelligence (AA Index). Models in the top-right corner excel at both.

Scale:Top tierStrongGoodAverageWeakPoor— Arena ELO normalised to 1250–1550 · AA Index to 0–60 · SciCode to 0–30 for colour

GPQA Diamond: 198 PhD-level science MCQs (biology, chemistry, physics). Expert human accuracy ≈65 %. Source: Artificial Analysis / model reports.

SWE-bench Verified: % of real GitHub issues resolved by mini-SWE-agent — standardised agentic coding eval. Source: swebench.com official leaderboard.

ARC-AGI 2: Abstract visual grid reasoning. Adversarially designed against neural nets. Human ≈98 %. Source: arcprize.org leaderboard.

Chatbot Arena ELO: Bradley-Terry ELO from 5.4M+ blind human-preference votes on arena.ai. Higher = preferred. Source: arena.ai Text Arena.

AA Intelligence Index: Composite of 10 evals (GDPval, Terminal-Bench, SciCode, HLE, GPQA, etc.). Independent. Source: Artificial Analysis.

LiveCodeBench: Contamination-free coding benchmark using problems from competitive programming contests. Continuously updated. Source: livecodebench.github.io.

TerminalBench 2.0: 89 real-world tasks in containerised terminal environments — software eng, ML, security, data science. Source: tbench.ai leaderboard.

τ-Bench (Retail): Tool-agent-user interaction benchmark. Tests policy adherence and tool use in real-world retail scenarios. Source: taubench.com.

SciCode: 338 sub-tasks from 80 real research problems across 16 scientific disciplines. Background-assisted evaluation. Source: scicode-bench.github.io.

Cost per 1M Tokens: Blended average cost for 1M input + output tokens. Lower = more economical. Source: Provider Pricing.

Throughput: Average generation speed in tokens per second. Higher = faster responses. Source: Artificial Analysis.

Humanity's Last Exam: 2,500 PhD-level multidisciplinary questions. Designed to be unsearchable. Source: Epoch AI / Center for AI Safety.

FrontierMath: Research-level mathematical problems that top models previously scored ~0%. Source: Epoch AI.

GDPval (Economic Value): ELO rating based on performance in 44 knowledge-work occupations vs human experts. Source: OpenAI / Artificial Analysis.

— not officially benchmarked / not publicly reported for that model. Data verified 16 March 2026. Page auto-refreshes via ISR every 24 hours. SWE-bench and Arena scores are agentic/interactive and may vary by setup.