Benchmarking LLMs on Advent of Code 2025 (Ruby)

Tags = [ Ruby, AI, Advent of Code ] Posted on February 26, 2026

Following up on the Haskell benchmark, the OCaml benchmark, the Python benchmark, and the ReScript benchmark, I ran the same AoC 2025 Days 1–5 puzzles in Ruby.

Same setup as before — the question is whether the leaderboard reshuffles when the target language changes.

The contestants

#	Model
1	`anthropic/claude-haiku-4-5`
2	`anthropic/claude-sonnet-4-6`
3	`anthropic/claude-opus-4-6`
4	`openai-codex/gpt-5.3-codex`
5	`zai/glm-5`
6	`minimax/MiniMax-M2.5`
7	`kimi-coding/k2p5`
8	`mistral/devstral-2512`
9	`alibaba/qwen3.5-plus`
10	`alibaba/qwen3-coder-next`

Ejections

None. All 10 models solved all 10 parts correctly on the first attempt.

zai/glm-5 was originally ejected on D1P1 due to persistent HTTP 429 errors from ZAI's API. It was re-run solo after the API stabilized and completed all 10 parts without issues. Its results are included below.

Results (Days 1–5)

Per-task leaderboards

Day 1 Part 1 — Dial rotation counting

Model	Time
mistral/devstral-2512	8s
anthropic/claude-haiku-4-5	9s
alibaba/qwen3-coder-next	11s
kimi-coding/k2p5	12s
openai-codex/gpt-5.3-codex	12s
anthropic/claude-sonnet-4-6	13s
anthropic/claude-opus-4-6	18s
alibaba/qwen3.5-plus	18s
zai/glm-5	27s
minimax/MiniMax-M2.5	48s

Day 1 Part 2 — Counting zero-crossings during dial rotation

Model	Time
anthropic/claude-haiku-4-5	8s
openai-codex/gpt-5.3-codex	10s
anthropic/claude-sonnet-4-6	17s
mistral/devstral-2512	22s
anthropic/claude-opus-4-6	26s
alibaba/qwen3.5-plus	29s
kimi-coding/k2p5	35s
zai/glm-5	63s
alibaba/qwen3-coder-next	164s
minimax/MiniMax-M2.5	187s

Day 2 Part 1 — Summing repeated-digit IDs in ranges

Model	Time
mistral/devstral-2512	10s
alibaba/qwen3-coder-next	10s
openai-codex/gpt-5.3-codex	13s
kimi-coding/k2p5	14s
anthropic/claude-haiku-4-5	17s
anthropic/claude-sonnet-4-6	21s
alibaba/qwen3.5-plus	23s
zai/glm-5	25s
anthropic/claude-opus-4-6	27s
minimax/MiniMax-M2.5	49s

Day 2 Part 2 — Repeated-pattern IDs (any repeat count)

Model	Time
mistral/devstral-2512	7s
anthropic/claude-haiku-4-5	11s
kimi-coding/k2p5	13s
alibaba/qwen3.5-plus	17s
openai-codex/gpt-5.3-codex	20s
anthropic/claude-opus-4-6	25s
zai/glm-5	26s
anthropic/claude-sonnet-4-6	27s
alibaba/qwen3-coder-next	28s
minimax/MiniMax-M2.5	31s

Day 3 Part 1 — Maximizing 2-digit joltage from battery banks

Model	Time
openai-codex/gpt-5.3-codex	11s
anthropic/claude-haiku-4-5	14s
anthropic/claude-sonnet-4-6	17s
anthropic/claude-opus-4-6	24s
alibaba/qwen3.5-plus	27s
alibaba/qwen3-coder-next	28s
zai/glm-5	34s
kimi-coding/k2p5	36s
mistral/devstral-2512	48s
minimax/MiniMax-M2.5	165s

Day 3 Part 2 — Maximizing 12-digit joltage from battery banks

Model	Time
openai-codex/gpt-5.3-codex	11s
mistral/devstral-2512	11s
anthropic/claude-haiku-4-5	13s
anthropic/claude-sonnet-4-6	14s
kimi-coding/k2p5	15s
anthropic/claude-opus-4-6	19s
zai/glm-5	20s
alibaba/qwen3-coder-next	23s
alibaba/qwen3.5-plus	24s
minimax/MiniMax-M2.5	36s

Day 4 Part 1 — Grid neighbor counting (accessible paper rolls)

Model	Time
mistral/devstral-2512	10s
anthropic/claude-haiku-4-5	11s
anthropic/claude-sonnet-4-6	14s
openai-codex/gpt-5.3-codex	15s
kimi-coding/k2p5	15s
anthropic/claude-opus-4-6	19s
alibaba/qwen3.5-plus	22s
alibaba/qwen3-coder-next	25s
minimax/MiniMax-M2.5	29s
zai/glm-5	32s

Day 4 Part 2 — Iterative grid removal simulation

Model	Time
mistral/devstral-2512	8s
anthropic/claude-haiku-4-5	12s
anthropic/claude-sonnet-4-6	14s
openai-codex/gpt-5.3-codex	14s
kimi-coding/k2p5	14s
anthropic/claude-opus-4-6	16s
alibaba/qwen3.5-plus	19s
zai/glm-5	25s
minimax/MiniMax-M2.5	27s
alibaba/qwen3-coder-next	35s

Day 5 Part 1 — Range membership checking

Model	Time
mistral/devstral-2512	8s
anthropic/claude-haiku-4-5	10s
kimi-coding/k2p5	12s
openai-codex/gpt-5.3-codex	12s
alibaba/qwen3-coder-next	13s
anthropic/claude-sonnet-4-6	14s
anthropic/claude-opus-4-6	15s
alibaba/qwen3.5-plus	16s
zai/glm-5	28s
minimax/MiniMax-M2.5	34s

Day 5 Part 2 — Counting total fresh IDs from overlapping ranges

Model	Time
mistral/devstral-2512	6s
anthropic/claude-haiku-4-5	8s
anthropic/claude-sonnet-4-6	12s
kimi-coding/k2p5	12s
openai-codex/gpt-5.3-codex	12s
anthropic/claude-opus-4-6	15s
zai/glm-5	29s
minimax/MiniMax-M2.5	34s
alibaba/qwen3.5-plus	36s
alibaba/qwen3-coder-next	39s

Speed vs accuracy

Summary tables

Wall-clock time (seconds)

Model	D1P1	D1P2	D2P1	D2P2	D3P1	D3P2	D4P1	D4P2	D5P1	D5P2	Total
anthropic/claude-haiku-4-5	9s	8s	17s	11s	14s	13s	11s	12s	10s	8s	113s
openai-codex/gpt-5.3-codex	12s	10s	13s	20s	11s	11s	15s	14s	12s	12s	130s
mistral/devstral-2512	8s	22s	10s	7s	48s	11s	10s	8s	8s	6s	138s
anthropic/claude-sonnet-4-6	13s	17s	21s	27s	17s	14s	14s	14s	14s	12s	163s
kimi-coding/k2p5	12s	35s	14s	13s	36s	15s	15s	14s	12s	12s	178s
anthropic/claude-opus-4-6	18s	26s	27s	25s	24s	19s	19s	16s	15s	15s	204s
alibaba/qwen3.5-plus	18s	29s	23s	17s	27s	24s	22s	19s	16s	36s	231s
zai/glm-5	27s	63s	25s	26s	34s	20s	32s	25s	28s	29s	309s
alibaba/qwen3-coder-next	11s	164s	10s	28s	28s	23s	25s	35s	13s	39s	376s
minimax/MiniMax-M2.5	48s	187s	49s	31s	165s	36s	29s	27s	34s	34s	640s

Output tokens per part

Model	D1P1	D1P2	D2P1	D2P2	D3P1	D3P2	D4P1	D4P2	D5P1	D5P2	Total
openai-codex/gpt-5.3-codex	319	382	367	377	337	396	437	671	330	379	3,995
kimi-coding/k2p5	408	837	522	621	406	532	542	583	398	444	5,293
zai/glm-5	543	1,291	485	507	605	379	542	498	464	474	5,788
anthropic/claude-sonnet-4-6	598	823	1,049	1,197	658	743	689	698	577	590	7,622
anthropic/claude-opus-4-6	565	1,221	1,349	1,405	958	828	732	725	566	572	8,921
anthropic/claude-haiku-4-5	915	792	1,384	1,000	1,295	906	903	837	813	755	9,600
mistral/devstral-2512	538	3,040	608	510	4,973	730	600	651	428	497	12,575
alibaba/qwen3-coder-next	688	6,720	744	719	766	768	802	1,135	707	1,227	14,276
alibaba/qwen3.5-plus	1,364	3,504	2,031	1,031	2,150	1,980	1,266	1,121	1,117	1,588	17,152
minimax/MiniMax-M2.5	822	9,448	1,390	868	6,678	1,102	809	820	788	1,189	23,914

API cost per part (approximate USD)

Costs are rough approximations based on published per-token pricing. I use subscription plans, so my actual spending is capped regardless.

Model	D1P1	D1P2	D2P1	D2P2	D3P1	D3P2	D4P1	D4P2	D5P1	D5P2	Total
kimi-coding/k2p5	.0028	.0047	.0031	.0036	.0025	.0032	.0097	.0085	.0024	.0028	$0.04
mistral/devstral-2512	.0063	.0217	.0069	.0065	.0308	.0206	.0044	.0094	.0061	.0080	$0.12
alibaba/qwen3.5-plus	.0108	.0194	.0129	.0145	.0125	.0160	.0101	.0146	.0093	.0188	$0.14
zai/glm-5	.0177	.0227	.0056	.0066	.0071	.0063	.0225	.0180	.0295	.0221	$0.16
openai-codex/gpt-5.3-codex	.0146	.0254	.0162	.0120	.0131	.0148	.0356	.0262	.0138	.0151	$0.19
anthropic/claude-haiku-4-5	.0228	.0122	.0331	.0106	.0333	.0140	.0317	.0164	.0261	.0155	$0.22
minimax/MiniMax-M2.5	.0138	.0945	.0080	.0147	.0443	.0251	.0034	.0058	.0159	.0207	$0.25
anthropic/claude-sonnet-4-6	.0325	.0257	.0406	.0331	.0325	.0235	.0330	.0252	.0302	.0192	$0.30
alibaba/qwen3-coder-next	.0243	.1128	.0081	.0117	.0502	.0599	.0258	.0409	.0506	.0945	$0.48
anthropic/claude-opus-4-6	.1078	.0927	.1421	.0628	.1359	.0827	.1080	.0786	.0985	.1340	$0.94

Observations

All 10 models solved all 10 parts correctly on the first attempt — matching Python's clean sweep.

claude-haiku-4-5 — fastest overall at 113s. Fastest or near-fastest on 7 of 10 parts.

devstral-2512 — fastest on individual parts (six sub-10s), but a 48-second D3P1 spike pushes its total to 138s. The token data shows 4,973 output tokens on D3P1 vs. a 428–651 range on most other parts.

gpt-5.3-codex — fewest tokens: 3,995 total, under 400 per part on average.

kimi-coding/k2p5 — cheapest at $0.04 for all 10 parts. Fifth in speed (178s), second in token count (5,293).

claude-opus-4-6 — $0.94 total, the most expensive at ~$0.09 per part. 204s total, 8,921 tokens.

qwen3-coder-next — 164 seconds on D1P2, with 6,720 output tokens on that single part. Every other part was 10–39s.

minimax/MiniMax-M2.5 — 640s total, slowest but correct on every part across all benchmarks so far.

zai/glm-5 completed all 10 parts in a solo re-run (309s total). Originally ejected due to API 429 errors, it was re-run after ZAI's service stabilized. 5,788 tokens and $0.16 total — mid-pack on speed, but third in token efficiency behind codex and k2p5.

Cross-language comparison

With five benchmarks now complete, some patterns are emerging:

Language	Models completing all 10 parts
Python	10/10
Ruby	10/10
Haskell	7/11
OCaml	5/9
ReScript (run 2)	2/10

Completion rates may correlate with how widely each language is represented in public codebases.

The speed rankings shift across languages. haiku is fastest in Ruby (113s) and OCaml (124s). devstral was fastest in Python (205s) but gets ejected in Haskell and OCaml. opus is the only model that completed the ReScript benchmark.

What's next

With multiple scripting-language benchmarks now showing the same pattern (all models pass, differences mainly in cost and token efficiency), the next runs in other languages should show whether the leaderboard reshuffles with different target languages.

Benchmarked on 2026-02-26 using pi as the agent harness.

This post was written with AI assistance.

Developers, developers, developers!

Blog about programming, programming and, ah more programming!