Developers, developers, developers!

Blog about programming, programming and, ah more programming!

Benchmarking LLMs on Advent of Code 2025 (Ruby)

Tags = [ Ruby, AI, Advent of Code ]

Following up on the Haskell benchmark, the OCaml benchmark, the Python benchmark, and the ReScript benchmark, I ran the same AoC 2025 Days 1–5 puzzles in Ruby.

Same setup as before — the question is whether the leaderboard reshuffles when the target language changes.

The contestants

#Model
1anthropic/claude-haiku-4-5
2anthropic/claude-sonnet-4-6
3anthropic/claude-opus-4-6
4openai-codex/gpt-5.3-codex
5zai/glm-5
6minimax/MiniMax-M2.5
7kimi-coding/k2p5
8mistral/devstral-2512
9alibaba/qwen3.5-plus
10alibaba/qwen3-coder-next

Ejections

None. All 10 models solved all 10 parts correctly on the first attempt.

zai/glm-5 was originally ejected on D1P1 due to persistent HTTP 429 errors from ZAI's API. It was re-run solo after the API stabilized and completed all 10 parts without issues. Its results are included below.

Results (Days 1–5)

Per-task leaderboards


Day 1 Part 1 — Dial rotation counting

ModelTime
mistral/devstral-25128s
anthropic/claude-haiku-4-59s
alibaba/qwen3-coder-next11s
kimi-coding/k2p512s
openai-codex/gpt-5.3-codex12s
anthropic/claude-sonnet-4-613s
anthropic/claude-opus-4-618s
alibaba/qwen3.5-plus18s
zai/glm-527s
minimax/MiniMax-M2.548s



Day 1 Part 2 — Counting zero-crossings during dial rotation

ModelTime
anthropic/claude-haiku-4-58s
openai-codex/gpt-5.3-codex10s
anthropic/claude-sonnet-4-617s
mistral/devstral-251222s
anthropic/claude-opus-4-626s
alibaba/qwen3.5-plus29s
kimi-coding/k2p535s
zai/glm-563s
alibaba/qwen3-coder-next164s
minimax/MiniMax-M2.5187s



Day 2 Part 1 — Summing repeated-digit IDs in ranges

ModelTime
mistral/devstral-251210s
alibaba/qwen3-coder-next10s
openai-codex/gpt-5.3-codex13s
kimi-coding/k2p514s
anthropic/claude-haiku-4-517s
anthropic/claude-sonnet-4-621s
alibaba/qwen3.5-plus23s
zai/glm-525s
anthropic/claude-opus-4-627s
minimax/MiniMax-M2.549s



Day 2 Part 2 — Repeated-pattern IDs (any repeat count)

ModelTime
mistral/devstral-25127s
anthropic/claude-haiku-4-511s
kimi-coding/k2p513s
alibaba/qwen3.5-plus17s
openai-codex/gpt-5.3-codex20s
anthropic/claude-opus-4-625s
zai/glm-526s
anthropic/claude-sonnet-4-627s
alibaba/qwen3-coder-next28s
minimax/MiniMax-M2.531s



Day 3 Part 1 — Maximizing 2-digit joltage from battery banks

ModelTime
openai-codex/gpt-5.3-codex11s
anthropic/claude-haiku-4-514s
anthropic/claude-sonnet-4-617s
anthropic/claude-opus-4-624s
alibaba/qwen3.5-plus27s
alibaba/qwen3-coder-next28s
zai/glm-534s
kimi-coding/k2p536s
mistral/devstral-251248s
minimax/MiniMax-M2.5165s



Day 3 Part 2 — Maximizing 12-digit joltage from battery banks

ModelTime
openai-codex/gpt-5.3-codex11s
mistral/devstral-251211s
anthropic/claude-haiku-4-513s
anthropic/claude-sonnet-4-614s
kimi-coding/k2p515s
anthropic/claude-opus-4-619s
zai/glm-520s
alibaba/qwen3-coder-next23s
alibaba/qwen3.5-plus24s
minimax/MiniMax-M2.536s



Day 4 Part 1 — Grid neighbor counting (accessible paper rolls)

ModelTime
mistral/devstral-251210s
anthropic/claude-haiku-4-511s
anthropic/claude-sonnet-4-614s
openai-codex/gpt-5.3-codex15s
kimi-coding/k2p515s
anthropic/claude-opus-4-619s
alibaba/qwen3.5-plus22s
alibaba/qwen3-coder-next25s
minimax/MiniMax-M2.529s
zai/glm-532s



Day 4 Part 2 — Iterative grid removal simulation

ModelTime
mistral/devstral-25128s
anthropic/claude-haiku-4-512s
anthropic/claude-sonnet-4-614s
openai-codex/gpt-5.3-codex14s
kimi-coding/k2p514s
anthropic/claude-opus-4-616s
alibaba/qwen3.5-plus19s
zai/glm-525s
minimax/MiniMax-M2.527s
alibaba/qwen3-coder-next35s



Day 5 Part 1 — Range membership checking

ModelTime
mistral/devstral-25128s
anthropic/claude-haiku-4-510s
kimi-coding/k2p512s
openai-codex/gpt-5.3-codex12s
alibaba/qwen3-coder-next13s
anthropic/claude-sonnet-4-614s
anthropic/claude-opus-4-615s
alibaba/qwen3.5-plus16s
zai/glm-528s
minimax/MiniMax-M2.534s



Day 5 Part 2 — Counting total fresh IDs from overlapping ranges

ModelTime
mistral/devstral-25126s
anthropic/claude-haiku-4-58s
anthropic/claude-sonnet-4-612s
kimi-coding/k2p512s
openai-codex/gpt-5.3-codex12s
anthropic/claude-opus-4-615s
zai/glm-529s
minimax/MiniMax-M2.534s
alibaba/qwen3.5-plus36s
alibaba/qwen3-coder-next39s

Summary tables

Wall-clock time (seconds)

Model D1P1 D1P2 D2P1 D2P2 D3P1 D3P2 D4P1 D4P2 D5P1 D5P2 Total
anthropic/claude-haiku-4-5 9s 8s 17s 11s 14s 13s 11s 12s 10s 8s 113s
openai-codex/gpt-5.3-codex 12s 10s 13s 20s 11s 11s 15s 14s 12s 12s 130s
mistral/devstral-2512 8s 22s 10s 7s 48s 11s 10s 8s 8s 6s 138s
anthropic/claude-sonnet-4-6 13s 17s 21s 27s 17s 14s 14s 14s 14s 12s 163s
kimi-coding/k2p5 12s 35s 14s 13s 36s 15s 15s 14s 12s 12s 178s
anthropic/claude-opus-4-6 18s 26s 27s 25s 24s 19s 19s 16s 15s 15s 204s
alibaba/qwen3.5-plus 18s 29s 23s 17s 27s 24s 22s 19s 16s 36s 231s
zai/glm-5 27s 63s 25s 26s 34s 20s 32s 25s 28s 29s 309s
alibaba/qwen3-coder-next 11s 164s 10s 28s 28s 23s 25s 35s 13s 39s 376s
minimax/MiniMax-M2.5 48s 187s 49s 31s 165s 36s 29s 27s 34s 34s 640s

Output tokens per part

Model D1P1 D1P2 D2P1 D2P2 D3P1 D3P2 D4P1 D4P2 D5P1 D5P2 Total
openai-codex/gpt-5.3-codex 319 382 367 377 337 396 437 671 330 379 3,995
kimi-coding/k2p5 408 837 522 621 406 532 542 583 398 444 5,293
zai/glm-5 543 1,291 485 507 605 379 542 498 464 474 5,788
anthropic/claude-sonnet-4-6 598 823 1,049 1,197 658 743 689 698 577 590 7,622
anthropic/claude-opus-4-6 565 1,221 1,349 1,405 958 828 732 725 566 572 8,921
anthropic/claude-haiku-4-5 915 792 1,384 1,000 1,295 906 903 837 813 755 9,600
mistral/devstral-2512 538 3,040 608 510 4,973 730 600 651 428 497 12,575
alibaba/qwen3-coder-next 688 6,720 744 719 766 768 802 1,135 707 1,227 14,276
alibaba/qwen3.5-plus 1,364 3,504 2,031 1,031 2,150 1,980 1,266 1,121 1,117 1,588 17,152
minimax/MiniMax-M2.5 822 9,448 1,390 868 6,678 1,102 809 820 788 1,189 23,914

API cost per part (approximate USD)

Costs are rough approximations based on published per-token pricing. I use subscription plans, so my actual spending is capped regardless.

Model D1P1 D1P2 D2P1 D2P2 D3P1 D3P2 D4P1 D4P2 D5P1 D5P2 Total
kimi-coding/k2p5 .0028 .0047 .0031 .0036 .0025 .0032 .0097 .0085 .0024 .0028 $0.04
mistral/devstral-2512 .0063 .0217 .0069 .0065 .0308 .0206 .0044 .0094 .0061 .0080 $0.12
alibaba/qwen3.5-plus .0108 .0194 .0129 .0145 .0125 .0160 .0101 .0146 .0093 .0188 $0.14
zai/glm-5 .0177 .0227 .0056 .0066 .0071 .0063 .0225 .0180 .0295 .0221 $0.16
openai-codex/gpt-5.3-codex .0146 .0254 .0162 .0120 .0131 .0148 .0356 .0262 .0138 .0151 $0.19
anthropic/claude-haiku-4-5 .0228 .0122 .0331 .0106 .0333 .0140 .0317 .0164 .0261 .0155 $0.22
minimax/MiniMax-M2.5 .0138 .0945 .0080 .0147 .0443 .0251 .0034 .0058 .0159 .0207 $0.25
anthropic/claude-sonnet-4-6 .0325 .0257 .0406 .0331 .0325 .0235 .0330 .0252 .0302 .0192 $0.30
alibaba/qwen3-coder-next .0243 .1128 .0081 .0117 .0502 .0599 .0258 .0409 .0506 .0945 $0.48
anthropic/claude-opus-4-6 .1078 .0927 .1421 .0628 .1359 .0827 .1080 .0786 .0985 .1340 $0.94

Observations

All 10 models solved all 10 parts correctly on the first attempt — matching Python's clean sweep.

claude-haiku-4-5 — fastest overall at 113s. Fastest or near-fastest on 7 of 10 parts.

devstral-2512 — fastest on individual parts (six sub-10s), but a 48-second D3P1 spike pushes its total to 138s. The token data shows 4,973 output tokens on D3P1 vs. a 428–651 range on most other parts.

gpt-5.3-codex — fewest tokens: 3,995 total, under 400 per part on average.

kimi-coding/k2p5 — cheapest at $0.04 for all 10 parts. Fifth in speed (178s), second in token count (5,293).

claude-opus-4-6 — $0.94 total, the most expensive at ~$0.09 per part. 204s total, 8,921 tokens.

qwen3-coder-next — 164 seconds on D1P2, with 6,720 output tokens on that single part. Every other part was 10–39s.

minimax/MiniMax-M2.5 — 640s total, slowest but correct on every part across all benchmarks so far.

zai/glm-5 completed all 10 parts in a solo re-run (309s total). Originally ejected due to API 429 errors, it was re-run after ZAI's service stabilized. 5,788 tokens and $0.16 total — mid-pack on speed, but third in token efficiency behind codex and k2p5.

Cross-language comparison

With five benchmarks now complete, some patterns are emerging:

LanguageModels completing all 10 parts
Python10/10
Ruby10/10
Haskell7/11
OCaml5/9
ReScript (run 2)2/10

Completion rates may correlate with how widely each language is represented in public codebases.

The speed rankings shift across languages. haiku is fastest in Ruby (113s) and OCaml (124s). devstral was fastest in Python (205s) but gets ejected in Haskell and OCaml. opus is the only model that completed the ReScript benchmark.

What's next

With multiple scripting-language benchmarks now showing the same pattern (all models pass, differences mainly in cost and token efficiency), the next runs in other languages should show whether the leaderboard reshuffles with different target languages.

Benchmarked on 2026-02-26 using pi as the agent harness.


This post was written with AI assistance.