Following up on the Haskell benchmark, the OCaml benchmark, the Python benchmark, and the ReScript benchmark, I ran the same AoC 2025 Days 1–5 puzzles in Ruby.
Same setup as before — the question is whether the leaderboard reshuffles when the target language changes.
The contestants
| # | Model |
|---|---|
| 1 | anthropic/claude-haiku-4-5 |
| 2 | anthropic/claude-sonnet-4-6 |
| 3 | anthropic/claude-opus-4-6 |
| 4 | openai-codex/gpt-5.3-codex |
| 5 | zai/glm-5 |
| 6 | minimax/MiniMax-M2.5 |
| 7 | kimi-coding/k2p5 |
| 8 | mistral/devstral-2512 |
| 9 | alibaba/qwen3.5-plus |
| 10 | alibaba/qwen3-coder-next |
Ejections
None. All 10 models solved all 10 parts correctly on the first attempt.
zai/glm-5 was originally ejected on D1P1 due to persistent HTTP 429 errors from ZAI's
API. It was re-run solo after the API stabilized and completed all 10 parts without issues.
Its results are included below.
Results (Days 1–5)
Per-task leaderboards
Day 1 Part 1 — Dial rotation counting
| Model | Time |
|---|---|
| mistral/devstral-2512 | 8s |
| anthropic/claude-haiku-4-5 | 9s |
| alibaba/qwen3-coder-next | 11s |
| kimi-coding/k2p5 | 12s |
| openai-codex/gpt-5.3-codex | 12s |
| anthropic/claude-sonnet-4-6 | 13s |
| anthropic/claude-opus-4-6 | 18s |
| alibaba/qwen3.5-plus | 18s |
| zai/glm-5 | 27s |
| minimax/MiniMax-M2.5 | 48s |
Day 1 Part 2 — Counting zero-crossings during dial rotation
| Model | Time |
|---|---|
| anthropic/claude-haiku-4-5 | 8s |
| openai-codex/gpt-5.3-codex | 10s |
| anthropic/claude-sonnet-4-6 | 17s |
| mistral/devstral-2512 | 22s |
| anthropic/claude-opus-4-6 | 26s |
| alibaba/qwen3.5-plus | 29s |
| kimi-coding/k2p5 | 35s |
| zai/glm-5 | 63s |
| alibaba/qwen3-coder-next | 164s |
| minimax/MiniMax-M2.5 | 187s |
Day 2 Part 1 — Summing repeated-digit IDs in ranges
| Model | Time |
|---|---|
| mistral/devstral-2512 | 10s |
| alibaba/qwen3-coder-next | 10s |
| openai-codex/gpt-5.3-codex | 13s |
| kimi-coding/k2p5 | 14s |
| anthropic/claude-haiku-4-5 | 17s |
| anthropic/claude-sonnet-4-6 | 21s |
| alibaba/qwen3.5-plus | 23s |
| zai/glm-5 | 25s |
| anthropic/claude-opus-4-6 | 27s |
| minimax/MiniMax-M2.5 | 49s |
Day 2 Part 2 — Repeated-pattern IDs (any repeat count)
| Model | Time |
|---|---|
| mistral/devstral-2512 | 7s |
| anthropic/claude-haiku-4-5 | 11s |
| kimi-coding/k2p5 | 13s |
| alibaba/qwen3.5-plus | 17s |
| openai-codex/gpt-5.3-codex | 20s |
| anthropic/claude-opus-4-6 | 25s |
| zai/glm-5 | 26s |
| anthropic/claude-sonnet-4-6 | 27s |
| alibaba/qwen3-coder-next | 28s |
| minimax/MiniMax-M2.5 | 31s |
Day 3 Part 1 — Maximizing 2-digit joltage from battery banks
| Model | Time |
|---|---|
| openai-codex/gpt-5.3-codex | 11s |
| anthropic/claude-haiku-4-5 | 14s |
| anthropic/claude-sonnet-4-6 | 17s |
| anthropic/claude-opus-4-6 | 24s |
| alibaba/qwen3.5-plus | 27s |
| alibaba/qwen3-coder-next | 28s |
| zai/glm-5 | 34s |
| kimi-coding/k2p5 | 36s |
| mistral/devstral-2512 | 48s |
| minimax/MiniMax-M2.5 | 165s |
Day 3 Part 2 — Maximizing 12-digit joltage from battery banks
| Model | Time |
|---|---|
| openai-codex/gpt-5.3-codex | 11s |
| mistral/devstral-2512 | 11s |
| anthropic/claude-haiku-4-5 | 13s |
| anthropic/claude-sonnet-4-6 | 14s |
| kimi-coding/k2p5 | 15s |
| anthropic/claude-opus-4-6 | 19s |
| zai/glm-5 | 20s |
| alibaba/qwen3-coder-next | 23s |
| alibaba/qwen3.5-plus | 24s |
| minimax/MiniMax-M2.5 | 36s |
Day 4 Part 1 — Grid neighbor counting (accessible paper rolls)
| Model | Time |
|---|---|
| mistral/devstral-2512 | 10s |
| anthropic/claude-haiku-4-5 | 11s |
| anthropic/claude-sonnet-4-6 | 14s |
| openai-codex/gpt-5.3-codex | 15s |
| kimi-coding/k2p5 | 15s |
| anthropic/claude-opus-4-6 | 19s |
| alibaba/qwen3.5-plus | 22s |
| alibaba/qwen3-coder-next | 25s |
| minimax/MiniMax-M2.5 | 29s |
| zai/glm-5 | 32s |
Day 4 Part 2 — Iterative grid removal simulation
| Model | Time |
|---|---|
| mistral/devstral-2512 | 8s |
| anthropic/claude-haiku-4-5 | 12s |
| anthropic/claude-sonnet-4-6 | 14s |
| openai-codex/gpt-5.3-codex | 14s |
| kimi-coding/k2p5 | 14s |
| anthropic/claude-opus-4-6 | 16s |
| alibaba/qwen3.5-plus | 19s |
| zai/glm-5 | 25s |
| minimax/MiniMax-M2.5 | 27s |
| alibaba/qwen3-coder-next | 35s |
Day 5 Part 1 — Range membership checking
| Model | Time |
|---|---|
| mistral/devstral-2512 | 8s |
| anthropic/claude-haiku-4-5 | 10s |
| kimi-coding/k2p5 | 12s |
| openai-codex/gpt-5.3-codex | 12s |
| alibaba/qwen3-coder-next | 13s |
| anthropic/claude-sonnet-4-6 | 14s |
| anthropic/claude-opus-4-6 | 15s |
| alibaba/qwen3.5-plus | 16s |
| zai/glm-5 | 28s |
| minimax/MiniMax-M2.5 | 34s |
Day 5 Part 2 — Counting total fresh IDs from overlapping ranges
| Model | Time |
|---|---|
| mistral/devstral-2512 | 6s |
| anthropic/claude-haiku-4-5 | 8s |
| anthropic/claude-sonnet-4-6 | 12s |
| kimi-coding/k2p5 | 12s |
| openai-codex/gpt-5.3-codex | 12s |
| anthropic/claude-opus-4-6 | 15s |
| zai/glm-5 | 29s |
| minimax/MiniMax-M2.5 | 34s |
| alibaba/qwen3.5-plus | 36s |
| alibaba/qwen3-coder-next | 39s |
Summary tables
Wall-clock time (seconds)
| Model | D1P1 | D1P2 | D2P1 | D2P2 | D3P1 | D3P2 | D4P1 | D4P2 | D5P1 | D5P2 | Total |
|---|---|---|---|---|---|---|---|---|---|---|---|
| anthropic/claude-haiku-4-5 | 9s | 8s | 17s | 11s | 14s | 13s | 11s | 12s | 10s | 8s | 113s |
| openai-codex/gpt-5.3-codex | 12s | 10s | 13s | 20s | 11s | 11s | 15s | 14s | 12s | 12s | 130s |
| mistral/devstral-2512 | 8s | 22s | 10s | 7s | 48s | 11s | 10s | 8s | 8s | 6s | 138s |
| anthropic/claude-sonnet-4-6 | 13s | 17s | 21s | 27s | 17s | 14s | 14s | 14s | 14s | 12s | 163s |
| kimi-coding/k2p5 | 12s | 35s | 14s | 13s | 36s | 15s | 15s | 14s | 12s | 12s | 178s |
| anthropic/claude-opus-4-6 | 18s | 26s | 27s | 25s | 24s | 19s | 19s | 16s | 15s | 15s | 204s |
| alibaba/qwen3.5-plus | 18s | 29s | 23s | 17s | 27s | 24s | 22s | 19s | 16s | 36s | 231s |
| zai/glm-5 | 27s | 63s | 25s | 26s | 34s | 20s | 32s | 25s | 28s | 29s | 309s |
| alibaba/qwen3-coder-next | 11s | 164s | 10s | 28s | 28s | 23s | 25s | 35s | 13s | 39s | 376s |
| minimax/MiniMax-M2.5 | 48s | 187s | 49s | 31s | 165s | 36s | 29s | 27s | 34s | 34s | 640s |
Output tokens per part
| Model | D1P1 | D1P2 | D2P1 | D2P2 | D3P1 | D3P2 | D4P1 | D4P2 | D5P1 | D5P2 | Total |
|---|---|---|---|---|---|---|---|---|---|---|---|
| openai-codex/gpt-5.3-codex | 319 | 382 | 367 | 377 | 337 | 396 | 437 | 671 | 330 | 379 | 3,995 |
| kimi-coding/k2p5 | 408 | 837 | 522 | 621 | 406 | 532 | 542 | 583 | 398 | 444 | 5,293 |
| zai/glm-5 | 543 | 1,291 | 485 | 507 | 605 | 379 | 542 | 498 | 464 | 474 | 5,788 |
| anthropic/claude-sonnet-4-6 | 598 | 823 | 1,049 | 1,197 | 658 | 743 | 689 | 698 | 577 | 590 | 7,622 |
| anthropic/claude-opus-4-6 | 565 | 1,221 | 1,349 | 1,405 | 958 | 828 | 732 | 725 | 566 | 572 | 8,921 |
| anthropic/claude-haiku-4-5 | 915 | 792 | 1,384 | 1,000 | 1,295 | 906 | 903 | 837 | 813 | 755 | 9,600 |
| mistral/devstral-2512 | 538 | 3,040 | 608 | 510 | 4,973 | 730 | 600 | 651 | 428 | 497 | 12,575 |
| alibaba/qwen3-coder-next | 688 | 6,720 | 744 | 719 | 766 | 768 | 802 | 1,135 | 707 | 1,227 | 14,276 |
| alibaba/qwen3.5-plus | 1,364 | 3,504 | 2,031 | 1,031 | 2,150 | 1,980 | 1,266 | 1,121 | 1,117 | 1,588 | 17,152 |
| minimax/MiniMax-M2.5 | 822 | 9,448 | 1,390 | 868 | 6,678 | 1,102 | 809 | 820 | 788 | 1,189 | 23,914 |
API cost per part (approximate USD)
Costs are rough approximations based on published per-token pricing. I use subscription plans, so my actual spending is capped regardless.
| Model | D1P1 | D1P2 | D2P1 | D2P2 | D3P1 | D3P2 | D4P1 | D4P2 | D5P1 | D5P2 | Total |
|---|---|---|---|---|---|---|---|---|---|---|---|
| kimi-coding/k2p5 | .0028 | .0047 | .0031 | .0036 | .0025 | .0032 | .0097 | .0085 | .0024 | .0028 | $0.04 |
| mistral/devstral-2512 | .0063 | .0217 | .0069 | .0065 | .0308 | .0206 | .0044 | .0094 | .0061 | .0080 | $0.12 |
| alibaba/qwen3.5-plus | .0108 | .0194 | .0129 | .0145 | .0125 | .0160 | .0101 | .0146 | .0093 | .0188 | $0.14 |
| zai/glm-5 | .0177 | .0227 | .0056 | .0066 | .0071 | .0063 | .0225 | .0180 | .0295 | .0221 | $0.16 |
| openai-codex/gpt-5.3-codex | .0146 | .0254 | .0162 | .0120 | .0131 | .0148 | .0356 | .0262 | .0138 | .0151 | $0.19 |
| anthropic/claude-haiku-4-5 | .0228 | .0122 | .0331 | .0106 | .0333 | .0140 | .0317 | .0164 | .0261 | .0155 | $0.22 |
| minimax/MiniMax-M2.5 | .0138 | .0945 | .0080 | .0147 | .0443 | .0251 | .0034 | .0058 | .0159 | .0207 | $0.25 |
| anthropic/claude-sonnet-4-6 | .0325 | .0257 | .0406 | .0331 | .0325 | .0235 | .0330 | .0252 | .0302 | .0192 | $0.30 |
| alibaba/qwen3-coder-next | .0243 | .1128 | .0081 | .0117 | .0502 | .0599 | .0258 | .0409 | .0506 | .0945 | $0.48 |
| anthropic/claude-opus-4-6 | .1078 | .0927 | .1421 | .0628 | .1359 | .0827 | .1080 | .0786 | .0985 | .1340 | $0.94 |
Observations
All 10 models solved all 10 parts correctly on the first attempt — matching Python's clean sweep.
claude-haiku-4-5 — fastest overall at 113s. Fastest or near-fastest on 7 of 10 parts.
devstral-2512 — fastest on individual parts (six sub-10s), but a 48-second D3P1
spike pushes its total to 138s. The token data shows 4,973 output tokens on D3P1 vs.
a 428–651 range on most other parts.
gpt-5.3-codex — fewest tokens: 3,995 total, under 400 per part on average.
kimi-coding/k2p5 — cheapest at $0.04 for all 10 parts. Fifth in speed (178s),
second in token count (5,293).
claude-opus-4-6 — $0.94 total, the most expensive at ~$0.09 per part. 204s total,
8,921 tokens.
qwen3-coder-next — 164 seconds on D1P2, with 6,720 output tokens on that single
part. Every other part was 10–39s.
minimax/MiniMax-M2.5 — 640s total, slowest but correct on every part across all
benchmarks so far.
zai/glm-5 completed all 10 parts in a solo re-run (309s total). Originally ejected
due to API 429 errors, it was re-run after ZAI's service stabilized. 5,788 tokens and $0.16
total — mid-pack on speed, but third in token efficiency behind codex and k2p5.
Cross-language comparison
With five benchmarks now complete, some patterns are emerging:
| Language | Models completing all 10 parts |
|---|---|
| Python | 10/10 |
| Ruby | 10/10 |
| Haskell | 7/11 |
| OCaml | 5/9 |
| ReScript (run 2) | 2/10 |
Completion rates may correlate with how widely each language is represented in public codebases.
The speed rankings shift across languages. haiku is fastest in Ruby (113s) and OCaml
(124s). devstral was fastest in Python (205s) but gets ejected in Haskell and OCaml.
opus is the only model that completed the ReScript benchmark.
What's next
With multiple scripting-language benchmarks now showing the same pattern (all models pass, differences mainly in cost and token efficiency), the next runs in other languages should show whether the leaderboard reshuffles with different target languages.
Benchmarked on 2026-02-26 using pi as the agent harness.
This post was written with AI assistance.