Following up on the Haskell benchmark, the OCaml
benchmark, the Python
benchmark, the ReScript
benchmark, the Ruby
benchmark, and the Elixir
benchmark, I ran the same AoC 2025 Days 1–5
setup in Java.
The contestants
| # | Model |
| 1 | anthropic/claude-haiku-4-5 |
| 2 | anthropic/claude-sonnet-4-6 |
| 3 | anthropic/claude-opus-4-6 |
| 4 | openai-codex/gpt-5.3-codex |
| 5 | zai/glm-5 |
| 6 | minimax/MiniMax-M2.5 |
| 7 | kimi-coding/k2p5 |
| 8 | mistral/devstral-2512 |
| 9 | alibaba/qwen3.5-plus |
| 10 | alibaba/qwen3-coder-next |
Ejections
| Model | Ejected at | Reason |
alibaba/qwen3-coder-next | D1P2 | Wrong answer after 3 clean retries |
The remaining 9 models solved all 10 parts, though two needed retries on Day 1 Part 2:
glm-5 passed on its 2nd attempt and MiniMax-M2.5 on its 3rd.
Results (Days 1–5)
Per-task leaderboards
Day 1 Part 1 — Dial rotation counting
| Model | Time |
| anthropic/claude-haiku-4-5 | 11s |
| mistral/devstral-2512 | 11s |
| alibaba/qwen3-coder-next | 12s |
| anthropic/claude-sonnet-4-6 | 14s |
| openai-codex/gpt-5.3-codex | 15s |
| kimi-coding/k2p5 | 16s |
| alibaba/qwen3.5-plus | 16s |
| anthropic/claude-opus-4-6 | 17s |
| zai/glm-5 | 26s |
| minimax/MiniMax-M2.5 | 44s |
Day 1 Part 2 — Counting zero-crossings during dial rotation
| Model | Time | Result |
| anthropic/claude-haiku-4-5 | 10s | ✓ |
| anthropic/claude-opus-4-6 | 16s | ✓ |
| anthropic/claude-sonnet-4-6 | 19s | ✓ |
| openai-codex/gpt-5.3-codex | 27s | ✓ |
| kimi-coding/k2p5 | 36s | ✓ |
| alibaba/qwen3.5-plus | 47s | ✓ |
| mistral/devstral-2512 | 111s | ✓ |
| zai/glm-5 | 282s | ✓ (2nd try) |
| minimax/MiniMax-M2.5 | 816s | ✓ (3rd try) |
| alibaba/qwen3-coder-next | — | ✗ (ejected) |
Day 2 Part 1 — Summing repeated-digit IDs in ranges
| Model | Time |
| anthropic/claude-haiku-4-5 | 11s |
| mistral/devstral-2512 | 12s |
| kimi-coding/k2p5 | 13s |
| openai-codex/gpt-5.3-codex | 23s |
| anthropic/claude-sonnet-4-6 | 25s |
| anthropic/claude-opus-4-6 | 27s |
| minimax/MiniMax-M2.5 | 32s |
| zai/glm-5 | 34s |
| alibaba/qwen3.5-plus | 48s |
Day 2 Part 2 — Repeated-pattern IDs (any repeat count)
| Model | Time |
| mistral/devstral-2512 | 9s |
| anthropic/claude-haiku-4-5 | 10s |
| kimi-coding/k2p5 | 13s |
| openai-codex/gpt-5.3-codex | 16s |
| alibaba/qwen3.5-plus | 23s |
| anthropic/claude-opus-4-6 | 25s |
| minimax/MiniMax-M2.5 | 31s |
| zai/glm-5 | 39s |
| anthropic/claude-sonnet-4-6 | 62s |
Day 3 Part 1 — Maximizing 2-digit joltage from battery banks
| Model | Time |
| kimi-coding/k2p5 | 14s |
| openai-codex/gpt-5.3-codex | 16s |
| anthropic/claude-sonnet-4-6 | 20s |
| anthropic/claude-opus-4-6 | 21s |
| mistral/devstral-2512 | 21s |
| anthropic/claude-haiku-4-5 | 36s |
| alibaba/qwen3.5-plus | 39s |
| zai/glm-5 | 83s |
| minimax/MiniMax-M2.5 | 110s |
Day 3 Part 2 — Maximizing 12-digit joltage from battery banks
| Model | Time |
| mistral/devstral-2512 | 10s |
| anthropic/claude-haiku-4-5 | 14s |
| anthropic/claude-sonnet-4-6 | 15s |
| kimi-coding/k2p5 | 18s |
| anthropic/claude-opus-4-6 | 19s |
| openai-codex/gpt-5.3-codex | 20s |
| minimax/MiniMax-M2.5 | 25s |
| zai/glm-5 | 43s |
| alibaba/qwen3.5-plus | 68s |
Day 4 Part 1 — Grid neighbor counting (accessible paper rolls)
| Model | Time |
| anthropic/claude-haiku-4-5 | 13s |
| openai-codex/gpt-5.3-codex | 16s |
| mistral/devstral-2512 | 16s |
| anthropic/claude-sonnet-4-6 | 19s |
| anthropic/claude-opus-4-6 | 20s |
| alibaba/qwen3.5-plus | 20s |
| zai/glm-5 | 30s |
| kimi-coding/k2p5 | 45s |
| minimax/MiniMax-M2.5 | 53s |
Day 4 Part 2 — Iterative grid removal simulation
| Model | Time |
| mistral/devstral-2512 | 9s |
| anthropic/claude-haiku-4-5 | 13s |
| alibaba/qwen3.5-plus | 17s |
| anthropic/claude-opus-4-6 | 18s |
| anthropic/claude-sonnet-4-6 | 21s |
| openai-codex/gpt-5.3-codex | 21s |
| kimi-coding/k2p5 | 32s |
| minimax/MiniMax-M2.5 | 38s |
| zai/glm-5 | 41s |
Day 5 Part 1 — Range membership checking
| Model | Time |
| anthropic/claude-haiku-4-5 | 14s |
| mistral/devstral-2512 | 17s |
| anthropic/claude-opus-4-6 | 18s |
| openai-codex/gpt-5.3-codex | 21s |
| anthropic/claude-sonnet-4-6 | 24s |
| alibaba/qwen3.5-plus | 28s |
| kimi-coding/k2p5 | 29s |
| zai/glm-5 | 41s |
| minimax/MiniMax-M2.5 | 66s |
Day 5 Part 2 — Counting total fresh IDs from overlapping ranges
| Model | Time |
| mistral/devstral-2512 | 9s |
| anthropic/claude-haiku-4-5 | 10s |
| kimi-coding/k2p5 | 13s |
| anthropic/claude-sonnet-4-6 | 15s |
| openai-codex/gpt-5.3-codex | 15s |
| anthropic/claude-opus-4-6 | 17s |
| alibaba/qwen3.5-plus | 19s |
| zai/glm-5 | 41s |
| minimax/MiniMax-M2.5 | 56s |
Summary tables
Wall-clock time (seconds)
| Model |
D1P1 |
D1P2 |
D2P1 |
D2P2 |
D3P1 |
D3P2 |
D4P1 |
D4P2 |
D5P1 |
D5P2 |
Total |
| anthropic/claude-haiku-4-5 |
11s |
10s |
11s |
10s |
36s |
14s |
13s |
13s |
14s |
10s |
142s |
| openai-codex/gpt-5.3-codex |
15s |
27s |
23s |
16s |
16s |
20s |
16s |
21s |
21s |
15s |
190s |
| anthropic/claude-opus-4-6 |
17s |
16s |
27s |
25s |
21s |
19s |
20s |
18s |
18s |
17s |
198s |
| mistral/devstral-2512 |
11s |
111s |
12s |
9s |
21s |
10s |
16s |
9s |
17s |
9s |
225s |
| kimi-coding/k2p5 |
16s |
36s |
13s |
13s |
14s |
18s |
45s |
32s |
29s |
13s |
229s |
| anthropic/claude-sonnet-4-6 |
14s |
19s |
25s |
62s |
20s |
15s |
19s |
21s |
24s |
15s |
234s |
| alibaba/qwen3.5-plus |
16s |
47s |
48s |
23s |
39s |
68s |
20s |
17s |
28s |
19s |
325s |
| zai/glm-5 |
26s |
282s |
34s |
39s |
83s |
43s |
30s |
41s |
41s |
41s |
660s |
| minimax/MiniMax-M2.5 |
44s |
816s |
32s |
31s |
110s |
25s |
53s |
38s |
66s |
56s |
1271s |
| — ejected — |
| alibaba/qwen3-coder-next |
12s |
✗ |
— |
— |
— |
— |
— |
— |
— |
— |
— |
Output tokens per part
| Model |
D1P1 |
D1P2 |
D2P1 |
D2P2 |
D3P1 |
D3P2 |
D4P1 |
D4P2 |
D5P1 |
D5P2 |
Total |
| openai-codex/gpt-5.3-codex |
525 |
654 |
621 |
576 |
555 |
573 |
584 |
855 |
899 |
683 |
6,525 |
| zai/glm-5 |
620 |
1,638 |
771 |
653 |
1,590 |
830 |
704 |
744 |
945 |
749 |
9,244 |
| kimi-coding/k2p5 |
673 |
3,081 |
636 |
745 |
582 |
814 |
825 |
932 |
1,112 |
685 |
10,085 |
| anthropic/claude-opus-4-6 |
793 |
755 |
1,541 |
1,586 |
957 |
918 |
985 |
1,083 |
870 |
922 |
10,410 |
| anthropic/claude-sonnet-4-6 |
772 |
1,071 |
1,703 |
4,317 |
923 |
919 |
997 |
1,042 |
1,495 |
858 |
14,097 |
| anthropic/claude-haiku-4-5 |
1,030 |
1,023 |
1,061 |
994 |
4,382 |
1,540 |
1,391 |
1,378 |
1,271 |
1,213 |
15,283 |
| mistral/devstral-2512 |
548 |
12,994 |
794 |
687 |
1,547 |
815 |
942 |
1,043 |
1,549 |
903 |
21,822 |
| alibaba/qwen3.5-plus |
1,329 |
6,333 |
4,610 |
1,791 |
4,252 |
7,334 |
1,314 |
1,129 |
1,980 |
1,525 |
31,597 |
| minimax/MiniMax-M2.5 |
1,091 |
28,159 |
1,273 |
1,336 |
3,970 |
928 |
1,123 |
1,071 |
1,866 |
900 |
41,717 |
| — ejected — |
| alibaba/qwen3-coder-next |
608 |
10,607 |
— |
— |
— |
— |
— |
— |
— |
— |
11,215 |
API cost per part (approximate USD)
Costs are rough approximations based on published per-token pricing. I use subscription plans, so my actual spending is capped regardless.
| Model |
D1P1 |
D1P2 |
D2P1 |
D2P2 |
D3P1 |
D3P2 |
D4P1 |
D4P2 |
D5P1 |
D5P2 |
Total |
| kimi-coding/k2p5 |
.0086 |
.0118 |
.0030 |
.0038 |
.0089 |
.0151 |
.0139 |
.0100 |
.0049 |
.0039 |
$0.08 |
| alibaba/qwen3.5-plus |
.0108 |
.0272 |
.0272 |
.0200 |
.0178 |
.0294 |
.0086 |
.0121 |
.0171 |
.0158 |
$0.19 |
| zai/glm-5 |
.0181 |
.0314 |
.0070 |
.0079 |
.0387 |
.0239 |
.0063 |
.0085 |
.0317 |
.0240 |
$0.20 |
| openai-codex/gpt-5.3-codex |
.0199 |
.0199 |
.0233 |
.0179 |
.0212 |
.0164 |
.0176 |
.0257 |
.0274 |
.0163 |
$0.21 |
| anthropic/claude-haiku-4-5 |
.0293 |
.0153 |
.0286 |
.0178 |
.0556 |
.0153 |
.0365 |
.0222 |
.0315 |
.0117 |
$0.26 |
| mistral/devstral-2512 |
.0084 |
.1834 |
.0090 |
.0096 |
.0103 |
.0109 |
.0101 |
.0112 |
.0137 |
.0115 |
$0.28 |
| anthropic/claude-sonnet-4-6 |
.0359 |
.0308 |
.0527 |
.0949 |
.0376 |
.0274 |
.0390 |
.0324 |
.0508 |
.0248 |
$0.43 |
| minimax/MiniMax-M2.5 |
.0170 |
.2968 |
.0060 |
.0076 |
.0257 |
.0231 |
.0059 |
.0082 |
.0342 |
.0261 |
$0.45 |
| anthropic/claude-opus-4-6 |
.1174 |
.0803 |
.1318 |
.0694 |
.1188 |
.0841 |
.1196 |
.0938 |
.1118 |
.0974 |
$1.02 |
| — ejected — |
| alibaba/qwen3-coder-next |
.0241 |
.2475 |
— |
— |
— |
— |
— |
— |
— |
— |
$0.27 |
Observations
claude-haiku-4-5 is the fastest overall at 142s. It never needed a retry, and 8 of
its 10 parts came in under 15 seconds. The D3P1 spike (36s, 4,382 tokens) is the only
outlier — still not slow, but notably wordier than its usual output.
gpt-5.3-codex is the most token-efficient: 6,525 tokens total. Under 900 tokens per
part across the board. This matches the pattern from previous benchmarks — codex
consistently writes the most compact solutions.
devstral-2512 — 111 seconds and 12,994 tokens on D1P2 alone — over half its total
token output. On the remaining 9 parts it averaged about 10 seconds.
sonnet-4-6 — 62 seconds and 4,317 tokens on D2P2. Every other part was 14–25s.
MiniMax-M2.5 needed three attempts on Day 1 Part 2. 816 seconds and 28,159 tokens
on that single part — 67% of its total token output across all 10 parts. It then solved
Days 2–5 without trouble, though consistently slower than other models (25–110s per part).
qwen3.5-plus — most tokens among completers: 31,597. The D3P2 spike (7,334 tokens,
68s) stands out. Total cost ~$0.19.
claude-opus-4-6 — ~$1.02 total. Third fastest at 198s, 10,410 tokens.
kimi-coding/k2p5 — cheapest at ~$0.08. Fifth in speed (229s), third in tokens
(10,085).
qwen3-coder-next was the only ejection. It solved D1P1 fast (12s) but couldn't
produce the correct D1P2 answer in three clean attempts, spending 10,607 tokens and $0.25
in the process. The same model was also ejected on D1P2 in the Haskell run and the Elixir
run.
Cross-language snapshot
| Language | Models completing all 10 parts |
| Python | 10/10 |
| Ruby | 10/10 |
| Java | 9/10 |
| Elixir | 7/10 |
| Haskell | 7/11 |
| OCaml | 5/9 |
| ReScript (run 2) | 2/10 |
Java sits alongside Python and Ruby in the high-completion tier. The single ejection
(qwen3-coder-next on D1P2) matches a pattern — the same model also failed D1P2 in
Haskell and Elixir.
Benchmarked on 2026-02-26 using pi as the agent harness.
This post was written with AI assistance.