Developers, developers, developers!

Blog about programming, programming and, ah more programming!

Benchmarking LLMs on Advent of Code 2025 (Java)

Tags = [ Java, AI, Advent of Code ]

Following up on the Haskell benchmark, the OCaml benchmark, the Python benchmark, the ReScript benchmark, the Ruby benchmark, and the Elixir benchmark, I ran the same AoC 2025 Days 1–5 setup in Java.

The contestants

#Model
1anthropic/claude-haiku-4-5
2anthropic/claude-sonnet-4-6
3anthropic/claude-opus-4-6
4openai-codex/gpt-5.3-codex
5zai/glm-5
6minimax/MiniMax-M2.5
7kimi-coding/k2p5
8mistral/devstral-2512
9alibaba/qwen3.5-plus
10alibaba/qwen3-coder-next

Ejections

ModelEjected atReason
alibaba/qwen3-coder-nextD1P2Wrong answer after 3 clean retries

The remaining 9 models solved all 10 parts, though two needed retries on Day 1 Part 2: glm-5 passed on its 2nd attempt and MiniMax-M2.5 on its 3rd.

Results (Days 1–5)

Per-task leaderboards


Day 1 Part 1 — Dial rotation counting

ModelTime
anthropic/claude-haiku-4-511s
mistral/devstral-251211s
alibaba/qwen3-coder-next12s
anthropic/claude-sonnet-4-614s
openai-codex/gpt-5.3-codex15s
kimi-coding/k2p516s
alibaba/qwen3.5-plus16s
anthropic/claude-opus-4-617s
zai/glm-526s
minimax/MiniMax-M2.544s



Day 1 Part 2 — Counting zero-crossings during dial rotation

ModelTimeResult
anthropic/claude-haiku-4-510s
anthropic/claude-opus-4-616s
anthropic/claude-sonnet-4-619s
openai-codex/gpt-5.3-codex27s
kimi-coding/k2p536s
alibaba/qwen3.5-plus47s
mistral/devstral-2512111s
zai/glm-5282s✓ (2nd try)
minimax/MiniMax-M2.5816s✓ (3rd try)
alibaba/qwen3-coder-next✗ (ejected)



Day 2 Part 1 — Summing repeated-digit IDs in ranges

ModelTime
anthropic/claude-haiku-4-511s
mistral/devstral-251212s
kimi-coding/k2p513s
openai-codex/gpt-5.3-codex23s
anthropic/claude-sonnet-4-625s
anthropic/claude-opus-4-627s
minimax/MiniMax-M2.532s
zai/glm-534s
alibaba/qwen3.5-plus48s



Day 2 Part 2 — Repeated-pattern IDs (any repeat count)

ModelTime
mistral/devstral-25129s
anthropic/claude-haiku-4-510s
kimi-coding/k2p513s
openai-codex/gpt-5.3-codex16s
alibaba/qwen3.5-plus23s
anthropic/claude-opus-4-625s
minimax/MiniMax-M2.531s
zai/glm-539s
anthropic/claude-sonnet-4-662s



Day 3 Part 1 — Maximizing 2-digit joltage from battery banks

ModelTime
kimi-coding/k2p514s
openai-codex/gpt-5.3-codex16s
anthropic/claude-sonnet-4-620s
anthropic/claude-opus-4-621s
mistral/devstral-251221s
anthropic/claude-haiku-4-536s
alibaba/qwen3.5-plus39s
zai/glm-583s
minimax/MiniMax-M2.5110s



Day 3 Part 2 — Maximizing 12-digit joltage from battery banks

ModelTime
mistral/devstral-251210s
anthropic/claude-haiku-4-514s
anthropic/claude-sonnet-4-615s
kimi-coding/k2p518s
anthropic/claude-opus-4-619s
openai-codex/gpt-5.3-codex20s
minimax/MiniMax-M2.525s
zai/glm-543s
alibaba/qwen3.5-plus68s



Day 4 Part 1 — Grid neighbor counting (accessible paper rolls)

ModelTime
anthropic/claude-haiku-4-513s
openai-codex/gpt-5.3-codex16s
mistral/devstral-251216s
anthropic/claude-sonnet-4-619s
anthropic/claude-opus-4-620s
alibaba/qwen3.5-plus20s
zai/glm-530s
kimi-coding/k2p545s
minimax/MiniMax-M2.553s



Day 4 Part 2 — Iterative grid removal simulation

ModelTime
mistral/devstral-25129s
anthropic/claude-haiku-4-513s
alibaba/qwen3.5-plus17s
anthropic/claude-opus-4-618s
anthropic/claude-sonnet-4-621s
openai-codex/gpt-5.3-codex21s
kimi-coding/k2p532s
minimax/MiniMax-M2.538s
zai/glm-541s



Day 5 Part 1 — Range membership checking

ModelTime
anthropic/claude-haiku-4-514s
mistral/devstral-251217s
anthropic/claude-opus-4-618s
openai-codex/gpt-5.3-codex21s
anthropic/claude-sonnet-4-624s
alibaba/qwen3.5-plus28s
kimi-coding/k2p529s
zai/glm-541s
minimax/MiniMax-M2.566s



Day 5 Part 2 — Counting total fresh IDs from overlapping ranges

ModelTime
mistral/devstral-25129s
anthropic/claude-haiku-4-510s
kimi-coding/k2p513s
anthropic/claude-sonnet-4-615s
openai-codex/gpt-5.3-codex15s
anthropic/claude-opus-4-617s
alibaba/qwen3.5-plus19s
zai/glm-541s
minimax/MiniMax-M2.556s

Summary tables

Wall-clock time (seconds)

Model D1P1 D1P2 D2P1 D2P2 D3P1 D3P2 D4P1 D4P2 D5P1 D5P2 Total
anthropic/claude-haiku-4-5 11s 10s 11s 10s 36s 14s 13s 13s 14s 10s 142s
openai-codex/gpt-5.3-codex 15s 27s 23s 16s 16s 20s 16s 21s 21s 15s 190s
anthropic/claude-opus-4-6 17s 16s 27s 25s 21s 19s 20s 18s 18s 17s 198s
mistral/devstral-2512 11s 111s 12s 9s 21s 10s 16s 9s 17s 9s 225s
kimi-coding/k2p5 16s 36s 13s 13s 14s 18s 45s 32s 29s 13s 229s
anthropic/claude-sonnet-4-6 14s 19s 25s 62s 20s 15s 19s 21s 24s 15s 234s
alibaba/qwen3.5-plus 16s 47s 48s 23s 39s 68s 20s 17s 28s 19s 325s
zai/glm-5 26s 282s 34s 39s 83s 43s 30s 41s 41s 41s 660s
minimax/MiniMax-M2.5 44s 816s 32s 31s 110s 25s 53s 38s 66s 56s 1271s
— ejected —
alibaba/qwen3-coder-next 12s

Output tokens per part

Model D1P1 D1P2 D2P1 D2P2 D3P1 D3P2 D4P1 D4P2 D5P1 D5P2 Total
openai-codex/gpt-5.3-codex 525 654 621 576 555 573 584 855 899 683 6,525
zai/glm-5 620 1,638 771 653 1,590 830 704 744 945 749 9,244
kimi-coding/k2p5 673 3,081 636 745 582 814 825 932 1,112 685 10,085
anthropic/claude-opus-4-6 793 755 1,541 1,586 957 918 985 1,083 870 922 10,410
anthropic/claude-sonnet-4-6 772 1,071 1,703 4,317 923 919 997 1,042 1,495 858 14,097
anthropic/claude-haiku-4-5 1,030 1,023 1,061 994 4,382 1,540 1,391 1,378 1,271 1,213 15,283
mistral/devstral-2512 548 12,994 794 687 1,547 815 942 1,043 1,549 903 21,822
alibaba/qwen3.5-plus 1,329 6,333 4,610 1,791 4,252 7,334 1,314 1,129 1,980 1,525 31,597
minimax/MiniMax-M2.5 1,091 28,159 1,273 1,336 3,970 928 1,123 1,071 1,866 900 41,717
— ejected —
alibaba/qwen3-coder-next 608 10,607 11,215

API cost per part (approximate USD)

Costs are rough approximations based on published per-token pricing. I use subscription plans, so my actual spending is capped regardless.

Model D1P1 D1P2 D2P1 D2P2 D3P1 D3P2 D4P1 D4P2 D5P1 D5P2 Total
kimi-coding/k2p5 .0086 .0118 .0030 .0038 .0089 .0151 .0139 .0100 .0049 .0039 $0.08
alibaba/qwen3.5-plus .0108 .0272 .0272 .0200 .0178 .0294 .0086 .0121 .0171 .0158 $0.19
zai/glm-5 .0181 .0314 .0070 .0079 .0387 .0239 .0063 .0085 .0317 .0240 $0.20
openai-codex/gpt-5.3-codex .0199 .0199 .0233 .0179 .0212 .0164 .0176 .0257 .0274 .0163 $0.21
anthropic/claude-haiku-4-5 .0293 .0153 .0286 .0178 .0556 .0153 .0365 .0222 .0315 .0117 $0.26
mistral/devstral-2512 .0084 .1834 .0090 .0096 .0103 .0109 .0101 .0112 .0137 .0115 $0.28
anthropic/claude-sonnet-4-6 .0359 .0308 .0527 .0949 .0376 .0274 .0390 .0324 .0508 .0248 $0.43
minimax/MiniMax-M2.5 .0170 .2968 .0060 .0076 .0257 .0231 .0059 .0082 .0342 .0261 $0.45
anthropic/claude-opus-4-6 .1174 .0803 .1318 .0694 .1188 .0841 .1196 .0938 .1118 .0974 $1.02
— ejected —
alibaba/qwen3-coder-next .0241 .2475 $0.27

Observations

claude-haiku-4-5 is the fastest overall at 142s. It never needed a retry, and 8 of its 10 parts came in under 15 seconds. The D3P1 spike (36s, 4,382 tokens) is the only outlier — still not slow, but notably wordier than its usual output.

gpt-5.3-codex is the most token-efficient: 6,525 tokens total. Under 900 tokens per part across the board. This matches the pattern from previous benchmarks — codex consistently writes the most compact solutions.

devstral-2512 — 111 seconds and 12,994 tokens on D1P2 alone — over half its total token output. On the remaining 9 parts it averaged about 10 seconds.

sonnet-4-6 — 62 seconds and 4,317 tokens on D2P2. Every other part was 14–25s.

MiniMax-M2.5 needed three attempts on Day 1 Part 2. 816 seconds and 28,159 tokens on that single part — 67% of its total token output across all 10 parts. It then solved Days 2–5 without trouble, though consistently slower than other models (25–110s per part).

qwen3.5-plus — most tokens among completers: 31,597. The D3P2 spike (7,334 tokens, 68s) stands out. Total cost ~$0.19.

claude-opus-4-6 — ~$1.02 total. Third fastest at 198s, 10,410 tokens.

kimi-coding/k2p5 — cheapest at ~$0.08. Fifth in speed (229s), third in tokens (10,085).

qwen3-coder-next was the only ejection. It solved D1P1 fast (12s) but couldn't produce the correct D1P2 answer in three clean attempts, spending 10,607 tokens and $0.25 in the process. The same model was also ejected on D1P2 in the Haskell run and the Elixir run.

Cross-language snapshot

LanguageModels completing all 10 parts
Python10/10
Ruby10/10
Java9/10
Elixir7/10
Haskell7/11
OCaml5/9
ReScript (run 2)2/10

Java sits alongside Python and Ruby in the high-completion tier. The single ejection (qwen3-coder-next on D1P2) matches a pattern — the same model also failed D1P2 in Haskell and Elixir.

Benchmarked on 2026-02-26 using pi as the agent harness.


This post was written with AI assistance.