Benchmarking LLMs on Advent of Code 2025 (Java)

Tags = [ Java, AI, Advent of Code ] Posted on February 26, 2026

Following up on the Haskell benchmark, the OCaml benchmark, the Python benchmark, the ReScript benchmark, the Ruby benchmark, and the Elixir benchmark, I ran the same AoC 2025 Days 1–5 setup in Java.

The contestants

#	Model
1	`anthropic/claude-haiku-4-5`
2	`anthropic/claude-sonnet-4-6`
3	`anthropic/claude-opus-4-6`
4	`openai-codex/gpt-5.3-codex`
5	`zai/glm-5`
6	`minimax/MiniMax-M2.5`
7	`kimi-coding/k2p5`
8	`mistral/devstral-2512`
9	`alibaba/qwen3.5-plus`
10	`alibaba/qwen3-coder-next`

Ejections

Model	Ejected at	Reason
`alibaba/qwen3-coder-next`	D1P2	Wrong answer after 3 clean retries

The remaining 9 models solved all 10 parts, though two needed retries on Day 1 Part 2: glm-5 passed on its 2nd attempt and MiniMax-M2.5 on its 3rd.

Results (Days 1–5)

Per-task leaderboards

Day 1 Part 1 — Dial rotation counting

Model	Time
anthropic/claude-haiku-4-5	11s
mistral/devstral-2512	11s
alibaba/qwen3-coder-next	12s
anthropic/claude-sonnet-4-6	14s
openai-codex/gpt-5.3-codex	15s
kimi-coding/k2p5	16s
alibaba/qwen3.5-plus	16s
anthropic/claude-opus-4-6	17s
zai/glm-5	26s
minimax/MiniMax-M2.5	44s

Day 1 Part 2 — Counting zero-crossings during dial rotation

Model	Time	Result
anthropic/claude-haiku-4-5	10s	✓
anthropic/claude-opus-4-6	16s	✓
anthropic/claude-sonnet-4-6	19s	✓
openai-codex/gpt-5.3-codex	27s	✓
kimi-coding/k2p5	36s	✓
alibaba/qwen3.5-plus	47s	✓
mistral/devstral-2512	111s	✓
zai/glm-5	282s	✓ (2nd try)
minimax/MiniMax-M2.5	816s	✓ (3rd try)
alibaba/qwen3-coder-next	—	✗ (ejected)

Day 2 Part 1 — Summing repeated-digit IDs in ranges

Model	Time
anthropic/claude-haiku-4-5	11s
mistral/devstral-2512	12s
kimi-coding/k2p5	13s
openai-codex/gpt-5.3-codex	23s
anthropic/claude-sonnet-4-6	25s
anthropic/claude-opus-4-6	27s
minimax/MiniMax-M2.5	32s
zai/glm-5	34s
alibaba/qwen3.5-plus	48s

Day 2 Part 2 — Repeated-pattern IDs (any repeat count)

Model	Time
mistral/devstral-2512	9s
anthropic/claude-haiku-4-5	10s
kimi-coding/k2p5	13s
openai-codex/gpt-5.3-codex	16s
alibaba/qwen3.5-plus	23s
anthropic/claude-opus-4-6	25s
minimax/MiniMax-M2.5	31s
zai/glm-5	39s
anthropic/claude-sonnet-4-6	62s

Day 3 Part 1 — Maximizing 2-digit joltage from battery banks

Model	Time
kimi-coding/k2p5	14s
openai-codex/gpt-5.3-codex	16s
anthropic/claude-sonnet-4-6	20s
anthropic/claude-opus-4-6	21s
mistral/devstral-2512	21s
anthropic/claude-haiku-4-5	36s
alibaba/qwen3.5-plus	39s
zai/glm-5	83s
minimax/MiniMax-M2.5	110s

Day 3 Part 2 — Maximizing 12-digit joltage from battery banks

Model	Time
mistral/devstral-2512	10s
anthropic/claude-haiku-4-5	14s
anthropic/claude-sonnet-4-6	15s
kimi-coding/k2p5	18s
anthropic/claude-opus-4-6	19s
openai-codex/gpt-5.3-codex	20s
minimax/MiniMax-M2.5	25s
zai/glm-5	43s
alibaba/qwen3.5-plus	68s

Day 4 Part 1 — Grid neighbor counting (accessible paper rolls)

Model	Time
anthropic/claude-haiku-4-5	13s
openai-codex/gpt-5.3-codex	16s
mistral/devstral-2512	16s
anthropic/claude-sonnet-4-6	19s
anthropic/claude-opus-4-6	20s
alibaba/qwen3.5-plus	20s
zai/glm-5	30s
kimi-coding/k2p5	45s
minimax/MiniMax-M2.5	53s

Day 4 Part 2 — Iterative grid removal simulation

Model	Time
mistral/devstral-2512	9s
anthropic/claude-haiku-4-5	13s
alibaba/qwen3.5-plus	17s
anthropic/claude-opus-4-6	18s
anthropic/claude-sonnet-4-6	21s
openai-codex/gpt-5.3-codex	21s
kimi-coding/k2p5	32s
minimax/MiniMax-M2.5	38s
zai/glm-5	41s

Day 5 Part 1 — Range membership checking

Model	Time
anthropic/claude-haiku-4-5	14s
mistral/devstral-2512	17s
anthropic/claude-opus-4-6	18s
openai-codex/gpt-5.3-codex	21s
anthropic/claude-sonnet-4-6	24s
alibaba/qwen3.5-plus	28s
kimi-coding/k2p5	29s
zai/glm-5	41s
minimax/MiniMax-M2.5	66s

Day 5 Part 2 — Counting total fresh IDs from overlapping ranges

Model	Time
mistral/devstral-2512	9s
anthropic/claude-haiku-4-5	10s
kimi-coding/k2p5	13s
anthropic/claude-sonnet-4-6	15s
openai-codex/gpt-5.3-codex	15s
anthropic/claude-opus-4-6	17s
alibaba/qwen3.5-plus	19s
zai/glm-5	41s
minimax/MiniMax-M2.5	56s

Speed vs accuracy

Summary tables

Wall-clock time (seconds)

Model	D1P1	D1P2	D2P1	D2P2	D3P1	D3P2	D4P1	D4P2	D5P1	D5P2	Total
anthropic/claude-haiku-4-5	11s	10s	11s	10s	36s	14s	13s	13s	14s	10s	142s
openai-codex/gpt-5.3-codex	15s	27s	23s	16s	16s	20s	16s	21s	21s	15s	190s
anthropic/claude-opus-4-6	17s	16s	27s	25s	21s	19s	20s	18s	18s	17s	198s
mistral/devstral-2512	11s	111s	12s	9s	21s	10s	16s	9s	17s	9s	225s
kimi-coding/k2p5	16s	36s	13s	13s	14s	18s	45s	32s	29s	13s	229s
anthropic/claude-sonnet-4-6	14s	19s	25s	62s	20s	15s	19s	21s	24s	15s	234s
alibaba/qwen3.5-plus	16s	47s	48s	23s	39s	68s	20s	17s	28s	19s	325s
zai/glm-5	26s	282s	34s	39s	83s	43s	30s	41s	41s	41s	660s
minimax/MiniMax-M2.5	44s	816s	32s	31s	110s	25s	53s	38s	66s	56s	1271s
— ejected —
alibaba/qwen3-coder-next	12s	✗	—	—	—	—	—	—	—	—	—

Output tokens per part

Model	D1P1	D1P2	D2P1	D2P2	D3P1	D3P2	D4P1	D4P2	D5P1	D5P2	Total
openai-codex/gpt-5.3-codex	525	654	621	576	555	573	584	855	899	683	6,525
zai/glm-5	620	1,638	771	653	1,590	830	704	744	945	749	9,244
kimi-coding/k2p5	673	3,081	636	745	582	814	825	932	1,112	685	10,085
anthropic/claude-opus-4-6	793	755	1,541	1,586	957	918	985	1,083	870	922	10,410
anthropic/claude-sonnet-4-6	772	1,071	1,703	4,317	923	919	997	1,042	1,495	858	14,097
anthropic/claude-haiku-4-5	1,030	1,023	1,061	994	4,382	1,540	1,391	1,378	1,271	1,213	15,283
mistral/devstral-2512	548	12,994	794	687	1,547	815	942	1,043	1,549	903	21,822
alibaba/qwen3.5-plus	1,329	6,333	4,610	1,791	4,252	7,334	1,314	1,129	1,980	1,525	31,597
minimax/MiniMax-M2.5	1,091	28,159	1,273	1,336	3,970	928	1,123	1,071	1,866	900	41,717
— ejected —
alibaba/qwen3-coder-next	608	10,607	—	—	—	—	—	—	—	—	11,215

API cost per part (approximate USD)

Costs are rough approximations based on published per-token pricing. I use subscription plans, so my actual spending is capped regardless.

Model	D1P1	D1P2	D2P1	D2P2	D3P1	D3P2	D4P1	D4P2	D5P1	D5P2	Total
kimi-coding/k2p5	.0086	.0118	.0030	.0038	.0089	.0151	.0139	.0100	.0049	.0039	$0.08
alibaba/qwen3.5-plus	.0108	.0272	.0272	.0200	.0178	.0294	.0086	.0121	.0171	.0158	$0.19
zai/glm-5	.0181	.0314	.0070	.0079	.0387	.0239	.0063	.0085	.0317	.0240	$0.20
openai-codex/gpt-5.3-codex	.0199	.0199	.0233	.0179	.0212	.0164	.0176	.0257	.0274	.0163	$0.21
anthropic/claude-haiku-4-5	.0293	.0153	.0286	.0178	.0556	.0153	.0365	.0222	.0315	.0117	$0.26
mistral/devstral-2512	.0084	.1834	.0090	.0096	.0103	.0109	.0101	.0112	.0137	.0115	$0.28
anthropic/claude-sonnet-4-6	.0359	.0308	.0527	.0949	.0376	.0274	.0390	.0324	.0508	.0248	$0.43
minimax/MiniMax-M2.5	.0170	.2968	.0060	.0076	.0257	.0231	.0059	.0082	.0342	.0261	$0.45
anthropic/claude-opus-4-6	.1174	.0803	.1318	.0694	.1188	.0841	.1196	.0938	.1118	.0974	$1.02
— ejected —
alibaba/qwen3-coder-next	.0241	.2475	—	—	—	—	—	—	—	—	$0.27

Observations

claude-haiku-4-5 is the fastest overall at 142s. It never needed a retry, and 8 of its 10 parts came in under 15 seconds. The D3P1 spike (36s, 4,382 tokens) is the only outlier — still not slow, but notably wordier than its usual output.

gpt-5.3-codex is the most token-efficient: 6,525 tokens total. Under 900 tokens per part across the board. This matches the pattern from previous benchmarks — codex consistently writes the most compact solutions.

devstral-2512 — 111 seconds and 12,994 tokens on D1P2 alone — over half its total token output. On the remaining 9 parts it averaged about 10 seconds.

sonnet-4-6 — 62 seconds and 4,317 tokens on D2P2. Every other part was 14–25s.

MiniMax-M2.5 needed three attempts on Day 1 Part 2. 816 seconds and 28,159 tokens on that single part — 67% of its total token output across all 10 parts. It then solved Days 2–5 without trouble, though consistently slower than other models (25–110s per part).

qwen3.5-plus — most tokens among completers: 31,597. The D3P2 spike (7,334 tokens, 68s) stands out. Total cost ~$0.19.

claude-opus-4-6 — ~$1.02 total. Third fastest at 198s, 10,410 tokens.

kimi-coding/k2p5 — cheapest at ~$0.08. Fifth in speed (229s), third in tokens (10,085).

qwen3-coder-next was the only ejection. It solved D1P1 fast (12s) but couldn't produce the correct D1P2 answer in three clean attempts, spending 10,607 tokens and $0.25 in the process. The same model was also ejected on D1P2 in the Haskell run and the Elixir run.

Cross-language snapshot

Language	Models completing all 10 parts
Python	10/10
Ruby	10/10
Java	9/10
Elixir	7/10
Haskell	7/11
OCaml	5/9
ReScript (run 2)	2/10

Java sits alongside Python and Ruby in the high-completion tier. The single ejection (qwen3-coder-next on D1P2) matches a pattern — the same model also failed D1P2 in Haskell and Elixir.

Benchmarked on 2026-02-26 using pi as the agent harness.

This post was written with AI assistance.

Developers, developers, developers!

Blog about programming, programming and, ah more programming!