Benchmarking LLMs on Advent of Code 2025 (Elm)

Tags = [ Elm, AI, Advent of Code ] Posted on February 26, 2026

Following up on the Haskell benchmark, the OCaml benchmark, the Python benchmark, the ReScript benchmark, the Ruby benchmark, the Elixir benchmark, and the Java benchmark, I ran the same AoC 2025 Days 1–5 setup in Elm.

Elm is the most niche language in this series. It's a pure functional language that compiles to JavaScript, has no native CLI story, and sees relatively little use outside its frontend niche. Each model received a pre-built scaffold — run.mjs, elm.json, and a Day00.elm template — that compiles and runs Elm modules via Node.js. The question was whether models would handle Elm's strict type system, lack of escape hatches, and unfamiliar idioms (e.g. Debug.log for output, Platform.worker for headless programs).

The answer: every single one of them did.

The contestants

#	Model
1	`anthropic/claude-haiku-4-5`
2	`anthropic/claude-sonnet-4-6`
3	`anthropic/claude-opus-4-6`
4	`openai-codex/gpt-5.3-codex`
5	`zai/glm-5`
6	`minimax/MiniMax-M2.5`
7	`kimi-coding/k2p5`
8	`mistral/devstral-2512`
9	`alibaba/qwen3.5-plus`
10	`alibaba/qwen3-coder-next`

Ejections

None. All 10 models completed all 10 parts. This ties Elm with Python and Ruby for the best completion rate in the series.

That said, the path was rocky for some. Several models needed multiple retries, and two (devstral-2512 on Day 3 Part 1, MiniMax-M2.5 on Day 3 Part 2) went through costly runaway loops requiring dirty restarts.

Results (Days 1–5)

Per-task leaderboards

Day 1 Part 1 — Dial rotation counting

Model	Time
anthropic/claude-haiku-4-5	43s
openai-codex/gpt-5.3-codex	52s
alibaba/qwen3-coder-next	52s
anthropic/claude-opus-4-6	53s
anthropic/claude-sonnet-4-6	54s
alibaba/qwen3.5-plus	58s
kimi-coding/k2p5	60s
mistral/devstral-2512	65s
zai/glm-5	66s
minimax/MiniMax-M2.5	97s

Day 1 Part 2 — Counting zero-crossings during dial rotation

Model	Time	Result
anthropic/claude-haiku-4-5	22s	✓
openai-codex/gpt-5.3-codex	23s	✓
anthropic/claude-opus-4-6	41s	✓
anthropic/claude-sonnet-4-6	44s	✓
alibaba/qwen3-coder-next	52s	✓
alibaba/qwen3.5-plus	86s	✓
minimax/MiniMax-M2.5	127s	✓
zai/glm-5	137s	✓
kimi-coding/k2p5	239s	✓ (2nd try)
mistral/devstral-2512	364s	✓ (3rd try)

Day 2 Part 1 — Summing repeated-digit IDs in ranges

Model	Time	Result
kimi-coding/k2p5	32s	✓
openai-codex/gpt-5.3-codex	33s	✓
anthropic/claude-haiku-4-5	37s	✓
alibaba/qwen3-coder-next	39s	✓
anthropic/claude-sonnet-4-6	49s	✓
zai/glm-5	54s	✓
mistral/devstral-2512	61s	✓
anthropic/claude-opus-4-6	80s	✓
minimax/MiniMax-M2.5	213s	✓ (2nd try)
alibaba/qwen3.5-plus	363s	✓ (2nd try)

Day 2 Part 2 — Repeated-pattern IDs (any repeat count)

Model	Time
mistral/devstral-2512	11s
kimi-coding/k2p5	16s
alibaba/qwen3-coder-next	16s
anthropic/claude-haiku-4-5	19s
openai-codex/gpt-5.3-codex	23s
anthropic/claude-opus-4-6	29s
zai/glm-5	48s
minimax/MiniMax-M2.5	79s
anthropic/claude-sonnet-4-6	239s
alibaba/qwen3.5-plus	260s

Day 3 Part 1 — Maximizing 2-digit joltage from battery banks

Model	Time	Result
anthropic/claude-sonnet-4-6	19s	✓
openai-codex/gpt-5.3-codex	19s	✓
alibaba/qwen3.5-plus	22s	✓
anthropic/claude-opus-4-6	27s	✓
alibaba/qwen3-coder-next	27s	✓
kimi-coding/k2p5	30s	✓
anthropic/claude-haiku-4-5	53s	✓
zai/glm-5	63s	✓
minimax/MiniMax-M2.5	67s	✓
mistral/devstral-2512	1164s	✓ (dirty retry)

Day 3 Part 2 — Maximizing 12-digit joltage from battery banks

Model	Time	Result
kimi-coding/k2p5	16s	✓
anthropic/claude-sonnet-4-6	32s	✓
anthropic/claude-haiku-4-5	45s	✓
zai/glm-5	45s	✓
anthropic/claude-opus-4-6	48s	✓
alibaba/qwen3.5-plus	62s	✓
mistral/devstral-2512	163s	✓
alibaba/qwen3-coder-next	216s	✓
openai-codex/gpt-5.3-codex	952s	✓ (nudge)
minimax/MiniMax-M2.5	—	✓ (dirty retry ×3)

Day 4 Part 1 — Grid neighbor counting (accessible paper rolls)

Model	Time	Result
anthropic/claude-haiku-4-5	15s	✓
kimi-coding/k2p5	17s	✓
anthropic/claude-sonnet-4-6	21s	✓
anthropic/claude-opus-4-6	22s	✓
mistral/devstral-2512	22s	✓
alibaba/qwen3-coder-next	23s	✓
zai/glm-5	28s	✓
alibaba/qwen3.5-plus	37s	✓
minimax/MiniMax-M2.5	53s	✓
openai-codex/gpt-5.3-codex	—	✓ (2nd try)

Day 4 Part 2 — Iterative grid removal simulation

Model	Time	Result
anthropic/claude-sonnet-4-6	24s	✓
anthropic/claude-opus-4-6	24s	✓
kimi-coding/k2p5	38s	✓
mistral/devstral-2512	47s	✓
alibaba/qwen3.5-plus	48s	✓
minimax/MiniMax-M2.5	58s	✓
zai/glm-5	68s	✓
anthropic/claude-haiku-4-5	245s	✓
alibaba/qwen3-coder-next	272s	✓
openai-codex/gpt-5.3-codex	971s	✓ (2nd try)

Day 5 Part 1 — Range membership checking

Model	Time
anthropic/claude-sonnet-4-6	16s
openai-codex/gpt-5.3-codex	17s
alibaba/qwen3.5-plus	19s
anthropic/claude-haiku-4-5	20s
mistral/devstral-2512	20s
anthropic/claude-opus-4-6	21s
zai/glm-5	24s
minimax/MiniMax-M2.5	30s
kimi-coding/k2p5	44s
alibaba/qwen3-coder-next	47s

Day 5 Part 2 — Counting total fresh IDs from overlapping ranges

Model	Time
anthropic/claude-haiku-4-5	9s
mistral/devstral-2512	11s
anthropic/claude-sonnet-4-6	13s
alibaba/qwen3.5-plus	14s
openai-codex/gpt-5.3-codex	15s
kimi-coding/k2p5	22s
alibaba/qwen3-coder-next	22s
anthropic/claude-opus-4-6	24s
zai/glm-5	31s
minimax/MiniMax-M2.5	108s

Summary tables

Wall-clock time (seconds)

Model	D1P1	D1P2	D2P1	D2P2	D3P1	D3P2	D4P1	D4P2	D5P1	D5P2	Total
anthropic/claude-opus-4-6	53s	41s	80s	29s	27s	48s	22s	24s	21s	24s	369s
anthropic/claude-haiku-4-5	43s	22s	37s	19s	53s	45s	15s	245s	20s	9s	508s
anthropic/claude-sonnet-4-6	54s	44s	49s	239s	19s	32s	21s	24s	16s	13s	511s
kimi-coding/k2p5	60s	239s	32s	16s	30s	16s	17s	38s	44s	22s	514s
zai/glm-5	66s	137s	54s	48s	63s	45s	28s	68s	24s	31s	564s
alibaba/qwen3-coder-next	52s	52s	39s	16s	27s	216s	23s	272s	47s	22s	766s
minimax/MiniMax-M2.5	97s	127s	213s	79s	67s	—*	53s	58s	30s	108s	832s*
alibaba/qwen3.5-plus	58s	86s	363s	260s	22s	62s	37s	48s	19s	14s	969s
mistral/devstral-2512	65s	364s	61s	11s	1164s	163s	22s	47s	20s	11s	1928s
openai-codex/gpt-5.3-codex	52s	23s	33s	23s	19s	952s	—†	971s	17s	15s	2105s†

* MiniMax D3P2 required three dirty restarts; wall-clock time not directly comparable. Total excludes D3P2.
† Codex D4P1 needed a retry; time for that part not captured cleanly.

Output tokens per part

Model	D1P1	D1P2	D2P1	D2P2	D3P1	D3P2	D4P1	D4P2	D5P1	D5P2	Total
openai-codex/gpt-5.3-codex	750	486	774	1,074	746	1,015	1,039	3,042	779	542	10,247
zai/glm-5	853	4,144	970	950	1,442	979	919	1,618	817	709	13,401
kimi-coding/k2p5	978	4,470	923	736	1,904	863	1,009	2,505	1,676	668	15,732
anthropic/claude-opus-4-6	1,246	1,600	2,627	1,876	1,328	2,061	1,282	1,390	1,024	1,351	15,785
anthropic/claude-sonnet-4-6	1,325	1,673	2,143	16,394	1,207	1,980	1,393	939	1,068	770	28,892
anthropic/claude-haiku-4-5	1,299	1,197	1,808	1,964	6,141	5,755	1,764	19,336	2,006	960	42,230
alibaba/qwen3-coder-next	2,224	6,930	1,955	1,855	1,610	22,069	2,055	25,465	3,080	1,393	68,636
alibaba/qwen3.5-plus	2,169	7,427	32,452	14,674	2,194	4,576	2,852	2,267	1,622	1,305	71,538
minimax/MiniMax-M2.5	1,877	4,445	5,885	3,525	2,606	87,094	1,972	2,260	1,063	1,144	111,871
mistral/devstral-2512	3,825	16,758	1,418	1,074	115,225	13,010	2,255	2,811	1,576	807	158,759

API cost per part (approximate USD)

Costs are rough approximations based on published per-token pricing. I use subscription plans, so my actual spending is capped regardless.

Model	D1P1	D1P2	D2P1	D2P2	D3P1	D3P2	D4P1	D4P2	D5P1	D5P2	Total
kimi-coding/k2p5	.0096	.0273	.0043	.0049	.0164	.0112	.0109	.0178	.0152	.0116	$0.13
zai/glm-5	.0196	.0330	.0097	.0104	.0121	.0136	.0083	.0170	.0073	.0096	$0.14
openai-codex/gpt-5.3-codex	.0300	.0158	.0197	.0389	.0260	.0277	.0296	.0864	.0190	.0199	$0.31
anthropic/claude-haiku-4-5	.0289	.0119	.0272	.0186	.0650	.0586	.0253	.2179	.0413	.0155	$0.51
alibaba/qwen3.5-plus	.0200	.0603	.2671	.2750	.0138	.0658	.0228	.0339	.0125	.0161	$0.79
anthropic/claude-sonnet-4-6	.0517	.0512	.0590	.4163	.0406	.0567	.0443	.0325	.0374	.0251	$0.81
anthropic/claude-opus-4-6	.0988	.0736	.1614	.0856	.1123	.0998	.1273	.0692	.1338	.0607	$1.02
alibaba/qwen3-coder-next	.0518	.0644	.0302	.0319	.0715	.7594	.0249	.5769	.1729	.1039	$1.89
minimax/MiniMax-M2.5	.0192	.0346	.0445	.0290	.0285	1.6985	.0095	.0160	.0205	.0262	$1.93
mistral/devstral-2512	.0487	.2439	.0254	.0204	2.0172	.3556	.0190	.0525	.0198	.0131	$2.82

Observations

10/10 completers — zero ejections. Elm joins Python and Ruby as the only languages in this series where every model solved every part.

claude-opus-4-6 — fastest at 369s total. No single part over 80s, never needed a retry. ~$1.02 total.

kimi-coding/k2p5 — cheapest at ~$0.13. Fourth fastest at 514s. On 8 of 10 parts it was under 44s.

Day 3 was rough for two models. devstral-2512 on D3P1 (runaway loop, ~$1.91 and 105K tokens before being killed) and MiniMax-M2.5 on D3P2 (three dirty restarts, 87K tokens, ~$1.70).

gpt-5.3-codex — fewest tokens: 10,247 total. But also the slowest overall (2,105s), due to D3P2 (952s) and D4P2 (971s).

claude-sonnet-4-6 — 239s and 16,394 tokens on D2P2. Every other part was 13–54s.

qwen3.5-plus — 32,452 tokens on D2P1 alone. Both Alibaba models completed everything but used a lot of tokens getting there.

glm-5 — second cheapest at ~$0.14, fifth fastest at 564s, 13,401 tokens. No dirty retries needed.

Cross-language snapshot

Language	Models completing all 10 parts
Python	10/10
Ruby	10/10
Elm	10/10
Java	9/10
Elixir	7/10
Haskell	7/11
OCaml	5/9
ReScript (run 2)	2/10

Elm's 10/10 completion was unexpected — it has a smaller training corpus than any other language tested. The provided template (Day00.elm with Platform.worker and Debug.log) may have helped by giving every model a clear starting point.

ReScript (2/10) is also a niche compile-to-JS functional language, but its toolchain gave models a much harder time. The scaffold and Elm's stable API may explain the difference.

Benchmarked on 2026-02-26 using pi as the agent harness.

This post was written with AI assistance.

Developers, developers, developers!

Blog about programming, programming and, ah more programming!