Developers, developers, developers!

Blog about programming, programming and, ah more programming!

Benchmarking LLMs on Advent of Code 2025 (Elm)

Tags = [ Elm, AI, Advent of Code ]

Following up on the Haskell benchmark, the OCaml benchmark, the Python benchmark, the ReScript benchmark, the Ruby benchmark, the Elixir benchmark, and the Java benchmark, I ran the same AoC 2025 Days 1–5 setup in Elm.

Elm is the most niche language in this series. It's a pure functional language that compiles to JavaScript, has no native CLI story, and sees relatively little use outside its frontend niche. Each model received a pre-built scaffold — run.mjs, elm.json, and a Day00.elm template — that compiles and runs Elm modules via Node.js. The question was whether models would handle Elm's strict type system, lack of escape hatches, and unfamiliar idioms (e.g. Debug.log for output, Platform.worker for headless programs).

The answer: every single one of them did.

The contestants

#Model
1anthropic/claude-haiku-4-5
2anthropic/claude-sonnet-4-6
3anthropic/claude-opus-4-6
4openai-codex/gpt-5.3-codex
5zai/glm-5
6minimax/MiniMax-M2.5
7kimi-coding/k2p5
8mistral/devstral-2512
9alibaba/qwen3.5-plus
10alibaba/qwen3-coder-next

Ejections

None. All 10 models completed all 10 parts. This ties Elm with Python and Ruby for the best completion rate in the series.

That said, the path was rocky for some. Several models needed multiple retries, and two (devstral-2512 on Day 3 Part 1, MiniMax-M2.5 on Day 3 Part 2) went through costly runaway loops requiring dirty restarts.

Results (Days 1–5)

Per-task leaderboards


Day 1 Part 1 — Dial rotation counting

ModelTime
anthropic/claude-haiku-4-543s
openai-codex/gpt-5.3-codex52s
alibaba/qwen3-coder-next52s
anthropic/claude-opus-4-653s
anthropic/claude-sonnet-4-654s
alibaba/qwen3.5-plus58s
kimi-coding/k2p560s
mistral/devstral-251265s
zai/glm-566s
minimax/MiniMax-M2.597s



Day 1 Part 2 — Counting zero-crossings during dial rotation

ModelTimeResult
anthropic/claude-haiku-4-522s
openai-codex/gpt-5.3-codex23s
anthropic/claude-opus-4-641s
anthropic/claude-sonnet-4-644s
alibaba/qwen3-coder-next52s
alibaba/qwen3.5-plus86s
minimax/MiniMax-M2.5127s
zai/glm-5137s
kimi-coding/k2p5239s✓ (2nd try)
mistral/devstral-2512364s✓ (3rd try)



Day 2 Part 1 — Summing repeated-digit IDs in ranges

ModelTimeResult
kimi-coding/k2p532s
openai-codex/gpt-5.3-codex33s
anthropic/claude-haiku-4-537s
alibaba/qwen3-coder-next39s
anthropic/claude-sonnet-4-649s
zai/glm-554s
mistral/devstral-251261s
anthropic/claude-opus-4-680s
minimax/MiniMax-M2.5213s✓ (2nd try)
alibaba/qwen3.5-plus363s✓ (2nd try)



Day 2 Part 2 — Repeated-pattern IDs (any repeat count)

ModelTime
mistral/devstral-251211s
kimi-coding/k2p516s
alibaba/qwen3-coder-next16s
anthropic/claude-haiku-4-519s
openai-codex/gpt-5.3-codex23s
anthropic/claude-opus-4-629s
zai/glm-548s
minimax/MiniMax-M2.579s
anthropic/claude-sonnet-4-6239s
alibaba/qwen3.5-plus260s



Day 3 Part 1 — Maximizing 2-digit joltage from battery banks

ModelTimeResult
anthropic/claude-sonnet-4-619s
openai-codex/gpt-5.3-codex19s
alibaba/qwen3.5-plus22s
anthropic/claude-opus-4-627s
alibaba/qwen3-coder-next27s
kimi-coding/k2p530s
anthropic/claude-haiku-4-553s
zai/glm-563s
minimax/MiniMax-M2.567s
mistral/devstral-25121164s✓ (dirty retry)



Day 3 Part 2 — Maximizing 12-digit joltage from battery banks

ModelTimeResult
kimi-coding/k2p516s
anthropic/claude-sonnet-4-632s
anthropic/claude-haiku-4-545s
zai/glm-545s
anthropic/claude-opus-4-648s
alibaba/qwen3.5-plus62s
mistral/devstral-2512163s
alibaba/qwen3-coder-next216s
openai-codex/gpt-5.3-codex952s✓ (nudge)
minimax/MiniMax-M2.5✓ (dirty retry ×3)



Day 4 Part 1 — Grid neighbor counting (accessible paper rolls)

ModelTimeResult
anthropic/claude-haiku-4-515s
kimi-coding/k2p517s
anthropic/claude-sonnet-4-621s
anthropic/claude-opus-4-622s
mistral/devstral-251222s
alibaba/qwen3-coder-next23s
zai/glm-528s
alibaba/qwen3.5-plus37s
minimax/MiniMax-M2.553s
openai-codex/gpt-5.3-codex✓ (2nd try)



Day 4 Part 2 — Iterative grid removal simulation

ModelTimeResult
anthropic/claude-sonnet-4-624s
anthropic/claude-opus-4-624s
kimi-coding/k2p538s
mistral/devstral-251247s
alibaba/qwen3.5-plus48s
minimax/MiniMax-M2.558s
zai/glm-568s
anthropic/claude-haiku-4-5245s
alibaba/qwen3-coder-next272s
openai-codex/gpt-5.3-codex971s✓ (2nd try)



Day 5 Part 1 — Range membership checking

ModelTime
anthropic/claude-sonnet-4-616s
openai-codex/gpt-5.3-codex17s
alibaba/qwen3.5-plus19s
anthropic/claude-haiku-4-520s
mistral/devstral-251220s
anthropic/claude-opus-4-621s
zai/glm-524s
minimax/MiniMax-M2.530s
kimi-coding/k2p544s
alibaba/qwen3-coder-next47s



Day 5 Part 2 — Counting total fresh IDs from overlapping ranges

ModelTime
anthropic/claude-haiku-4-59s
mistral/devstral-251211s
anthropic/claude-sonnet-4-613s
alibaba/qwen3.5-plus14s
openai-codex/gpt-5.3-codex15s
kimi-coding/k2p522s
alibaba/qwen3-coder-next22s
anthropic/claude-opus-4-624s
zai/glm-531s
minimax/MiniMax-M2.5108s

Summary tables

Wall-clock time (seconds)

Model D1P1 D1P2 D2P1 D2P2 D3P1 D3P2 D4P1 D4P2 D5P1 D5P2 Total
anthropic/claude-opus-4-6 53s 41s 80s 29s 27s 48s 22s 24s 21s 24s 369s
anthropic/claude-haiku-4-5 43s 22s 37s 19s 53s 45s 15s 245s 20s 9s 508s
anthropic/claude-sonnet-4-6 54s 44s 49s 239s 19s 32s 21s 24s 16s 13s 511s
kimi-coding/k2p5 60s 239s 32s 16s 30s 16s 17s 38s 44s 22s 514s
zai/glm-5 66s 137s 54s 48s 63s 45s 28s 68s 24s 31s 564s
alibaba/qwen3-coder-next 52s 52s 39s 16s 27s 216s 23s 272s 47s 22s 766s
minimax/MiniMax-M2.5 97s 127s 213s 79s 67s —* 53s 58s 30s 108s 832s*
alibaba/qwen3.5-plus 58s 86s 363s 260s 22s 62s 37s 48s 19s 14s 969s
mistral/devstral-2512 65s 364s 61s 11s 1164s 163s 22s 47s 20s 11s 1928s
openai-codex/gpt-5.3-codex 52s 23s 33s 23s 19s 952s —† 971s 17s 15s 2105s†

* MiniMax D3P2 required three dirty restarts; wall-clock time not directly comparable. Total excludes D3P2.
† Codex D4P1 needed a retry; time for that part not captured cleanly.


Output tokens per part

Model D1P1 D1P2 D2P1 D2P2 D3P1 D3P2 D4P1 D4P2 D5P1 D5P2 Total
openai-codex/gpt-5.3-codex 750 486 774 1,074 746 1,015 1,039 3,042 779 542 10,247
zai/glm-5 853 4,144 970 950 1,442 979 919 1,618 817 709 13,401
kimi-coding/k2p5 978 4,470 923 736 1,904 863 1,009 2,505 1,676 668 15,732
anthropic/claude-opus-4-6 1,246 1,600 2,627 1,876 1,328 2,061 1,282 1,390 1,024 1,351 15,785
anthropic/claude-sonnet-4-6 1,325 1,673 2,143 16,394 1,207 1,980 1,393 939 1,068 770 28,892
anthropic/claude-haiku-4-5 1,299 1,197 1,808 1,964 6,141 5,755 1,764 19,336 2,006 960 42,230
alibaba/qwen3-coder-next 2,224 6,930 1,955 1,855 1,610 22,069 2,055 25,465 3,080 1,393 68,636
alibaba/qwen3.5-plus 2,169 7,427 32,452 14,674 2,194 4,576 2,852 2,267 1,622 1,305 71,538
minimax/MiniMax-M2.5 1,877 4,445 5,885 3,525 2,606 87,094 1,972 2,260 1,063 1,144 111,871
mistral/devstral-2512 3,825 16,758 1,418 1,074 115,225 13,010 2,255 2,811 1,576 807 158,759

API cost per part (approximate USD)

Costs are rough approximations based on published per-token pricing. I use subscription plans, so my actual spending is capped regardless.

Model D1P1 D1P2 D2P1 D2P2 D3P1 D3P2 D4P1 D4P2 D5P1 D5P2 Total
kimi-coding/k2p5 .0096 .0273 .0043 .0049 .0164 .0112 .0109 .0178 .0152 .0116 $0.13
zai/glm-5 .0196 .0330 .0097 .0104 .0121 .0136 .0083 .0170 .0073 .0096 $0.14
openai-codex/gpt-5.3-codex .0300 .0158 .0197 .0389 .0260 .0277 .0296 .0864 .0190 .0199 $0.31
anthropic/claude-haiku-4-5 .0289 .0119 .0272 .0186 .0650 .0586 .0253 .2179 .0413 .0155 $0.51
alibaba/qwen3.5-plus .0200 .0603 .2671 .2750 .0138 .0658 .0228 .0339 .0125 .0161 $0.79
anthropic/claude-sonnet-4-6 .0517 .0512 .0590 .4163 .0406 .0567 .0443 .0325 .0374 .0251 $0.81
anthropic/claude-opus-4-6 .0988 .0736 .1614 .0856 .1123 .0998 .1273 .0692 .1338 .0607 $1.02
alibaba/qwen3-coder-next .0518 .0644 .0302 .0319 .0715 .7594 .0249 .5769 .1729 .1039 $1.89
minimax/MiniMax-M2.5 .0192 .0346 .0445 .0290 .0285 1.6985 .0095 .0160 .0205 .0262 $1.93
mistral/devstral-2512 .0487 .2439 .0254 .0204 2.0172 .3556 .0190 .0525 .0198 .0131 $2.82

Observations

10/10 completers — zero ejections. Elm joins Python and Ruby as the only languages in this series where every model solved every part.

claude-opus-4-6 — fastest at 369s total. No single part over 80s, never needed a retry. ~$1.02 total.

kimi-coding/k2p5 — cheapest at ~$0.13. Fourth fastest at 514s. On 8 of 10 parts it was under 44s.

Day 3 was rough for two models. devstral-2512 on D3P1 (runaway loop, ~$1.91 and 105K tokens before being killed) and MiniMax-M2.5 on D3P2 (three dirty restarts, 87K tokens, ~$1.70).

gpt-5.3-codex — fewest tokens: 10,247 total. But also the slowest overall (2,105s), due to D3P2 (952s) and D4P2 (971s).

claude-sonnet-4-6 — 239s and 16,394 tokens on D2P2. Every other part was 13–54s.

qwen3.5-plus — 32,452 tokens on D2P1 alone. Both Alibaba models completed everything but used a lot of tokens getting there.

glm-5 — second cheapest at ~$0.14, fifth fastest at 564s, 13,401 tokens. No dirty retries needed.

Cross-language snapshot

LanguageModels completing all 10 parts
Python10/10
Ruby10/10
Elm10/10
Java9/10
Elixir7/10
Haskell7/11
OCaml5/9
ReScript (run 2)2/10

Elm's 10/10 completion was unexpected — it has a smaller training corpus than any other language tested. The provided template (Day00.elm with Platform.worker and Debug.log) may have helped by giving every model a clear starting point.

ReScript (2/10) is also a niche compile-to-JS functional language, but its toolchain gave models a much harder time. The scaffold and Elm's stable API may explain the difference.

Benchmarked on 2026-02-26 using pi as the agent harness.


This post was written with AI assistance.