Following up on the Haskell benchmark, the OCaml benchmark, the Python benchmark, the ReScript benchmark, the Ruby benchmark, the Elixir benchmark, and the Java benchmark, I ran the same AoC 2025 Days 1–5 setup in Elm.
Elm is the most niche language in this series. It's a pure functional language that compiles
to JavaScript, has no native CLI story, and sees relatively little use outside its frontend
niche. Each model received a pre-built scaffold — run.mjs, elm.json, and a
Day00.elm template — that compiles and runs Elm modules via Node.js. The question was
whether models would handle Elm's strict type system, lack of escape hatches, and unfamiliar
idioms (e.g. Debug.log for output, Platform.worker for headless programs).
The answer: every single one of them did.
The contestants
| # | Model |
|---|---|
| 1 | anthropic/claude-haiku-4-5 |
| 2 | anthropic/claude-sonnet-4-6 |
| 3 | anthropic/claude-opus-4-6 |
| 4 | openai-codex/gpt-5.3-codex |
| 5 | zai/glm-5 |
| 6 | minimax/MiniMax-M2.5 |
| 7 | kimi-coding/k2p5 |
| 8 | mistral/devstral-2512 |
| 9 | alibaba/qwen3.5-plus |
| 10 | alibaba/qwen3-coder-next |
Ejections
None. All 10 models completed all 10 parts. This ties Elm with Python and Ruby for the best completion rate in the series.
That said, the path was rocky for some. Several models needed multiple retries, and two
(devstral-2512 on Day 3 Part 1, MiniMax-M2.5 on Day 3 Part 2) went through costly
runaway loops requiring dirty restarts.
Results (Days 1–5)
Per-task leaderboards
Day 1 Part 1 — Dial rotation counting
| Model | Time |
|---|---|
| anthropic/claude-haiku-4-5 | 43s |
| openai-codex/gpt-5.3-codex | 52s |
| alibaba/qwen3-coder-next | 52s |
| anthropic/claude-opus-4-6 | 53s |
| anthropic/claude-sonnet-4-6 | 54s |
| alibaba/qwen3.5-plus | 58s |
| kimi-coding/k2p5 | 60s |
| mistral/devstral-2512 | 65s |
| zai/glm-5 | 66s |
| minimax/MiniMax-M2.5 | 97s |
Day 1 Part 2 — Counting zero-crossings during dial rotation
| Model | Time | Result |
|---|---|---|
| anthropic/claude-haiku-4-5 | 22s | ✓ |
| openai-codex/gpt-5.3-codex | 23s | ✓ |
| anthropic/claude-opus-4-6 | 41s | ✓ |
| anthropic/claude-sonnet-4-6 | 44s | ✓ |
| alibaba/qwen3-coder-next | 52s | ✓ |
| alibaba/qwen3.5-plus | 86s | ✓ |
| minimax/MiniMax-M2.5 | 127s | ✓ |
| zai/glm-5 | 137s | ✓ |
| kimi-coding/k2p5 | 239s | ✓ (2nd try) |
| mistral/devstral-2512 | 364s | ✓ (3rd try) |
Day 2 Part 1 — Summing repeated-digit IDs in ranges
| Model | Time | Result |
|---|---|---|
| kimi-coding/k2p5 | 32s | ✓ |
| openai-codex/gpt-5.3-codex | 33s | ✓ |
| anthropic/claude-haiku-4-5 | 37s | ✓ |
| alibaba/qwen3-coder-next | 39s | ✓ |
| anthropic/claude-sonnet-4-6 | 49s | ✓ |
| zai/glm-5 | 54s | ✓ |
| mistral/devstral-2512 | 61s | ✓ |
| anthropic/claude-opus-4-6 | 80s | ✓ |
| minimax/MiniMax-M2.5 | 213s | ✓ (2nd try) |
| alibaba/qwen3.5-plus | 363s | ✓ (2nd try) |
Day 2 Part 2 — Repeated-pattern IDs (any repeat count)
| Model | Time |
|---|---|
| mistral/devstral-2512 | 11s |
| kimi-coding/k2p5 | 16s |
| alibaba/qwen3-coder-next | 16s |
| anthropic/claude-haiku-4-5 | 19s |
| openai-codex/gpt-5.3-codex | 23s |
| anthropic/claude-opus-4-6 | 29s |
| zai/glm-5 | 48s |
| minimax/MiniMax-M2.5 | 79s |
| anthropic/claude-sonnet-4-6 | 239s |
| alibaba/qwen3.5-plus | 260s |
Day 3 Part 1 — Maximizing 2-digit joltage from battery banks
| Model | Time | Result |
|---|---|---|
| anthropic/claude-sonnet-4-6 | 19s | ✓ |
| openai-codex/gpt-5.3-codex | 19s | ✓ |
| alibaba/qwen3.5-plus | 22s | ✓ |
| anthropic/claude-opus-4-6 | 27s | ✓ |
| alibaba/qwen3-coder-next | 27s | ✓ |
| kimi-coding/k2p5 | 30s | ✓ |
| anthropic/claude-haiku-4-5 | 53s | ✓ |
| zai/glm-5 | 63s | ✓ |
| minimax/MiniMax-M2.5 | 67s | ✓ |
| mistral/devstral-2512 | 1164s | ✓ (dirty retry) |
Day 3 Part 2 — Maximizing 12-digit joltage from battery banks
| Model | Time | Result |
|---|---|---|
| kimi-coding/k2p5 | 16s | ✓ |
| anthropic/claude-sonnet-4-6 | 32s | ✓ |
| anthropic/claude-haiku-4-5 | 45s | ✓ |
| zai/glm-5 | 45s | ✓ |
| anthropic/claude-opus-4-6 | 48s | ✓ |
| alibaba/qwen3.5-plus | 62s | ✓ |
| mistral/devstral-2512 | 163s | ✓ |
| alibaba/qwen3-coder-next | 216s | ✓ |
| openai-codex/gpt-5.3-codex | 952s | ✓ (nudge) |
| minimax/MiniMax-M2.5 | — | ✓ (dirty retry ×3) |
Day 4 Part 1 — Grid neighbor counting (accessible paper rolls)
| Model | Time | Result |
|---|---|---|
| anthropic/claude-haiku-4-5 | 15s | ✓ |
| kimi-coding/k2p5 | 17s | ✓ |
| anthropic/claude-sonnet-4-6 | 21s | ✓ |
| anthropic/claude-opus-4-6 | 22s | ✓ |
| mistral/devstral-2512 | 22s | ✓ |
| alibaba/qwen3-coder-next | 23s | ✓ |
| zai/glm-5 | 28s | ✓ |
| alibaba/qwen3.5-plus | 37s | ✓ |
| minimax/MiniMax-M2.5 | 53s | ✓ |
| openai-codex/gpt-5.3-codex | — | ✓ (2nd try) |
Day 4 Part 2 — Iterative grid removal simulation
| Model | Time | Result |
|---|---|---|
| anthropic/claude-sonnet-4-6 | 24s | ✓ |
| anthropic/claude-opus-4-6 | 24s | ✓ |
| kimi-coding/k2p5 | 38s | ✓ |
| mistral/devstral-2512 | 47s | ✓ |
| alibaba/qwen3.5-plus | 48s | ✓ |
| minimax/MiniMax-M2.5 | 58s | ✓ |
| zai/glm-5 | 68s | ✓ |
| anthropic/claude-haiku-4-5 | 245s | ✓ |
| alibaba/qwen3-coder-next | 272s | ✓ |
| openai-codex/gpt-5.3-codex | 971s | ✓ (2nd try) |
Day 5 Part 1 — Range membership checking
| Model | Time |
|---|---|
| anthropic/claude-sonnet-4-6 | 16s |
| openai-codex/gpt-5.3-codex | 17s |
| alibaba/qwen3.5-plus | 19s |
| anthropic/claude-haiku-4-5 | 20s |
| mistral/devstral-2512 | 20s |
| anthropic/claude-opus-4-6 | 21s |
| zai/glm-5 | 24s |
| minimax/MiniMax-M2.5 | 30s |
| kimi-coding/k2p5 | 44s |
| alibaba/qwen3-coder-next | 47s |
Day 5 Part 2 — Counting total fresh IDs from overlapping ranges
| Model | Time |
|---|---|
| anthropic/claude-haiku-4-5 | 9s |
| mistral/devstral-2512 | 11s |
| anthropic/claude-sonnet-4-6 | 13s |
| alibaba/qwen3.5-plus | 14s |
| openai-codex/gpt-5.3-codex | 15s |
| kimi-coding/k2p5 | 22s |
| alibaba/qwen3-coder-next | 22s |
| anthropic/claude-opus-4-6 | 24s |
| zai/glm-5 | 31s |
| minimax/MiniMax-M2.5 | 108s |
Summary tables
Wall-clock time (seconds)
| Model | D1P1 | D1P2 | D2P1 | D2P2 | D3P1 | D3P2 | D4P1 | D4P2 | D5P1 | D5P2 | Total |
|---|---|---|---|---|---|---|---|---|---|---|---|
| anthropic/claude-opus-4-6 | 53s | 41s | 80s | 29s | 27s | 48s | 22s | 24s | 21s | 24s | 369s |
| anthropic/claude-haiku-4-5 | 43s | 22s | 37s | 19s | 53s | 45s | 15s | 245s | 20s | 9s | 508s |
| anthropic/claude-sonnet-4-6 | 54s | 44s | 49s | 239s | 19s | 32s | 21s | 24s | 16s | 13s | 511s |
| kimi-coding/k2p5 | 60s | 239s | 32s | 16s | 30s | 16s | 17s | 38s | 44s | 22s | 514s |
| zai/glm-5 | 66s | 137s | 54s | 48s | 63s | 45s | 28s | 68s | 24s | 31s | 564s |
| alibaba/qwen3-coder-next | 52s | 52s | 39s | 16s | 27s | 216s | 23s | 272s | 47s | 22s | 766s |
| minimax/MiniMax-M2.5 | 97s | 127s | 213s | 79s | 67s | —* | 53s | 58s | 30s | 108s | 832s* |
| alibaba/qwen3.5-plus | 58s | 86s | 363s | 260s | 22s | 62s | 37s | 48s | 19s | 14s | 969s |
| mistral/devstral-2512 | 65s | 364s | 61s | 11s | 1164s | 163s | 22s | 47s | 20s | 11s | 1928s |
| openai-codex/gpt-5.3-codex | 52s | 23s | 33s | 23s | 19s | 952s | —† | 971s | 17s | 15s | 2105s† |
* MiniMax D3P2 required three dirty restarts; wall-clock time not directly comparable.
Total excludes D3P2.
† Codex D4P1 needed a retry; time for that part not captured cleanly.
Output tokens per part
| Model | D1P1 | D1P2 | D2P1 | D2P2 | D3P1 | D3P2 | D4P1 | D4P2 | D5P1 | D5P2 | Total |
|---|---|---|---|---|---|---|---|---|---|---|---|
| openai-codex/gpt-5.3-codex | 750 | 486 | 774 | 1,074 | 746 | 1,015 | 1,039 | 3,042 | 779 | 542 | 10,247 |
| zai/glm-5 | 853 | 4,144 | 970 | 950 | 1,442 | 979 | 919 | 1,618 | 817 | 709 | 13,401 |
| kimi-coding/k2p5 | 978 | 4,470 | 923 | 736 | 1,904 | 863 | 1,009 | 2,505 | 1,676 | 668 | 15,732 |
| anthropic/claude-opus-4-6 | 1,246 | 1,600 | 2,627 | 1,876 | 1,328 | 2,061 | 1,282 | 1,390 | 1,024 | 1,351 | 15,785 |
| anthropic/claude-sonnet-4-6 | 1,325 | 1,673 | 2,143 | 16,394 | 1,207 | 1,980 | 1,393 | 939 | 1,068 | 770 | 28,892 |
| anthropic/claude-haiku-4-5 | 1,299 | 1,197 | 1,808 | 1,964 | 6,141 | 5,755 | 1,764 | 19,336 | 2,006 | 960 | 42,230 |
| alibaba/qwen3-coder-next | 2,224 | 6,930 | 1,955 | 1,855 | 1,610 | 22,069 | 2,055 | 25,465 | 3,080 | 1,393 | 68,636 |
| alibaba/qwen3.5-plus | 2,169 | 7,427 | 32,452 | 14,674 | 2,194 | 4,576 | 2,852 | 2,267 | 1,622 | 1,305 | 71,538 |
| minimax/MiniMax-M2.5 | 1,877 | 4,445 | 5,885 | 3,525 | 2,606 | 87,094 | 1,972 | 2,260 | 1,063 | 1,144 | 111,871 |
| mistral/devstral-2512 | 3,825 | 16,758 | 1,418 | 1,074 | 115,225 | 13,010 | 2,255 | 2,811 | 1,576 | 807 | 158,759 |
API cost per part (approximate USD)
Costs are rough approximations based on published per-token pricing. I use subscription plans, so my actual spending is capped regardless.
| Model | D1P1 | D1P2 | D2P1 | D2P2 | D3P1 | D3P2 | D4P1 | D4P2 | D5P1 | D5P2 | Total |
|---|---|---|---|---|---|---|---|---|---|---|---|
| kimi-coding/k2p5 | .0096 | .0273 | .0043 | .0049 | .0164 | .0112 | .0109 | .0178 | .0152 | .0116 | $0.13 |
| zai/glm-5 | .0196 | .0330 | .0097 | .0104 | .0121 | .0136 | .0083 | .0170 | .0073 | .0096 | $0.14 |
| openai-codex/gpt-5.3-codex | .0300 | .0158 | .0197 | .0389 | .0260 | .0277 | .0296 | .0864 | .0190 | .0199 | $0.31 |
| anthropic/claude-haiku-4-5 | .0289 | .0119 | .0272 | .0186 | .0650 | .0586 | .0253 | .2179 | .0413 | .0155 | $0.51 |
| alibaba/qwen3.5-plus | .0200 | .0603 | .2671 | .2750 | .0138 | .0658 | .0228 | .0339 | .0125 | .0161 | $0.79 |
| anthropic/claude-sonnet-4-6 | .0517 | .0512 | .0590 | .4163 | .0406 | .0567 | .0443 | .0325 | .0374 | .0251 | $0.81 |
| anthropic/claude-opus-4-6 | .0988 | .0736 | .1614 | .0856 | .1123 | .0998 | .1273 | .0692 | .1338 | .0607 | $1.02 |
| alibaba/qwen3-coder-next | .0518 | .0644 | .0302 | .0319 | .0715 | .7594 | .0249 | .5769 | .1729 | .1039 | $1.89 |
| minimax/MiniMax-M2.5 | .0192 | .0346 | .0445 | .0290 | .0285 | 1.6985 | .0095 | .0160 | .0205 | .0262 | $1.93 |
| mistral/devstral-2512 | .0487 | .2439 | .0254 | .0204 | 2.0172 | .3556 | .0190 | .0525 | .0198 | .0131 | $2.82 |
Observations
10/10 completers — zero ejections. Elm joins Python and Ruby as the only languages in this series where every model solved every part.
claude-opus-4-6 — fastest at 369s total. No single part over 80s, never needed a
retry. ~$1.02 total.
kimi-coding/k2p5 — cheapest at ~$0.13. Fourth fastest at 514s. On 8 of 10 parts
it was under 44s.
Day 3 was rough for two models. devstral-2512 on D3P1 (runaway loop, ~$1.91 and
105K tokens before being killed) and MiniMax-M2.5 on D3P2 (three dirty restarts, 87K
tokens, ~$1.70).
gpt-5.3-codex — fewest tokens: 10,247 total. But also the slowest overall (2,105s),
due to D3P2 (952s) and D4P2 (971s).
claude-sonnet-4-6 — 239s and 16,394 tokens on D2P2. Every other part was 13–54s.
qwen3.5-plus — 32,452 tokens on D2P1 alone. Both Alibaba models completed
everything but used a lot of tokens getting there.
glm-5 — second cheapest at ~$0.14, fifth fastest at 564s, 13,401 tokens. No dirty
retries needed.
Cross-language snapshot
| Language | Models completing all 10 parts |
|---|---|
| Python | 10/10 |
| Ruby | 10/10 |
| Elm | 10/10 |
| Java | 9/10 |
| Elixir | 7/10 |
| Haskell | 7/11 |
| OCaml | 5/9 |
| ReScript (run 2) | 2/10 |
Elm's 10/10 completion was unexpected — it has a smaller training corpus than any other
language tested. The provided template (Day00.elm with Platform.worker and Debug.log)
may have helped by giving every model a clear starting point.
ReScript (2/10) is also a niche compile-to-JS functional language, but its toolchain gave models a much harder time. The scaffold and Elm's stable API may explain the difference.
Benchmarked on 2026-02-26 using pi as the agent harness.
This post was written with AI assistance.