Developers, developers, developers!

Blog about programming, programming and, ah more programming!

Benchmarking LLMs on Advent of Code 2025 (Elixir)

Tags = [ Elixir, AI, Advent of Code ]

Following up on the Haskell benchmark, the OCaml benchmark, the Python benchmark, the ReScript benchmark, and the Ruby benchmark, I ran the same AoC 2025 Days 1–5 setup in Elixir.

Elixir is dynamic like Ruby/Python, but with its own ecosystem and idioms that models don't always handle cleanly.

The contestants

#Model
1anthropic/claude-haiku-4-5
2anthropic/claude-sonnet-4-6
3anthropic/claude-opus-4-6
4openai-codex/gpt-5.3-codex
5zai/glm-5
6minimax/MiniMax-M2.5
7kimi-coding/k2p5
8mistral/devstral-2512
9alibaba/qwen3.5-plus
10alibaba/qwen3-coder-next

Ejections

ModelEjected atReason
mistral/devstral-2512D1P1Wrong answer after 3 clean retries
alibaba/qwen3-coder-nextD1P2Wrong answer after 3 clean retries
openai-codex/gpt-5.3-codexD3P1Brain-dead/no-progress loop after retry nudge

So this run finished with 7/10 full completers.

Results (Days 1–5)

Per-task leaderboards


Day 1 Part 1 — Dial rotation counting

ModelTimeResult
anthropic/claude-haiku-4-524s
openai-codex/gpt-5.3-codex24s
anthropic/claude-opus-4-626s
alibaba/qwen3.5-plus39s
kimi-coding/k2p544s
alibaba/qwen3-coder-next79s
minimax/MiniMax-M2.5109s
anthropic/claude-sonnet-4-6135s
zai/glm-5172s
mistral/devstral-2512416s✗ (ejected)



Day 1 Part 2 — Counting zero-crossings during dial rotation

ModelTimeResult
openai-codex/gpt-5.3-codex13s
anthropic/claude-opus-4-625s
anthropic/claude-sonnet-4-631s
kimi-coding/k2p549s
alibaba/qwen3.5-plus78s
anthropic/claude-haiku-4-5403s✓ (2nd try)
minimax/MiniMax-M2.5477s✓ (2nd try)
zai/glm-5526s✓ (2nd try)
alibaba/qwen3-coder-next870s✗ (ejected)



Day 2 Part 1 — Summing repeated-digit IDs in ranges

ModelTime
anthropic/claude-haiku-4-512s
kimi-coding/k2p521s
alibaba/qwen3.5-plus24s
openai-codex/gpt-5.3-codex27s
anthropic/claude-sonnet-4-633s
anthropic/claude-opus-4-645s
minimax/MiniMax-M2.550s
zai/glm-592s



Day 2 Part 2 — Repeated-pattern IDs (any repeat count)

ModelTime
openai-codex/gpt-5.3-codex11s
anthropic/claude-opus-4-633s
anthropic/claude-sonnet-4-635s
anthropic/claude-haiku-4-536s
kimi-coding/k2p536s
alibaba/qwen3.5-plus85s
zai/glm-5106s
minimax/MiniMax-M2.5112s



Day 3 Part 1 — Maximizing 2-digit joltage from battery banks

ModelTimeResult
anthropic/claude-haiku-4-519s
alibaba/qwen3.5-plus27s
anthropic/claude-sonnet-4-629s
kimi-coding/k2p529s
anthropic/claude-opus-4-631s
minimax/MiniMax-M2.539s
zai/glm-5125s
openai-codex/gpt-5.3-codex✗ (ejected)



Day 3 Part 2 — Maximizing 12-digit joltage from battery banks

ModelTime
kimi-coding/k2p515s
alibaba/qwen3.5-plus18s
anthropic/claude-sonnet-4-619s
anthropic/claude-opus-4-622s
minimax/MiniMax-M2.534s
anthropic/claude-haiku-4-559s
zai/glm-5229s



Day 4 Part 1 — Grid neighbor counting (accessible paper rolls)

ModelTime
anthropic/claude-haiku-4-513s
anthropic/claude-sonnet-4-615s
kimi-coding/k2p517s
alibaba/qwen3.5-plus23s
anthropic/claude-opus-4-628s
minimax/MiniMax-M2.538s
zai/glm-556s



Day 4 Part 2 — Iterative grid removal simulation

ModelTime
kimi-coding/k2p513s
anthropic/claude-sonnet-4-617s
alibaba/qwen3.5-plus17s
anthropic/claude-opus-4-618s
anthropic/claude-haiku-4-519s
zai/glm-534s
minimax/MiniMax-M2.551s



Day 5 Part 1 — Range membership checking

ModelTime
alibaba/qwen3.5-plus13s
anthropic/claude-sonnet-4-615s
anthropic/claude-haiku-4-518s
kimi-coding/k2p521s
anthropic/claude-opus-4-628s
minimax/MiniMax-M2.534s
zai/glm-565s



Day 5 Part 2 — Counting total fresh IDs from overlapping ranges

ModelTime
anthropic/claude-haiku-4-513s
anthropic/claude-sonnet-4-613s
kimi-coding/k2p515s
anthropic/claude-opus-4-616s
minimax/MiniMax-M2.527s
alibaba/qwen3.5-plus32s
zai/glm-571s

Summary tables

Wall-clock time (seconds)

ModelD1P1D1P2D2P1D2P2D3P1D3P2D4P1D4P2D5P1D5P2Total
kimi-coding/k2p544s49s21s36s29s15s17s13s21s15s260s
anthropic/claude-opus-4-626s25s45s33s31s22s28s18s28s16s272s
anthropic/claude-sonnet-4-6135s31s33s35s29s19s15s17s15s13s342s
alibaba/qwen3.5-plus39s78s24s85s27s18s23s17s13s32s356s
anthropic/claude-haiku-4-524s403s12s36s19s59s13s19s18s13s616s
minimax/MiniMax-M2.5109s477s50s112s39s34s38s51s34s27s971s
zai/glm-5172s526s92s106s125s229s56s34s65s71s1476s
mistral/devstral-2512DNF
alibaba/qwen3-coder-next79sDNF
openai-codex/gpt-5.3-codex24s13s27s11s✗(—)DNF

Output tokens per part

ModelD1P1D1P2D2P1D2P2D3P1D3P2D4P1D4P2D5P1D5P2Total
kimi-coding/k2p51,2193,3009491,8391,37181276480461659412,268
anthropic/claude-opus-4-61,0221,1392,3551,8581,3601,1221,1839521,15381312,957
anthropic/claude-sonnet-4-61,3201,5131,7872,0501,36896779088368670412,068
alibaba/qwen3.5-plus2,4569,8661,9356,7542,4131,5401,8241,1468991,85030,683
anthropic/claude-haiku-4-52,0084,2441,0073,5082,1066,3491,3221,7801,6021,01124,937
minimax/MiniMax-M2.52,99413,5082,1185,2781,7091,3231,0371,3471,04087431,228
zai/glm-57533,7281,5941,8472,3384,3277756625311,03317,588
mistral/devstral-251212,27712,277
alibaba/qwen3-coder-next2,10032,84534,945
openai-codex/gpt-5.3-codex6914828643599353,331

API cost per part (approximate USD)

Costs are rough approximations based on published per-token pricing. I use subscription plans, so my actual spending is capped regardless.

ModelD1P1D1P2D2P1D2P2D3P1D3P2D4P1D4P2D5P1D5P2Total
kimi-coding/k2p50.03710.02420.00500.00910.01500.00900.01010.00920.01000.00820.1370
anthropic/claude-opus-4-60.15690.05540.17200.09330.14780.05520.17320.05260.18680.04211.1352
anthropic/claude-sonnet-4-60.05460.04190.06150.06020.05520.03060.03500.02900.03240.02160.4220
alibaba/qwen3.5-plus0.03150.04180.01330.03100.01720.01620.01520.01670.00880.02020.2119
anthropic/claude-haiku-4-50.03920.04370.02470.04350.03300.06670.02840.01780.03510.01510.3472
minimax/MiniMax-M2.50.05790.12870.01280.04500.00560.00770.01850.02900.01650.02060.3425
zai/glm-50.03150.04860.01510.01910.05160.07870.02820.01930.02980.03220.3542
mistral/devstral-25120.25480.2548
alibaba/qwen3-coder-next0.10520.98271.0878
openai-codex/gpt-5.3-codex0.02200.02020.03510.01210.04220.1316

Observations

7/10 completers. Fewer than Python (10/10) and Ruby (10/10), but more than ReScript run 2 (2/10).

kimi-coding/k2p5 wins the full-completion speed race. 260s total across all 10 parts, beating claude-opus-4-6 by 12 seconds.

claude-opus-4-6 is fast but expensive. 272s total (second place), but $1.1352 total cost — more than 8× k2p5.

claude-sonnet-4-6 is the token-efficiency winner among completers. 12,068 output tokens total, slightly lower than k2p5's 12,268.

qwen3.5-plus is fast-ish but verbose. 356s total is solid (4th), but 30,683 output tokens is over 2.5× sonnet and k2p5.

Day 1 Part 2 was the slowest part overall. Three models eventually passed only on a second clean attempt (haiku, glm-5, MiniMax-M2.5), producing 403–526s times.

gpt-5.3-codex — fast early (D1P2: 13s, D2P2: 11s), then ejected on D3P1 for no-progress behavior after a retry nudge.

qwen3-coder-next — spent ~$0.98 on one failed puzzle part (D1P2). Solved D1P1 in 79s, then failed D1P2 retries and was ejected.

Cross-language snapshot

LanguageModels completing all 10 parts
Python10/10
Ruby10/10
Elixir7/10
Haskell7/11
OCaml5/9
ReScript (run 2)2/10

Elixir lands very close to Haskell in completion rate on this benchmark, but with a very different shape of failures (more behavioral/retry-loop failures, fewer pure language/tooling barriers).

What's next

If I extend this Elixir run beyond Day 5 in a follow-up benchmark, it'll be interesting to see whether the same seven-model pack holds through later, trickier puzzles — or whether another wave of ejections appears.

Benchmarked on 2026-02-26 using pi as the agent harness.


This post was written with AI assistance.