Benchmarking LLMs on Advent of Code 2025 (Elixir)

Tags = [ Elixir, AI, Advent of Code ] Posted on February 26, 2026

Following up on the Haskell benchmark, the OCaml benchmark, the Python benchmark, the ReScript benchmark, and the Ruby benchmark, I ran the same AoC 2025 Days 1–5 setup in Elixir.

Elixir is dynamic like Ruby/Python, but with its own ecosystem and idioms that models don't always handle cleanly.

The contestants

#	Model
1	`anthropic/claude-haiku-4-5`
2	`anthropic/claude-sonnet-4-6`
3	`anthropic/claude-opus-4-6`
4	`openai-codex/gpt-5.3-codex`
5	`zai/glm-5`
6	`minimax/MiniMax-M2.5`
7	`kimi-coding/k2p5`
8	`mistral/devstral-2512`
9	`alibaba/qwen3.5-plus`
10	`alibaba/qwen3-coder-next`

Ejections

Model	Ejected at	Reason
`mistral/devstral-2512`	D1P1	Wrong answer after 3 clean retries
`alibaba/qwen3-coder-next`	D1P2	Wrong answer after 3 clean retries
`openai-codex/gpt-5.3-codex`	D3P1	Brain-dead/no-progress loop after retry nudge

So this run finished with 7/10 full completers.

Results (Days 1–5)

Per-task leaderboards

Day 1 Part 1 — Dial rotation counting

Model	Time	Result
anthropic/claude-haiku-4-5	24s	✓
openai-codex/gpt-5.3-codex	24s	✓
anthropic/claude-opus-4-6	26s	✓
alibaba/qwen3.5-plus	39s	✓
kimi-coding/k2p5	44s	✓
alibaba/qwen3-coder-next	79s	✓
minimax/MiniMax-M2.5	109s	✓
anthropic/claude-sonnet-4-6	135s	✓
zai/glm-5	172s	✓
mistral/devstral-2512	416s	✗ (ejected)

Day 1 Part 2 — Counting zero-crossings during dial rotation

Model	Time	Result
openai-codex/gpt-5.3-codex	13s	✓
anthropic/claude-opus-4-6	25s	✓
anthropic/claude-sonnet-4-6	31s	✓
kimi-coding/k2p5	49s	✓
alibaba/qwen3.5-plus	78s	✓
anthropic/claude-haiku-4-5	403s	✓ (2nd try)
minimax/MiniMax-M2.5	477s	✓ (2nd try)
zai/glm-5	526s	✓ (2nd try)
alibaba/qwen3-coder-next	870s	✗ (ejected)

Day 2 Part 1 — Summing repeated-digit IDs in ranges

Model	Time
anthropic/claude-haiku-4-5	12s
kimi-coding/k2p5	21s
alibaba/qwen3.5-plus	24s
openai-codex/gpt-5.3-codex	27s
anthropic/claude-sonnet-4-6	33s
anthropic/claude-opus-4-6	45s
minimax/MiniMax-M2.5	50s
zai/glm-5	92s

Day 2 Part 2 — Repeated-pattern IDs (any repeat count)

Model	Time
openai-codex/gpt-5.3-codex	11s
anthropic/claude-opus-4-6	33s
anthropic/claude-sonnet-4-6	35s
anthropic/claude-haiku-4-5	36s
kimi-coding/k2p5	36s
alibaba/qwen3.5-plus	85s
zai/glm-5	106s
minimax/MiniMax-M2.5	112s

Day 3 Part 1 — Maximizing 2-digit joltage from battery banks

Model	Time	Result
anthropic/claude-haiku-4-5	19s	✓
alibaba/qwen3.5-plus	27s	✓
anthropic/claude-sonnet-4-6	29s	✓
kimi-coding/k2p5	29s	✓
anthropic/claude-opus-4-6	31s	✓
minimax/MiniMax-M2.5	39s	✓
zai/glm-5	125s	✓
openai-codex/gpt-5.3-codex	—	✗ (ejected)

Day 3 Part 2 — Maximizing 12-digit joltage from battery banks

Model	Time
kimi-coding/k2p5	15s
alibaba/qwen3.5-plus	18s
anthropic/claude-sonnet-4-6	19s
anthropic/claude-opus-4-6	22s
minimax/MiniMax-M2.5	34s
anthropic/claude-haiku-4-5	59s
zai/glm-5	229s

Day 4 Part 1 — Grid neighbor counting (accessible paper rolls)

Model	Time
anthropic/claude-haiku-4-5	13s
anthropic/claude-sonnet-4-6	15s
kimi-coding/k2p5	17s
alibaba/qwen3.5-plus	23s
anthropic/claude-opus-4-6	28s
minimax/MiniMax-M2.5	38s
zai/glm-5	56s

Day 4 Part 2 — Iterative grid removal simulation

Model	Time
kimi-coding/k2p5	13s
anthropic/claude-sonnet-4-6	17s
alibaba/qwen3.5-plus	17s
anthropic/claude-opus-4-6	18s
anthropic/claude-haiku-4-5	19s
zai/glm-5	34s
minimax/MiniMax-M2.5	51s

Day 5 Part 1 — Range membership checking

Model	Time
alibaba/qwen3.5-plus	13s
anthropic/claude-sonnet-4-6	15s
anthropic/claude-haiku-4-5	18s
kimi-coding/k2p5	21s
anthropic/claude-opus-4-6	28s
minimax/MiniMax-M2.5	34s
zai/glm-5	65s

Day 5 Part 2 — Counting total fresh IDs from overlapping ranges

Model	Time
anthropic/claude-haiku-4-5	13s
anthropic/claude-sonnet-4-6	13s
kimi-coding/k2p5	15s
anthropic/claude-opus-4-6	16s
minimax/MiniMax-M2.5	27s
alibaba/qwen3.5-plus	32s
zai/glm-5	71s

Speed vs accuracy

Summary tables

Wall-clock time (seconds)

Model	D1P1	D1P2	D2P1	D2P2	D3P1	D3P2	D4P1	D4P2	D5P1	D5P2	Total
kimi-coding/k2p5	44s	49s	21s	36s	29s	15s	17s	13s	21s	15s	260s
anthropic/claude-opus-4-6	26s	25s	45s	33s	31s	22s	28s	18s	28s	16s	272s
anthropic/claude-sonnet-4-6	135s	31s	33s	35s	29s	19s	15s	17s	15s	13s	342s
alibaba/qwen3.5-plus	39s	78s	24s	85s	27s	18s	23s	17s	13s	32s	356s
anthropic/claude-haiku-4-5	24s	403s	12s	36s	19s	59s	13s	19s	18s	13s	616s
minimax/MiniMax-M2.5	109s	477s	50s	112s	39s	34s	38s	51s	34s	27s	971s
zai/glm-5	172s	526s	92s	106s	125s	229s	56s	34s	65s	71s	1476s
mistral/devstral-2512	✗	—	—	—	—	—	—	—	—	—	DNF
alibaba/qwen3-coder-next	79s	✗	—	—	—	—	—	—	—	—	DNF
openai-codex/gpt-5.3-codex	24s	13s	27s	11s	✗(—)	—	—	—	—	—	DNF

Output tokens per part

Model	D1P1	D1P2	D2P1	D2P2	D3P1	D3P2	D4P1	D4P2	D5P1	D5P2	Total
kimi-coding/k2p5	1,219	3,300	949	1,839	1,371	812	764	804	616	594	12,268
anthropic/claude-opus-4-6	1,022	1,139	2,355	1,858	1,360	1,122	1,183	952	1,153	813	12,957
anthropic/claude-sonnet-4-6	1,320	1,513	1,787	2,050	1,368	967	790	883	686	704	12,068
alibaba/qwen3.5-plus	2,456	9,866	1,935	6,754	2,413	1,540	1,824	1,146	899	1,850	30,683
anthropic/claude-haiku-4-5	2,008	4,244	1,007	3,508	2,106	6,349	1,322	1,780	1,602	1,011	24,937
minimax/MiniMax-M2.5	2,994	13,508	2,118	5,278	1,709	1,323	1,037	1,347	1,040	874	31,228
zai/glm-5	753	3,728	1,594	1,847	2,338	4,327	775	662	531	1,033	17,588
mistral/devstral-2512	12,277	—	—	—	—	—	—	—	—	—	12,277
alibaba/qwen3-coder-next	2,100	32,845	—	—	—	—	—	—	—	—	34,945
openai-codex/gpt-5.3-codex	691	482	864	359	935	—	—	—	—	—	3,331

API cost per part (approximate USD)

Costs are rough approximations based on published per-token pricing. I use subscription plans, so my actual spending is capped regardless.

Model	D1P1	D1P2	D2P1	D2P2	D3P1	D3P2	D4P1	D4P2	D5P1	D5P2	Total
kimi-coding/k2p5	0.0371	0.0242	0.0050	0.0091	0.0150	0.0090	0.0101	0.0092	0.0100	0.0082	0.1370
anthropic/claude-opus-4-6	0.1569	0.0554	0.1720	0.0933	0.1478	0.0552	0.1732	0.0526	0.1868	0.0421	1.1352
anthropic/claude-sonnet-4-6	0.0546	0.0419	0.0615	0.0602	0.0552	0.0306	0.0350	0.0290	0.0324	0.0216	0.4220
alibaba/qwen3.5-plus	0.0315	0.0418	0.0133	0.0310	0.0172	0.0162	0.0152	0.0167	0.0088	0.0202	0.2119
anthropic/claude-haiku-4-5	0.0392	0.0437	0.0247	0.0435	0.0330	0.0667	0.0284	0.0178	0.0351	0.0151	0.3472
minimax/MiniMax-M2.5	0.0579	0.1287	0.0128	0.0450	0.0056	0.0077	0.0185	0.0290	0.0165	0.0206	0.3425
zai/glm-5	0.0315	0.0486	0.0151	0.0191	0.0516	0.0787	0.0282	0.0193	0.0298	0.0322	0.3542
mistral/devstral-2512	0.2548	—	—	—	—	—	—	—	—	—	0.2548
alibaba/qwen3-coder-next	0.1052	0.9827	—	—	—	—	—	—	—	—	1.0878
openai-codex/gpt-5.3-codex	0.0220	0.0202	0.0351	0.0121	0.0422	—	—	—	—	—	0.1316

Observations

7/10 completers. Fewer than Python (10/10) and Ruby (10/10), but more than ReScript run 2 (2/10).

kimi-coding/k2p5 wins the full-completion speed race. 260s total across all 10 parts, beating claude-opus-4-6 by 12 seconds.

claude-opus-4-6 is fast but expensive. 272s total (second place), but $1.1352 total cost — more than 8× k2p5.

claude-sonnet-4-6 is the token-efficiency winner among completers. 12,068 output tokens total, slightly lower than k2p5's 12,268.

qwen3.5-plus is fast-ish but verbose. 356s total is solid (4th), but 30,683 output tokens is over 2.5× sonnet and k2p5.

Day 1 Part 2 was the slowest part overall. Three models eventually passed only on a second clean attempt (haiku, glm-5, MiniMax-M2.5), producing 403–526s times.

gpt-5.3-codex — fast early (D1P2: 13s, D2P2: 11s), then ejected on D3P1 for no-progress behavior after a retry nudge.

qwen3-coder-next — spent ~$0.98 on one failed puzzle part (D1P2). Solved D1P1 in 79s, then failed D1P2 retries and was ejected.

Cross-language snapshot

Language	Models completing all 10 parts
Python	10/10
Ruby	10/10
Elixir	7/10
Haskell	7/11
OCaml	5/9
ReScript (run 2)	2/10

Elixir lands very close to Haskell in completion rate on this benchmark, but with a very different shape of failures (more behavioral/retry-loop failures, fewer pure language/tooling barriers).

What's next

If I extend this Elixir run beyond Day 5 in a follow-up benchmark, it'll be interesting to see whether the same seven-model pack holds through later, trickier puzzles — or whether another wave of ejections appears.

Benchmarked on 2026-02-26 using pi as the agent harness.

This post was written with AI assistance.

Developers, developers, developers!

Blog about programming, programming and, ah more programming!