Benchmarking LLMs on Advent of Code 2025 (Python)

Tags = [ Python, AI, Advent of Code ] Posted on February 25, 2026

Following up on the Haskell benchmark and the OCaml benchmark, I ran the same orchestration setup on the same AoC 2025 Days 1–5 puzzles — this time in Python.

This is also the first run with full token usage and API cost tracking per part, which adds a new angle beyond raw wall-clock time.

The contestants

#	Model
1	`anthropic/claude-haiku-4-5`
2	`anthropic/claude-sonnet-4-6`
3	`anthropic/claude-opus-4-6`
4	`openai-codex/gpt-5.3-codex`
5	`zai/glm-5`
6	`minimax/MiniMax-M2.5`
7	`kimi-coding/k2p5`
8	`mistral/devstral-2512`
9	`alibaba/qwen3.5-plus`
10	`alibaba/qwen3-coder-next`

A wrong model ID (claude-3-5-haiku-latest) was accidentally in the enabled model list at the start. It was caught immediately, killed, and claude-haiku-4-5 was launched as a replacement, missing only Day 1 Part 1 of the original session. Its D1P1 was run separately and produced the right answer in 9s.

Ejections

None. All 10 models solved all 10 parts correctly on the first attempt. This is the first benchmark in this series with a perfect sweep.

Results (Days 1–5)

Per-task leaderboards

Day 1 Part 1 — Dial rotation counting

Model	Time
anthropic/claude-haiku-4-5	9s
mistral/devstral-2512	23s
kimi-coding/k2p5	27s
alibaba/qwen3-coder-next	28s
anthropic/claude-sonnet-4-6	29s
anthropic/claude-opus-4-6	30s
alibaba/qwen3.5-plus	30s
openai-codex/gpt-5.3-codex	36s
zai/glm-5	37s
minimax/MiniMax-M2.5	60s

Day 1 Part 2 — Counting zero-crossings during dial rotation

Model	Time
mistral/devstral-2512	13s
anthropic/claude-haiku-4-5	18s
openai-codex/gpt-5.3-codex	20s
kimi-coding/k2p5	21s
anthropic/claude-sonnet-4-6	26s
anthropic/claude-opus-4-6	26s
alibaba/qwen3-coder-next	42s
zai/glm-5	56s
alibaba/qwen3.5-plus	73s
minimax/MiniMax-M2.5	81s

Day 2 Part 1 — Summing repeated-digit IDs in ranges

Model	Time
alibaba/qwen3-coder-next	22s
mistral/devstral-2512	23s
anthropic/claude-haiku-4-5	24s
openai-codex/gpt-5.3-codex	26s
kimi-coding/k2p5	28s
alibaba/qwen3.5-plus	30s
zai/glm-5	38s
anthropic/claude-sonnet-4-6	43s
anthropic/claude-opus-4-6	50s
minimax/MiniMax-M2.5	67s

Day 2 Part 2 — Repeated-pattern IDs (any repeat count)

Model	Time
mistral/devstral-2512	14s
alibaba/qwen3-coder-next	17s
anthropic/claude-haiku-4-5	18s
kimi-coding/k2p5	24s
anthropic/claude-sonnet-4-6	25s
openai-codex/gpt-5.3-codex	27s
zai/glm-5	30s
minimax/MiniMax-M2.5	33s
anthropic/claude-opus-4-6	41s
alibaba/qwen3.5-plus	48s

Day 3 Part 1 — Maximizing 2-digit joltage from battery banks

Model	Time
kimi-coding/k2p5	21s
mistral/devstral-2512	24s
alibaba/qwen3-coder-next	25s
anthropic/claude-sonnet-4-6	28s
anthropic/claude-haiku-4-5	30s
anthropic/claude-opus-4-6	31s
openai-codex/gpt-5.3-codex	31s
alibaba/qwen3.5-plus	31s
zai/glm-5	71s
minimax/MiniMax-M2.5	72s

Day 3 Part 2 — Maximizing 12-digit joltage from battery banks

Model	Time
kimi-coding/k2p5	19s
anthropic/claude-haiku-4-5	21s
mistral/devstral-2512	21s
alibaba/qwen3-coder-next	21s
alibaba/qwen3.5-plus	24s
anthropic/claude-sonnet-4-6	25s
openai-codex/gpt-5.3-codex	25s
anthropic/claude-opus-4-6	28s
zai/glm-5	33s
minimax/MiniMax-M2.5	56s

Day 4 Part 1 — Grid neighbor counting (accessible paper rolls)

Model	Time
anthropic/claude-haiku-4-5	21s
mistral/devstral-2512	22s
alibaba/qwen3-coder-next	23s
kimi-coding/k2p5	27s
anthropic/claude-opus-4-6	29s
openai-codex/gpt-5.3-codex	31s
zai/glm-5	31s
alibaba/qwen3.5-plus	40s
anthropic/claude-sonnet-4-6	51s
minimax/MiniMax-M2.5	59s

Day 4 Part 2 — Iterative grid removal simulation

Model	Time
mistral/devstral-2512	14s
alibaba/qwen3-coder-next	18s
anthropic/claude-haiku-4-5	19s
openai-codex/gpt-5.3-codex	21s
anthropic/claude-sonnet-4-6	23s
anthropic/claude-opus-4-6	23s
kimi-coding/k2p5	30s
alibaba/qwen3.5-plus	47s
minimax/MiniMax-M2.5	52s
zai/glm-5	217s

Day 5 Part 1 — Range membership checking

Model	Time
kimi-coding/k2p5	22s
mistral/devstral-2512	22s
openai-codex/gpt-5.3-codex	23s
anthropic/claude-haiku-4-5	25s
anthropic/claude-sonnet-4-6	25s
alibaba/qwen3.5-plus	26s
anthropic/claude-opus-4-6	27s
alibaba/qwen3-coder-next	27s
zai/glm-5	43s
minimax/MiniMax-M2.5	53s

Day 5 Part 2 — Counting total fresh IDs from overlapping ranges

Model	Time
anthropic/claude-haiku-4-5	21s
kimi-coding/k2p5	21s
anthropic/claude-sonnet-4-6	22s
anthropic/claude-opus-4-6	23s
openai-codex/gpt-5.3-codex	26s
alibaba/qwen3-coder-next	28s
mistral/devstral-2512	29s
alibaba/qwen3.5-plus	30s
minimax/MiniMax-M2.5	41s
zai/glm-5	46s

Speed vs accuracy

Summary tables

Wall-clock time (seconds)

Model	D1P1	D1P2	D2P1	D2P2	D3P1	D3P2	D4P1	D4P2	D5P1	D5P2	Total
mistral/devstral-2512	23s	13s	23s	14s	24s	21s	22s	14s	22s	29s	205s
anthropic/claude-haiku-4-5	9s	18s	24s	18s	30s	21s	21s	19s	25s	21s	206s
kimi-coding/k2p5	27s	21s	28s	24s	21s	19s	27s	30s	22s	21s	240s
alibaba/qwen3-coder-next	28s	42s	22s	17s	25s	21s	23s	18s	27s	28s	251s
openai-codex/gpt-5.3-codex	36s	20s	26s	27s	31s	25s	31s	21s	23s	26s	266s
anthropic/claude-sonnet-4-6	29s	26s	43s	25s	28s	25s	51s	23s	25s	22s	297s
anthropic/claude-opus-4-6	30s	26s	50s	41s	31s	28s	29s	23s	27s	23s	308s
alibaba/qwen3.5-plus	30s	73s	30s	48s	31s	24s	40s	47s	26s	30s	379s
minimax/MiniMax-M2.5	60s	81s	67s	33s	72s	56s	59s	52s	53s	41s	574s
zai/glm-5	37s	56s	38s	30s	71s	33s	31s	217s	43s	46s	602s

Output tokens per part

Model	D1P1	D1P2	D2P1	D2P2	D3P1	D3P2	D4P1	D4P2	D5P1	D5P2	Total
openai-codex/gpt-5.3-codex	420	560	482	476	563	617	583	807	417	501	5,426
kimi-coding/k2p5	582	798	550	583	438	571	511	612	500	568	5,713
zai/glm-5	454	1,316	603	536	1,626	507	521	576	788	785	7,712
mistral/devstral-2512	528	849	683	539	860	988	664	826	728	1,460	8,125
anthropic/claude-sonnet-4-6	671	897	1,508	1,031	780	776	783	794	682	658	8,580
anthropic/claude-opus-4-6	592	852	2,165	1,882	728	763	737	742	663	650	9,774
anthropic/claude-haiku-4-5	843	798	1,034	897	1,940	1,099	907	970	1,410	1,365	11,263
alibaba/qwen3-coder-next	956	4,823	1,152	953	800	737	907	1,019	718	1,022	13,087
minimax/MiniMax-M2.5	1,305	2,853	1,333	947	1,973	2,249	1,068	1,046	1,049	940	14,763
alibaba/qwen3.5-plus	1,388	7,840	1,537	3,484	2,188	1,031	1,414	1,056	1,141	1,582	22,661

API cost per part (approximate USD)

Costs are rough approximations based on published per-token pricing. I use subscription plans, so my actual spending is capped regardless.

Model	D1P1	D1P2	D2P1	D2P2	D3P1	D3P2	D4P1	D4P2	D5P1	D5P2	Total
kimi-coding/k2p5	.0049	.0019	.0020	.0017	.0015	.0015	.0016	.0018	.0016	.0015	$0.02
mistral/devstral-2512	.0079	.0092	.0078	.0093	.0072	.0092	.0064	.0104	.0076	.0155	$0.09
zai/glm-5	.0060	.0132	.0072	.0074	.0385	.0226	.0055	.0074	.0072	.0087	$0.12
minimax/MiniMax-M2.5	.0224	.0297	.0060	.0065	.0234	.0491	.0053	.0089	.0168	.0211	$0.19
anthropic/claude-haiku-4-5	.0224	.0122	.0273	.0092	.0288	.0116	.0233	.0179	.0330	.0172	$0.20
alibaba/qwen3.5-plus	.0110	.0308	.0119	.0273	.0127	.0141	.0336	.0425	.0094	.0161	$0.21
openai-codex/gpt-5.3-codex	.0178	.0185	.0163	.0167	.0518	.0240	.0340	.0465	.0153	.0124	$0.25
anthropic/claude-sonnet-4-6	.0340	.0272	.0519	.0311	.0349	.0243	.0348	.0272	.0323	.0207	$0.32
alibaba/qwen3-coder-next	.0273	.0560	.0104	.0123	.0482	.0599	.0261	.0344	.0525	.0741	$0.40
anthropic/claude-opus-4-6	.1090	.0809	.1351	.0803	.1089	.0942	.1088	.0794	.1029	.1030	$1.00

Observations

All 10 models passed all 10 parts on the first attempt. In the OCaml run, 5 of 9 models failed at Day 1 Part 2. Here, nobody failed anything — no retries needed across the board.

devstral-2512 is the fastest overall at 205s. Fastest or joint-fastest on 6 of 10 parts. 8,125 output tokens total.

claude-haiku-4-5 — 206s total, close behind. Higher token count (11,263) relative to its speed.

gpt-5.3-codex — fewest output tokens: 5,426 total across 10 parts. $0.25 total cost, 266s total time.

kimi-coding/k2p5 — cheapest at ~$0.02. 5,713 tokens, 240s total.

qwen3.5-plus — most tokens: 22,661 total. The D1P2 spike (7,840 tokens for a single part) stands out. Total cost of $0.21, kept low by per-token pricing.

glm-5 — 217s on D4P2, while others solved it in 14–52s. Token usage on that part (576 tok) was normal, so the time was spent elsewhere (execution retries, perhaps).

claude-opus-4-6 — $1.00 total across all 10 parts. Not the slowest (308s), not the most verbose (9,774 tok), but the most expensive at roughly $0.10 per part.

qwen3-coder-next — 251s total, but $0.40 (second-highest cost). The D1P2 token spike (4,823) accounts for much of that.

What's next

Future runs in other languages should show whether these results hold or whether the leaderboard reshuffles when the target language changes.

Token and cost tracking will continue across all future benchmarks.

Benchmarked on 2026-02-25 using pi as the agent harness.

This post was written with AI assistance.

Developers, developers, developers!

Blog about programming, programming and, ah more programming!