Benchmarking LLMs on Advent of Code 2025 (Rust)

Tags = [ Rust, AI, Advent of Code ] Posted on February 27, 2026

Following up on the Haskell, OCaml, Python, Elixir, Elm, Java, ReScript, and Ruby benchmarks, I ran the same orchestration setup on the same AoC 2025 Days 1–5 puzzles — this time in Rust.

Rust is a compiled systems language with strict ownership rules and a demanding compiler. Models have to deal with borrow-checking, lifetime annotations, and explicit error handling just to get a solution that compiles.

The contestants

#	Model
1	`anthropic/claude-haiku-4-5`
2	`anthropic/claude-sonnet-4-6`
3	`anthropic/claude-opus-4-6`
4	`openai-codex/gpt-5.3-codex`
5	`zai/glm-5`
6	`minimax/MiniMax-M2.5`
7	`kimi-coding/k2p5`
8	`mistral/devstral-2512`
9	`alibaba/qwen3.5-plus`
10	`alibaba/qwen3-coder-next`

Ejections

None. All 10 models solved all 10 parts and survived the full benchmark. Two models needed a second attempt on Day 1 Part 2, but nobody was ejected.

Results (Days 1–5)

Per-task leaderboards

Day 1 Part 1 — Dial rotation counting

Model	Time
mistral/devstral-2512	12s
anthropic/claude-haiku-4-5	16s
openai-codex/gpt-5.3-codex	16s
anthropic/claude-sonnet-4-6	17s
anthropic/claude-opus-4-6	19s
alibaba/qwen3.5-plus	22s
zai/glm-5	31s
kimi-coding/k2p5	31s
alibaba/qwen3-coder-next	52s
minimax/MiniMax-M2.5	63s

Day 1 Part 2 — Counting zero-crossings during dial rotation

glm-5 and qwen3.5-plus both gave wrong answers on their first attempt. Both got it right on the second try.

Model	Time	Note
anthropic/claude-haiku-4-5	10s
openai-codex/gpt-5.3-codex	15s
mistral/devstral-2512	18s
anthropic/claude-opus-4-6	21s
anthropic/claude-sonnet-4-6	28s
kimi-coding/k2p5	60s
alibaba/qwen3-coder-next	86s
minimax/MiniMax-M2.5	93s
alibaba/qwen3.5-plus	206s	2nd try
zai/glm-5	212s	2nd try

Day 2 Part 1 — Summing repeated-digit IDs in ranges

Model	Time
anthropic/claude-haiku-4-5	16s
kimi-coding/k2p5	18s
openai-codex/gpt-5.3-codex	20s
alibaba/qwen3.5-plus	20s
anthropic/claude-sonnet-4-6	30s
zai/glm-5	30s
mistral/devstral-2512	30s
anthropic/claude-opus-4-6	32s
minimax/MiniMax-M2.5	36s
alibaba/qwen3-coder-next	197s

Day 2 Part 2 — Repeated-pattern IDs (any repeat count)

Model	Time
anthropic/claude-haiku-4-5	10s
mistral/devstral-2512	11s
alibaba/qwen3-coder-next	14s
openai-codex/gpt-5.3-codex	16s
alibaba/qwen3.5-plus	21s
kimi-coding/k2p5	22s
zai/glm-5	33s
anthropic/claude-sonnet-4-6	37s
anthropic/claude-opus-4-6	43s
minimax/MiniMax-M2.5	51s

Day 3 Part 1 — Maximizing 2-digit joltage from battery banks

Model	Time
kimi-coding/k2p5	15s
anthropic/claude-haiku-4-5	16s
anthropic/claude-sonnet-4-6	20s
alibaba/qwen3.5-plus	23s
openai-codex/gpt-5.3-codex	24s
anthropic/claude-opus-4-6	28s
zai/glm-5	30s
mistral/devstral-2512	35s
minimax/MiniMax-M2.5	67s
alibaba/qwen3-coder-next	103s

Day 3 Part 2 — Maximizing 12-digit joltage from battery banks

Model	Time
mistral/devstral-2512	13s
openai-codex/gpt-5.3-codex	15s
kimi-coding/k2p5	16s
alibaba/qwen3.5-plus	18s
anthropic/claude-sonnet-4-6	19s
anthropic/claude-opus-4-6	24s
zai/glm-5	28s
alibaba/qwen3-coder-next	29s
anthropic/claude-haiku-4-5	33s
minimax/MiniMax-M2.5	41s

Day 4 Part 1 — Grid neighbor counting (accessible paper rolls)

Model	Time
mistral/devstral-2512	16s
openai-codex/gpt-5.3-codex	17s
kimi-coding/k2p5	17s
anthropic/claude-haiku-4-5	19s
anthropic/claude-sonnet-4-6	19s
anthropic/claude-opus-4-6	20s
alibaba/qwen3-coder-next	20s
alibaba/qwen3.5-plus	21s
zai/glm-5	38s
minimax/MiniMax-M2.5	64s

Day 4 Part 2 — Iterative grid removal simulation

Model	Time
anthropic/claude-haiku-4-5	12s
anthropic/claude-sonnet-4-6	16s
alibaba/qwen3.5-plus	16s
kimi-coding/k2p5	17s
mistral/devstral-2512	18s
anthropic/claude-opus-4-6	22s
openai-codex/gpt-5.3-codex	22s
alibaba/qwen3-coder-next	28s
minimax/MiniMax-M2.5	31s
zai/glm-5	34s

Day 5 Part 1 — Range membership checking

qwen3-coder-next initially wrote a stale answer from the previous day while still working. After the dirty stop was cleared, it produced the correct answer.

Model	Time
mistral/devstral-2512	14s
anthropic/claude-sonnet-4-6	18s
openai-codex/gpt-5.3-codex	18s
kimi-coding/k2p5	18s
anthropic/claude-haiku-4-5	19s
anthropic/claude-opus-4-6	22s
alibaba/qwen3.5-plus	33s
zai/glm-5	35s
minimax/MiniMax-M2.5	39s
alibaba/qwen3-coder-next	162s

Day 5 Part 2 — Counting total fresh IDs from overlapping ranges

Model	Time
kimi-coding/k2p5	14s
anthropic/claude-sonnet-4-6	16s
anthropic/claude-haiku-4-5	17s
openai-codex/gpt-5.3-codex	18s
anthropic/claude-opus-4-6	19s
mistral/devstral-2512	25s
minimax/MiniMax-M2.5	29s
alibaba/qwen3.5-plus	30s
zai/glm-5	47s
alibaba/qwen3-coder-next	60s

Speed vs accuracy

Summary tables

Wall-clock time (seconds)

Model	D1P1	D1P2	D2P1	D2P2	D3P1	D3P2	D4P1	D4P2	D5P1	D5P2	Total
anthropic/claude-haiku-4-5	16s	10s	16s	10s	16s	33s	19s	12s	19s	17s	168s
openai-codex/gpt-5.3-codex	16s	15s	20s	16s	24s	15s	17s	22s	18s	18s	181s
mistral/devstral-2512	12s	18s	30s	11s	35s	13s	16s	18s	14s	25s	192s
anthropic/claude-sonnet-4-6	17s	28s	30s	37s	20s	19s	19s	16s	18s	16s	220s
kimi-coding/k2p5	31s	60s	18s	22s	15s	16s	17s	17s	18s	14s	228s
anthropic/claude-opus-4-6	19s	21s	32s	43s	28s	24s	20s	22s	22s	19s	250s
alibaba/qwen3.5-plus	22s	206s	20s	21s	23s	18s	21s	16s	33s	30s	410s
minimax/MiniMax-M2.5	63s	93s	36s	51s	67s	41s	64s	31s	39s	29s	514s
zai/glm-5	31s	212s	30s	33s	30s	28s	38s	34s	35s	47s	518s
alibaba/qwen3-coder-next	52s	86s	197s	14s	103s	29s	20s	28s	162s	60s	751s

Output tokens per part

Model	D1P1	D1P2	D2P1	D2P2	D3P1	D3P2	D4P1	D4P2	D5P1	D5P2	Total
openai-codex/gpt-5.3-codex	448	689	575	667	562	630	627	975	617	638	6,428
zai/glm-5	636	1,251	720	783	547	695	896	745	774	837	7,884
anthropic/claude-opus-4-6	732	1,038	1,658	2,416	1,112	1,064	979	979	933	850	11,761
kimi-coding/k2p5	766	4,725	1,026	1,307	585	793	692	1,076	723	743	12,436
anthropic/claude-sonnet-4-6	1,023	1,575	1,984	2,522	1,024	1,121	1,135	1,094	993	952	13,423
minimax/MiniMax-M2.5	1,583	3,010	1,172	1,449	1,913	1,140	1,359	1,041	1,094	923	14,684
anthropic/claude-haiku-4-5	1,315	1,150	1,692	937	1,339	3,084	1,544	1,547	1,667	1,559	15,834
mistral/devstral-2512	610	2,746	2,625	1,574	2,616	1,298	1,225	1,709	568	1,752	16,723
alibaba/qwen3.5-plus	1,198	5,507	1,693	1,550	2,044	1,506	1,244	1,231	1,447	1,400	18,820
alibaba/qwen3-coder-next	1,500	8,238	7,040	808	3,046	1,948	1,203	1,265	5,264	1,991	32,303

API cost per part (approximate USD)

Costs are rough approximations based on published per-token pricing. I use subscription plans, so my actual spending is capped regardless.

Model	D1P1	D1P2	D2P1	D2P2	D3P1	D3P2	D4P1	D4P2	D5P1	D5P2	Total
kimi-coding/k2p5	.0236	.0299	.0042	.0059	.0099	.0104	.0112	.0111	.0111	.0100	$0.13
zai/glm-5	.0201	.0292	.0086	.0095	.0062	.0084	.0105	.0105	.0342	.0279	$0.17
mistral/devstral-2512	.0101	.0245	.0320	.0274	.0310	.0161	.0111	.0276	.0077	.0187	$0.21
openai-codex/gpt-5.3-codex	.0205	.0251	.0262	.0196	.0220	.0173	.0219	.0235	.0193	.0169	$0.21
minimax/MiniMax-M2.5	.0257	.0492	.0047	.0080	.0259	.0295	.0290	.0321	.0169	.0212	$0.24
anthropic/claude-haiku-4-5	.0301	.0163	.0284	.0099	.0361	.0352	.0438	.0237	.0410	.0205	$0.29
alibaba/qwen3.5-plus	.0158	.0454	.0140	.0195	.0170	.0194	.0141	.0191	.0909	.0940	$0.35
anthropic/claude-sonnet-4-6	.0415	.0447	.0600	.0633	.0407	.0331	.0426	.0337	.0394	.0271	$0.43
anthropic/claude-opus-4-6	.1148	.0888	.1203	.1003	.1430	.0911	.1193	.0904	.1305	.0964	$1.09
alibaba/qwen3-coder-next	.0725	.1456	.1978	.0586	.2160	.1075	.0421	.0488	.4686	.1493	$1.51

Observations

All 10 models solved all 10 parts. Two needed a second attempt on Day 1 Part 2, and one had a dirty stop on Day 5 Part 1, but nobody was ejected.

claude-haiku-4-5 — fastest overall at 168s. Five parts solved in ≤16s. In the Python benchmark it placed second (206s); here it placed first.

gpt-5.3-codex — 181s total, 6,428 tokens. Fewest output tokens of any model (next lowest: glm-5 at 7,884). Also the most token-efficient in the Python run.

devstral-2512 — 192s, third place. Won the Python benchmark (205s). Tied with gpt-5.3-codex for cheapest at $0.21.

kimi-coding/k2p5 — $0.13 total cost, 228s total time. Cheapest model. Was also cheapest in Python ($0.02).

qwen3-coder-next — most expensive at $1.51, slowest at 751s. D5P1 alone cost $0.47 due to a dirty stop. 32,303 total output tokens — 5× more than gpt-5.3-codex.

claude-opus-4-6 — $1.09, second-most expensive. Per-part cost consistently around $0.10. 250s total.

Day 1 Part 2 was the only part where any model needed a retry. Both glm-5 and qwen3.5-plus recovered on the second try, but the retries added 200+ seconds each.

No model got stuck on Rust-specific issues. No borrow-checker loops, no lifetime annotation struggles across the full 10 parts.

Comparison with Python

Model	Python time	Rust time	Δ
claude-haiku-4-5	206s	168s	−38s
gpt-5.3-codex	266s	181s	−85s
devstral-2512	205s	192s	−13s
claude-sonnet-4-6	297s	220s	−77s
k2p5	240s	228s	−12s
claude-opus-4-6	308s	250s	−58s
qwen3.5-plus	379s	410s	+31s
MiniMax-M2.5	574s	514s	−60s
glm-5	602s	518s	−84s
qwen3-coder-next	251s	751s	+500s

8 of 10 models were faster in Rust than Python. The two exceptions — qwen3.5-plus and qwen3-coder-next — both lost time to retries and dirty stops.

What's next

The benchmark stopped at Day 5 because Day 6+ inputs and descriptions weren't available yet.

Benchmarked on 2026-02-27 using pi as the agent harness.

This post was written with AI assistance.

Developers, developers, developers!

Blog about programming, programming and, ah more programming!