Developers, developers, developers!

Blog about programming, programming and, ah more programming!

Benchmarking LLMs on Advent of Code 2025 (Rust)

Tags = [ Rust, AI, Advent of Code ]

Following up on the Haskell, OCaml, Python, Elixir, Elm, Java, ReScript, and Ruby benchmarks, I ran the same orchestration setup on the same AoC 2025 Days 1–5 puzzles — this time in Rust.

Rust is a compiled systems language with strict ownership rules and a demanding compiler. Models have to deal with borrow-checking, lifetime annotations, and explicit error handling just to get a solution that compiles.

The contestants

#Model
1anthropic/claude-haiku-4-5
2anthropic/claude-sonnet-4-6
3anthropic/claude-opus-4-6
4openai-codex/gpt-5.3-codex
5zai/glm-5
6minimax/MiniMax-M2.5
7kimi-coding/k2p5
8mistral/devstral-2512
9alibaba/qwen3.5-plus
10alibaba/qwen3-coder-next

Ejections

None. All 10 models solved all 10 parts and survived the full benchmark. Two models needed a second attempt on Day 1 Part 2, but nobody was ejected.

Results (Days 1–5)

Per-task leaderboards


Day 1 Part 1 — Dial rotation counting

ModelTime
mistral/devstral-251212s
anthropic/claude-haiku-4-516s
openai-codex/gpt-5.3-codex16s
anthropic/claude-sonnet-4-617s
anthropic/claude-opus-4-619s
alibaba/qwen3.5-plus22s
zai/glm-531s
kimi-coding/k2p531s
alibaba/qwen3-coder-next52s
minimax/MiniMax-M2.563s



Day 1 Part 2 — Counting zero-crossings during dial rotation

glm-5 and qwen3.5-plus both gave wrong answers on their first attempt. Both got it right on the second try.

ModelTimeNote
anthropic/claude-haiku-4-510s
openai-codex/gpt-5.3-codex15s
mistral/devstral-251218s
anthropic/claude-opus-4-621s
anthropic/claude-sonnet-4-628s
kimi-coding/k2p560s
alibaba/qwen3-coder-next86s
minimax/MiniMax-M2.593s
alibaba/qwen3.5-plus206s2nd try
zai/glm-5212s2nd try



Day 2 Part 1 — Summing repeated-digit IDs in ranges

ModelTime
anthropic/claude-haiku-4-516s
kimi-coding/k2p518s
openai-codex/gpt-5.3-codex20s
alibaba/qwen3.5-plus20s
anthropic/claude-sonnet-4-630s
zai/glm-530s
mistral/devstral-251230s
anthropic/claude-opus-4-632s
minimax/MiniMax-M2.536s
alibaba/qwen3-coder-next197s



Day 2 Part 2 — Repeated-pattern IDs (any repeat count)

ModelTime
anthropic/claude-haiku-4-510s
mistral/devstral-251211s
alibaba/qwen3-coder-next14s
openai-codex/gpt-5.3-codex16s
alibaba/qwen3.5-plus21s
kimi-coding/k2p522s
zai/glm-533s
anthropic/claude-sonnet-4-637s
anthropic/claude-opus-4-643s
minimax/MiniMax-M2.551s



Day 3 Part 1 — Maximizing 2-digit joltage from battery banks

ModelTime
kimi-coding/k2p515s
anthropic/claude-haiku-4-516s
anthropic/claude-sonnet-4-620s
alibaba/qwen3.5-plus23s
openai-codex/gpt-5.3-codex24s
anthropic/claude-opus-4-628s
zai/glm-530s
mistral/devstral-251235s
minimax/MiniMax-M2.567s
alibaba/qwen3-coder-next103s



Day 3 Part 2 — Maximizing 12-digit joltage from battery banks

ModelTime
mistral/devstral-251213s
openai-codex/gpt-5.3-codex15s
kimi-coding/k2p516s
alibaba/qwen3.5-plus18s
anthropic/claude-sonnet-4-619s
anthropic/claude-opus-4-624s
zai/glm-528s
alibaba/qwen3-coder-next29s
anthropic/claude-haiku-4-533s
minimax/MiniMax-M2.541s



Day 4 Part 1 — Grid neighbor counting (accessible paper rolls)

ModelTime
mistral/devstral-251216s
openai-codex/gpt-5.3-codex17s
kimi-coding/k2p517s
anthropic/claude-haiku-4-519s
anthropic/claude-sonnet-4-619s
anthropic/claude-opus-4-620s
alibaba/qwen3-coder-next20s
alibaba/qwen3.5-plus21s
zai/glm-538s
minimax/MiniMax-M2.564s



Day 4 Part 2 — Iterative grid removal simulation

ModelTime
anthropic/claude-haiku-4-512s
anthropic/claude-sonnet-4-616s
alibaba/qwen3.5-plus16s
kimi-coding/k2p517s
mistral/devstral-251218s
anthropic/claude-opus-4-622s
openai-codex/gpt-5.3-codex22s
alibaba/qwen3-coder-next28s
minimax/MiniMax-M2.531s
zai/glm-534s



Day 5 Part 1 — Range membership checking

qwen3-coder-next initially wrote a stale answer from the previous day while still working. After the dirty stop was cleared, it produced the correct answer.

ModelTime
mistral/devstral-251214s
anthropic/claude-sonnet-4-618s
openai-codex/gpt-5.3-codex18s
kimi-coding/k2p518s
anthropic/claude-haiku-4-519s
anthropic/claude-opus-4-622s
alibaba/qwen3.5-plus33s
zai/glm-535s
minimax/MiniMax-M2.539s
alibaba/qwen3-coder-next162s



Day 5 Part 2 — Counting total fresh IDs from overlapping ranges

ModelTime
kimi-coding/k2p514s
anthropic/claude-sonnet-4-616s
anthropic/claude-haiku-4-517s
openai-codex/gpt-5.3-codex18s
anthropic/claude-opus-4-619s
mistral/devstral-251225s
minimax/MiniMax-M2.529s
alibaba/qwen3.5-plus30s
zai/glm-547s
alibaba/qwen3-coder-next60s

Speed vs accuracy

Summary tables

Wall-clock time (seconds)

Model D1P1 D1P2 D2P1 D2P2 D3P1 D3P2 D4P1 D4P2 D5P1 D5P2 Total
anthropic/claude-haiku-4-5 16s 10s 16s 10s 16s 33s 19s 12s 19s 17s 168s
openai-codex/gpt-5.3-codex 16s 15s 20s 16s 24s 15s 17s 22s 18s 18s 181s
mistral/devstral-2512 12s 18s 30s 11s 35s 13s 16s 18s 14s 25s 192s
anthropic/claude-sonnet-4-6 17s 28s 30s 37s 20s 19s 19s 16s 18s 16s 220s
kimi-coding/k2p5 31s 60s 18s 22s 15s 16s 17s 17s 18s 14s 228s
anthropic/claude-opus-4-6 19s 21s 32s 43s 28s 24s 20s 22s 22s 19s 250s
alibaba/qwen3.5-plus 22s 206s 20s 21s 23s 18s 21s 16s 33s 30s 410s
minimax/MiniMax-M2.5 63s 93s 36s 51s 67s 41s 64s 31s 39s 29s 514s
zai/glm-5 31s 212s 30s 33s 30s 28s 38s 34s 35s 47s 518s
alibaba/qwen3-coder-next 52s 86s 197s 14s 103s 29s 20s 28s 162s 60s 751s

Output tokens per part

Model D1P1 D1P2 D2P1 D2P2 D3P1 D3P2 D4P1 D4P2 D5P1 D5P2 Total
openai-codex/gpt-5.3-codex 448 689 575 667 562 630 627 975 617 638 6,428
zai/glm-5 636 1,251 720 783 547 695 896 745 774 837 7,884
anthropic/claude-opus-4-6 732 1,038 1,658 2,416 1,112 1,064 979 979 933 850 11,761
kimi-coding/k2p5 766 4,725 1,026 1,307 585 793 692 1,076 723 743 12,436
anthropic/claude-sonnet-4-6 1,023 1,575 1,984 2,522 1,024 1,121 1,135 1,094 993 952 13,423
minimax/MiniMax-M2.5 1,583 3,010 1,172 1,449 1,913 1,140 1,359 1,041 1,094 923 14,684
anthropic/claude-haiku-4-5 1,315 1,150 1,692 937 1,339 3,084 1,544 1,547 1,667 1,559 15,834
mistral/devstral-2512 610 2,746 2,625 1,574 2,616 1,298 1,225 1,709 568 1,752 16,723
alibaba/qwen3.5-plus 1,198 5,507 1,693 1,550 2,044 1,506 1,244 1,231 1,447 1,400 18,820
alibaba/qwen3-coder-next 1,500 8,238 7,040 808 3,046 1,948 1,203 1,265 5,264 1,991 32,303

API cost per part (approximate USD)

Costs are rough approximations based on published per-token pricing. I use subscription plans, so my actual spending is capped regardless.

Model D1P1 D1P2 D2P1 D2P2 D3P1 D3P2 D4P1 D4P2 D5P1 D5P2 Total
kimi-coding/k2p5 .0236 .0299 .0042 .0059 .0099 .0104 .0112 .0111 .0111 .0100 $0.13
zai/glm-5 .0201 .0292 .0086 .0095 .0062 .0084 .0105 .0105 .0342 .0279 $0.17
mistral/devstral-2512 .0101 .0245 .0320 .0274 .0310 .0161 .0111 .0276 .0077 .0187 $0.21
openai-codex/gpt-5.3-codex .0205 .0251 .0262 .0196 .0220 .0173 .0219 .0235 .0193 .0169 $0.21
minimax/MiniMax-M2.5 .0257 .0492 .0047 .0080 .0259 .0295 .0290 .0321 .0169 .0212 $0.24
anthropic/claude-haiku-4-5 .0301 .0163 .0284 .0099 .0361 .0352 .0438 .0237 .0410 .0205 $0.29
alibaba/qwen3.5-plus .0158 .0454 .0140 .0195 .0170 .0194 .0141 .0191 .0909 .0940 $0.35
anthropic/claude-sonnet-4-6 .0415 .0447 .0600 .0633 .0407 .0331 .0426 .0337 .0394 .0271 $0.43
anthropic/claude-opus-4-6 .1148 .0888 .1203 .1003 .1430 .0911 .1193 .0904 .1305 .0964 $1.09
alibaba/qwen3-coder-next .0725 .1456 .1978 .0586 .2160 .1075 .0421 .0488 .4686 .1493 $1.51

Observations

All 10 models solved all 10 parts. Two needed a second attempt on Day 1 Part 2, and one had a dirty stop on Day 5 Part 1, but nobody was ejected.

claude-haiku-4-5 — fastest overall at 168s. Five parts solved in ≤16s. In the Python benchmark it placed second (206s); here it placed first.

gpt-5.3-codex — 181s total, 6,428 tokens. Fewest output tokens of any model (next lowest: glm-5 at 7,884). Also the most token-efficient in the Python run.

devstral-2512 — 192s, third place. Won the Python benchmark (205s). Tied with gpt-5.3-codex for cheapest at $0.21.

kimi-coding/k2p5 — $0.13 total cost, 228s total time. Cheapest model. Was also cheapest in Python ($0.02).

qwen3-coder-next — most expensive at $1.51, slowest at 751s. D5P1 alone cost $0.47 due to a dirty stop. 32,303 total output tokens — 5× more than gpt-5.3-codex.

claude-opus-4-6 — $1.09, second-most expensive. Per-part cost consistently around $0.10. 250s total.

Day 1 Part 2 was the only part where any model needed a retry. Both glm-5 and qwen3.5-plus recovered on the second try, but the retries added 200+ seconds each.

No model got stuck on Rust-specific issues. No borrow-checker loops, no lifetime annotation struggles across the full 10 parts.

Comparison with Python

ModelPython timeRust timeΔ
claude-haiku-4-5206s168s−38s
gpt-5.3-codex266s181s−85s
devstral-2512205s192s−13s
claude-sonnet-4-6297s220s−77s
k2p5240s228s−12s
claude-opus-4-6308s250s−58s
qwen3.5-plus379s410s+31s
MiniMax-M2.5574s514s−60s
glm-5602s518s−84s
qwen3-coder-next251s751s+500s

8 of 10 models were faster in Rust than Python. The two exceptions — qwen3.5-plus and qwen3-coder-next — both lost time to retries and dirty stops.

What's next

The benchmark stopped at Day 5 because Day 6+ inputs and descriptions weren't available yet.

Benchmarked on 2026-02-27 using pi as the agent harness.


This post was written with AI assistance.