Following up on the Haskell, OCaml, Python, Elixir, Elm, Java, ReScript, and Ruby benchmarks, I ran the same orchestration setup on the same AoC 2025 Days 1–5 puzzles — this time in Rust.
Rust is a compiled systems language with strict ownership rules and a demanding compiler. Models have to deal with borrow-checking, lifetime annotations, and explicit error handling just to get a solution that compiles.
The contestants
| # | Model |
|---|---|
| 1 | anthropic/claude-haiku-4-5 |
| 2 | anthropic/claude-sonnet-4-6 |
| 3 | anthropic/claude-opus-4-6 |
| 4 | openai-codex/gpt-5.3-codex |
| 5 | zai/glm-5 |
| 6 | minimax/MiniMax-M2.5 |
| 7 | kimi-coding/k2p5 |
| 8 | mistral/devstral-2512 |
| 9 | alibaba/qwen3.5-plus |
| 10 | alibaba/qwen3-coder-next |
Ejections
None. All 10 models solved all 10 parts and survived the full benchmark. Two models needed a second attempt on Day 1 Part 2, but nobody was ejected.
Results (Days 1–5)
Per-task leaderboards
Day 1 Part 1 — Dial rotation counting
| Model | Time |
|---|---|
| mistral/devstral-2512 | 12s |
| anthropic/claude-haiku-4-5 | 16s |
| openai-codex/gpt-5.3-codex | 16s |
| anthropic/claude-sonnet-4-6 | 17s |
| anthropic/claude-opus-4-6 | 19s |
| alibaba/qwen3.5-plus | 22s |
| zai/glm-5 | 31s |
| kimi-coding/k2p5 | 31s |
| alibaba/qwen3-coder-next | 52s |
| minimax/MiniMax-M2.5 | 63s |
Day 1 Part 2 — Counting zero-crossings during dial rotation
glm-5 and qwen3.5-plus both gave wrong answers on their first attempt. Both
got it right on the second try.
| Model | Time | Note |
|---|---|---|
| anthropic/claude-haiku-4-5 | 10s | |
| openai-codex/gpt-5.3-codex | 15s | |
| mistral/devstral-2512 | 18s | |
| anthropic/claude-opus-4-6 | 21s | |
| anthropic/claude-sonnet-4-6 | 28s | |
| kimi-coding/k2p5 | 60s | |
| alibaba/qwen3-coder-next | 86s | |
| minimax/MiniMax-M2.5 | 93s | |
| alibaba/qwen3.5-plus | 206s | 2nd try |
| zai/glm-5 | 212s | 2nd try |
Day 2 Part 1 — Summing repeated-digit IDs in ranges
| Model | Time |
|---|---|
| anthropic/claude-haiku-4-5 | 16s |
| kimi-coding/k2p5 | 18s |
| openai-codex/gpt-5.3-codex | 20s |
| alibaba/qwen3.5-plus | 20s |
| anthropic/claude-sonnet-4-6 | 30s |
| zai/glm-5 | 30s |
| mistral/devstral-2512 | 30s |
| anthropic/claude-opus-4-6 | 32s |
| minimax/MiniMax-M2.5 | 36s |
| alibaba/qwen3-coder-next | 197s |
Day 2 Part 2 — Repeated-pattern IDs (any repeat count)
| Model | Time |
|---|---|
| anthropic/claude-haiku-4-5 | 10s |
| mistral/devstral-2512 | 11s |
| alibaba/qwen3-coder-next | 14s |
| openai-codex/gpt-5.3-codex | 16s |
| alibaba/qwen3.5-plus | 21s |
| kimi-coding/k2p5 | 22s |
| zai/glm-5 | 33s |
| anthropic/claude-sonnet-4-6 | 37s |
| anthropic/claude-opus-4-6 | 43s |
| minimax/MiniMax-M2.5 | 51s |
Day 3 Part 1 — Maximizing 2-digit joltage from battery banks
| Model | Time |
|---|---|
| kimi-coding/k2p5 | 15s |
| anthropic/claude-haiku-4-5 | 16s |
| anthropic/claude-sonnet-4-6 | 20s |
| alibaba/qwen3.5-plus | 23s |
| openai-codex/gpt-5.3-codex | 24s |
| anthropic/claude-opus-4-6 | 28s |
| zai/glm-5 | 30s |
| mistral/devstral-2512 | 35s |
| minimax/MiniMax-M2.5 | 67s |
| alibaba/qwen3-coder-next | 103s |
Day 3 Part 2 — Maximizing 12-digit joltage from battery banks
| Model | Time |
|---|---|
| mistral/devstral-2512 | 13s |
| openai-codex/gpt-5.3-codex | 15s |
| kimi-coding/k2p5 | 16s |
| alibaba/qwen3.5-plus | 18s |
| anthropic/claude-sonnet-4-6 | 19s |
| anthropic/claude-opus-4-6 | 24s |
| zai/glm-5 | 28s |
| alibaba/qwen3-coder-next | 29s |
| anthropic/claude-haiku-4-5 | 33s |
| minimax/MiniMax-M2.5 | 41s |
Day 4 Part 1 — Grid neighbor counting (accessible paper rolls)
| Model | Time |
|---|---|
| mistral/devstral-2512 | 16s |
| openai-codex/gpt-5.3-codex | 17s |
| kimi-coding/k2p5 | 17s |
| anthropic/claude-haiku-4-5 | 19s |
| anthropic/claude-sonnet-4-6 | 19s |
| anthropic/claude-opus-4-6 | 20s |
| alibaba/qwen3-coder-next | 20s |
| alibaba/qwen3.5-plus | 21s |
| zai/glm-5 | 38s |
| minimax/MiniMax-M2.5 | 64s |
Day 4 Part 2 — Iterative grid removal simulation
| Model | Time |
|---|---|
| anthropic/claude-haiku-4-5 | 12s |
| anthropic/claude-sonnet-4-6 | 16s |
| alibaba/qwen3.5-plus | 16s |
| kimi-coding/k2p5 | 17s |
| mistral/devstral-2512 | 18s |
| anthropic/claude-opus-4-6 | 22s |
| openai-codex/gpt-5.3-codex | 22s |
| alibaba/qwen3-coder-next | 28s |
| minimax/MiniMax-M2.5 | 31s |
| zai/glm-5 | 34s |
Day 5 Part 1 — Range membership checking
qwen3-coder-next initially wrote a stale answer from the previous day while still working.
After the dirty stop was cleared, it produced the correct answer.
| Model | Time |
|---|---|
| mistral/devstral-2512 | 14s |
| anthropic/claude-sonnet-4-6 | 18s |
| openai-codex/gpt-5.3-codex | 18s |
| kimi-coding/k2p5 | 18s |
| anthropic/claude-haiku-4-5 | 19s |
| anthropic/claude-opus-4-6 | 22s |
| alibaba/qwen3.5-plus | 33s |
| zai/glm-5 | 35s |
| minimax/MiniMax-M2.5 | 39s |
| alibaba/qwen3-coder-next | 162s |
Day 5 Part 2 — Counting total fresh IDs from overlapping ranges
| Model | Time |
|---|---|
| kimi-coding/k2p5 | 14s |
| anthropic/claude-sonnet-4-6 | 16s |
| anthropic/claude-haiku-4-5 | 17s |
| openai-codex/gpt-5.3-codex | 18s |
| anthropic/claude-opus-4-6 | 19s |
| mistral/devstral-2512 | 25s |
| minimax/MiniMax-M2.5 | 29s |
| alibaba/qwen3.5-plus | 30s |
| zai/glm-5 | 47s |
| alibaba/qwen3-coder-next | 60s |
Speed vs accuracy
Summary tables
Wall-clock time (seconds)
| Model | D1P1 | D1P2 | D2P1 | D2P2 | D3P1 | D3P2 | D4P1 | D4P2 | D5P1 | D5P2 | Total |
|---|---|---|---|---|---|---|---|---|---|---|---|
| anthropic/claude-haiku-4-5 | 16s | 10s | 16s | 10s | 16s | 33s | 19s | 12s | 19s | 17s | 168s |
| openai-codex/gpt-5.3-codex | 16s | 15s | 20s | 16s | 24s | 15s | 17s | 22s | 18s | 18s | 181s |
| mistral/devstral-2512 | 12s | 18s | 30s | 11s | 35s | 13s | 16s | 18s | 14s | 25s | 192s |
| anthropic/claude-sonnet-4-6 | 17s | 28s | 30s | 37s | 20s | 19s | 19s | 16s | 18s | 16s | 220s |
| kimi-coding/k2p5 | 31s | 60s | 18s | 22s | 15s | 16s | 17s | 17s | 18s | 14s | 228s |
| anthropic/claude-opus-4-6 | 19s | 21s | 32s | 43s | 28s | 24s | 20s | 22s | 22s | 19s | 250s |
| alibaba/qwen3.5-plus | 22s | 206s | 20s | 21s | 23s | 18s | 21s | 16s | 33s | 30s | 410s |
| minimax/MiniMax-M2.5 | 63s | 93s | 36s | 51s | 67s | 41s | 64s | 31s | 39s | 29s | 514s |
| zai/glm-5 | 31s | 212s | 30s | 33s | 30s | 28s | 38s | 34s | 35s | 47s | 518s |
| alibaba/qwen3-coder-next | 52s | 86s | 197s | 14s | 103s | 29s | 20s | 28s | 162s | 60s | 751s |
Output tokens per part
| Model | D1P1 | D1P2 | D2P1 | D2P2 | D3P1 | D3P2 | D4P1 | D4P2 | D5P1 | D5P2 | Total |
|---|---|---|---|---|---|---|---|---|---|---|---|
| openai-codex/gpt-5.3-codex | 448 | 689 | 575 | 667 | 562 | 630 | 627 | 975 | 617 | 638 | 6,428 |
| zai/glm-5 | 636 | 1,251 | 720 | 783 | 547 | 695 | 896 | 745 | 774 | 837 | 7,884 |
| anthropic/claude-opus-4-6 | 732 | 1,038 | 1,658 | 2,416 | 1,112 | 1,064 | 979 | 979 | 933 | 850 | 11,761 |
| kimi-coding/k2p5 | 766 | 4,725 | 1,026 | 1,307 | 585 | 793 | 692 | 1,076 | 723 | 743 | 12,436 |
| anthropic/claude-sonnet-4-6 | 1,023 | 1,575 | 1,984 | 2,522 | 1,024 | 1,121 | 1,135 | 1,094 | 993 | 952 | 13,423 |
| minimax/MiniMax-M2.5 | 1,583 | 3,010 | 1,172 | 1,449 | 1,913 | 1,140 | 1,359 | 1,041 | 1,094 | 923 | 14,684 |
| anthropic/claude-haiku-4-5 | 1,315 | 1,150 | 1,692 | 937 | 1,339 | 3,084 | 1,544 | 1,547 | 1,667 | 1,559 | 15,834 |
| mistral/devstral-2512 | 610 | 2,746 | 2,625 | 1,574 | 2,616 | 1,298 | 1,225 | 1,709 | 568 | 1,752 | 16,723 |
| alibaba/qwen3.5-plus | 1,198 | 5,507 | 1,693 | 1,550 | 2,044 | 1,506 | 1,244 | 1,231 | 1,447 | 1,400 | 18,820 |
| alibaba/qwen3-coder-next | 1,500 | 8,238 | 7,040 | 808 | 3,046 | 1,948 | 1,203 | 1,265 | 5,264 | 1,991 | 32,303 |
API cost per part (approximate USD)
Costs are rough approximations based on published per-token pricing. I use subscription plans, so my actual spending is capped regardless.
| Model | D1P1 | D1P2 | D2P1 | D2P2 | D3P1 | D3P2 | D4P1 | D4P2 | D5P1 | D5P2 | Total |
|---|---|---|---|---|---|---|---|---|---|---|---|
| kimi-coding/k2p5 | .0236 | .0299 | .0042 | .0059 | .0099 | .0104 | .0112 | .0111 | .0111 | .0100 | $0.13 |
| zai/glm-5 | .0201 | .0292 | .0086 | .0095 | .0062 | .0084 | .0105 | .0105 | .0342 | .0279 | $0.17 |
| mistral/devstral-2512 | .0101 | .0245 | .0320 | .0274 | .0310 | .0161 | .0111 | .0276 | .0077 | .0187 | $0.21 |
| openai-codex/gpt-5.3-codex | .0205 | .0251 | .0262 | .0196 | .0220 | .0173 | .0219 | .0235 | .0193 | .0169 | $0.21 |
| minimax/MiniMax-M2.5 | .0257 | .0492 | .0047 | .0080 | .0259 | .0295 | .0290 | .0321 | .0169 | .0212 | $0.24 |
| anthropic/claude-haiku-4-5 | .0301 | .0163 | .0284 | .0099 | .0361 | .0352 | .0438 | .0237 | .0410 | .0205 | $0.29 |
| alibaba/qwen3.5-plus | .0158 | .0454 | .0140 | .0195 | .0170 | .0194 | .0141 | .0191 | .0909 | .0940 | $0.35 |
| anthropic/claude-sonnet-4-6 | .0415 | .0447 | .0600 | .0633 | .0407 | .0331 | .0426 | .0337 | .0394 | .0271 | $0.43 |
| anthropic/claude-opus-4-6 | .1148 | .0888 | .1203 | .1003 | .1430 | .0911 | .1193 | .0904 | .1305 | .0964 | $1.09 |
| alibaba/qwen3-coder-next | .0725 | .1456 | .1978 | .0586 | .2160 | .1075 | .0421 | .0488 | .4686 | .1493 | $1.51 |
Observations
All 10 models solved all 10 parts. Two needed a second attempt on Day 1 Part 2, and one had a dirty stop on Day 5 Part 1, but nobody was ejected.
claude-haiku-4-5 — fastest overall at 168s. Five parts solved in ≤16s. In the
Python benchmark it placed second (206s); here it placed first.
gpt-5.3-codex — 181s total, 6,428 tokens. Fewest output tokens of any model
(next lowest: glm-5 at 7,884). Also the most token-efficient in the Python run.
devstral-2512 — 192s, third place. Won the Python benchmark (205s). Tied with
gpt-5.3-codex for cheapest at $0.21.
kimi-coding/k2p5 — $0.13 total cost, 228s total time. Cheapest model. Was also
cheapest in Python ($0.02).
qwen3-coder-next — most expensive at $1.51, slowest at 751s. D5P1 alone cost
$0.47 due to a dirty stop. 32,303 total output tokens — 5× more than gpt-5.3-codex.
claude-opus-4-6 — $1.09, second-most expensive. Per-part cost consistently
around $0.10. 250s total.
Day 1 Part 2 was the only part where any model needed a retry. Both glm-5 and
qwen3.5-plus recovered on the second try, but the retries added 200+ seconds each.
No model got stuck on Rust-specific issues. No borrow-checker loops, no lifetime annotation struggles across the full 10 parts.
Comparison with Python
| Model | Python time | Rust time | Δ |
|---|---|---|---|
| claude-haiku-4-5 | 206s | 168s | −38s |
| gpt-5.3-codex | 266s | 181s | −85s |
| devstral-2512 | 205s | 192s | −13s |
| claude-sonnet-4-6 | 297s | 220s | −77s |
| k2p5 | 240s | 228s | −12s |
| claude-opus-4-6 | 308s | 250s | −58s |
| qwen3.5-plus | 379s | 410s | +31s |
| MiniMax-M2.5 | 574s | 514s | −60s |
| glm-5 | 602s | 518s | −84s |
| qwen3-coder-next | 251s | 751s | +500s |
8 of 10 models were faster in Rust than Python. The two exceptions — qwen3.5-plus
and qwen3-coder-next — both lost time to retries and dirty stops.
What's next
The benchmark stopped at Day 5 because Day 6+ inputs and descriptions weren't available yet.
Benchmarked on 2026-02-27 using pi as the agent harness.
This post was written with AI assistance.