Following up on the Haskell benchmark and the OCaml benchmark, I ran the same orchestration setup on the same AoC 2025 Days 1–5 puzzles — this time in Python.
This is also the first run with full token usage and API cost tracking per part, which adds a new angle beyond raw wall-clock time.
The contestants
| # | Model |
|---|---|
| 1 | anthropic/claude-haiku-4-5 |
| 2 | anthropic/claude-sonnet-4-6 |
| 3 | anthropic/claude-opus-4-6 |
| 4 | openai-codex/gpt-5.3-codex |
| 5 | zai/glm-5 |
| 6 | minimax/MiniMax-M2.5 |
| 7 | kimi-coding/k2p5 |
| 8 | mistral/devstral-2512 |
| 9 | alibaba/qwen3.5-plus |
| 10 | alibaba/qwen3-coder-next |
A wrong model ID (claude-3-5-haiku-latest) was accidentally in the enabled model list at
the start. It was caught immediately, killed, and claude-haiku-4-5 was launched as a
replacement, missing only Day 1 Part 1 of the original session. Its D1P1 was run separately
and produced the right answer in 9s.
Ejections
None. All 10 models solved all 10 parts correctly on the first attempt. This is the first benchmark in this series with a perfect sweep.
Results (Days 1–5)
Per-task leaderboards
Day 1 Part 1 — Dial rotation counting
| Model | Time |
|---|---|
| anthropic/claude-haiku-4-5 | 9s |
| mistral/devstral-2512 | 23s |
| kimi-coding/k2p5 | 27s |
| alibaba/qwen3-coder-next | 28s |
| anthropic/claude-sonnet-4-6 | 29s |
| anthropic/claude-opus-4-6 | 30s |
| alibaba/qwen3.5-plus | 30s |
| openai-codex/gpt-5.3-codex | 36s |
| zai/glm-5 | 37s |
| minimax/MiniMax-M2.5 | 60s |
Day 1 Part 2 — Counting zero-crossings during dial rotation
| Model | Time |
|---|---|
| mistral/devstral-2512 | 13s |
| anthropic/claude-haiku-4-5 | 18s |
| openai-codex/gpt-5.3-codex | 20s |
| kimi-coding/k2p5 | 21s |
| anthropic/claude-sonnet-4-6 | 26s |
| anthropic/claude-opus-4-6 | 26s |
| alibaba/qwen3-coder-next | 42s |
| zai/glm-5 | 56s |
| alibaba/qwen3.5-plus | 73s |
| minimax/MiniMax-M2.5 | 81s |
Day 2 Part 1 — Summing repeated-digit IDs in ranges
| Model | Time |
|---|---|
| alibaba/qwen3-coder-next | 22s |
| mistral/devstral-2512 | 23s |
| anthropic/claude-haiku-4-5 | 24s |
| openai-codex/gpt-5.3-codex | 26s |
| kimi-coding/k2p5 | 28s |
| alibaba/qwen3.5-plus | 30s |
| zai/glm-5 | 38s |
| anthropic/claude-sonnet-4-6 | 43s |
| anthropic/claude-opus-4-6 | 50s |
| minimax/MiniMax-M2.5 | 67s |
Day 2 Part 2 — Repeated-pattern IDs (any repeat count)
| Model | Time |
|---|---|
| mistral/devstral-2512 | 14s |
| alibaba/qwen3-coder-next | 17s |
| anthropic/claude-haiku-4-5 | 18s |
| kimi-coding/k2p5 | 24s |
| anthropic/claude-sonnet-4-6 | 25s |
| openai-codex/gpt-5.3-codex | 27s |
| zai/glm-5 | 30s |
| minimax/MiniMax-M2.5 | 33s |
| anthropic/claude-opus-4-6 | 41s |
| alibaba/qwen3.5-plus | 48s |
Day 3 Part 1 — Maximizing 2-digit joltage from battery banks
| Model | Time |
|---|---|
| kimi-coding/k2p5 | 21s |
| mistral/devstral-2512 | 24s |
| alibaba/qwen3-coder-next | 25s |
| anthropic/claude-sonnet-4-6 | 28s |
| anthropic/claude-haiku-4-5 | 30s |
| anthropic/claude-opus-4-6 | 31s |
| openai-codex/gpt-5.3-codex | 31s |
| alibaba/qwen3.5-plus | 31s |
| zai/glm-5 | 71s |
| minimax/MiniMax-M2.5 | 72s |
Day 3 Part 2 — Maximizing 12-digit joltage from battery banks
| Model | Time |
|---|---|
| kimi-coding/k2p5 | 19s |
| anthropic/claude-haiku-4-5 | 21s |
| mistral/devstral-2512 | 21s |
| alibaba/qwen3-coder-next | 21s |
| alibaba/qwen3.5-plus | 24s |
| anthropic/claude-sonnet-4-6 | 25s |
| openai-codex/gpt-5.3-codex | 25s |
| anthropic/claude-opus-4-6 | 28s |
| zai/glm-5 | 33s |
| minimax/MiniMax-M2.5 | 56s |
Day 4 Part 1 — Grid neighbor counting (accessible paper rolls)
| Model | Time |
|---|---|
| anthropic/claude-haiku-4-5 | 21s |
| mistral/devstral-2512 | 22s |
| alibaba/qwen3-coder-next | 23s |
| kimi-coding/k2p5 | 27s |
| anthropic/claude-opus-4-6 | 29s |
| openai-codex/gpt-5.3-codex | 31s |
| zai/glm-5 | 31s |
| alibaba/qwen3.5-plus | 40s |
| anthropic/claude-sonnet-4-6 | 51s |
| minimax/MiniMax-M2.5 | 59s |
Day 4 Part 2 — Iterative grid removal simulation
| Model | Time |
|---|---|
| mistral/devstral-2512 | 14s |
| alibaba/qwen3-coder-next | 18s |
| anthropic/claude-haiku-4-5 | 19s |
| openai-codex/gpt-5.3-codex | 21s |
| anthropic/claude-sonnet-4-6 | 23s |
| anthropic/claude-opus-4-6 | 23s |
| kimi-coding/k2p5 | 30s |
| alibaba/qwen3.5-plus | 47s |
| minimax/MiniMax-M2.5 | 52s |
| zai/glm-5 | 217s |
Day 5 Part 1 — Range membership checking
| Model | Time |
|---|---|
| kimi-coding/k2p5 | 22s |
| mistral/devstral-2512 | 22s |
| openai-codex/gpt-5.3-codex | 23s |
| anthropic/claude-haiku-4-5 | 25s |
| anthropic/claude-sonnet-4-6 | 25s |
| alibaba/qwen3.5-plus | 26s |
| anthropic/claude-opus-4-6 | 27s |
| alibaba/qwen3-coder-next | 27s |
| zai/glm-5 | 43s |
| minimax/MiniMax-M2.5 | 53s |
Day 5 Part 2 — Counting total fresh IDs from overlapping ranges
| Model | Time |
|---|---|
| anthropic/claude-haiku-4-5 | 21s |
| kimi-coding/k2p5 | 21s |
| anthropic/claude-sonnet-4-6 | 22s |
| anthropic/claude-opus-4-6 | 23s |
| openai-codex/gpt-5.3-codex | 26s |
| alibaba/qwen3-coder-next | 28s |
| mistral/devstral-2512 | 29s |
| alibaba/qwen3.5-plus | 30s |
| minimax/MiniMax-M2.5 | 41s |
| zai/glm-5 | 46s |
Summary tables
Wall-clock time (seconds)
| Model | D1P1 | D1P2 | D2P1 | D2P2 | D3P1 | D3P2 | D4P1 | D4P2 | D5P1 | D5P2 | Total |
|---|---|---|---|---|---|---|---|---|---|---|---|
| mistral/devstral-2512 | 23s | 13s | 23s | 14s | 24s | 21s | 22s | 14s | 22s | 29s | 205s |
| anthropic/claude-haiku-4-5 | 9s | 18s | 24s | 18s | 30s | 21s | 21s | 19s | 25s | 21s | 206s |
| kimi-coding/k2p5 | 27s | 21s | 28s | 24s | 21s | 19s | 27s | 30s | 22s | 21s | 240s |
| alibaba/qwen3-coder-next | 28s | 42s | 22s | 17s | 25s | 21s | 23s | 18s | 27s | 28s | 251s |
| openai-codex/gpt-5.3-codex | 36s | 20s | 26s | 27s | 31s | 25s | 31s | 21s | 23s | 26s | 266s |
| anthropic/claude-sonnet-4-6 | 29s | 26s | 43s | 25s | 28s | 25s | 51s | 23s | 25s | 22s | 297s |
| anthropic/claude-opus-4-6 | 30s | 26s | 50s | 41s | 31s | 28s | 29s | 23s | 27s | 23s | 308s |
| alibaba/qwen3.5-plus | 30s | 73s | 30s | 48s | 31s | 24s | 40s | 47s | 26s | 30s | 379s |
| minimax/MiniMax-M2.5 | 60s | 81s | 67s | 33s | 72s | 56s | 59s | 52s | 53s | 41s | 574s |
| zai/glm-5 | 37s | 56s | 38s | 30s | 71s | 33s | 31s | 217s | 43s | 46s | 602s |
Output tokens per part
| Model | D1P1 | D1P2 | D2P1 | D2P2 | D3P1 | D3P2 | D4P1 | D4P2 | D5P1 | D5P2 | Total |
|---|---|---|---|---|---|---|---|---|---|---|---|
| openai-codex/gpt-5.3-codex | 420 | 560 | 482 | 476 | 563 | 617 | 583 | 807 | 417 | 501 | 5,426 |
| kimi-coding/k2p5 | 582 | 798 | 550 | 583 | 438 | 571 | 511 | 612 | 500 | 568 | 5,713 |
| zai/glm-5 | 454 | 1,316 | 603 | 536 | 1,626 | 507 | 521 | 576 | 788 | 785 | 7,712 |
| mistral/devstral-2512 | 528 | 849 | 683 | 539 | 860 | 988 | 664 | 826 | 728 | 1,460 | 8,125 |
| anthropic/claude-sonnet-4-6 | 671 | 897 | 1,508 | 1,031 | 780 | 776 | 783 | 794 | 682 | 658 | 8,580 |
| anthropic/claude-opus-4-6 | 592 | 852 | 2,165 | 1,882 | 728 | 763 | 737 | 742 | 663 | 650 | 9,774 |
| anthropic/claude-haiku-4-5 | 843 | 798 | 1,034 | 897 | 1,940 | 1,099 | 907 | 970 | 1,410 | 1,365 | 11,263 |
| alibaba/qwen3-coder-next | 956 | 4,823 | 1,152 | 953 | 800 | 737 | 907 | 1,019 | 718 | 1,022 | 13,087 |
| minimax/MiniMax-M2.5 | 1,305 | 2,853 | 1,333 | 947 | 1,973 | 2,249 | 1,068 | 1,046 | 1,049 | 940 | 14,763 |
| alibaba/qwen3.5-plus | 1,388 | 7,840 | 1,537 | 3,484 | 2,188 | 1,031 | 1,414 | 1,056 | 1,141 | 1,582 | 22,661 |
API cost per part (approximate USD)
Costs are rough approximations based on published per-token pricing. I use subscription plans, so my actual spending is capped regardless.
| Model | D1P1 | D1P2 | D2P1 | D2P2 | D3P1 | D3P2 | D4P1 | D4P2 | D5P1 | D5P2 | Total |
|---|---|---|---|---|---|---|---|---|---|---|---|
| kimi-coding/k2p5 | .0049 | .0019 | .0020 | .0017 | .0015 | .0015 | .0016 | .0018 | .0016 | .0015 | $0.02 |
| mistral/devstral-2512 | .0079 | .0092 | .0078 | .0093 | .0072 | .0092 | .0064 | .0104 | .0076 | .0155 | $0.09 |
| zai/glm-5 | .0060 | .0132 | .0072 | .0074 | .0385 | .0226 | .0055 | .0074 | .0072 | .0087 | $0.12 |
| minimax/MiniMax-M2.5 | .0224 | .0297 | .0060 | .0065 | .0234 | .0491 | .0053 | .0089 | .0168 | .0211 | $0.19 |
| anthropic/claude-haiku-4-5 | .0224 | .0122 | .0273 | .0092 | .0288 | .0116 | .0233 | .0179 | .0330 | .0172 | $0.20 |
| alibaba/qwen3.5-plus | .0110 | .0308 | .0119 | .0273 | .0127 | .0141 | .0336 | .0425 | .0094 | .0161 | $0.21 |
| openai-codex/gpt-5.3-codex | .0178 | .0185 | .0163 | .0167 | .0518 | .0240 | .0340 | .0465 | .0153 | .0124 | $0.25 |
| anthropic/claude-sonnet-4-6 | .0340 | .0272 | .0519 | .0311 | .0349 | .0243 | .0348 | .0272 | .0323 | .0207 | $0.32 |
| alibaba/qwen3-coder-next | .0273 | .0560 | .0104 | .0123 | .0482 | .0599 | .0261 | .0344 | .0525 | .0741 | $0.40 |
| anthropic/claude-opus-4-6 | .1090 | .0809 | .1351 | .0803 | .1089 | .0942 | .1088 | .0794 | .1029 | .1030 | $1.00 |
Observations
All 10 models passed all 10 parts on the first attempt. In the OCaml run, 5 of 9 models failed at Day 1 Part 2. Here, nobody failed anything — no retries needed across the board.
devstral-2512 is the fastest overall at 205s. Fastest or joint-fastest on 6 of 10
parts. 8,125 output tokens total.
claude-haiku-4-5 — 206s total, close behind. Higher token count (11,263) relative
to its speed.
gpt-5.3-codex — fewest output tokens: 5,426 total across 10 parts. $0.25 total
cost, 266s total time.
kimi-coding/k2p5 — cheapest at ~$0.02. 5,713 tokens, 240s total.
qwen3.5-plus — most tokens: 22,661 total. The D1P2 spike (7,840 tokens for a
single part) stands out. Total cost of $0.21, kept low by per-token pricing.
glm-5 — 217s on D4P2, while others solved it in 14–52s. Token usage on that part
(576 tok) was normal, so the time was spent elsewhere (execution retries, perhaps).
claude-opus-4-6 — $1.00 total across all 10 parts. Not the slowest (308s), not the
most verbose (9,774 tok), but the most expensive at roughly $0.10 per part.
qwen3-coder-next — 251s total, but $0.40 (second-highest cost). The D1P2 token
spike (4,823) accounts for much of that.
What's next
Future runs in other languages should show whether these results hold or whether the leaderboard reshuffles when the target language changes.
Token and cost tracking will continue across all future benchmarks.
Benchmarked on 2026-02-25 using pi as the agent harness.
This post was written with AI assistance.