Developers, developers, developers!

Blog about programming, programming and, ah more programming!

Benchmarking LLMs on Advent of Code 2025 (Python)

Tags = [ Python, AI, Advent of Code ]

Following up on the Haskell benchmark and the OCaml benchmark, I ran the same orchestration setup on the same AoC 2025 Days 1–5 puzzles — this time in Python.

This is also the first run with full token usage and API cost tracking per part, which adds a new angle beyond raw wall-clock time.

The contestants

#Model
1anthropic/claude-haiku-4-5
2anthropic/claude-sonnet-4-6
3anthropic/claude-opus-4-6
4openai-codex/gpt-5.3-codex
5zai/glm-5
6minimax/MiniMax-M2.5
7kimi-coding/k2p5
8mistral/devstral-2512
9alibaba/qwen3.5-plus
10alibaba/qwen3-coder-next

A wrong model ID (claude-3-5-haiku-latest) was accidentally in the enabled model list at the start. It was caught immediately, killed, and claude-haiku-4-5 was launched as a replacement, missing only Day 1 Part 1 of the original session. Its D1P1 was run separately and produced the right answer in 9s.

Ejections

None. All 10 models solved all 10 parts correctly on the first attempt. This is the first benchmark in this series with a perfect sweep.

Results (Days 1–5)

Per-task leaderboards


Day 1 Part 1 — Dial rotation counting

ModelTime
anthropic/claude-haiku-4-59s
mistral/devstral-251223s
kimi-coding/k2p527s
alibaba/qwen3-coder-next28s
anthropic/claude-sonnet-4-629s
anthropic/claude-opus-4-630s
alibaba/qwen3.5-plus30s
openai-codex/gpt-5.3-codex36s
zai/glm-537s
minimax/MiniMax-M2.560s



Day 1 Part 2 — Counting zero-crossings during dial rotation

ModelTime
mistral/devstral-251213s
anthropic/claude-haiku-4-518s
openai-codex/gpt-5.3-codex20s
kimi-coding/k2p521s
anthropic/claude-sonnet-4-626s
anthropic/claude-opus-4-626s
alibaba/qwen3-coder-next42s
zai/glm-556s
alibaba/qwen3.5-plus73s
minimax/MiniMax-M2.581s



Day 2 Part 1 — Summing repeated-digit IDs in ranges

ModelTime
alibaba/qwen3-coder-next22s
mistral/devstral-251223s
anthropic/claude-haiku-4-524s
openai-codex/gpt-5.3-codex26s
kimi-coding/k2p528s
alibaba/qwen3.5-plus30s
zai/glm-538s
anthropic/claude-sonnet-4-643s
anthropic/claude-opus-4-650s
minimax/MiniMax-M2.567s



Day 2 Part 2 — Repeated-pattern IDs (any repeat count)

ModelTime
mistral/devstral-251214s
alibaba/qwen3-coder-next17s
anthropic/claude-haiku-4-518s
kimi-coding/k2p524s
anthropic/claude-sonnet-4-625s
openai-codex/gpt-5.3-codex27s
zai/glm-530s
minimax/MiniMax-M2.533s
anthropic/claude-opus-4-641s
alibaba/qwen3.5-plus48s



Day 3 Part 1 — Maximizing 2-digit joltage from battery banks

ModelTime
kimi-coding/k2p521s
mistral/devstral-251224s
alibaba/qwen3-coder-next25s
anthropic/claude-sonnet-4-628s
anthropic/claude-haiku-4-530s
anthropic/claude-opus-4-631s
openai-codex/gpt-5.3-codex31s
alibaba/qwen3.5-plus31s
zai/glm-571s
minimax/MiniMax-M2.572s



Day 3 Part 2 — Maximizing 12-digit joltage from battery banks

ModelTime
kimi-coding/k2p519s
anthropic/claude-haiku-4-521s
mistral/devstral-251221s
alibaba/qwen3-coder-next21s
alibaba/qwen3.5-plus24s
anthropic/claude-sonnet-4-625s
openai-codex/gpt-5.3-codex25s
anthropic/claude-opus-4-628s
zai/glm-533s
minimax/MiniMax-M2.556s



Day 4 Part 1 — Grid neighbor counting (accessible paper rolls)

ModelTime
anthropic/claude-haiku-4-521s
mistral/devstral-251222s
alibaba/qwen3-coder-next23s
kimi-coding/k2p527s
anthropic/claude-opus-4-629s
openai-codex/gpt-5.3-codex31s
zai/glm-531s
alibaba/qwen3.5-plus40s
anthropic/claude-sonnet-4-651s
minimax/MiniMax-M2.559s



Day 4 Part 2 — Iterative grid removal simulation

ModelTime
mistral/devstral-251214s
alibaba/qwen3-coder-next18s
anthropic/claude-haiku-4-519s
openai-codex/gpt-5.3-codex21s
anthropic/claude-sonnet-4-623s
anthropic/claude-opus-4-623s
kimi-coding/k2p530s
alibaba/qwen3.5-plus47s
minimax/MiniMax-M2.552s
zai/glm-5217s



Day 5 Part 1 — Range membership checking

ModelTime
kimi-coding/k2p522s
mistral/devstral-251222s
openai-codex/gpt-5.3-codex23s
anthropic/claude-haiku-4-525s
anthropic/claude-sonnet-4-625s
alibaba/qwen3.5-plus26s
anthropic/claude-opus-4-627s
alibaba/qwen3-coder-next27s
zai/glm-543s
minimax/MiniMax-M2.553s



Day 5 Part 2 — Counting total fresh IDs from overlapping ranges

ModelTime
anthropic/claude-haiku-4-521s
kimi-coding/k2p521s
anthropic/claude-sonnet-4-622s
anthropic/claude-opus-4-623s
openai-codex/gpt-5.3-codex26s
alibaba/qwen3-coder-next28s
mistral/devstral-251229s
alibaba/qwen3.5-plus30s
minimax/MiniMax-M2.541s
zai/glm-546s

Summary tables

Wall-clock time (seconds)

Model D1P1 D1P2 D2P1 D2P2 D3P1 D3P2 D4P1 D4P2 D5P1 D5P2 Total
mistral/devstral-2512 23s 13s 23s 14s 24s 21s 22s 14s 22s 29s 205s
anthropic/claude-haiku-4-5 9s 18s 24s 18s 30s 21s 21s 19s 25s 21s 206s
kimi-coding/k2p5 27s 21s 28s 24s 21s 19s 27s 30s 22s 21s 240s
alibaba/qwen3-coder-next 28s 42s 22s 17s 25s 21s 23s 18s 27s 28s 251s
openai-codex/gpt-5.3-codex 36s 20s 26s 27s 31s 25s 31s 21s 23s 26s 266s
anthropic/claude-sonnet-4-6 29s 26s 43s 25s 28s 25s 51s 23s 25s 22s 297s
anthropic/claude-opus-4-6 30s 26s 50s 41s 31s 28s 29s 23s 27s 23s 308s
alibaba/qwen3.5-plus 30s 73s 30s 48s 31s 24s 40s 47s 26s 30s 379s
minimax/MiniMax-M2.5 60s 81s 67s 33s 72s 56s 59s 52s 53s 41s 574s
zai/glm-5 37s 56s 38s 30s 71s 33s 31s 217s 43s 46s 602s

Output tokens per part

Model D1P1 D1P2 D2P1 D2P2 D3P1 D3P2 D4P1 D4P2 D5P1 D5P2 Total
openai-codex/gpt-5.3-codex 420 560 482 476 563 617 583 807 417 501 5,426
kimi-coding/k2p5 582 798 550 583 438 571 511 612 500 568 5,713
zai/glm-5 454 1,316 603 536 1,626 507 521 576 788 785 7,712
mistral/devstral-2512 528 849 683 539 860 988 664 826 728 1,460 8,125
anthropic/claude-sonnet-4-6 671 897 1,508 1,031 780 776 783 794 682 658 8,580
anthropic/claude-opus-4-6 592 852 2,165 1,882 728 763 737 742 663 650 9,774
anthropic/claude-haiku-4-5 843 798 1,034 897 1,940 1,099 907 970 1,410 1,365 11,263
alibaba/qwen3-coder-next 956 4,823 1,152 953 800 737 907 1,019 718 1,022 13,087
minimax/MiniMax-M2.5 1,305 2,853 1,333 947 1,973 2,249 1,068 1,046 1,049 940 14,763
alibaba/qwen3.5-plus 1,388 7,840 1,537 3,484 2,188 1,031 1,414 1,056 1,141 1,582 22,661

API cost per part (approximate USD)

Costs are rough approximations based on published per-token pricing. I use subscription plans, so my actual spending is capped regardless.

Model D1P1 D1P2 D2P1 D2P2 D3P1 D3P2 D4P1 D4P2 D5P1 D5P2 Total
kimi-coding/k2p5 .0049 .0019 .0020 .0017 .0015 .0015 .0016 .0018 .0016 .0015 $0.02
mistral/devstral-2512 .0079 .0092 .0078 .0093 .0072 .0092 .0064 .0104 .0076 .0155 $0.09
zai/glm-5 .0060 .0132 .0072 .0074 .0385 .0226 .0055 .0074 .0072 .0087 $0.12
minimax/MiniMax-M2.5 .0224 .0297 .0060 .0065 .0234 .0491 .0053 .0089 .0168 .0211 $0.19
anthropic/claude-haiku-4-5 .0224 .0122 .0273 .0092 .0288 .0116 .0233 .0179 .0330 .0172 $0.20
alibaba/qwen3.5-plus .0110 .0308 .0119 .0273 .0127 .0141 .0336 .0425 .0094 .0161 $0.21
openai-codex/gpt-5.3-codex .0178 .0185 .0163 .0167 .0518 .0240 .0340 .0465 .0153 .0124 $0.25
anthropic/claude-sonnet-4-6 .0340 .0272 .0519 .0311 .0349 .0243 .0348 .0272 .0323 .0207 $0.32
alibaba/qwen3-coder-next .0273 .0560 .0104 .0123 .0482 .0599 .0261 .0344 .0525 .0741 $0.40
anthropic/claude-opus-4-6 .1090 .0809 .1351 .0803 .1089 .0942 .1088 .0794 .1029 .1030 $1.00

Observations

All 10 models passed all 10 parts on the first attempt. In the OCaml run, 5 of 9 models failed at Day 1 Part 2. Here, nobody failed anything — no retries needed across the board.

devstral-2512 is the fastest overall at 205s. Fastest or joint-fastest on 6 of 10 parts. 8,125 output tokens total.

claude-haiku-4-5 — 206s total, close behind. Higher token count (11,263) relative to its speed.

gpt-5.3-codex — fewest output tokens: 5,426 total across 10 parts. $0.25 total cost, 266s total time.

kimi-coding/k2p5 — cheapest at ~$0.02. 5,713 tokens, 240s total.

qwen3.5-plus — most tokens: 22,661 total. The D1P2 spike (7,840 tokens for a single part) stands out. Total cost of $0.21, kept low by per-token pricing.

glm-5 — 217s on D4P2, while others solved it in 14–52s. Token usage on that part (576 tok) was normal, so the time was spent elsewhere (execution retries, perhaps).

claude-opus-4-6 — $1.00 total across all 10 parts. Not the slowest (308s), not the most verbose (9,774 tok), but the most expensive at roughly $0.10 per part.

qwen3-coder-next — 251s total, but $0.40 (second-highest cost). The D1P2 token spike (4,823) accounts for much of that.

What's next

Future runs in other languages should show whether these results hold or whether the leaderboard reshuffles when the target language changes.

Token and cost tracking will continue across all future benchmarks.

Benchmarked on 2026-02-25 using pi as the agent harness.


This post was written with AI assistance.