Following up on the Haskell, OCaml, Python, ReScript, Ruby, Elixir, Java, Elm, and Rust benchmarks, I ran the same AoC 2025 Days 1–5 setup in Racket.
Racket is a Lisp dialect from the Scheme family. It's well-known in the programming
languages community and widely used in education (How to Design Programs, SICP variants),
but it's not a mainstream production language. Models need to handle S-expressions,
#lang racket conventions, and functional idioms with mutable state available but
discouraged. No scaffolding was provided — each model started from scratch.
The result: another clean sweep. Every model solved every part.
The contestants
| # | Model |
|---|---|
| 1 | anthropic/claude-haiku-4-5 |
| 2 | anthropic/claude-sonnet-4-6 |
| 3 | anthropic/claude-opus-4-6 |
| 4 | openai-codex/gpt-5.3-codex |
| 5 | zai/glm-5 |
| 6 | minimax/MiniMax-M2.5 |
| 7 | kimi-coding/k2p5 |
| 8 | mistral/devstral-2512 |
| 9 | alibaba/qwen3.5-plus |
| 10 | alibaba/qwen3-coder-next |
Ejections
None. All 10 models completed all 10 parts.
Only four retries were needed across all 100 model-parts:
kimi-coding/k2p5— Day 1 Part 2 (wrong answer, fixed on 2nd try)alibaba/qwen3-coder-next— Day 1 Part 2 (wrong answer, fixed on 2nd try)anthropic/claude-haiku-4-5— Day 3 Part 1 (wrong answer, fixed on 2nd try)minimax/MiniMax-M2.5— Day 5 Part 1 (API timeout, dirty restart) and Day 5 Part 2 (wrong answer, fixed on 2nd try)
Results (Days 1–5)
Per-task leaderboards
Day 1 Part 1 — Dial rotation counting
| Model | Time |
|---|---|
| anthropic/claude-haiku-4-5 | 39s |
| anthropic/claude-sonnet-4-6 | 40s |
| kimi-coding/k2p5 | 40s |
| openai-codex/gpt-5.3-codex | 41s |
| anthropic/claude-opus-4-6 | 44s |
| zai/glm-5 | 44s |
| mistral/devstral-2512 | 44s |
| alibaba/qwen3-coder-next | 50s |
| alibaba/qwen3.5-plus | 62s |
| minimax/MiniMax-M2.5 | 78s |
Day 1 Part 2 — Counting zero-crossings during dial rotation
| Model | Time | Result |
|---|---|---|
| anthropic/claude-haiku-4-5 | 11s | ✓ |
| openai-codex/gpt-5.3-codex | 16s | ✓ |
| anthropic/claude-sonnet-4-6 | 20s | ✓ |
| anthropic/claude-opus-4-6 | 23s | ✓ |
| zai/glm-5 | 25s | ✓ |
| mistral/devstral-2512 | 49s | ✓ |
| alibaba/qwen3.5-plus | 70s | ✓ |
| minimax/MiniMax-M2.5 | 386s | ✓ |
| kimi-coding/k2p5 | 492s | ✓ (2nd try) |
| alibaba/qwen3-coder-next | 540s | ✓ (2nd try) |
Day 2 Part 1 — Summing repeated-digit IDs in ranges
| Model | Time |
|---|---|
| alibaba/qwen3-coder-next | 11s |
| anthropic/claude-haiku-4-5 | 12s |
| kimi-coding/k2p5 | 14s |
| openai-codex/gpt-5.3-codex | 18s |
| anthropic/claude-sonnet-4-6 | 23s |
| mistral/devstral-2512 | 25s |
| zai/glm-5 | 28s |
| anthropic/claude-opus-4-6 | 30s |
| alibaba/qwen3.5-plus | 31s |
| minimax/MiniMax-M2.5 | 37s |
Day 2 Part 2 — Repeated-pattern IDs (any repeat count)
| Model | Time |
|---|---|
| mistral/devstral-2512 | 21s |
| anthropic/claude-haiku-4-5 | 24s |
| openai-codex/gpt-5.3-codex | 26s |
| kimi-coding/k2p5 | 27s |
| alibaba/qwen3-coder-next | 30s |
| alibaba/qwen3.5-plus | 34s |
| anthropic/claude-sonnet-4-6 | 36s |
| anthropic/claude-opus-4-6 | 41s |
| zai/glm-5 | 50s |
| minimax/MiniMax-M2.5 | 94s |
Day 3 Part 1 — Maximizing 2-digit joltage from battery banks
| Model | Time | Result |
|---|---|---|
| anthropic/claude-sonnet-4-6 | 32s | ✓ |
| anthropic/claude-opus-4-6 | 32s | ✓ |
| alibaba/qwen3.5-plus | 34s | ✓ |
| zai/glm-5 | 37s | ✓ |
| mistral/devstral-2512 | 38s | ✓ |
| kimi-coding/k2p5 | 40s | ✓ |
| openai-codex/gpt-5.3-codex | 48s | ✓ |
| alibaba/qwen3-coder-next | 50s | ✓ |
| minimax/MiniMax-M2.5 | 110s | ✓ |
| anthropic/claude-haiku-4-5 | 229s | ✓ (2nd try) |
Day 3 Part 2 — Maximizing 12-digit joltage from battery banks
| Model | Time |
|---|---|
| anthropic/claude-haiku-4-5 | 24s |
| alibaba/qwen3-coder-next | 29s |
| anthropic/claude-sonnet-4-6 | 31s |
| anthropic/claude-opus-4-6 | 34s |
| openai-codex/gpt-5.3-codex | 38s |
| kimi-coding/k2p5 | 40s |
| alibaba/qwen3.5-plus | 86s |
| mistral/devstral-2512 | 156s |
| zai/glm-5 | 171s |
| minimax/MiniMax-M2.5 | 229s |
Day 4 Part 1 — Grid neighbor counting (accessible paper rolls)
| Model | Time |
|---|---|
| anthropic/claude-haiku-4-5 | 13s |
| anthropic/claude-sonnet-4-6 | 17s |
| openai-codex/gpt-5.3-codex | 17s |
| anthropic/claude-opus-4-6 | 21s |
| kimi-coding/k2p5 | 22s |
| zai/glm-5 | 24s |
| alibaba/qwen3-coder-next | 40s |
| alibaba/qwen3.5-plus | 41s |
| minimax/MiniMax-M2.5 | 63s |
| mistral/devstral-2512 | 255s |
Day 4 Part 2 — Iterative grid removal simulation
| Model | Time |
|---|---|
| openai-codex/gpt-5.3-codex | 27s |
| anthropic/claude-sonnet-4-6 | 30s |
| alibaba/qwen3.5-plus | 30s |
| anthropic/claude-opus-4-6 | 31s |
| kimi-coding/k2p5 | 33s |
| zai/glm-5 | 41s |
| anthropic/claude-haiku-4-5 | 49s |
| minimax/MiniMax-M2.5 | 64s |
| mistral/devstral-2512 | 84s |
| alibaba/qwen3-coder-next | 290s |
Day 5 Part 1 — Range membership checking
| Model | Time | Result |
|---|---|---|
| openai-codex/gpt-5.3-codex | 14s | ✓ |
| anthropic/claude-haiku-4-5 | 15s | ✓ |
| mistral/devstral-2512 | 16s | ✓ |
| anthropic/claude-sonnet-4-6 | 17s | ✓ |
| zai/glm-5 | 18s | ✓ |
| alibaba/qwen3-coder-next | 19s | ✓ |
| anthropic/claude-opus-4-6 | 21s | ✓ |
| alibaba/qwen3.5-plus | 21s | ✓ |
| kimi-coding/k2p5 | 28s | ✓ |
| minimax/MiniMax-M2.5 | —* | ✓ (dirty retry) |
* API timeout forced a fresh relaunch.
Day 5 Part 2 — Counting total fresh IDs from overlapping ranges
| Model | Time | Result |
|---|---|---|
| kimi-coding/k2p5 | 21s | ✓ |
| alibaba/qwen3-coder-next | 24s | ✓ |
| openai-codex/gpt-5.3-codex | 27s | ✓ |
| anthropic/claude-opus-4-6 | 29s | ✓ |
| anthropic/claude-haiku-4-5 | 30s | ✓ |
| anthropic/claude-sonnet-4-6 | 32s | ✓ |
| zai/glm-5 | 36s | ✓ |
| alibaba/qwen3.5-plus | 44s | ✓ |
| mistral/devstral-2512 | 59s | ✓ |
| minimax/MiniMax-M2.5 | 789s | ✓ (2nd try) |
Speed vs accuracy
Summary tables
Wall-clock time (seconds)
| Model | D1P1 | D1P2 | D2P1 | D2P2 | D3P1 | D3P2 | D4P1 | D4P2 | D5P1 | D5P2 | Total |
|---|---|---|---|---|---|---|---|---|---|---|---|
| openai-codex/gpt-5.3-codex | 41s | 16s | 18s | 26s | 48s | 38s | 17s | 27s | 14s | 27s | 272s |
| anthropic/claude-sonnet-4-6 | 40s | 20s | 23s | 36s | 32s | 31s | 17s | 30s | 17s | 32s | 278s |
| anthropic/claude-opus-4-6 | 44s | 23s | 30s | 41s | 32s | 34s | 21s | 31s | 21s | 29s | 306s |
| anthropic/claude-haiku-4-5 | 39s | 11s | 12s | 24s | 229s | 24s | 13s | 49s | 15s | 30s | 446s |
| alibaba/qwen3.5-plus | 62s | 70s | 31s | 34s | 34s | 86s | 41s | 30s | 21s | 44s | 453s |
| zai/glm-5 | 44s | 25s | 28s | 50s | 37s | 171s | 24s | 41s | 18s | 36s | 474s |
| mistral/devstral-2512 | 44s | 49s | 25s | 21s | 38s | 156s | 255s | 84s | 16s | 59s | 747s |
| kimi-coding/k2p5 | 40s | 492s | 14s | 27s | 40s | 40s | 22s | 33s | 28s | 21s | 757s |
| alibaba/qwen3-coder-next | 50s | 540s | 11s | 30s | 50s | 29s | 40s | 290s | 19s | 24s | 1,083s |
| minimax/MiniMax-M2.5 | 78s | 386s | 37s | 94s | 110s | 229s | 63s | 64s | —* | 789s | 1,850s* |
* MiniMax D5P1 required a dirty restart after API timeout; wall-clock time not captured cleanly. Total excludes D5P1.
Output tokens per part
| Model | D1P1 | D1P2 | D2P1 | D2P2 | D3P1 | D3P2 | D4P1 | D4P2 | D5P1 | D5P2 | Total |
|---|---|---|---|---|---|---|---|---|---|---|---|
| openai-codex/gpt-5.3-codex | 427 | 570 | 674 | 540 | 1,109 | 893 | 656 | 740 | 561 | 541 | 6,711 |
| kimi-coding/k2p5 | 575 | 2,515 | 782 | 830 | 614 | 774 | 636 | 793 | 886 | 585 | 7,990 |
| zai/glm-5 | 465 | 518 | 660 | 766 | 489 | 4,134 | 560 | 671 | 475 | 505 | 8,243 |
| anthropic/claude-opus-4-6 | 704 | 1,158 | 1,453 | 1,435 | 797 | 958 | 901 | 930 | 865 | 813 | 10,014 |
| anthropic/claude-sonnet-4-6 | 718 | 1,045 | 1,318 | 1,414 | 902 | 947 | 990 | 1,096 | 807 | 844 | 10,081 |
| anthropic/claude-haiku-4-5 | 1,172 | 1,040 | 1,125 | 1,074 | 5,350 | 1,092 | 1,332 | 5,061 | 1,274 | 2,823 | 21,343 |
| alibaba/qwen3-coder-next | 1,053 | 6,960 | 768 | 1,421 | 1,345 | 1,679 | 4,761 | 10,162 | 1,430 | 818 | 30,397 |
| alibaba/qwen3.5-plus | 2,831 | 9,323 | 3,407 | 1,790 | 2,579 | 6,137 | 1,926 | 1,177 | 1,654 | 2,102 | 32,926 |
| minimax/MiniMax-M2.5 | 1,005 | 6,380 | 1,392 | 3,668 | 3,208 | 7,825 | 2,038 | 1,184 | 1,316 | 29,365 | 57,381 |
| mistral/devstral-2512 | 2,215 | 7,035 | 2,306 | 800 | 2,344 | 14,380 | 23,047 | 6,962 | 1,368 | 5,165 | 65,622 |
API cost per part (approximate USD)
Costs are rough approximations based on published per-token pricing. I use subscription plans, so my actual spending is capped regardless.
| Model | D1P1 | D1P2 | D2P1 | D2P2 | D3P1 | D3P2 | D4P1 | D4P2 | D5P1 | D5P2 | Total |
|---|---|---|---|---|---|---|---|---|---|---|---|
| kimi-coding/k2p5 | .0084 | .0165 | .0035 | .0043 | .0090 | .0081 | .0029 | .0042 | .0040 | .0040 | $0.06 |
| zai/glm-5 | .0061 | .0068 | .0072 | .0104 | .0271 | .0761 | .0222 | .0190 | .0052 | .0073 | $0.19 |
| anthropic/claude-haiku-4-5 | .0347 | .0104 | .0245 | .0104 | .0664 | .0137 | .0292 | .0469 | .0317 | .0231 | $0.29 |
| alibaba/qwen3.5-plus | .0237 | .0460 | .0213 | .0185 | .0137 | .0538 | .0553 | .0388 | .0108 | .0216 | $0.30 |
| openai-codex/gpt-5.3-codex | .0155 | .0220 | .0271 | .0171 | .0340 | .0352 | .0386 | .0410 | .0389 | .0390 | $0.31 |
| anthropic/claude-sonnet-4-6 | .0349 | .0303 | .0457 | .0377 | .0373 | .0279 | .0388 | .0335 | .0348 | .0246 | $0.35 |
| anthropic/claude-opus-4-6 | .1142 | .0926 | .1286 | .0646 | .1117 | .0834 | .1160 | .0879 | .1117 | .0939 | $1.00 |
| alibaba/qwen3-coder-next | .0360 | .1663 | .0092 | .0198 | .0993 | .0962 | .1017 | .4650 | .0828 | .0647 | $1.14 |
| minimax/MiniMax-M2.5 | .0144 | .0334 | .0057 | .0259 | .0356 | .1261 | .0452 | .0395 | .0208 | .8727 | $1.22 |
| mistral/devstral-2512 | .0233 | .0987 | .0171 | .0147 | .0286 | .3224 | .5733 | .1688 | .0142 | .0667 | $1.33 |
Observations
10/10 completers — zero ejections. Racket joins Python, Ruby, Elm, and Rust as the fifth language in this series where every model solved every part.
gpt-5.3-codex — fastest overall at 272s. Also the most token-efficient at 6,711
tokens total. Never needed a retry. No single part over 48s.
claude-sonnet-4-6 — second fastest at 278s, remarkably consistent. Every part
between 17–40s.
kimi-coding/k2p5 — cheapest at $0.06 total, second fewest tokens at 7,990. The
Day 1 Part 2 retry inflated its total time to 757s, but on 9 of 10 parts it was under
40s.
minimax/MiniMax-M2.5 — slowest overall at 1,850s. Hit an API timeout on Day 5
Part 1, then gave a wrong answer on Day 5 Part 2 that took 789s and 29K tokens to fix.
Day 1 Part 2 also took 386s. The model completed everything, but was consistently the
bottleneck.
devstral-2512 — spiked on two parts: D3P2 (156s, 14K tokens) and D4P1 (255s,
23K tokens, $0.57). Every other part was routine.
Day 1 Part 2 was the stumbling block. Two models (k2p5, qwen3-coder-next)
needed a second try, and MiniMax-M2.5 took 386s on its first try. Part 2 was otherwise
straightforward — most first-try solves landed under 50s.
No Racket-specific struggles. No model got stuck on S-expression syntax,
#lang racket conventions, or Racket-specific library APIs. The parentheses didn't slow
anyone down.
Cross-language snapshot
| Language | Models completing all 10 parts |
|---|---|
| Python | 10/10 |
| Ruby | 10/10 |
| Elm | 10/10 |
| Rust | 10/10 |
| Racket | 10/10 |
| Java | 9/10 |
| Elixir | 7/10 |
| Haskell | 7/11 |
| OCaml | 5/9 |
| ReScript (run 2) | 2/10 |
Racket's perfect completion rate is notable. It's not a mainstream language, but it has
clear semantics, good documentation, and a REPL-friendly workflow. Unlike Elm (which
needed a scaffold) or Rust (which demands borrow-checking), Racket lets you write a
quick script with #lang racket and go — and that simplicity may have helped.
Benchmarked on 2026-02-27 using pi as the agent harness.
This post was written with AI assistance.