Developers, developers, developers!

Blog about programming, programming and, ah more programming!

Benchmarking LLMs on Advent of Code 2025 (Racket)

Tags = [ Racket, AI, Advent of Code ]

Following up on the Haskell, OCaml, Python, ReScript, Ruby, Elixir, Java, Elm, and Rust benchmarks, I ran the same AoC 2025 Days 1–5 setup in Racket.

Racket is a Lisp dialect from the Scheme family. It's well-known in the programming languages community and widely used in education (How to Design Programs, SICP variants), but it's not a mainstream production language. Models need to handle S-expressions, #lang racket conventions, and functional idioms with mutable state available but discouraged. No scaffolding was provided — each model started from scratch.

The result: another clean sweep. Every model solved every part.

The contestants

#Model
1anthropic/claude-haiku-4-5
2anthropic/claude-sonnet-4-6
3anthropic/claude-opus-4-6
4openai-codex/gpt-5.3-codex
5zai/glm-5
6minimax/MiniMax-M2.5
7kimi-coding/k2p5
8mistral/devstral-2512
9alibaba/qwen3.5-plus
10alibaba/qwen3-coder-next

Ejections

None. All 10 models completed all 10 parts.

Only four retries were needed across all 100 model-parts:

  • kimi-coding/k2p5 — Day 1 Part 2 (wrong answer, fixed on 2nd try)
  • alibaba/qwen3-coder-next — Day 1 Part 2 (wrong answer, fixed on 2nd try)
  • anthropic/claude-haiku-4-5 — Day 3 Part 1 (wrong answer, fixed on 2nd try)
  • minimax/MiniMax-M2.5 — Day 5 Part 1 (API timeout, dirty restart) and Day 5 Part 2 (wrong answer, fixed on 2nd try)

Results (Days 1–5)

Per-task leaderboards


Day 1 Part 1 — Dial rotation counting

ModelTime
anthropic/claude-haiku-4-539s
anthropic/claude-sonnet-4-640s
kimi-coding/k2p540s
openai-codex/gpt-5.3-codex41s
anthropic/claude-opus-4-644s
zai/glm-544s
mistral/devstral-251244s
alibaba/qwen3-coder-next50s
alibaba/qwen3.5-plus62s
minimax/MiniMax-M2.578s



Day 1 Part 2 — Counting zero-crossings during dial rotation

ModelTimeResult
anthropic/claude-haiku-4-511s
openai-codex/gpt-5.3-codex16s
anthropic/claude-sonnet-4-620s
anthropic/claude-opus-4-623s
zai/glm-525s
mistral/devstral-251249s
alibaba/qwen3.5-plus70s
minimax/MiniMax-M2.5386s
kimi-coding/k2p5492s✓ (2nd try)
alibaba/qwen3-coder-next540s✓ (2nd try)



Day 2 Part 1 — Summing repeated-digit IDs in ranges

ModelTime
alibaba/qwen3-coder-next11s
anthropic/claude-haiku-4-512s
kimi-coding/k2p514s
openai-codex/gpt-5.3-codex18s
anthropic/claude-sonnet-4-623s
mistral/devstral-251225s
zai/glm-528s
anthropic/claude-opus-4-630s
alibaba/qwen3.5-plus31s
minimax/MiniMax-M2.537s



Day 2 Part 2 — Repeated-pattern IDs (any repeat count)

ModelTime
mistral/devstral-251221s
anthropic/claude-haiku-4-524s
openai-codex/gpt-5.3-codex26s
kimi-coding/k2p527s
alibaba/qwen3-coder-next30s
alibaba/qwen3.5-plus34s
anthropic/claude-sonnet-4-636s
anthropic/claude-opus-4-641s
zai/glm-550s
minimax/MiniMax-M2.594s



Day 3 Part 1 — Maximizing 2-digit joltage from battery banks

ModelTimeResult
anthropic/claude-sonnet-4-632s
anthropic/claude-opus-4-632s
alibaba/qwen3.5-plus34s
zai/glm-537s
mistral/devstral-251238s
kimi-coding/k2p540s
openai-codex/gpt-5.3-codex48s
alibaba/qwen3-coder-next50s
minimax/MiniMax-M2.5110s
anthropic/claude-haiku-4-5229s✓ (2nd try)



Day 3 Part 2 — Maximizing 12-digit joltage from battery banks

ModelTime
anthropic/claude-haiku-4-524s
alibaba/qwen3-coder-next29s
anthropic/claude-sonnet-4-631s
anthropic/claude-opus-4-634s
openai-codex/gpt-5.3-codex38s
kimi-coding/k2p540s
alibaba/qwen3.5-plus86s
mistral/devstral-2512156s
zai/glm-5171s
minimax/MiniMax-M2.5229s



Day 4 Part 1 — Grid neighbor counting (accessible paper rolls)

ModelTime
anthropic/claude-haiku-4-513s
anthropic/claude-sonnet-4-617s
openai-codex/gpt-5.3-codex17s
anthropic/claude-opus-4-621s
kimi-coding/k2p522s
zai/glm-524s
alibaba/qwen3-coder-next40s
alibaba/qwen3.5-plus41s
minimax/MiniMax-M2.563s
mistral/devstral-2512255s



Day 4 Part 2 — Iterative grid removal simulation

ModelTime
openai-codex/gpt-5.3-codex27s
anthropic/claude-sonnet-4-630s
alibaba/qwen3.5-plus30s
anthropic/claude-opus-4-631s
kimi-coding/k2p533s
zai/glm-541s
anthropic/claude-haiku-4-549s
minimax/MiniMax-M2.564s
mistral/devstral-251284s
alibaba/qwen3-coder-next290s



Day 5 Part 1 — Range membership checking

ModelTimeResult
openai-codex/gpt-5.3-codex14s
anthropic/claude-haiku-4-515s
mistral/devstral-251216s
anthropic/claude-sonnet-4-617s
zai/glm-518s
alibaba/qwen3-coder-next19s
anthropic/claude-opus-4-621s
alibaba/qwen3.5-plus21s
kimi-coding/k2p528s
minimax/MiniMax-M2.5—*✓ (dirty retry)

* API timeout forced a fresh relaunch.



Day 5 Part 2 — Counting total fresh IDs from overlapping ranges

ModelTimeResult
kimi-coding/k2p521s
alibaba/qwen3-coder-next24s
openai-codex/gpt-5.3-codex27s
anthropic/claude-opus-4-629s
anthropic/claude-haiku-4-530s
anthropic/claude-sonnet-4-632s
zai/glm-536s
alibaba/qwen3.5-plus44s
mistral/devstral-251259s
minimax/MiniMax-M2.5789s✓ (2nd try)

Speed vs accuracy

Summary tables

Wall-clock time (seconds)

Model D1P1 D1P2 D2P1 D2P2 D3P1 D3P2 D4P1 D4P2 D5P1 D5P2 Total
openai-codex/gpt-5.3-codex 41s 16s 18s 26s 48s 38s 17s 27s 14s 27s 272s
anthropic/claude-sonnet-4-6 40s 20s 23s 36s 32s 31s 17s 30s 17s 32s 278s
anthropic/claude-opus-4-6 44s 23s 30s 41s 32s 34s 21s 31s 21s 29s 306s
anthropic/claude-haiku-4-5 39s 11s 12s 24s 229s 24s 13s 49s 15s 30s 446s
alibaba/qwen3.5-plus 62s 70s 31s 34s 34s 86s 41s 30s 21s 44s 453s
zai/glm-5 44s 25s 28s 50s 37s 171s 24s 41s 18s 36s 474s
mistral/devstral-2512 44s 49s 25s 21s 38s 156s 255s 84s 16s 59s 747s
kimi-coding/k2p5 40s 492s 14s 27s 40s 40s 22s 33s 28s 21s 757s
alibaba/qwen3-coder-next 50s 540s 11s 30s 50s 29s 40s 290s 19s 24s 1,083s
minimax/MiniMax-M2.5 78s 386s 37s 94s 110s 229s 63s 64s —* 789s 1,850s*

* MiniMax D5P1 required a dirty restart after API timeout; wall-clock time not captured cleanly. Total excludes D5P1.


Output tokens per part

Model D1P1 D1P2 D2P1 D2P2 D3P1 D3P2 D4P1 D4P2 D5P1 D5P2 Total
openai-codex/gpt-5.3-codex 427 570 674 540 1,109 893 656 740 561 541 6,711
kimi-coding/k2p5 575 2,515 782 830 614 774 636 793 886 585 7,990
zai/glm-5 465 518 660 766 489 4,134 560 671 475 505 8,243
anthropic/claude-opus-4-6 704 1,158 1,453 1,435 797 958 901 930 865 813 10,014
anthropic/claude-sonnet-4-6 718 1,045 1,318 1,414 902 947 990 1,096 807 844 10,081
anthropic/claude-haiku-4-5 1,172 1,040 1,125 1,074 5,350 1,092 1,332 5,061 1,274 2,823 21,343
alibaba/qwen3-coder-next 1,053 6,960 768 1,421 1,345 1,679 4,761 10,162 1,430 818 30,397
alibaba/qwen3.5-plus 2,831 9,323 3,407 1,790 2,579 6,137 1,926 1,177 1,654 2,102 32,926
minimax/MiniMax-M2.5 1,005 6,380 1,392 3,668 3,208 7,825 2,038 1,184 1,316 29,365 57,381
mistral/devstral-2512 2,215 7,035 2,306 800 2,344 14,380 23,047 6,962 1,368 5,165 65,622

API cost per part (approximate USD)

Costs are rough approximations based on published per-token pricing. I use subscription plans, so my actual spending is capped regardless.

Model D1P1 D1P2 D2P1 D2P2 D3P1 D3P2 D4P1 D4P2 D5P1 D5P2 Total
kimi-coding/k2p5 .0084 .0165 .0035 .0043 .0090 .0081 .0029 .0042 .0040 .0040 $0.06
zai/glm-5 .0061 .0068 .0072 .0104 .0271 .0761 .0222 .0190 .0052 .0073 $0.19
anthropic/claude-haiku-4-5 .0347 .0104 .0245 .0104 .0664 .0137 .0292 .0469 .0317 .0231 $0.29
alibaba/qwen3.5-plus .0237 .0460 .0213 .0185 .0137 .0538 .0553 .0388 .0108 .0216 $0.30
openai-codex/gpt-5.3-codex .0155 .0220 .0271 .0171 .0340 .0352 .0386 .0410 .0389 .0390 $0.31
anthropic/claude-sonnet-4-6 .0349 .0303 .0457 .0377 .0373 .0279 .0388 .0335 .0348 .0246 $0.35
anthropic/claude-opus-4-6 .1142 .0926 .1286 .0646 .1117 .0834 .1160 .0879 .1117 .0939 $1.00
alibaba/qwen3-coder-next .0360 .1663 .0092 .0198 .0993 .0962 .1017 .4650 .0828 .0647 $1.14
minimax/MiniMax-M2.5 .0144 .0334 .0057 .0259 .0356 .1261 .0452 .0395 .0208 .8727 $1.22
mistral/devstral-2512 .0233 .0987 .0171 .0147 .0286 .3224 .5733 .1688 .0142 .0667 $1.33

Observations

10/10 completers — zero ejections. Racket joins Python, Ruby, Elm, and Rust as the fifth language in this series where every model solved every part.

gpt-5.3-codex — fastest overall at 272s. Also the most token-efficient at 6,711 tokens total. Never needed a retry. No single part over 48s.

claude-sonnet-4-6 — second fastest at 278s, remarkably consistent. Every part between 17–40s.

kimi-coding/k2p5 — cheapest at $0.06 total, second fewest tokens at 7,990. The Day 1 Part 2 retry inflated its total time to 757s, but on 9 of 10 parts it was under 40s.

minimax/MiniMax-M2.5 — slowest overall at 1,850s. Hit an API timeout on Day 5 Part 1, then gave a wrong answer on Day 5 Part 2 that took 789s and 29K tokens to fix. Day 1 Part 2 also took 386s. The model completed everything, but was consistently the bottleneck.

devstral-2512 — spiked on two parts: D3P2 (156s, 14K tokens) and D4P1 (255s, 23K tokens, $0.57). Every other part was routine.

Day 1 Part 2 was the stumbling block. Two models (k2p5, qwen3-coder-next) needed a second try, and MiniMax-M2.5 took 386s on its first try. Part 2 was otherwise straightforward — most first-try solves landed under 50s.

No Racket-specific struggles. No model got stuck on S-expression syntax, #lang racket conventions, or Racket-specific library APIs. The parentheses didn't slow anyone down.

Cross-language snapshot

LanguageModels completing all 10 parts
Python10/10
Ruby10/10
Elm10/10
Rust10/10
Racket10/10
Java9/10
Elixir7/10
Haskell7/11
OCaml5/9
ReScript (run 2)2/10

Racket's perfect completion rate is notable. It's not a mainstream language, but it has clear semantics, good documentation, and a REPL-friendly workflow. Unlike Elm (which needed a scaffold) or Rust (which demands borrow-checking), Racket lets you write a quick script with #lang racket and go — and that simplicity may have helped.

Benchmarked on 2026-02-27 using pi as the agent harness.


This post was written with AI assistance.