Benchmarking LLMs on Advent of Code 2025 (Racket)

Tags = [ Racket, AI, Advent of Code ] Posted on February 27, 2026

Following up on the Haskell, OCaml, Python, ReScript, Ruby, Elixir, Java, Elm, and Rust benchmarks, I ran the same AoC 2025 Days 1–5 setup in Racket.

Racket is a Lisp dialect from the Scheme family. It's well-known in the programming languages community and widely used in education (How to Design Programs, SICP variants), but it's not a mainstream production language. Models need to handle S-expressions, #lang racket conventions, and functional idioms with mutable state available but discouraged. No scaffolding was provided — each model started from scratch.

The result: another clean sweep. Every model solved every part.

The contestants

#	Model
1	`anthropic/claude-haiku-4-5`
2	`anthropic/claude-sonnet-4-6`
3	`anthropic/claude-opus-4-6`
4	`openai-codex/gpt-5.3-codex`
5	`zai/glm-5`
6	`minimax/MiniMax-M2.5`
7	`kimi-coding/k2p5`
8	`mistral/devstral-2512`
9	`alibaba/qwen3.5-plus`
10	`alibaba/qwen3-coder-next`

Ejections

None. All 10 models completed all 10 parts.

Only four retries were needed across all 100 model-parts:

kimi-coding/k2p5 — Day 1 Part 2 (wrong answer, fixed on 2nd try)
alibaba/qwen3-coder-next — Day 1 Part 2 (wrong answer, fixed on 2nd try)
anthropic/claude-haiku-4-5 — Day 3 Part 1 (wrong answer, fixed on 2nd try)
minimax/MiniMax-M2.5 — Day 5 Part 1 (API timeout, dirty restart) and Day 5 Part 2 (wrong answer, fixed on 2nd try)

Results (Days 1–5)

Per-task leaderboards

Day 1 Part 1 — Dial rotation counting

Model	Time
anthropic/claude-haiku-4-5	39s
anthropic/claude-sonnet-4-6	40s
kimi-coding/k2p5	40s
openai-codex/gpt-5.3-codex	41s
anthropic/claude-opus-4-6	44s
zai/glm-5	44s
mistral/devstral-2512	44s
alibaba/qwen3-coder-next	50s
alibaba/qwen3.5-plus	62s
minimax/MiniMax-M2.5	78s

Day 1 Part 2 — Counting zero-crossings during dial rotation

Model	Time	Result
anthropic/claude-haiku-4-5	11s	✓
openai-codex/gpt-5.3-codex	16s	✓
anthropic/claude-sonnet-4-6	20s	✓
anthropic/claude-opus-4-6	23s	✓
zai/glm-5	25s	✓
mistral/devstral-2512	49s	✓
alibaba/qwen3.5-plus	70s	✓
minimax/MiniMax-M2.5	386s	✓
kimi-coding/k2p5	492s	✓ (2nd try)
alibaba/qwen3-coder-next	540s	✓ (2nd try)

Day 2 Part 1 — Summing repeated-digit IDs in ranges

Model	Time
alibaba/qwen3-coder-next	11s
anthropic/claude-haiku-4-5	12s
kimi-coding/k2p5	14s
openai-codex/gpt-5.3-codex	18s
anthropic/claude-sonnet-4-6	23s
mistral/devstral-2512	25s
zai/glm-5	28s
anthropic/claude-opus-4-6	30s
alibaba/qwen3.5-plus	31s
minimax/MiniMax-M2.5	37s

Day 2 Part 2 — Repeated-pattern IDs (any repeat count)

Model	Time
mistral/devstral-2512	21s
anthropic/claude-haiku-4-5	24s
openai-codex/gpt-5.3-codex	26s
kimi-coding/k2p5	27s
alibaba/qwen3-coder-next	30s
alibaba/qwen3.5-plus	34s
anthropic/claude-sonnet-4-6	36s
anthropic/claude-opus-4-6	41s
zai/glm-5	50s
minimax/MiniMax-M2.5	94s

Day 3 Part 1 — Maximizing 2-digit joltage from battery banks

Model	Time	Result
anthropic/claude-sonnet-4-6	32s	✓
anthropic/claude-opus-4-6	32s	✓
alibaba/qwen3.5-plus	34s	✓
zai/glm-5	37s	✓
mistral/devstral-2512	38s	✓
kimi-coding/k2p5	40s	✓
openai-codex/gpt-5.3-codex	48s	✓
alibaba/qwen3-coder-next	50s	✓
minimax/MiniMax-M2.5	110s	✓
anthropic/claude-haiku-4-5	229s	✓ (2nd try)

Day 3 Part 2 — Maximizing 12-digit joltage from battery banks

Model	Time
anthropic/claude-haiku-4-5	24s
alibaba/qwen3-coder-next	29s
anthropic/claude-sonnet-4-6	31s
anthropic/claude-opus-4-6	34s
openai-codex/gpt-5.3-codex	38s
kimi-coding/k2p5	40s
alibaba/qwen3.5-plus	86s
mistral/devstral-2512	156s
zai/glm-5	171s
minimax/MiniMax-M2.5	229s

Day 4 Part 1 — Grid neighbor counting (accessible paper rolls)

Model	Time
anthropic/claude-haiku-4-5	13s
anthropic/claude-sonnet-4-6	17s
openai-codex/gpt-5.3-codex	17s
anthropic/claude-opus-4-6	21s
kimi-coding/k2p5	22s
zai/glm-5	24s
alibaba/qwen3-coder-next	40s
alibaba/qwen3.5-plus	41s
minimax/MiniMax-M2.5	63s
mistral/devstral-2512	255s

Day 4 Part 2 — Iterative grid removal simulation

Model	Time
openai-codex/gpt-5.3-codex	27s
anthropic/claude-sonnet-4-6	30s
alibaba/qwen3.5-plus	30s
anthropic/claude-opus-4-6	31s
kimi-coding/k2p5	33s
zai/glm-5	41s
anthropic/claude-haiku-4-5	49s
minimax/MiniMax-M2.5	64s
mistral/devstral-2512	84s
alibaba/qwen3-coder-next	290s

Day 5 Part 1 — Range membership checking

Model	Time	Result
openai-codex/gpt-5.3-codex	14s	✓
anthropic/claude-haiku-4-5	15s	✓
mistral/devstral-2512	16s	✓
anthropic/claude-sonnet-4-6	17s	✓
zai/glm-5	18s	✓
alibaba/qwen3-coder-next	19s	✓
anthropic/claude-opus-4-6	21s	✓
alibaba/qwen3.5-plus	21s	✓
kimi-coding/k2p5	28s	✓
minimax/MiniMax-M2.5	—*	✓ (dirty retry)

* API timeout forced a fresh relaunch.

Day 5 Part 2 — Counting total fresh IDs from overlapping ranges

Model	Time	Result
kimi-coding/k2p5	21s	✓
alibaba/qwen3-coder-next	24s	✓
openai-codex/gpt-5.3-codex	27s	✓
anthropic/claude-opus-4-6	29s	✓
anthropic/claude-haiku-4-5	30s	✓
anthropic/claude-sonnet-4-6	32s	✓
zai/glm-5	36s	✓
alibaba/qwen3.5-plus	44s	✓
mistral/devstral-2512	59s	✓
minimax/MiniMax-M2.5	789s	✓ (2nd try)

Speed vs accuracy

Summary tables

Wall-clock time (seconds)

Model	D1P1	D1P2	D2P1	D2P2	D3P1	D3P2	D4P1	D4P2	D5P1	D5P2	Total
openai-codex/gpt-5.3-codex	41s	16s	18s	26s	48s	38s	17s	27s	14s	27s	272s
anthropic/claude-sonnet-4-6	40s	20s	23s	36s	32s	31s	17s	30s	17s	32s	278s
anthropic/claude-opus-4-6	44s	23s	30s	41s	32s	34s	21s	31s	21s	29s	306s
anthropic/claude-haiku-4-5	39s	11s	12s	24s	229s	24s	13s	49s	15s	30s	446s
alibaba/qwen3.5-plus	62s	70s	31s	34s	34s	86s	41s	30s	21s	44s	453s
zai/glm-5	44s	25s	28s	50s	37s	171s	24s	41s	18s	36s	474s
mistral/devstral-2512	44s	49s	25s	21s	38s	156s	255s	84s	16s	59s	747s
kimi-coding/k2p5	40s	492s	14s	27s	40s	40s	22s	33s	28s	21s	757s
alibaba/qwen3-coder-next	50s	540s	11s	30s	50s	29s	40s	290s	19s	24s	1,083s
minimax/MiniMax-M2.5	78s	386s	37s	94s	110s	229s	63s	64s	—*	789s	1,850s*

* MiniMax D5P1 required a dirty restart after API timeout; wall-clock time not captured cleanly. Total excludes D5P1.

Output tokens per part

Model	D1P1	D1P2	D2P1	D2P2	D3P1	D3P2	D4P1	D4P2	D5P1	D5P2	Total
openai-codex/gpt-5.3-codex	427	570	674	540	1,109	893	656	740	561	541	6,711
kimi-coding/k2p5	575	2,515	782	830	614	774	636	793	886	585	7,990
zai/glm-5	465	518	660	766	489	4,134	560	671	475	505	8,243
anthropic/claude-opus-4-6	704	1,158	1,453	1,435	797	958	901	930	865	813	10,014
anthropic/claude-sonnet-4-6	718	1,045	1,318	1,414	902	947	990	1,096	807	844	10,081
anthropic/claude-haiku-4-5	1,172	1,040	1,125	1,074	5,350	1,092	1,332	5,061	1,274	2,823	21,343
alibaba/qwen3-coder-next	1,053	6,960	768	1,421	1,345	1,679	4,761	10,162	1,430	818	30,397
alibaba/qwen3.5-plus	2,831	9,323	3,407	1,790	2,579	6,137	1,926	1,177	1,654	2,102	32,926
minimax/MiniMax-M2.5	1,005	6,380	1,392	3,668	3,208	7,825	2,038	1,184	1,316	29,365	57,381
mistral/devstral-2512	2,215	7,035	2,306	800	2,344	14,380	23,047	6,962	1,368	5,165	65,622

API cost per part (approximate USD)

Costs are rough approximations based on published per-token pricing. I use subscription plans, so my actual spending is capped regardless.

Model	D1P1	D1P2	D2P1	D2P2	D3P1	D3P2	D4P1	D4P2	D5P1	D5P2	Total
kimi-coding/k2p5	.0084	.0165	.0035	.0043	.0090	.0081	.0029	.0042	.0040	.0040	$0.06
zai/glm-5	.0061	.0068	.0072	.0104	.0271	.0761	.0222	.0190	.0052	.0073	$0.19
anthropic/claude-haiku-4-5	.0347	.0104	.0245	.0104	.0664	.0137	.0292	.0469	.0317	.0231	$0.29
alibaba/qwen3.5-plus	.0237	.0460	.0213	.0185	.0137	.0538	.0553	.0388	.0108	.0216	$0.30
openai-codex/gpt-5.3-codex	.0155	.0220	.0271	.0171	.0340	.0352	.0386	.0410	.0389	.0390	$0.31
anthropic/claude-sonnet-4-6	.0349	.0303	.0457	.0377	.0373	.0279	.0388	.0335	.0348	.0246	$0.35
anthropic/claude-opus-4-6	.1142	.0926	.1286	.0646	.1117	.0834	.1160	.0879	.1117	.0939	$1.00
alibaba/qwen3-coder-next	.0360	.1663	.0092	.0198	.0993	.0962	.1017	.4650	.0828	.0647	$1.14
minimax/MiniMax-M2.5	.0144	.0334	.0057	.0259	.0356	.1261	.0452	.0395	.0208	.8727	$1.22
mistral/devstral-2512	.0233	.0987	.0171	.0147	.0286	.3224	.5733	.1688	.0142	.0667	$1.33

Observations

10/10 completers — zero ejections. Racket joins Python, Ruby, Elm, and Rust as the fifth language in this series where every model solved every part.

gpt-5.3-codex — fastest overall at 272s. Also the most token-efficient at 6,711 tokens total. Never needed a retry. No single part over 48s.

claude-sonnet-4-6 — second fastest at 278s, remarkably consistent. Every part between 17–40s.

kimi-coding/k2p5 — cheapest at $0.06 total, second fewest tokens at 7,990. The Day 1 Part 2 retry inflated its total time to 757s, but on 9 of 10 parts it was under 40s.

minimax/MiniMax-M2.5 — slowest overall at 1,850s. Hit an API timeout on Day 5 Part 1, then gave a wrong answer on Day 5 Part 2 that took 789s and 29K tokens to fix. Day 1 Part 2 also took 386s. The model completed everything, but was consistently the bottleneck.

devstral-2512 — spiked on two parts: D3P2 (156s, 14K tokens) and D4P1 (255s, 23K tokens, $0.57). Every other part was routine.

Day 1 Part 2 was the stumbling block. Two models (k2p5, qwen3-coder-next) needed a second try, and MiniMax-M2.5 took 386s on its first try. Part 2 was otherwise straightforward — most first-try solves landed under 50s.

No Racket-specific struggles. No model got stuck on S-expression syntax, #lang racket conventions, or Racket-specific library APIs. The parentheses didn't slow anyone down.

Cross-language snapshot

Language	Models completing all 10 parts
Python	10/10
Ruby	10/10
Elm	10/10
Rust	10/10
Racket	10/10
Java	9/10
Elixir	7/10
Haskell	7/11
OCaml	5/9
ReScript (run 2)	2/10

Racket's perfect completion rate is notable. It's not a mainstream language, but it has clear semantics, good documentation, and a REPL-friendly workflow. Unlike Elm (which needed a scaffold) or Rust (which demands borrow-checking), Racket lets you write a quick script with #lang racket and go — and that simplicity may have helped.

Benchmarked on 2026-02-27 using pi as the agent harness.

This post was written with AI assistance.

Developers, developers, developers!

Blog about programming, programming and, ah more programming!