Benchmarking LLMs on Advent of Code 2025 (Clojure)

Tags = [ Clojure, AI, Advent of Code ] Posted on February 27, 2026

Following up on the Haskell, OCaml, Python, ReScript, Ruby, Elixir, Java, Elm, Rust, and Racket benchmarks, I ran the same AoC 2025 Days 1–5 setup in Clojure.

Clojure is a Lisp dialect that runs on the JVM. It's known for its persistent data structures, REPL-driven development, and strong concurrency primitives. For this benchmark, models needed to write standalone scripts runnable via clj. The JVM startup cost is real — one model got trapped in repeated slow clj invocations on a single part, ballooning its wall-clock time — but the language itself posed no conceptual difficulty. No scaffolding was provided.

The result: 9 of 10 models completed all 10 parts. One ejection on Day 1 Part 2.

The contestants

#	Model
1	`anthropic/claude-haiku-4-5`
2	`anthropic/claude-sonnet-4-6`
3	`anthropic/claude-opus-4-6`
4	`openai-codex/gpt-5.3-codex`
5	`zai/glm-5`
6	`minimax/MiniMax-M2.5`
7	`kimi-coding/k2p5`
8	`mistral/devstral-2512`
9	`alibaba/qwen3.5-plus`
10	`alibaba/qwen3-coder-next`

Ejections

1 ejection:

alibaba/qwen3-coder-next — Day 1 Part 2: wrong answer on all 3 clean attempts (2132, 5637, 5637). Ejected.

Three retries succeeded across the remaining 90 model-parts:

anthropic/claude-haiku-4-5 — Day 1 Part 2 (wrong answer, fixed on 3rd try)
mistral/devstral-2512 — Day 1 Part 2 (wrong answer, fixed on 3rd try)
minimax/MiniMax-M2.5 — Day 2 Part 2 (wrong answer, fixed on 2nd try)

Results (Days 1–5)

Per-task leaderboards

Day 1 Part 1 — Dial rotation counting

Model	Time
anthropic/claude-haiku-4-5	35s
openai-codex/gpt-5.3-codex	35s
mistral/devstral-2512	41s
anthropic/claude-opus-4-6	44s
alibaba/qwen3.5-plus	45s
anthropic/claude-sonnet-4-6	46s
alibaba/qwen3-coder-next	54s
kimi-coding/k2p5	58s
minimax/MiniMax-M2.5	67s
zai/glm-5	72s

Day 1 Part 2 — Counting zero-crossings during dial rotation

Model	Time	Result
openai-codex/gpt-5.3-codex	22s	✓
anthropic/claude-opus-4-6	27s	✓
anthropic/claude-sonnet-4-6	40s	✓
alibaba/qwen3.5-plus	55s	✓
kimi-coding/k2p5	83s	✓
minimax/MiniMax-M2.5	111s	✓
zai/glm-5	141s	✓
mistral/devstral-2512	344s	✓ (3rd try)
anthropic/claude-haiku-4-5	365s	✓ (3rd try)
alibaba/qwen3-coder-next	—	✗ (ejected, 3/3 failed)

Day 2 Part 1 — Summing repeated-digit IDs in ranges

Model	Time
anthropic/claude-haiku-4-5	29s
openai-codex/gpt-5.3-codex	36s
mistral/devstral-2512	42s
kimi-coding/k2p5	47s
anthropic/claude-opus-4-6	57s
anthropic/claude-sonnet-4-6	64s
alibaba/qwen3.5-plus	68s
zai/glm-5	72s
minimax/MiniMax-M2.5	84s

Day 2 Part 2 — Repeated-pattern IDs (any repeat count)

Model	Time	Result
openai-codex/gpt-5.3-codex	21s	✓
mistral/devstral-2512	21s	✓
zai/glm-5	23s	✓
anthropic/claude-haiku-4-5	28s	✓
alibaba/qwen3.5-plus	33s	✓
anthropic/claude-sonnet-4-6	34s	✓
anthropic/claude-opus-4-6	39s	✓
kimi-coding/k2p5	62s	✓
minimax/MiniMax-M2.5	190s	✓ (2nd try)

Day 3 Part 1 — Maximizing 2-digit joltage from battery banks

Model	Time
openai-codex/gpt-5.3-codex	27s
kimi-coding/k2p5	27s
anthropic/claude-opus-4-6	35s
anthropic/claude-haiku-4-5	42s
anthropic/claude-sonnet-4-6	48s
alibaba/qwen3.5-plus	53s
mistral/devstral-2512	55s
zai/glm-5	70s
minimax/MiniMax-M2.5	103s

Day 3 Part 2 — Maximizing 12-digit joltage from battery banks

Model	Time
openai-codex/gpt-5.3-codex	24s
kimi-coding/k2p5	25s
anthropic/claude-sonnet-4-6	26s
anthropic/claude-opus-4-6	31s
anthropic/claude-haiku-4-5	47s
mistral/devstral-2512	50s
zai/glm-5	116s
minimax/MiniMax-M2.5	183s
alibaba/qwen3.5-plus	1,069s*

* qwen3.5-plus's first solution had an infinite loop that ran for over 16 minutes before being externally killed. After rewriting and fixing several subsequent bugs (runtime errors, unmatched parens), it eventually produced the correct answer.

Day 4 Part 1 — Grid neighbor counting (accessible paper rolls)

Model	Time
anthropic/claude-haiku-4-5	29s
kimi-coding/k2p5	34s
anthropic/claude-opus-4-6	36s
mistral/devstral-2512	36s
alibaba/qwen3.5-plus	38s
openai-codex/gpt-5.3-codex	43s
anthropic/claude-sonnet-4-6	44s
minimax/MiniMax-M2.5	54s
zai/glm-5	69s

Day 4 Part 2 — Iterative grid removal simulation

Model	Time
mistral/devstral-2512	14s
anthropic/claude-haiku-4-5	19s
openai-codex/gpt-5.3-codex	20s
alibaba/qwen3.5-plus	22s
anthropic/claude-sonnet-4-6	23s
zai/glm-5	30s
anthropic/claude-opus-4-6	32s
minimax/MiniMax-M2.5	35s
kimi-coding/k2p5	37s

Day 5 Part 1 — Range membership checking

Model	Time
anthropic/claude-haiku-4-5	26s
kimi-coding/k2p5	27s
mistral/devstral-2512	32s
anthropic/claude-sonnet-4-6	34s
anthropic/claude-opus-4-6	35s
openai-codex/gpt-5.3-codex	38s
alibaba/qwen3.5-plus	41s
zai/glm-5	60s
minimax/MiniMax-M2.5	84s

Day 5 Part 2 — Counting total fresh IDs from overlapping ranges

Model	Time
openai-codex/gpt-5.3-codex	18s
kimi-coding/k2p5	18s
anthropic/claude-sonnet-4-6	19s
anthropic/claude-haiku-4-5	22s
alibaba/qwen3.5-plus	23s
mistral/devstral-2512	24s
anthropic/claude-opus-4-6	27s
minimax/MiniMax-M2.5	30s
zai/glm-5	31s

Speed vs accuracy

Summary tables

Wall-clock time (seconds)

Model	D1P1	D1P2	D2P1	D2P2	D3P1	D3P2	D4P1	D4P2	D5P1	D5P2	Total
openai-codex/gpt-5.3-codex	35s	22s	36s	21s	27s	24s	43s	20s	38s	18s	284s
anthropic/claude-opus-4-6	44s	27s	57s	39s	35s	31s	36s	32s	35s	27s	363s
anthropic/claude-sonnet-4-6	46s	40s	64s	34s	48s	26s	44s	23s	34s	19s	378s
kimi-coding/k2p5	58s	83s	47s	62s	27s	25s	34s	37s	27s	18s	418s
anthropic/claude-haiku-4-5	35s	365s	29s	28s	42s	47s	29s	19s	26s	22s	642s
mistral/devstral-2512	41s	344s	42s	21s	55s	50s	36s	14s	32s	24s	659s
zai/glm-5	72s	141s	72s	23s	70s	116s	69s	30s	60s	31s	684s
minimax/MiniMax-M2.5	67s	111s	84s	190s	103s	183s	54s	35s	84s	30s	941s
alibaba/qwen3.5-plus	45s	55s	68s	33s	53s	1,069s	38s	22s	41s	23s	1,447s
alibaba/qwen3-coder-next	54s	✗	—	—	—	—	—	—	—	—	—

Output tokens per part

Model	D1P1	D1P2	D2P1	D2P2	D3P1	D3P2	D4P1	D4P2	D5P1	D5P2	Total
openai-codex/gpt-5.3-codex	474	491	608	569	451	632	526	473	945	481	5,650
anthropic/claude-opus-4-6	764	839	2,161	1,753	825	971	869	1,165	866	834	11,047
kimi-coding/k2p5	1,081	3,243	758	2,433	501	1,008	593	1,069	558	570	11,814
anthropic/claude-sonnet-4-6	1,277	1,872	2,274	1,528	1,782	980	1,865	872	1,165	692	14,307
zai/glm-5	1,413	4,696	1,663	557	1,645	4,361	1,677	594	1,024	582	18,212
anthropic/claude-haiku-4-5	970	5,811	1,061	971	3,058	4,781	1,467	1,204	1,234	1,450	22,007
mistral/devstral-2512	1,791	10,105	2,342	886	2,956	4,798	2,192	820	1,163	2,221	29,274
alibaba/qwen3.5-plus	1,948	6,586	6,076	2,183	3,467	4,784	1,972	1,152	2,126	1,420	31,714
minimax/MiniMax-M2.5	1,928	5,000	2,511	6,304	2,254	7,516	1,900	1,157	2,539	858	31,967
alibaba/qwen3-coder-next	2,158	15,431	—	—	—	—	—	—	—	—	17,589

API cost per part (approximate USD)

Costs are rough approximations based on published per-token pricing. I use subscription plans, so my actual spending is capped regardless.

Model	D1P1	D1P2	D2P1	D2P2	D3P1	D3P2	D4P1	D4P2	D5P1	D5P2	Total
kimi-coding/k2p5	.0142	.0181	.0038	.0112	.0087	.0108	.0097	.0122	.0097	.0082	$0.11
openai-codex/gpt-5.3-codex	.0168	.0190	.0252	.0191	.0213	.0193	.0169	.0200	.0664	.0416	$0.27
anthropic/claude-haiku-4-5	.0268	.0693	.0276	.0179	.0367	.0404	.0297	.0125	.0339	.0204	$0.32
alibaba/qwen3.5-plus	.0164	.0363	.0479	.0284	.0331	.0831	.0170	.0163	.0245	.0166	$0.32
minimax/MiniMax-M2.5	.0238	.0444	.0151	.0574	.0334	.0808	.0083	.0110	.0337	.0239	$0.33
mistral/devstral-2512	.0175	.1344	.0222	.0120	.0325	.0522	.0188	.0098	.0123	.0199	$0.33
zai/glm-5	.0402	.0527	.0227	.0105	.0387	.0483	.0397	.0201	.0518	.0236	$0.35
anthropic/claude-sonnet-4-6	.0498	.0480	.0805	.0439	.0671	.0316	.0622	.0309	.0453	.0225	$0.48
alibaba/qwen3-coder-next	.0746	.4550	—	—	—	—	—	—	—	—	$0.53
anthropic/claude-opus-4-6	.1160	.0827	.1362	.0767	.1125	.0841	.1141	.1015	.1117	.0966	$1.03

Observations

9/10 completers — one ejection. qwen3-coder-next fell on Day 1 Part 2 after three wrong answers, while the other nine models finished all 10 parts.

gpt-5.3-codex — fastest overall at 284s, fewest tokens at 5,650, and never needed a retry. No single part over 43s. Consistent dominance, same as in Racket.

claude-opus-4-6 — second fastest at 363s, zero retries, rock-solid. The premium pricing ($1.03 total) remains its only weakness.

kimi-coding/k2p5 — cheapest at $0.11 total, with a respectable 418s. The best value proposition in the field.

Day 1 Part 2 was the graveyard. Three models needed retries and one was ejected. claude-haiku-4-5 and devstral-2512 both needed all three attempts, pushing their Day 1 Part 2 times past 340s. After clearing that hurdle, both ran clean for the remaining 8 parts.

The 17-minute outlier. qwen3.5-plus's first Day 3 Part 2 solution had an infinite loop that ran for over 16 minutes before being externally killed. The model then recognized the issue ("likely an infinite loop"), rewrote the algorithm, but hit several more bugs (runtime exceptions, unmatched parens) before finally producing the correct answer. Total wall-clock: 1,069s. Excluding that one disastrous part, qwen3.5-plus was a mid-pack performer.

MiniMax-M2.5 — completed everything but was consistently the slowest or second-slowest. Day 2 Part 2 required a retry (190s), and several other parts crossed the 100s mark. Total: 941s.

No Clojure-specific struggles. No model got stuck on S-expression syntax, Clojure's threading macros, or JVM interop beyond the startup cost. The language was accessible to all models.

Cross-language snapshot

Language	Models completing all 10 parts
Python	10/10
Ruby	10/10
Elm	10/10
Rust	10/10
Racket	10/10
Java	9/10
Clojure	9/10
Elixir	7/10
Haskell	7/11
OCaml	5/9
ReScript (run 2)	2/10

Clojure slots in alongside Java — one ejection, strong overall. The Lisp syntax was no barrier. The only real friction was runtime: JVM startup latency penalized models that didn't plan their execution strategy. As with Racket, these are languages that LLMs clearly know — the training data coverage is sufficient for correct solutions even if the languages aren't mainstream.

Benchmarked on 2026-02-27 using pi as the agent harness.

This post was written with AI assistance.

Developers, developers, developers!

Blog about programming, programming and, ah more programming!