Developers, developers, developers!

Blog about programming, programming and, ah more programming!

Benchmarking LLMs on Advent of Code 2025 (Clojure)

Tags = [ Clojure, AI, Advent of Code ]

Following up on the Haskell, OCaml, Python, ReScript, Ruby, Elixir, Java, Elm, Rust, and Racket benchmarks, I ran the same AoC 2025 Days 1–5 setup in Clojure.

Clojure is a Lisp dialect that runs on the JVM. It's known for its persistent data structures, REPL-driven development, and strong concurrency primitives. For this benchmark, models needed to write standalone scripts runnable via clj. The JVM startup cost is real — one model got trapped in repeated slow clj invocations on a single part, ballooning its wall-clock time — but the language itself posed no conceptual difficulty. No scaffolding was provided.

The result: 9 of 10 models completed all 10 parts. One ejection on Day 1 Part 2.

The contestants

#Model
1anthropic/claude-haiku-4-5
2anthropic/claude-sonnet-4-6
3anthropic/claude-opus-4-6
4openai-codex/gpt-5.3-codex
5zai/glm-5
6minimax/MiniMax-M2.5
7kimi-coding/k2p5
8mistral/devstral-2512
9alibaba/qwen3.5-plus
10alibaba/qwen3-coder-next

Ejections

1 ejection:

  • alibaba/qwen3-coder-next — Day 1 Part 2: wrong answer on all 3 clean attempts (2132, 5637, 5637). Ejected.

Three retries succeeded across the remaining 90 model-parts:

  • anthropic/claude-haiku-4-5 — Day 1 Part 2 (wrong answer, fixed on 3rd try)
  • mistral/devstral-2512 — Day 1 Part 2 (wrong answer, fixed on 3rd try)
  • minimax/MiniMax-M2.5 — Day 2 Part 2 (wrong answer, fixed on 2nd try)

Results (Days 1–5)

Per-task leaderboards


Day 1 Part 1 — Dial rotation counting

ModelTime
anthropic/claude-haiku-4-535s
openai-codex/gpt-5.3-codex35s
mistral/devstral-251241s
anthropic/claude-opus-4-644s
alibaba/qwen3.5-plus45s
anthropic/claude-sonnet-4-646s
alibaba/qwen3-coder-next54s
kimi-coding/k2p558s
minimax/MiniMax-M2.567s
zai/glm-572s



Day 1 Part 2 — Counting zero-crossings during dial rotation

ModelTimeResult
openai-codex/gpt-5.3-codex22s
anthropic/claude-opus-4-627s
anthropic/claude-sonnet-4-640s
alibaba/qwen3.5-plus55s
kimi-coding/k2p583s
minimax/MiniMax-M2.5111s
zai/glm-5141s
mistral/devstral-2512344s✓ (3rd try)
anthropic/claude-haiku-4-5365s✓ (3rd try)
alibaba/qwen3-coder-next✗ (ejected, 3/3 failed)



Day 2 Part 1 — Summing repeated-digit IDs in ranges

ModelTime
anthropic/claude-haiku-4-529s
openai-codex/gpt-5.3-codex36s
mistral/devstral-251242s
kimi-coding/k2p547s
anthropic/claude-opus-4-657s
anthropic/claude-sonnet-4-664s
alibaba/qwen3.5-plus68s
zai/glm-572s
minimax/MiniMax-M2.584s



Day 2 Part 2 — Repeated-pattern IDs (any repeat count)

ModelTimeResult
openai-codex/gpt-5.3-codex21s
mistral/devstral-251221s
zai/glm-523s
anthropic/claude-haiku-4-528s
alibaba/qwen3.5-plus33s
anthropic/claude-sonnet-4-634s
anthropic/claude-opus-4-639s
kimi-coding/k2p562s
minimax/MiniMax-M2.5190s✓ (2nd try)



Day 3 Part 1 — Maximizing 2-digit joltage from battery banks

ModelTime
openai-codex/gpt-5.3-codex27s
kimi-coding/k2p527s
anthropic/claude-opus-4-635s
anthropic/claude-haiku-4-542s
anthropic/claude-sonnet-4-648s
alibaba/qwen3.5-plus53s
mistral/devstral-251255s
zai/glm-570s
minimax/MiniMax-M2.5103s



Day 3 Part 2 — Maximizing 12-digit joltage from battery banks

ModelTime
openai-codex/gpt-5.3-codex24s
kimi-coding/k2p525s
anthropic/claude-sonnet-4-626s
anthropic/claude-opus-4-631s
anthropic/claude-haiku-4-547s
mistral/devstral-251250s
zai/glm-5116s
minimax/MiniMax-M2.5183s
alibaba/qwen3.5-plus1,069s*

* qwen3.5-plus's first solution had an infinite loop that ran for over 16 minutes before being externally killed. After rewriting and fixing several subsequent bugs (runtime errors, unmatched parens), it eventually produced the correct answer.



Day 4 Part 1 — Grid neighbor counting (accessible paper rolls)

ModelTime
anthropic/claude-haiku-4-529s
kimi-coding/k2p534s
anthropic/claude-opus-4-636s
mistral/devstral-251236s
alibaba/qwen3.5-plus38s
openai-codex/gpt-5.3-codex43s
anthropic/claude-sonnet-4-644s
minimax/MiniMax-M2.554s
zai/glm-569s



Day 4 Part 2 — Iterative grid removal simulation

ModelTime
mistral/devstral-251214s
anthropic/claude-haiku-4-519s
openai-codex/gpt-5.3-codex20s
alibaba/qwen3.5-plus22s
anthropic/claude-sonnet-4-623s
zai/glm-530s
anthropic/claude-opus-4-632s
minimax/MiniMax-M2.535s
kimi-coding/k2p537s



Day 5 Part 1 — Range membership checking

ModelTime
anthropic/claude-haiku-4-526s
kimi-coding/k2p527s
mistral/devstral-251232s
anthropic/claude-sonnet-4-634s
anthropic/claude-opus-4-635s
openai-codex/gpt-5.3-codex38s
alibaba/qwen3.5-plus41s
zai/glm-560s
minimax/MiniMax-M2.584s



Day 5 Part 2 — Counting total fresh IDs from overlapping ranges

ModelTime
openai-codex/gpt-5.3-codex18s
kimi-coding/k2p518s
anthropic/claude-sonnet-4-619s
anthropic/claude-haiku-4-522s
alibaba/qwen3.5-plus23s
mistral/devstral-251224s
anthropic/claude-opus-4-627s
minimax/MiniMax-M2.530s
zai/glm-531s

Speed vs accuracy

Summary tables

Wall-clock time (seconds)

Model D1P1 D1P2 D2P1 D2P2 D3P1 D3P2 D4P1 D4P2 D5P1 D5P2 Total
openai-codex/gpt-5.3-codex 35s 22s 36s 21s 27s 24s 43s 20s 38s 18s 284s
anthropic/claude-opus-4-6 44s 27s 57s 39s 35s 31s 36s 32s 35s 27s 363s
anthropic/claude-sonnet-4-6 46s 40s 64s 34s 48s 26s 44s 23s 34s 19s 378s
kimi-coding/k2p5 58s 83s 47s 62s 27s 25s 34s 37s 27s 18s 418s
anthropic/claude-haiku-4-5 35s 365s 29s 28s 42s 47s 29s 19s 26s 22s 642s
mistral/devstral-2512 41s 344s 42s 21s 55s 50s 36s 14s 32s 24s 659s
zai/glm-5 72s 141s 72s 23s 70s 116s 69s 30s 60s 31s 684s
minimax/MiniMax-M2.5 67s 111s 84s 190s 103s 183s 54s 35s 84s 30s 941s
alibaba/qwen3.5-plus 45s 55s 68s 33s 53s 1,069s 38s 22s 41s 23s 1,447s
alibaba/qwen3-coder-next 54s

Output tokens per part

Model D1P1 D1P2 D2P1 D2P2 D3P1 D3P2 D4P1 D4P2 D5P1 D5P2 Total
openai-codex/gpt-5.3-codex 474 491 608 569 451 632 526 473 945 481 5,650
anthropic/claude-opus-4-6 764 839 2,161 1,753 825 971 869 1,165 866 834 11,047
kimi-coding/k2p5 1,081 3,243 758 2,433 501 1,008 593 1,069 558 570 11,814
anthropic/claude-sonnet-4-6 1,277 1,872 2,274 1,528 1,782 980 1,865 872 1,165 692 14,307
zai/glm-5 1,413 4,696 1,663 557 1,645 4,361 1,677 594 1,024 582 18,212
anthropic/claude-haiku-4-5 970 5,811 1,061 971 3,058 4,781 1,467 1,204 1,234 1,450 22,007
mistral/devstral-2512 1,791 10,105 2,342 886 2,956 4,798 2,192 820 1,163 2,221 29,274
alibaba/qwen3.5-plus 1,948 6,586 6,076 2,183 3,467 4,784 1,972 1,152 2,126 1,420 31,714
minimax/MiniMax-M2.5 1,928 5,000 2,511 6,304 2,254 7,516 1,900 1,157 2,539 858 31,967
alibaba/qwen3-coder-next 2,158 15,431 17,589

API cost per part (approximate USD)

Costs are rough approximations based on published per-token pricing. I use subscription plans, so my actual spending is capped regardless.

Model D1P1 D1P2 D2P1 D2P2 D3P1 D3P2 D4P1 D4P2 D5P1 D5P2 Total
kimi-coding/k2p5 .0142 .0181 .0038 .0112 .0087 .0108 .0097 .0122 .0097 .0082 $0.11
openai-codex/gpt-5.3-codex .0168 .0190 .0252 .0191 .0213 .0193 .0169 .0200 .0664 .0416 $0.27
anthropic/claude-haiku-4-5 .0268 .0693 .0276 .0179 .0367 .0404 .0297 .0125 .0339 .0204 $0.32
alibaba/qwen3.5-plus .0164 .0363 .0479 .0284 .0331 .0831 .0170 .0163 .0245 .0166 $0.32
minimax/MiniMax-M2.5 .0238 .0444 .0151 .0574 .0334 .0808 .0083 .0110 .0337 .0239 $0.33
mistral/devstral-2512 .0175 .1344 .0222 .0120 .0325 .0522 .0188 .0098 .0123 .0199 $0.33
zai/glm-5 .0402 .0527 .0227 .0105 .0387 .0483 .0397 .0201 .0518 .0236 $0.35
anthropic/claude-sonnet-4-6 .0498 .0480 .0805 .0439 .0671 .0316 .0622 .0309 .0453 .0225 $0.48
alibaba/qwen3-coder-next .0746 .4550 $0.53
anthropic/claude-opus-4-6 .1160 .0827 .1362 .0767 .1125 .0841 .1141 .1015 .1117 .0966 $1.03

Observations

9/10 completers — one ejection. qwen3-coder-next fell on Day 1 Part 2 after three wrong answers, while the other nine models finished all 10 parts.

gpt-5.3-codex — fastest overall at 284s, fewest tokens at 5,650, and never needed a retry. No single part over 43s. Consistent dominance, same as in Racket.

claude-opus-4-6 — second fastest at 363s, zero retries, rock-solid. The premium pricing ($1.03 total) remains its only weakness.

kimi-coding/k2p5 — cheapest at $0.11 total, with a respectable 418s. The best value proposition in the field.

Day 1 Part 2 was the graveyard. Three models needed retries and one was ejected. claude-haiku-4-5 and devstral-2512 both needed all three attempts, pushing their Day 1 Part 2 times past 340s. After clearing that hurdle, both ran clean for the remaining 8 parts.

The 17-minute outlier. qwen3.5-plus's first Day 3 Part 2 solution had an infinite loop that ran for over 16 minutes before being externally killed. The model then recognized the issue ("likely an infinite loop"), rewrote the algorithm, but hit several more bugs (runtime exceptions, unmatched parens) before finally producing the correct answer. Total wall-clock: 1,069s. Excluding that one disastrous part, qwen3.5-plus was a mid-pack performer.

MiniMax-M2.5 — completed everything but was consistently the slowest or second-slowest. Day 2 Part 2 required a retry (190s), and several other parts crossed the 100s mark. Total: 941s.

No Clojure-specific struggles. No model got stuck on S-expression syntax, Clojure's threading macros, or JVM interop beyond the startup cost. The language was accessible to all models.

Cross-language snapshot

LanguageModels completing all 10 parts
Python10/10
Ruby10/10
Elm10/10
Rust10/10
Racket10/10
Java9/10
Clojure9/10
Elixir7/10
Haskell7/11
OCaml5/9
ReScript (run 2)2/10

Clojure slots in alongside Java — one ejection, strong overall. The Lisp syntax was no barrier. The only real friction was runtime: JVM startup latency penalized models that didn't plan their execution strategy. As with Racket, these are languages that LLMs clearly know — the training data coverage is sufficient for correct solutions even if the languages aren't mainstream.

Benchmarked on 2026-02-27 using pi as the agent harness.


This post was written with AI assistance.