Following up on the Haskell, OCaml, Python, ReScript, Ruby, Elixir, Java, Elm, Rust, and Racket benchmarks, I ran the same AoC 2025 Days 1–5 setup in Clojure.
Clojure is a Lisp dialect that runs on the JVM. It's known for its persistent data
structures, REPL-driven development, and strong concurrency primitives. For this benchmark,
models needed to write standalone scripts runnable via clj. The JVM startup cost is real —
one model got trapped in repeated slow clj invocations on a single part, ballooning its
wall-clock time — but the language itself posed no conceptual difficulty. No scaffolding was
provided.
The result: 9 of 10 models completed all 10 parts. One ejection on Day 1 Part 2.
The contestants
| # | Model |
|---|---|
| 1 | anthropic/claude-haiku-4-5 |
| 2 | anthropic/claude-sonnet-4-6 |
| 3 | anthropic/claude-opus-4-6 |
| 4 | openai-codex/gpt-5.3-codex |
| 5 | zai/glm-5 |
| 6 | minimax/MiniMax-M2.5 |
| 7 | kimi-coding/k2p5 |
| 8 | mistral/devstral-2512 |
| 9 | alibaba/qwen3.5-plus |
| 10 | alibaba/qwen3-coder-next |
Ejections
1 ejection:
alibaba/qwen3-coder-next— Day 1 Part 2: wrong answer on all 3 clean attempts (2132, 5637, 5637). Ejected.
Three retries succeeded across the remaining 90 model-parts:
anthropic/claude-haiku-4-5— Day 1 Part 2 (wrong answer, fixed on 3rd try)mistral/devstral-2512— Day 1 Part 2 (wrong answer, fixed on 3rd try)minimax/MiniMax-M2.5— Day 2 Part 2 (wrong answer, fixed on 2nd try)
Results (Days 1–5)
Per-task leaderboards
Day 1 Part 1 — Dial rotation counting
| Model | Time |
|---|---|
| anthropic/claude-haiku-4-5 | 35s |
| openai-codex/gpt-5.3-codex | 35s |
| mistral/devstral-2512 | 41s |
| anthropic/claude-opus-4-6 | 44s |
| alibaba/qwen3.5-plus | 45s |
| anthropic/claude-sonnet-4-6 | 46s |
| alibaba/qwen3-coder-next | 54s |
| kimi-coding/k2p5 | 58s |
| minimax/MiniMax-M2.5 | 67s |
| zai/glm-5 | 72s |
Day 1 Part 2 — Counting zero-crossings during dial rotation
| Model | Time | Result |
|---|---|---|
| openai-codex/gpt-5.3-codex | 22s | ✓ |
| anthropic/claude-opus-4-6 | 27s | ✓ |
| anthropic/claude-sonnet-4-6 | 40s | ✓ |
| alibaba/qwen3.5-plus | 55s | ✓ |
| kimi-coding/k2p5 | 83s | ✓ |
| minimax/MiniMax-M2.5 | 111s | ✓ |
| zai/glm-5 | 141s | ✓ |
| mistral/devstral-2512 | 344s | ✓ (3rd try) |
| anthropic/claude-haiku-4-5 | 365s | ✓ (3rd try) |
| alibaba/qwen3-coder-next | — | ✗ (ejected, 3/3 failed) |
Day 2 Part 1 — Summing repeated-digit IDs in ranges
| Model | Time |
|---|---|
| anthropic/claude-haiku-4-5 | 29s |
| openai-codex/gpt-5.3-codex | 36s |
| mistral/devstral-2512 | 42s |
| kimi-coding/k2p5 | 47s |
| anthropic/claude-opus-4-6 | 57s |
| anthropic/claude-sonnet-4-6 | 64s |
| alibaba/qwen3.5-plus | 68s |
| zai/glm-5 | 72s |
| minimax/MiniMax-M2.5 | 84s |
Day 2 Part 2 — Repeated-pattern IDs (any repeat count)
| Model | Time | Result |
|---|---|---|
| openai-codex/gpt-5.3-codex | 21s | ✓ |
| mistral/devstral-2512 | 21s | ✓ |
| zai/glm-5 | 23s | ✓ |
| anthropic/claude-haiku-4-5 | 28s | ✓ |
| alibaba/qwen3.5-plus | 33s | ✓ |
| anthropic/claude-sonnet-4-6 | 34s | ✓ |
| anthropic/claude-opus-4-6 | 39s | ✓ |
| kimi-coding/k2p5 | 62s | ✓ |
| minimax/MiniMax-M2.5 | 190s | ✓ (2nd try) |
Day 3 Part 1 — Maximizing 2-digit joltage from battery banks
| Model | Time |
|---|---|
| openai-codex/gpt-5.3-codex | 27s |
| kimi-coding/k2p5 | 27s |
| anthropic/claude-opus-4-6 | 35s |
| anthropic/claude-haiku-4-5 | 42s |
| anthropic/claude-sonnet-4-6 | 48s |
| alibaba/qwen3.5-plus | 53s |
| mistral/devstral-2512 | 55s |
| zai/glm-5 | 70s |
| minimax/MiniMax-M2.5 | 103s |
Day 3 Part 2 — Maximizing 12-digit joltage from battery banks
| Model | Time |
|---|---|
| openai-codex/gpt-5.3-codex | 24s |
| kimi-coding/k2p5 | 25s |
| anthropic/claude-sonnet-4-6 | 26s |
| anthropic/claude-opus-4-6 | 31s |
| anthropic/claude-haiku-4-5 | 47s |
| mistral/devstral-2512 | 50s |
| zai/glm-5 | 116s |
| minimax/MiniMax-M2.5 | 183s |
| alibaba/qwen3.5-plus | 1,069s* |
* qwen3.5-plus's first solution had an infinite loop that ran for over 16 minutes
before being externally killed. After rewriting and fixing several subsequent bugs
(runtime errors, unmatched parens), it eventually produced the correct answer.
Day 4 Part 1 — Grid neighbor counting (accessible paper rolls)
| Model | Time |
|---|---|
| anthropic/claude-haiku-4-5 | 29s |
| kimi-coding/k2p5 | 34s |
| anthropic/claude-opus-4-6 | 36s |
| mistral/devstral-2512 | 36s |
| alibaba/qwen3.5-plus | 38s |
| openai-codex/gpt-5.3-codex | 43s |
| anthropic/claude-sonnet-4-6 | 44s |
| minimax/MiniMax-M2.5 | 54s |
| zai/glm-5 | 69s |
Day 4 Part 2 — Iterative grid removal simulation
| Model | Time |
|---|---|
| mistral/devstral-2512 | 14s |
| anthropic/claude-haiku-4-5 | 19s |
| openai-codex/gpt-5.3-codex | 20s |
| alibaba/qwen3.5-plus | 22s |
| anthropic/claude-sonnet-4-6 | 23s |
| zai/glm-5 | 30s |
| anthropic/claude-opus-4-6 | 32s |
| minimax/MiniMax-M2.5 | 35s |
| kimi-coding/k2p5 | 37s |
Day 5 Part 1 — Range membership checking
| Model | Time |
|---|---|
| anthropic/claude-haiku-4-5 | 26s |
| kimi-coding/k2p5 | 27s |
| mistral/devstral-2512 | 32s |
| anthropic/claude-sonnet-4-6 | 34s |
| anthropic/claude-opus-4-6 | 35s |
| openai-codex/gpt-5.3-codex | 38s |
| alibaba/qwen3.5-plus | 41s |
| zai/glm-5 | 60s |
| minimax/MiniMax-M2.5 | 84s |
Day 5 Part 2 — Counting total fresh IDs from overlapping ranges
| Model | Time |
|---|---|
| openai-codex/gpt-5.3-codex | 18s |
| kimi-coding/k2p5 | 18s |
| anthropic/claude-sonnet-4-6 | 19s |
| anthropic/claude-haiku-4-5 | 22s |
| alibaba/qwen3.5-plus | 23s |
| mistral/devstral-2512 | 24s |
| anthropic/claude-opus-4-6 | 27s |
| minimax/MiniMax-M2.5 | 30s |
| zai/glm-5 | 31s |
Speed vs accuracy
Summary tables
Wall-clock time (seconds)
| Model | D1P1 | D1P2 | D2P1 | D2P2 | D3P1 | D3P2 | D4P1 | D4P2 | D5P1 | D5P2 | Total |
|---|---|---|---|---|---|---|---|---|---|---|---|
| openai-codex/gpt-5.3-codex | 35s | 22s | 36s | 21s | 27s | 24s | 43s | 20s | 38s | 18s | 284s |
| anthropic/claude-opus-4-6 | 44s | 27s | 57s | 39s | 35s | 31s | 36s | 32s | 35s | 27s | 363s |
| anthropic/claude-sonnet-4-6 | 46s | 40s | 64s | 34s | 48s | 26s | 44s | 23s | 34s | 19s | 378s |
| kimi-coding/k2p5 | 58s | 83s | 47s | 62s | 27s | 25s | 34s | 37s | 27s | 18s | 418s |
| anthropic/claude-haiku-4-5 | 35s | 365s | 29s | 28s | 42s | 47s | 29s | 19s | 26s | 22s | 642s |
| mistral/devstral-2512 | 41s | 344s | 42s | 21s | 55s | 50s | 36s | 14s | 32s | 24s | 659s |
| zai/glm-5 | 72s | 141s | 72s | 23s | 70s | 116s | 69s | 30s | 60s | 31s | 684s |
| minimax/MiniMax-M2.5 | 67s | 111s | 84s | 190s | 103s | 183s | 54s | 35s | 84s | 30s | 941s |
| alibaba/qwen3.5-plus | 45s | 55s | 68s | 33s | 53s | 1,069s | 38s | 22s | 41s | 23s | 1,447s |
| alibaba/qwen3-coder-next | 54s | ✗ | — | — | — | — | — | — | — | — | — |
Output tokens per part
| Model | D1P1 | D1P2 | D2P1 | D2P2 | D3P1 | D3P2 | D4P1 | D4P2 | D5P1 | D5P2 | Total |
|---|---|---|---|---|---|---|---|---|---|---|---|
| openai-codex/gpt-5.3-codex | 474 | 491 | 608 | 569 | 451 | 632 | 526 | 473 | 945 | 481 | 5,650 |
| anthropic/claude-opus-4-6 | 764 | 839 | 2,161 | 1,753 | 825 | 971 | 869 | 1,165 | 866 | 834 | 11,047 |
| kimi-coding/k2p5 | 1,081 | 3,243 | 758 | 2,433 | 501 | 1,008 | 593 | 1,069 | 558 | 570 | 11,814 |
| anthropic/claude-sonnet-4-6 | 1,277 | 1,872 | 2,274 | 1,528 | 1,782 | 980 | 1,865 | 872 | 1,165 | 692 | 14,307 |
| zai/glm-5 | 1,413 | 4,696 | 1,663 | 557 | 1,645 | 4,361 | 1,677 | 594 | 1,024 | 582 | 18,212 |
| anthropic/claude-haiku-4-5 | 970 | 5,811 | 1,061 | 971 | 3,058 | 4,781 | 1,467 | 1,204 | 1,234 | 1,450 | 22,007 |
| mistral/devstral-2512 | 1,791 | 10,105 | 2,342 | 886 | 2,956 | 4,798 | 2,192 | 820 | 1,163 | 2,221 | 29,274 |
| alibaba/qwen3.5-plus | 1,948 | 6,586 | 6,076 | 2,183 | 3,467 | 4,784 | 1,972 | 1,152 | 2,126 | 1,420 | 31,714 |
| minimax/MiniMax-M2.5 | 1,928 | 5,000 | 2,511 | 6,304 | 2,254 | 7,516 | 1,900 | 1,157 | 2,539 | 858 | 31,967 |
| alibaba/qwen3-coder-next | 2,158 | 15,431 | — | — | — | — | — | — | — | — | 17,589 |
API cost per part (approximate USD)
Costs are rough approximations based on published per-token pricing. I use subscription plans, so my actual spending is capped regardless.
| Model | D1P1 | D1P2 | D2P1 | D2P2 | D3P1 | D3P2 | D4P1 | D4P2 | D5P1 | D5P2 | Total |
|---|---|---|---|---|---|---|---|---|---|---|---|
| kimi-coding/k2p5 | .0142 | .0181 | .0038 | .0112 | .0087 | .0108 | .0097 | .0122 | .0097 | .0082 | $0.11 |
| openai-codex/gpt-5.3-codex | .0168 | .0190 | .0252 | .0191 | .0213 | .0193 | .0169 | .0200 | .0664 | .0416 | $0.27 |
| anthropic/claude-haiku-4-5 | .0268 | .0693 | .0276 | .0179 | .0367 | .0404 | .0297 | .0125 | .0339 | .0204 | $0.32 |
| alibaba/qwen3.5-plus | .0164 | .0363 | .0479 | .0284 | .0331 | .0831 | .0170 | .0163 | .0245 | .0166 | $0.32 |
| minimax/MiniMax-M2.5 | .0238 | .0444 | .0151 | .0574 | .0334 | .0808 | .0083 | .0110 | .0337 | .0239 | $0.33 |
| mistral/devstral-2512 | .0175 | .1344 | .0222 | .0120 | .0325 | .0522 | .0188 | .0098 | .0123 | .0199 | $0.33 |
| zai/glm-5 | .0402 | .0527 | .0227 | .0105 | .0387 | .0483 | .0397 | .0201 | .0518 | .0236 | $0.35 |
| anthropic/claude-sonnet-4-6 | .0498 | .0480 | .0805 | .0439 | .0671 | .0316 | .0622 | .0309 | .0453 | .0225 | $0.48 |
| alibaba/qwen3-coder-next | .0746 | .4550 | — | — | — | — | — | — | — | — | $0.53 |
| anthropic/claude-opus-4-6 | .1160 | .0827 | .1362 | .0767 | .1125 | .0841 | .1141 | .1015 | .1117 | .0966 | $1.03 |
Observations
9/10 completers — one ejection. qwen3-coder-next fell on Day 1 Part 2 after three
wrong answers, while the other nine models finished all 10 parts.
gpt-5.3-codex — fastest overall at 284s, fewest tokens at 5,650, and never needed
a retry. No single part over 43s. Consistent dominance, same as in Racket.
claude-opus-4-6 — second fastest at 363s, zero retries, rock-solid. The premium
pricing ($1.03 total) remains its only weakness.
kimi-coding/k2p5 — cheapest at $0.11 total, with a respectable 418s. The best
value proposition in the field.
Day 1 Part 2 was the graveyard. Three models needed retries and one was ejected.
claude-haiku-4-5 and devstral-2512 both needed all three attempts, pushing their
Day 1 Part 2 times past 340s. After clearing that hurdle, both ran clean for the
remaining 8 parts.
The 17-minute outlier. qwen3.5-plus's first Day 3 Part 2 solution had an infinite
loop that ran for over 16 minutes before being externally killed. The model then
recognized the issue ("likely an infinite loop"), rewrote the algorithm, but hit
several more bugs (runtime exceptions, unmatched parens) before finally producing the
correct answer. Total wall-clock: 1,069s. Excluding that one disastrous part,
qwen3.5-plus was a mid-pack performer.
MiniMax-M2.5 — completed everything but was consistently the slowest or
second-slowest. Day 2 Part 2 required a retry (190s), and several other parts crossed
the 100s mark. Total: 941s.
No Clojure-specific struggles. No model got stuck on S-expression syntax, Clojure's threading macros, or JVM interop beyond the startup cost. The language was accessible to all models.
Cross-language snapshot
| Language | Models completing all 10 parts |
|---|---|
| Python | 10/10 |
| Ruby | 10/10 |
| Elm | 10/10 |
| Rust | 10/10 |
| Racket | 10/10 |
| Java | 9/10 |
| Clojure | 9/10 |
| Elixir | 7/10 |
| Haskell | 7/11 |
| OCaml | 5/9 |
| ReScript (run 2) | 2/10 |
Clojure slots in alongside Java — one ejection, strong overall. The Lisp syntax was no barrier. The only real friction was runtime: JVM startup latency penalized models that didn't plan their execution strategy. As with Racket, these are languages that LLMs clearly know — the training data coverage is sufficient for correct solutions even if the languages aren't mainstream.
Benchmarked on 2026-02-27 using pi as the agent harness.
This post was written with AI assistance.