Benchmarking LLMs on Advent of Code 2025 (OCaml)

Tags = [ OCaml, AI, Advent of Code ] Posted on February 25, 2026

Following up on the Haskell benchmark, I ran the same orchestration setup on the same AoC 2025 Days 1–5 puzzles — this time requiring solutions in OCaml. The methodology is identical: each model gets an isolated directory, a puzzle description, and must write its final answer to ANSWER.txt. Wrong answer or no answer = ejection.

The contestants

9 models from my enabled model list this run, plus one added retroactively:

#	Model
1	`anthropic/claude-opus-4-6`
2	`anthropic/claude-sonnet-4-6`
3	`openai-codex/gpt-5.3-codex`
4	`zai/glm-5`
5	`minimax/MiniMax-M2.5`
6	`kimi-coding/k2p5`
7	`mistral/devstral-2512`
8	`alibaba/qwen3.5-plus`
9	`alibaba/qwen3-coder-next`
10 ★	`anthropic/claude-haiku-4-5`

★ Added in a separate session after the main benchmark. Since inference is remote, the timings are directly comparable.

Ejections

All 5 casualties happened at Day 1 Part 2:

Model	Ejected at
`mistral/devstral-2512`	D1P2
`alibaba/qwen3.5-plus`	D1P2
`alibaba/qwen3-coder-next`	D1P2
`kimi-coding/k2p5`	D1P2
`minimax/MiniMax-M2.5`	D1P2

Notably, mistral/devstral-2512 was the fastest model on Day 1 Part 1 (19s) but failed Part 2. Same thing happened in the Haskell run.

Results (Days 1–5)

The 4 original survivors, plus claude-haiku-4-5 (★), went on a perfect streak. All correct, all 10 parts.

Per-task leaderboards

Day 1 Part 1 — Dial rotation counting

Model	Time
anthropic/claude-haiku-4-5 ★	13s
mistral/devstral-2512	19s
alibaba/qwen3-coder-next	20s
openai-codex/gpt-5.3-codex	24s
kimi-coding/k2p5	24s
anthropic/claude-sonnet-4-6	26s
alibaba/qwen3.5-plus	34s
zai/glm-5	35s
anthropic/claude-opus-4-6	50s
minimax/MiniMax-M2.5	59s

Day 1 Part 2 — Counting zero-crossings during dial rotation

Model	Time
anthropic/claude-haiku-4-5 ★	12s	✓
anthropic/claude-sonnet-4-6	34s	✓
openai-codex/gpt-5.3-codex	35s	✓
anthropic/claude-opus-4-6	36s	✓
zai/glm-5	62s	✓
mistral/devstral-2512	51s	✗
alibaba/qwen3.5-plus	38s	✗
alibaba/qwen3-coder-next	40s	✗
kimi-coding/k2p5	68s	✗
minimax/MiniMax-M2.5	248s	✗

Day 2 Part 1 — Summing repeated-digit IDs in ranges

Model	Time
anthropic/claude-haiku-4-5 ★	11s
openai-codex/gpt-5.3-codex	26s
anthropic/claude-sonnet-4-6	54s
anthropic/claude-opus-4-6	56s
zai/glm-5	110s

Day 2 Part 2 — Repeated-pattern IDs (any repeat count)

Model	Time
anthropic/claude-haiku-4-5 ★	14s
openai-codex/gpt-5.3-codex	20s
anthropic/claude-opus-4-6	39s
anthropic/claude-sonnet-4-6	154s
zai/glm-5	236s

Day 3 Part 1 — Maximizing 2-digit joltage from battery banks

Model	Time
anthropic/claude-haiku-4-5 ★	13s
openai-codex/gpt-5.3-codex	24s
anthropic/claude-sonnet-4-6	30s
anthropic/claude-opus-4-6	37s
zai/glm-5	162s

Day 3 Part 2 — Maximizing 12-digit joltage from battery banks

Model	Time
anthropic/claude-haiku-4-5 ★	13s
openai-codex/gpt-5.3-codex	18s
anthropic/claude-sonnet-4-6	24s
anthropic/claude-opus-4-6	28s
zai/glm-5	161s

Day 4 Part 1 — Grid neighbor counting (accessible paper rolls)

Model	Time
anthropic/claude-haiku-4-5 ★	12s
anthropic/claude-sonnet-4-6	26s
openai-codex/gpt-5.3-codex	26s
anthropic/claude-opus-4-6	31s
zai/glm-5	34s

Day 4 Part 2 — Iterative grid removal simulation

Model	Time
anthropic/claude-haiku-4-5 ★	11s
openai-codex/gpt-5.3-codex	18s
anthropic/claude-opus-4-6	23s
anthropic/claude-sonnet-4-6	28s
zai/glm-5	37s

Day 5 Part 1 — Range membership checking

Model	Time
anthropic/claude-haiku-4-5 ★	12s
anthropic/claude-sonnet-4-6	29s
openai-codex/gpt-5.3-codex	32s
anthropic/claude-opus-4-6	33s
zai/glm-5	108s

Day 5 Part 2 — Counting total fresh IDs from overlapping ranges

Model	Time
anthropic/claude-haiku-4-5 ★	13s
openai-codex/gpt-5.3-codex	18s
anthropic/claude-sonnet-4-6	23s
anthropic/claude-opus-4-6	24s
zai/glm-5	82s

Summary table

Model	D1P1	D1P2	D2P1	D2P2	D3P1	D3P2	D4P1	D4P2	D5P1	D5P2
anthropic/claude-haiku-4-5 ★	13s	12s	11s	14s	13s	13s	12s	11s	12s	13s
openai-codex/gpt-5.3-codex	24s	35s	26s	20s	24s	18s	26s	18s	32s	18s
anthropic/claude-sonnet-4-6	26s	34s	54s	154s	30s	24s	26s	28s	29s	23s
anthropic/claude-opus-4-6	50s	36s	56s	39s	37s	28s	31s	23s	33s	24s
zai/glm-5	35s	62s	110s	236s	162s	161s	34s	37s	108s	82s
— ejected at Day 1 Part 2 —
mistral/devstral-2512	19s	✗
alibaba/qwen3.5-plus	34s	✗
alibaba/qwen3-coder-next	20s	✗
kimi-coding/k2p5	24s	✗
minimax/MiniMax-M2.5	59s	✗

Observations

OCaml is harder — 5 of 9 models failed Part 2 of Day 1, vs. 4 of 11 in the Haskell run. The survivors were exactly the same top-tier models: the two Anthropic models, GPT-5.3-Codex, and GLM-5.

gpt-5.3-codex led among the original 9 models — fastest on 7 of the 10 parts, with a consistent 18–35s range.

claude-haiku-4-5 (★ retroactive) — fastest on all 10 parts, never exceeding 14s.

glm-5 — always correct, almost always last. Its times were often 3–6× those of the leaders, particularly in Days 2 and 3 (110–236s). Later benchmarks with token tracking showed glm-5 uses roughly 2–3× more output tokens per part.

claude-sonnet-4-6 — 154s on D2P2 stands out against its otherwise 23–54s range. A single run doesn't tell the full story — averaging over multiple runs would give a cleaner signal.

mistral/devstral-2512 — same pattern as in the Haskell run: fast on Part 1 (19s), wrong answer on Part 2, ejected.

What's next

Token usage and API cost tracking are now part of the benchmark. Future runs will report output token counts and per-part cost alongside wall-clock times — giving a clearer picture of solution complexity and value-for-money across models.

Benchmarked on 2026-02-25 using pi as the agent harness.

This post was written with AI assistance.

Developers, developers, developers!

Blog about programming, programming and, ah more programming!