Following up on the Haskell benchmark, I ran the same
orchestration setup on the same AoC 2025 Days 1–5 puzzles — this time requiring solutions in
OCaml. The methodology is identical: each model gets an isolated directory, a puzzle
description, and must write its final answer to ANSWER.txt. Wrong answer or no answer = ejection.
The contestants
9 models from my enabled model list this run, plus one added retroactively:
| # | Model |
|---|---|
| 1 | anthropic/claude-opus-4-6 |
| 2 | anthropic/claude-sonnet-4-6 |
| 3 | openai-codex/gpt-5.3-codex |
| 4 | zai/glm-5 |
| 5 | minimax/MiniMax-M2.5 |
| 6 | kimi-coding/k2p5 |
| 7 | mistral/devstral-2512 |
| 8 | alibaba/qwen3.5-plus |
| 9 | alibaba/qwen3-coder-next |
| 10 ★ | anthropic/claude-haiku-4-5 |
★ Added in a separate session after the main benchmark. Since inference is remote, the timings are directly comparable.
Ejections
All 5 casualties happened at Day 1 Part 2:
| Model | Ejected at |
|---|---|
mistral/devstral-2512 | D1P2 |
alibaba/qwen3.5-plus | D1P2 |
alibaba/qwen3-coder-next | D1P2 |
kimi-coding/k2p5 | D1P2 |
minimax/MiniMax-M2.5 | D1P2 |
Notably, mistral/devstral-2512 was the fastest model on Day 1 Part 1 (19s) but
failed Part 2. Same thing happened in the Haskell run.
Results (Days 1–5)
The 4 original survivors, plus claude-haiku-4-5 (★), went on a perfect streak. All correct, all 10 parts.
Per-task leaderboards
Day 1 Part 1 — Dial rotation counting
| Model | Time |
|---|---|
| anthropic/claude-haiku-4-5 ★ | 13s |
| mistral/devstral-2512 | 19s |
| alibaba/qwen3-coder-next | 20s |
| openai-codex/gpt-5.3-codex | 24s |
| kimi-coding/k2p5 | 24s |
| anthropic/claude-sonnet-4-6 | 26s |
| alibaba/qwen3.5-plus | 34s |
| zai/glm-5 | 35s |
| anthropic/claude-opus-4-6 | 50s |
| minimax/MiniMax-M2.5 | 59s |
Day 1 Part 2 — Counting zero-crossings during dial rotation
| Model | Time | |
|---|---|---|
| anthropic/claude-haiku-4-5 ★ | 12s | ✓ |
| anthropic/claude-sonnet-4-6 | 34s | ✓ |
| openai-codex/gpt-5.3-codex | 35s | ✓ |
| anthropic/claude-opus-4-6 | 36s | ✓ |
| zai/glm-5 | 62s | ✓ |
| mistral/devstral-2512 | 51s | ✗ |
| alibaba/qwen3.5-plus | 38s | ✗ |
| alibaba/qwen3-coder-next | 40s | ✗ |
| kimi-coding/k2p5 | 68s | ✗ |
| minimax/MiniMax-M2.5 | 248s | ✗ |
Day 2 Part 1 — Summing repeated-digit IDs in ranges
| Model | Time |
|---|---|
| anthropic/claude-haiku-4-5 ★ | 11s |
| openai-codex/gpt-5.3-codex | 26s |
| anthropic/claude-sonnet-4-6 | 54s |
| anthropic/claude-opus-4-6 | 56s |
| zai/glm-5 | 110s |
Day 2 Part 2 — Repeated-pattern IDs (any repeat count)
| Model | Time |
|---|---|
| anthropic/claude-haiku-4-5 ★ | 14s |
| openai-codex/gpt-5.3-codex | 20s |
| anthropic/claude-opus-4-6 | 39s |
| anthropic/claude-sonnet-4-6 | 154s |
| zai/glm-5 | 236s |
Day 3 Part 1 — Maximizing 2-digit joltage from battery banks
| Model | Time |
|---|---|
| anthropic/claude-haiku-4-5 ★ | 13s |
| openai-codex/gpt-5.3-codex | 24s |
| anthropic/claude-sonnet-4-6 | 30s |
| anthropic/claude-opus-4-6 | 37s |
| zai/glm-5 | 162s |
Day 3 Part 2 — Maximizing 12-digit joltage from battery banks
| Model | Time |
|---|---|
| anthropic/claude-haiku-4-5 ★ | 13s |
| openai-codex/gpt-5.3-codex | 18s |
| anthropic/claude-sonnet-4-6 | 24s |
| anthropic/claude-opus-4-6 | 28s |
| zai/glm-5 | 161s |
Day 4 Part 1 — Grid neighbor counting (accessible paper rolls)
| Model | Time |
|---|---|
| anthropic/claude-haiku-4-5 ★ | 12s |
| anthropic/claude-sonnet-4-6 | 26s |
| openai-codex/gpt-5.3-codex | 26s |
| anthropic/claude-opus-4-6 | 31s |
| zai/glm-5 | 34s |
Day 4 Part 2 — Iterative grid removal simulation
| Model | Time |
|---|---|
| anthropic/claude-haiku-4-5 ★ | 11s |
| openai-codex/gpt-5.3-codex | 18s |
| anthropic/claude-opus-4-6 | 23s |
| anthropic/claude-sonnet-4-6 | 28s |
| zai/glm-5 | 37s |
Day 5 Part 1 — Range membership checking
| Model | Time |
|---|---|
| anthropic/claude-haiku-4-5 ★ | 12s |
| anthropic/claude-sonnet-4-6 | 29s |
| openai-codex/gpt-5.3-codex | 32s |
| anthropic/claude-opus-4-6 | 33s |
| zai/glm-5 | 108s |
Day 5 Part 2 — Counting total fresh IDs from overlapping ranges
| Model | Time |
|---|---|
| anthropic/claude-haiku-4-5 ★ | 13s |
| openai-codex/gpt-5.3-codex | 18s |
| anthropic/claude-sonnet-4-6 | 23s |
| anthropic/claude-opus-4-6 | 24s |
| zai/glm-5 | 82s |
Summary table
| Model | D1P1 | D1P2 | D2P1 | D2P2 | D3P1 | D3P2 | D4P1 | D4P2 | D5P1 | D5P2 |
|---|---|---|---|---|---|---|---|---|---|---|
| anthropic/claude-haiku-4-5 ★ | 13s | 12s | 11s | 14s | 13s | 13s | 12s | 11s | 12s | 13s |
| openai-codex/gpt-5.3-codex | 24s | 35s | 26s | 20s | 24s | 18s | 26s | 18s | 32s | 18s |
| anthropic/claude-sonnet-4-6 | 26s | 34s | 54s | 154s | 30s | 24s | 26s | 28s | 29s | 23s |
| anthropic/claude-opus-4-6 | 50s | 36s | 56s | 39s | 37s | 28s | 31s | 23s | 33s | 24s |
| zai/glm-5 | 35s | 62s | 110s | 236s | 162s | 161s | 34s | 37s | 108s | 82s |
| — ejected at Day 1 Part 2 — | ||||||||||
| mistral/devstral-2512 | 19s | ✗ | ||||||||
| alibaba/qwen3.5-plus | 34s | ✗ | ||||||||
| alibaba/qwen3-coder-next | 20s | ✗ | ||||||||
| kimi-coding/k2p5 | 24s | ✗ | ||||||||
| minimax/MiniMax-M2.5 | 59s | ✗ | ||||||||
Observations
OCaml is harder — 5 of 9 models failed Part 2 of Day 1, vs. 4 of 11 in the Haskell run. The survivors were exactly the same top-tier models: the two Anthropic models, GPT-5.3-Codex, and GLM-5.
gpt-5.3-codex led among the original 9 models — fastest on 7 of
the 10 parts, with a consistent 18–35s range.
claude-haiku-4-5 (★ retroactive) — fastest on all 10 parts,
never exceeding 14s.
glm-5 — always correct, almost always last. Its times were often
3–6× those of the leaders, particularly in Days 2 and 3 (110–236s). Later benchmarks
with token tracking showed glm-5 uses roughly 2–3× more output tokens per part.
claude-sonnet-4-6 — 154s on D2P2 stands out against its otherwise
23–54s range. A single run doesn't tell the full story — averaging over multiple runs
would give a cleaner signal.
mistral/devstral-2512 — same pattern as in the Haskell run: fast on Part
1 (19s), wrong answer on Part 2, ejected.
What's next
Token usage and API cost tracking are now part of the benchmark. Future runs will report output token counts and per-part cost alongside wall-clock times — giving a clearer picture of solution complexity and value-for-money across models.
Benchmarked on 2026-02-25 using pi as the agent harness.
This post was written with AI assistance.