Developers, developers, developers!

Blog about programming, programming and, ah more programming!

Benchmarking LLMs on Advent of Code 2025 (OCaml)

Tags = [ OCaml, AI, Advent of Code ]

Following up on the Haskell benchmark, I ran the same orchestration setup on the same AoC 2025 Days 1–5 puzzles — this time requiring solutions in OCaml. The methodology is identical: each model gets an isolated directory, a puzzle description, and must write its final answer to ANSWER.txt. Wrong answer or no answer = ejection.

The contestants

9 models from my enabled model list this run, plus one added retroactively:

#Model
1anthropic/claude-opus-4-6
2anthropic/claude-sonnet-4-6
3openai-codex/gpt-5.3-codex
4zai/glm-5
5minimax/MiniMax-M2.5
6kimi-coding/k2p5
7mistral/devstral-2512
8alibaba/qwen3.5-plus
9alibaba/qwen3-coder-next
10 ★anthropic/claude-haiku-4-5

★ Added in a separate session after the main benchmark. Since inference is remote, the timings are directly comparable.

Ejections

All 5 casualties happened at Day 1 Part 2:

ModelEjected at
mistral/devstral-2512D1P2
alibaba/qwen3.5-plusD1P2
alibaba/qwen3-coder-nextD1P2
kimi-coding/k2p5D1P2
minimax/MiniMax-M2.5D1P2

Notably, mistral/devstral-2512 was the fastest model on Day 1 Part 1 (19s) but failed Part 2. Same thing happened in the Haskell run.

Results (Days 1–5)

The 4 original survivors, plus claude-haiku-4-5 (★), went on a perfect streak. All correct, all 10 parts.


Per-task leaderboards


Day 1 Part 1 — Dial rotation counting

ModelTime
anthropic/claude-haiku-4-5 ★13s
mistral/devstral-251219s
alibaba/qwen3-coder-next20s
openai-codex/gpt-5.3-codex24s
kimi-coding/k2p524s
anthropic/claude-sonnet-4-626s
alibaba/qwen3.5-plus34s
zai/glm-535s
anthropic/claude-opus-4-650s
minimax/MiniMax-M2.559s



Day 1 Part 2 — Counting zero-crossings during dial rotation

ModelTime
anthropic/claude-haiku-4-5 ★12s
anthropic/claude-sonnet-4-634s
openai-codex/gpt-5.3-codex35s
anthropic/claude-opus-4-636s
zai/glm-562s
mistral/devstral-251251s
alibaba/qwen3.5-plus38s
alibaba/qwen3-coder-next40s
kimi-coding/k2p568s
minimax/MiniMax-M2.5248s



Day 2 Part 1 — Summing repeated-digit IDs in ranges

ModelTime
anthropic/claude-haiku-4-5 ★11s
openai-codex/gpt-5.3-codex26s
anthropic/claude-sonnet-4-654s
anthropic/claude-opus-4-656s
zai/glm-5110s



Day 2 Part 2 — Repeated-pattern IDs (any repeat count)

ModelTime
anthropic/claude-haiku-4-5 ★14s
openai-codex/gpt-5.3-codex20s
anthropic/claude-opus-4-639s
anthropic/claude-sonnet-4-6154s
zai/glm-5236s



Day 3 Part 1 — Maximizing 2-digit joltage from battery banks

ModelTime
anthropic/claude-haiku-4-5 ★13s
openai-codex/gpt-5.3-codex24s
anthropic/claude-sonnet-4-630s
anthropic/claude-opus-4-637s
zai/glm-5162s



Day 3 Part 2 — Maximizing 12-digit joltage from battery banks

ModelTime
anthropic/claude-haiku-4-5 ★13s
openai-codex/gpt-5.3-codex18s
anthropic/claude-sonnet-4-624s
anthropic/claude-opus-4-628s
zai/glm-5161s



Day 4 Part 1 — Grid neighbor counting (accessible paper rolls)

ModelTime
anthropic/claude-haiku-4-5 ★12s
anthropic/claude-sonnet-4-626s
openai-codex/gpt-5.3-codex26s
anthropic/claude-opus-4-631s
zai/glm-534s



Day 4 Part 2 — Iterative grid removal simulation

ModelTime
anthropic/claude-haiku-4-5 ★11s
openai-codex/gpt-5.3-codex18s
anthropic/claude-opus-4-623s
anthropic/claude-sonnet-4-628s
zai/glm-537s



Day 5 Part 1 — Range membership checking

ModelTime
anthropic/claude-haiku-4-5 ★12s
anthropic/claude-sonnet-4-629s
openai-codex/gpt-5.3-codex32s
anthropic/claude-opus-4-633s
zai/glm-5108s



Day 5 Part 2 — Counting total fresh IDs from overlapping ranges

ModelTime
anthropic/claude-haiku-4-5 ★13s
openai-codex/gpt-5.3-codex18s
anthropic/claude-sonnet-4-623s
anthropic/claude-opus-4-624s
zai/glm-582s

Summary table

Model D1P1 D1P2 D2P1 D2P2 D3P1 D3P2 D4P1 D4P2 D5P1 D5P2
anthropic/claude-haiku-4-5 ★ 13s 12s 11s 14s 13s 13s 12s 11s 12s 13s
openai-codex/gpt-5.3-codex 24s 35s 26s 20s 24s 18s 26s 18s 32s 18s
anthropic/claude-sonnet-4-6 26s 34s 54s 154s 30s 24s 26s 28s 29s 23s
anthropic/claude-opus-4-6 50s 36s 56s 39s 37s 28s 31s 23s 33s 24s
zai/glm-5 35s 62s 110s 236s 162s 161s 34s 37s 108s 82s
— ejected at Day 1 Part 2 —
mistral/devstral-2512 19s
alibaba/qwen3.5-plus 34s
alibaba/qwen3-coder-next 20s
kimi-coding/k2p5 24s
minimax/MiniMax-M2.5 59s

Observations

OCaml is harder — 5 of 9 models failed Part 2 of Day 1, vs. 4 of 11 in the Haskell run. The survivors were exactly the same top-tier models: the two Anthropic models, GPT-5.3-Codex, and GLM-5.

gpt-5.3-codex led among the original 9 models — fastest on 7 of the 10 parts, with a consistent 18–35s range.

claude-haiku-4-5 (★ retroactive) — fastest on all 10 parts, never exceeding 14s.

glm-5 — always correct, almost always last. Its times were often 3–6× those of the leaders, particularly in Days 2 and 3 (110–236s). Later benchmarks with token tracking showed glm-5 uses roughly 2–3× more output tokens per part.

claude-sonnet-4-6 — 154s on D2P2 stands out against its otherwise 23–54s range. A single run doesn't tell the full story — averaging over multiple runs would give a cleaner signal.

mistral/devstral-2512 — same pattern as in the Haskell run: fast on Part 1 (19s), wrong answer on Part 2, ejected.

What's next

Token usage and API cost tracking are now part of the benchmark. Future runs will report output token counts and per-part cost alongside wall-clock times — giving a clearer picture of solution complexity and value-for-money across models.

Benchmarked on 2026-02-25 using pi as the agent harness.


This post was written with AI assistance.