I benchmarked 11 LLMs on Advent of Code 2025 Days 1–5, each solving independently in Haskell. The goal: see which models can reliably produce correct, working solutions — and how fast.
How it works
I built a custom orchestration prompt for pi, a CLI coding agent. The prompt acts as a benchmark controller: it launches one pi agent per model in separate tmux windows, feeds them the puzzle description, waits for them to finish, collects their answers, scores them for correctness and time-to-solution, and maintains a leaderboard. Models that fail a puzzle get ejected from the competition.
Each agent works in complete isolation — its own directory, no shared state, no awareness of the others. They all get the same puzzle description and input files. The orchestrator never solves anything itself; it just dispatches, collects, and scores.
The contestants
All 11 models came from my enabled model list:
| # | Model |
|---|---|
| 1 | anthropic/claude-opus-4-6 |
| 2 | anthropic/claude-sonnet-4-6 |
| 3 | openai-codex/gpt-5.3-codex |
| 4 | zai/glm-5 |
| 5 | minimax/MiniMax-M2.5 |
| 6 | kimi-coding/k2p5 |
| 7 | mistral/devstral-2512 |
| 8 | alibaba/qwen3.5-plus |
| 9 | alibaba/qwen3-max-2026-01-23 |
| 10 | alibaba/qwen3-coder-next |
| 11 | alibaba/qwen3-coder-plus |
Ejections
Models that failed a puzzle were offered for ejection. Four models didn't survive past Day 1:
| Model | Ejected at | Reason |
|---|---|---|
alibaba/qwen3-coder-plus | D1P1 | No answer (never wrote ANSWER.txt) |
mistral/devstral-2512 | D1P2 | Wrong answer |
alibaba/qwen3-coder-next | D1P2 | Wrong answer |
alibaba/qwen3-max-2026-01-23 | D1P2 | No answer (got the right answer but stopped before writing the file!) |
The qwen3-max case was particularly painful — the model computed the correct answer, said "Now I'll write the answer to ANSWER.txt:" and then... just stopped generating. The answer was right there. 😩
Results (Days 1–5)
The surviving 7 models went on a perfect streak through Days 1–5, all producing correct answers for every part. Here's the full timing breakdown:
Per-task leaderboards
Day 1 Part 1 — Dial rotation counting
| Model | Time |
|---|---|
| mistral/devstral-2512 | 33s |
| anthropic/claude-opus-4-6 | 40s |
| anthropic/claude-sonnet-4-6 | 41s |
| kimi-coding/k2p5 | 42s |
| alibaba/qwen3.5-plus | 45s |
| openai-codex/gpt-5.3-codex | 50s |
| zai/glm-5 | 70s |
| alibaba/qwen3-coder-next | 74s |
| alibaba/qwen3-max-2026-01-23 | 78s |
| minimax/MiniMax-M2.5 | 111s |
Day 1 Part 2 — Counting zero-crossings during dial rotation
| Model | Time |
|---|---|
| anthropic/claude-opus-4-6 | 32s |
| anthropic/claude-sonnet-4-6 | 39s |
| openai-codex/gpt-5.3-codex | 53s |
| kimi-coding/k2p5 | 70s |
| alibaba/qwen3.5-plus | 85s |
| zai/glm-5 | 130s |
| minimax/MiniMax-M2.5 | 972s |
Day 2 Part 1 — Summing repeated-digit IDs in ranges
| Model | Time |
|---|---|
| openai-codex/gpt-5.3-codex | 17s |
| kimi-coding/k2p5 | 41s |
| alibaba/qwen3.5-plus | 45s |
| anthropic/claude-sonnet-4-6 | 53s |
| anthropic/claude-opus-4-6 | 54s |
| zai/glm-5 | 72s |
| minimax/MiniMax-M2.5 | 113s |
Day 2 Part 2 — Repeated-pattern IDs (any repeat count)
| Model | Time |
|---|---|
| openai-codex/gpt-5.3-codex | 22s |
| kimi-coding/k2p5 | 22s |
| alibaba/qwen3.5-plus | 25s |
| anthropic/claude-sonnet-4-6 | 31s |
| minimax/MiniMax-M2.5 | 38s |
| anthropic/claude-opus-4-6 | 41s |
| zai/glm-5 | 63s |
Day 3 Part 1 — Maximizing 2-digit joltage from battery banks
| Model | Time |
|---|---|
| anthropic/claude-sonnet-4-6 | 28s |
| kimi-coding/k2p5 | 30s |
| anthropic/claude-opus-4-6 | 32s |
| zai/glm-5 | 39s |
| openai-codex/gpt-5.3-codex | 41s |
| alibaba/qwen3.5-plus | 42s |
| minimax/MiniMax-M2.5 | 1078s |
Day 3 Part 2 — Maximizing 12-digit joltage from battery banks
| Model | Time |
|---|---|
| openai-codex/gpt-5.3-codex | 19s |
| alibaba/qwen3.5-plus | 24s |
| anthropic/claude-sonnet-4-6 | 27s |
| anthropic/claude-opus-4-6 | 30s |
| zai/glm-5 | 37s |
| kimi-coding/k2p5 | 71s |
| minimax/MiniMax-M2.5 | 473s |
Day 4 Part 1 — Grid neighbor counting (accessible paper rolls)
| Model | Time |
|---|---|
| openai-codex/gpt-5.3-codex | 27s |
| alibaba/qwen3.5-plus | 27s |
| anthropic/claude-sonnet-4-6 | 28s |
| anthropic/claude-opus-4-6 | 33s |
| kimi-coding/k2p5 | 36s |
| zai/glm-5 | 70s |
| minimax/MiniMax-M2.5 | 707s |
Day 4 Part 2 — Iterative grid removal simulation
| Model | Time |
|---|---|
| openai-codex/gpt-5.3-codex | 23s |
| anthropic/claude-sonnet-4-6 | 24s |
| alibaba/qwen3.5-plus | 24s |
| anthropic/claude-opus-4-6 | 28s |
| kimi-coding/k2p5 | 37s |
| zai/glm-5 | 39s |
| minimax/MiniMax-M2.5 | 166s |
Day 5 Part 1 — Range membership checking
| Model | Time |
|---|---|
| openai-codex/gpt-5.3-codex | 27s |
| anthropic/claude-sonnet-4-6 | 28s |
| alibaba/qwen3.5-plus | 28s |
| anthropic/claude-opus-4-6 | 32s |
| kimi-coding/k2p5 | 33s |
| zai/glm-5 | 72s |
| minimax/MiniMax-M2.5 | 140s |
Day 5 Part 2 — Counting total fresh IDs from overlapping ranges
| Model | Time |
|---|---|
| openai-codex/gpt-5.3-codex | 20s |
| kimi-coding/k2p5 | 20s |
| anthropic/claude-sonnet-4-6 | 23s |
| anthropic/claude-opus-4-6 | 24s |
| alibaba/qwen3.5-plus | 28s |
| minimax/MiniMax-M2.5 | 44s |
| zai/glm-5 | 54s |
Summary table
| Model | D1P1 | D1P2 | D2P1 | D2P2 | D3P1 | D3P2 | D4P1 | D4P2 | D5P1 | D5P2 |
|---|---|---|---|---|---|---|---|---|---|---|
| openai-codex/gpt-5.3-codex | 50s | 53s | 17s | 22s | 41s | 19s | 27s | 23s | 27s | 20s |
| anthropic/claude-sonnet-4-6 | 41s | 39s | 53s | 31s | 28s | 27s | 28s | 24s | 28s | 23s |
| anthropic/claude-opus-4-6 | 40s | 32s | 54s | 41s | 32s | 30s | 33s | 28s | 32s | 24s |
| kimi-coding/k2p5 | 42s | 70s | 41s | 22s | 30s | 71s | 36s | 37s | 33s | 20s |
| alibaba/qwen3.5-plus | 45s | 85s | 45s | 25s | 42s | 24s | 27s | 24s | 28s | 28s |
| zai/glm-5 | 70s | 130s | 72s | 63s | 39s | 37s | 70s | 39s | 72s | 54s |
| minimax/MiniMax-M2.5 | 111s | 972s | 113s | 38s | 1078s | 473s | 707s | 166s | 140s | 44s |
| — ejected — | ||||||||||
| mistral/devstral-2512 | 33s | ✗ | ||||||||
| alibaba/qwen3-coder-next | 74s | ✗ | ||||||||
| alibaba/qwen3-max-2026-01-23 | 78s | ✗(—) | ||||||||
| alibaba/qwen3-coder-plus | ✗(—) | |||||||||
Observations
The top tier — openai-codex/gpt-5.3-codex, anthropic/claude-sonnet-4-6, and anthropic/claude-opus-4-6 were consistently fast and correct. GPT-5.3-Codex was often the fastest, with the Anthropic models close behind.
Solid mid-pack — kimi-coding/k2p5 and alibaba/qwen3.5-plus were reliable and reasonably quick, occasionally matching the top performers.
Consistent but slow — zai/glm-5 always got the right answer but typically took 2–3x longer than the leaders.
The outlier — minimax/MiniMax-M2.5 always got the right answer eventually, but with wildly inconsistent timing. It ranged from 38s (competitive) to 1078s (18 minutes!) for puzzles others solved in under a minute. Something about its approach leads to very expensive wrong turns before converging.
Early casualties — The four ejected models all failed on the very first day. mistral/devstral-2512 was actually the fastest on D1P1 (33s!) but got Part 2 wrong. qwen3-max was the most frustrating: it computed the correct answer and then stopped generating before writing it to disk.
Haskell — All surviving models were able to produce compilable Haskell code that solved the puzzles correctly. I did not review the code quality itself.
Methodology
Orchestration
The whole benchmark is driven by a custom pi prompt that acts as a controller. It:
- Reads the enabled model list from pi's settings
- Creates an isolated working directory per model
- Launches each model as a separate
piagent in its own tmux window - Feeds puzzle descriptions (pasted by the operator) into each model's directory
- Waits for agents to finish, then reads their
ANSWER.txtfiles - Compares answers, displays leaderboards, and handles ejections
- For Part 2, reuses the same tmux sessions so agents keep their Part 1 context (since AoC Part 2 typically builds on Part 1)
The orchestrator never reads puzzle descriptions itself and never solves anything — it only dispatches and scores.
Timing
- Elapsed time =
ANSWER.txtfile modification time − launch time − stagger offset - Agents are launched with a 3-second stagger between each to avoid a lock file race condition in pi's settings. Each model's offset is subtracted from its raw time
- Times include the full cycle: reading the puzzle, writing Haskell code, compiling with GHC, testing against example input, running against real input, and writing the answer
Fairness controls
- Thinking/reasoning disabled (
--thinking off) — keeps things fair across models that support extended thinking differently - 5-second execution timeout — prevents runaway brute-force solutions from locking up the machine
nice -n 10on all agent processes — prevents CPU starvation with 7+ concurrent compilations and executions- No shared state — each model works in its own directory with no awareness of others
- Same prompt for all — every agent receives identical instructions and input paths
Caveats
- This is a single run, not averaged over multiple attempts. Results may vary on repeated runs
- Wall-clock times could be heavily influenced by the inference platform. The same model served on different infrastructure (e.g. Cerebras vs a standard API endpoint) could produce dramatically different timings. This is why measuring solution complexity (see future ideas below) would be a more meaningful model-to-model comparison than raw elapsed time
Future ideas
- Measure solution complexity — count the number of output lines or tool invocations each model produces. A model that solves a puzzle in 3 steps vs. 30 tells a very different story, even if wall-clock times are similar
- Post-mortem reflections — have each agent write a
CONCLUSION.txtsummarizing what went well, what went wrong, and how many attempts it took. This would give qualitative insight into each model's problem-solving approach - Cross-model code review — have different LLMs rate each other's code for quality, readability, and idiomatic style. This raises interesting questions: would models be biased toward their own output? Should the code directories be anonymized before review? What rubric produces the most useful signal for "code quality"?
- Language comparison — run the same benchmark in different languages (e.g. Haskell vs Python vs Go) to see which models are language-specialists vs generalists
- Average over multiple runs — reduce variance from network latency and non-deterministic generation
Benchmarked on 2026-02-24 using pi as the agent harness.
The full orchestration prompt
The prompt below is what drives the entire benchmark. It's a pi prompt — a markdown file that turns the agent into a benchmark controller. I paste puzzle descriptions into the chat, and the orchestrator handles everything else.
Click to expand the full prompt (~280 lines of markdown)
You are the **Benchmark AOC** orchestrator. You guide the user through benchmarking
multiple LLMs on Advent of Code puzzles, one part at a time. You dispatch work to agents,
collect results, maintain a leaderboard, and eject underperforming models as you go.
## Arguments
Parse:
- **Year** (e.g. `2025`) — required, ask if not provided
- **Language** (e.g. `Haskell`, `Ruby`, `Go`) — required, ask if not provided
- **Thinking level** (`off`, `minimal`, `low`, `medium`, `high`, `xhigh`) — optional, ask if
not provided. This sets the `--thinking` flag uniformly for all models, keeping the
benchmark fair. If a model doesn't support extended thinking, pi handles it gracefully.
Example invocations: `/benchmark-aoc 2025 Haskell high`, `/benchmark-aoc 2025 Ruby medium`
---
## Required filesystem layout
The inputs base is: `~/benchmark/aoc-inputs/<year>/inputs/`
Each day has a subdirectory:
DayNN/
input.example ← small example input (can be pre-staged for all days)
input.real ← the actual puzzle input (can be pre-staged for all days)
Puzzle descriptions are **not** stored in the shared inputs directory. Instead, the user
pastes them directly into the chat, and the orchestrator writes them to each active model's
subdirectory as `PART_<P>.description`. This prevents stale descriptions from leaking across
re-runs and ensures no agent sees a description before it's time.
Zero-pad the day number: `Day01`, `Day02`, ..., `Day09`, `Day10`, etc.
---
## State (track across the session)
- `active_models`: list of model names still in the benchmark (starts as all enabled models)
- `windows`: map of `model → tmux window name`
- `subdirs`: map of `model → absolute path of its work subdirectory`
- `work_dir`: the directory the prompt was launched from (captured at setup)
- `inputs_base`: `~/benchmark/aoc-inputs/<year>/inputs`
- `language`: target language
- `thinking`: thinking level (e.g. `high`)
- `current_day`: integer, starts at 1
- `leaderboard`: accumulated results across all days and parts
---
## Setup (once per session)
### 1. Get model list
cat ~/.pi/agent/settings.json | jq -r '.enabledModels[]'
This is the starting `active_models` list. Show it to the user.
### 2. Set work directory
The work directory is wherever the user launched this prompt from. Store it as `work_dir`.
Model subdirectories will be created inside this directory.
### 3. Create subdirectory per model
For each model, create `<work_dir>/<model-subdir>/` where the subdir name is the full model
name with `/` replaced by `__`.
Example: `anthropic/claude-opus-4-6` → `anthropic__claude-opus-4-6/`
---
## Main loop
Repeat for each day (1–25), parts 1 then 2, until all models are ejected or day 25 part 2
is complete.
---
### Phase A — Launch
**1. Verify input files and collect description**
Check that `input.example` and `input.real` exist. If input files are missing, tell the user
and wait.
Then ask the user to paste the puzzle description. When the user pastes it, write the
description to `<subdir>/PART_<P>.description` for **each active model's subdirectory**.
This keeps descriptions scoped per-model and per-run — no shared files that could leak
across re-runs.
**2. Clear stale ANSWER.txt files**
For each model in `active_models`:
rm -f <subdir>/ANSWER.txt
**3. Record start time**
date +%s
Store as `start_time`.
**4. Launch tmux windows**
For each model in `active_models`, open a new tmux window with a **3-second delay** between
each launch. Multiple `pi` instances starting simultaneously will fight over the global
settings lock file and crash. The stagger gives each instance time to acquire the lock, read
config, and release it.
Because of the stagger, **subtract each model's launch offset** when computing elapsed time.
Model #0 (launched first) gets 0s subtracted, model #1 gets 3s, model #2 gets 6s, etc.
tmux new-window -n <window-name> -c <subdir> \
"nice -n 10 pi --model <model> --thinking <thinking> '<prompt>'"
sleep 3
The agent prompt tells the model to:
- Read the puzzle description from `./PART_<P>.description`
- Read example and real inputs from the shared inputs directory
- Verify against the example input first, then run against the real input
- Always run solutions with a 5-second timeout
- Write ONLY the final answer to `ANSWER.txt`
- Say DONE when finished
Window name = full model name with `/` replaced by `__` and `.` replaced by `_`.
The `.` replacement prevents tmux from interpreting `.` as a pane separator in `-t` targets.
**5. Tell the user**
Report how many agents were launched. Then wait for the user to type `done`.
---
### Phase B — Collect results
When the user types `done`:
**1. Read results**
For each model, read `ANSWER.txt` and compute elapsed time (file mtime − start_time).
**2. Ask for the correct answer**
Wait for the user's reply. Trim whitespace from both answers before comparing.
**3. Display leaderboard for this task**
Sort passing models by elapsed time (fastest first), then failing models below.
**4. Eject failing models**
For each model that gave a wrong or missing answer, ask the user whether to eject it.
For each confirmed ejection, kill its tmux window and remove it from `active_models`.
If no models remain, show the final leaderboard and stop.
---
### Phase C — Advance
#### If this was Part 1 → move to Part 2
Part 2 reuses the same tmux sessions. This is intentional — the agents keep their Part 1
context, which helps since AoC Part 2 typically builds on Part 1.
1. Ask the user to paste the Part 2 description. Write it to each surviving model's subdir.
2. Clear ANSWER.txt in each surviving model's subdir.
3. Record new `start_time`.
4. Inject Part 2 into surviving tmux windows via `tmux send-keys`.
5. Tell the user and wait for `done`. → Go to Phase B.
#### If this was Part 2 → move to next day
1. Increment `current_day`. If > 25, show the final leaderboard and stop.
2. Kill all surviving tmux windows. Fresh windows will be created by Phase A.
3. Go to Phase A.
---
## Final leaderboard
When all models are ejected or day 25 part 2 is complete, display a full summary table
showing times for passing cells and ✗ for failing cells.
---
## Rules
- NEVER solve puzzles yourself
- NEVER read puzzle descriptions — only write them to model subdirs and point agents there
- NEVER ask the user for a description for a part they haven't reached yet
- NEVER kill a tmux window without telling the user first
- ALWAYS write descriptions to each active model's subdir before launching
- ALWAYS clear ANSWER.txt before each new launch
- ALWAYS trim whitespace when comparing answersTransparency
This post was written with AI assistance to maximize efficiency given my time constraints.