Benchmarking LLMs on Advent of Code 2025 (Haskell)

Tags = [ Haskell, AI, Advent of Code ] Posted on February 24, 2026 at 19:30 CET

I benchmarked 11 LLMs on Advent of Code 2025 Days 1–5, each solving independently in Haskell. The goal: see which models can reliably produce correct, working solutions — and how fast.

How it works

I built a custom orchestration prompt for pi, a CLI coding agent. The prompt acts as a benchmark controller: it launches one pi agent per model in separate tmux windows, feeds them the puzzle description, waits for them to finish, collects their answers, scores them for correctness and time-to-solution, and maintains a leaderboard. Models that fail a puzzle get ejected from the competition.

Each agent works in complete isolation — its own directory, no shared state, no awareness of the others. They all get the same puzzle description and input files. The orchestrator never solves anything itself; it just dispatches, collects, and scores.

The contestants

All 11 models came from my enabled model list:

#	Model
1	`anthropic/claude-opus-4-6`
2	`anthropic/claude-sonnet-4-6`
3	`openai-codex/gpt-5.3-codex`
4	`zai/glm-5`
5	`minimax/MiniMax-M2.5`
6	`kimi-coding/k2p5`
7	`mistral/devstral-2512`
8	`alibaba/qwen3.5-plus`
9	`alibaba/qwen3-max-2026-01-23`
10	`alibaba/qwen3-coder-next`
11	`alibaba/qwen3-coder-plus`

Ejections

Models that failed a puzzle were offered for ejection. Four models didn't survive past Day 1:

Model	Ejected at	Reason
`alibaba/qwen3-coder-plus`	D1P1	No answer (never wrote ANSWER.txt)
`mistral/devstral-2512`	D1P2	Wrong answer
`alibaba/qwen3-coder-next`	D1P2	Wrong answer
`alibaba/qwen3-max-2026-01-23`	D1P2	No answer (got the right answer but stopped before writing the file!)

The qwen3-max case was particularly painful — the model computed the correct answer, said "Now I'll write the answer to ANSWER.txt:" and then... just stopped generating. The answer was right there. 😩

Results (Days 1–5)

The surviving 7 models went on a perfect streak through Days 1–5, all producing correct answers for every part. Here's the full timing breakdown:

Per-task leaderboards

Day 1 Part 1 — Dial rotation counting

Model	Time
mistral/devstral-2512	33s
anthropic/claude-opus-4-6	40s
anthropic/claude-sonnet-4-6	41s
kimi-coding/k2p5	42s
alibaba/qwen3.5-plus	45s
openai-codex/gpt-5.3-codex	50s
zai/glm-5	70s
alibaba/qwen3-coder-next	74s
alibaba/qwen3-max-2026-01-23	78s
minimax/MiniMax-M2.5	111s

Day 1 Part 2 — Counting zero-crossings during dial rotation

Model	Time
anthropic/claude-opus-4-6	32s
anthropic/claude-sonnet-4-6	39s
openai-codex/gpt-5.3-codex	53s
kimi-coding/k2p5	70s
alibaba/qwen3.5-plus	85s
zai/glm-5	130s
minimax/MiniMax-M2.5	972s

Day 2 Part 1 — Summing repeated-digit IDs in ranges

Model	Time
openai-codex/gpt-5.3-codex	17s
kimi-coding/k2p5	41s
alibaba/qwen3.5-plus	45s
anthropic/claude-sonnet-4-6	53s
anthropic/claude-opus-4-6	54s
zai/glm-5	72s
minimax/MiniMax-M2.5	113s

Day 2 Part 2 — Repeated-pattern IDs (any repeat count)

Model	Time
openai-codex/gpt-5.3-codex	22s
kimi-coding/k2p5	22s
alibaba/qwen3.5-plus	25s
anthropic/claude-sonnet-4-6	31s
minimax/MiniMax-M2.5	38s
anthropic/claude-opus-4-6	41s
zai/glm-5	63s

Day 3 Part 1 — Maximizing 2-digit joltage from battery banks

Model	Time
anthropic/claude-sonnet-4-6	28s
kimi-coding/k2p5	30s
anthropic/claude-opus-4-6	32s
zai/glm-5	39s
openai-codex/gpt-5.3-codex	41s
alibaba/qwen3.5-plus	42s
minimax/MiniMax-M2.5	1078s

Day 3 Part 2 — Maximizing 12-digit joltage from battery banks

Model	Time
openai-codex/gpt-5.3-codex	19s
alibaba/qwen3.5-plus	24s
anthropic/claude-sonnet-4-6	27s
anthropic/claude-opus-4-6	30s
zai/glm-5	37s
kimi-coding/k2p5	71s
minimax/MiniMax-M2.5	473s

Day 4 Part 1 — Grid neighbor counting (accessible paper rolls)

Model	Time
openai-codex/gpt-5.3-codex	27s
alibaba/qwen3.5-plus	27s
anthropic/claude-sonnet-4-6	28s
anthropic/claude-opus-4-6	33s
kimi-coding/k2p5	36s
zai/glm-5	70s
minimax/MiniMax-M2.5	707s

Day 4 Part 2 — Iterative grid removal simulation

Model	Time
openai-codex/gpt-5.3-codex	23s
anthropic/claude-sonnet-4-6	24s
alibaba/qwen3.5-plus	24s
anthropic/claude-opus-4-6	28s
kimi-coding/k2p5	37s
zai/glm-5	39s
minimax/MiniMax-M2.5	166s

Day 5 Part 1 — Range membership checking

Model	Time
openai-codex/gpt-5.3-codex	27s
anthropic/claude-sonnet-4-6	28s
alibaba/qwen3.5-plus	28s
anthropic/claude-opus-4-6	32s
kimi-coding/k2p5	33s
zai/glm-5	72s
minimax/MiniMax-M2.5	140s

Day 5 Part 2 — Counting total fresh IDs from overlapping ranges

Model	Time
openai-codex/gpt-5.3-codex	20s
kimi-coding/k2p5	20s
anthropic/claude-sonnet-4-6	23s
anthropic/claude-opus-4-6	24s
alibaba/qwen3.5-plus	28s
minimax/MiniMax-M2.5	44s
zai/glm-5	54s

Summary table

Model	D1P1	D1P2	D2P1	D2P2	D3P1	D3P2	D4P1	D4P2	D5P1	D5P2
openai-codex/gpt-5.3-codex	50s	53s	17s	22s	41s	19s	27s	23s	27s	20s
anthropic/claude-sonnet-4-6	41s	39s	53s	31s	28s	27s	28s	24s	28s	23s
anthropic/claude-opus-4-6	40s	32s	54s	41s	32s	30s	33s	28s	32s	24s
kimi-coding/k2p5	42s	70s	41s	22s	30s	71s	36s	37s	33s	20s
alibaba/qwen3.5-plus	45s	85s	45s	25s	42s	24s	27s	24s	28s	28s
zai/glm-5	70s	130s	72s	63s	39s	37s	70s	39s	72s	54s
minimax/MiniMax-M2.5	111s	972s	113s	38s	1078s	473s	707s	166s	140s	44s
— ejected —
mistral/devstral-2512	33s	✗
alibaba/qwen3-coder-next	74s	✗
alibaba/qwen3-max-2026-01-23	78s	✗(—)
alibaba/qwen3-coder-plus	✗(—)

Observations

The top tier — openai-codex/gpt-5.3-codex, anthropic/claude-sonnet-4-6, and anthropic/claude-opus-4-6 were consistently fast and correct. GPT-5.3-Codex was often the fastest, with the Anthropic models close behind.

Solid mid-pack — kimi-coding/k2p5 and alibaba/qwen3.5-plus were reliable and reasonably quick, occasionally matching the top performers.

Consistent but slow — zai/glm-5 always got the right answer but typically took 2–3x longer than the leaders.

The outlier — minimax/MiniMax-M2.5 always got the right answer eventually, but with wildly inconsistent timing. It ranged from 38s (competitive) to 1078s (18 minutes!) for puzzles others solved in under a minute. Something about its approach leads to very expensive wrong turns before converging.

Early casualties — The four ejected models all failed on the very first day. mistral/devstral-2512 was actually the fastest on D1P1 (33s!) but got Part 2 wrong. qwen3-max was the most frustrating: it computed the correct answer and then stopped generating before writing it to disk.

Haskell — All surviving models were able to produce compilable Haskell code that solved the puzzles correctly. I did not review the code quality itself.

Methodology

Orchestration

The whole benchmark is driven by a custom pi prompt that acts as a controller. It:

Reads the enabled model list from pi's settings
Creates an isolated working directory per model
Launches each model as a separate pi agent in its own tmux window
Feeds puzzle descriptions (pasted by the operator) into each model's directory
Waits for agents to finish, then reads their ANSWER.txt files
Compares answers, displays leaderboards, and handles ejections
For Part 2, reuses the same tmux sessions so agents keep their Part 1 context (since AoC Part 2 typically builds on Part 1)

The orchestrator never reads puzzle descriptions itself and never solves anything — it only dispatches and scores.

Timing

Elapsed time = ANSWER.txt file modification time − launch time − stagger offset
Agents are launched with a 3-second stagger between each to avoid a lock file race condition in pi's settings. Each model's offset is subtracted from its raw time
Times include the full cycle: reading the puzzle, writing Haskell code, compiling with GHC, testing against example input, running against real input, and writing the answer

Fairness controls

Thinking/reasoning disabled (--thinking off) — keeps things fair across models that support extended thinking differently
5-second execution timeout — prevents runaway brute-force solutions from locking up the machine
nice -n 10 on all agent processes — prevents CPU starvation with 7+ concurrent compilations and executions
No shared state — each model works in its own directory with no awareness of others
Same prompt for all — every agent receives identical instructions and input paths

Caveats

This is a single run, not averaged over multiple attempts. Results may vary on repeated runs
Wall-clock times could be heavily influenced by the inference platform. The same model served on different infrastructure (e.g. Cerebras vs a standard API endpoint) could produce dramatically different timings. This is why measuring solution complexity (see future ideas below) would be a more meaningful model-to-model comparison than raw elapsed time

Future ideas

Measure solution complexity — count the number of output lines or tool invocations each model produces. A model that solves a puzzle in 3 steps vs. 30 tells a very different story, even if wall-clock times are similar
Post-mortem reflections — have each agent write a CONCLUSION.txt summarizing what went well, what went wrong, and how many attempts it took. This would give qualitative insight into each model's problem-solving approach
Cross-model code review — have different LLMs rate each other's code for quality, readability, and idiomatic style. This raises interesting questions: would models be biased toward their own output? Should the code directories be anonymized before review? What rubric produces the most useful signal for "code quality"?
Language comparison — run the same benchmark in different languages (e.g. Haskell vs Python vs Go) to see which models are language-specialists vs generalists
Average over multiple runs — reduce variance from network latency and non-deterministic generation

Benchmarked on 2026-02-24 using pi as the agent harness.

The full orchestration prompt

The prompt below is what drives the entire benchmark. It's a pi prompt — a markdown file that turns the agent into a benchmark controller. I paste puzzle descriptions into the chat, and the orchestrator handles everything else.

Click to expand the full prompt (~280 lines of markdown)

You are the **Benchmark AOC** orchestrator. You guide the user through benchmarking
multiple LLMs on Advent of Code puzzles, one part at a time. You dispatch work to agents,
collect results, maintain a leaderboard, and eject underperforming models as you go.

## Arguments

Parse:
- **Year** (e.g. `2025`) — required, ask if not provided
- **Language** (e.g. `Haskell`, `Ruby`, `Go`) — required, ask if not provided
- **Thinking level** (`off`, `minimal`, `low`, `medium`, `high`, `xhigh`) — optional, ask if
  not provided. This sets the `--thinking` flag uniformly for all models, keeping the
  benchmark fair. If a model doesn't support extended thinking, pi handles it gracefully.

Example invocations: `/benchmark-aoc 2025 Haskell high`, `/benchmark-aoc 2025 Ruby medium`

---

## Required filesystem layout

The inputs base is: `~/benchmark/aoc-inputs/<year>/inputs/`

Each day has a subdirectory:

    DayNN/
      input.example       ← small example input (can be pre-staged for all days)
      input.real          ← the actual puzzle input (can be pre-staged for all days)

Puzzle descriptions are **not** stored in the shared inputs directory. Instead, the user
pastes them directly into the chat, and the orchestrator writes them to each active model's
subdirectory as `PART_<P>.description`. This prevents stale descriptions from leaking across
re-runs and ensures no agent sees a description before it's time.

Zero-pad the day number: `Day01`, `Day02`, ..., `Day09`, `Day10`, etc.

---

## State (track across the session)

- `active_models`: list of model names still in the benchmark (starts as all enabled models)
- `windows`: map of `model → tmux window name`
- `subdirs`: map of `model → absolute path of its work subdirectory`
- `work_dir`: the directory the prompt was launched from (captured at setup)
- `inputs_base`: `~/benchmark/aoc-inputs/<year>/inputs`
- `language`: target language
- `thinking`: thinking level (e.g. `high`)
- `current_day`: integer, starts at 1
- `leaderboard`: accumulated results across all days and parts

---

## Setup (once per session)

### 1. Get model list

    cat ~/.pi/agent/settings.json | jq -r '.enabledModels[]'

This is the starting `active_models` list. Show it to the user.

### 2. Set work directory

The work directory is wherever the user launched this prompt from. Store it as `work_dir`.
Model subdirectories will be created inside this directory.

### 3. Create subdirectory per model

For each model, create `<work_dir>/<model-subdir>/` where the subdir name is the full model
name with `/` replaced by `__`.

Example: `anthropic/claude-opus-4-6` → `anthropic__claude-opus-4-6/`

---

## Main loop

Repeat for each day (1–25), parts 1 then 2, until all models are ejected or day 25 part 2
is complete.

---

### Phase A — Launch

**1. Verify input files and collect description**

Check that `input.example` and `input.real` exist. If input files are missing, tell the user
and wait.

Then ask the user to paste the puzzle description. When the user pastes it, write the
description to `<subdir>/PART_<P>.description` for **each active model's subdirectory**.
This keeps descriptions scoped per-model and per-run — no shared files that could leak
across re-runs.

**2. Clear stale ANSWER.txt files**

For each model in `active_models`:

    rm -f <subdir>/ANSWER.txt

**3. Record start time**

    date +%s

Store as `start_time`.

**4. Launch tmux windows**

For each model in `active_models`, open a new tmux window with a **3-second delay** between
each launch. Multiple `pi` instances starting simultaneously will fight over the global
settings lock file and crash. The stagger gives each instance time to acquire the lock, read
config, and release it.

Because of the stagger, **subtract each model's launch offset** when computing elapsed time.
Model #0 (launched first) gets 0s subtracted, model #1 gets 3s, model #2 gets 6s, etc.

    tmux new-window -n <window-name> -c <subdir> \
      "nice -n 10 pi --model <model> --thinking <thinking> '<prompt>'"
    sleep 3

The agent prompt tells the model to:
- Read the puzzle description from `./PART_<P>.description`
- Read example and real inputs from the shared inputs directory
- Verify against the example input first, then run against the real input
- Always run solutions with a 5-second timeout
- Write ONLY the final answer to `ANSWER.txt`
- Say DONE when finished

Window name = full model name with `/` replaced by `__` and `.` replaced by `_`.
The `.` replacement prevents tmux from interpreting `.` as a pane separator in `-t` targets.

**5. Tell the user**

Report how many agents were launched. Then wait for the user to type `done`.

---

### Phase B — Collect results

When the user types `done`:

**1. Read results**

For each model, read `ANSWER.txt` and compute elapsed time (file mtime − start_time).

**2. Ask for the correct answer**

Wait for the user's reply. Trim whitespace from both answers before comparing.

**3. Display leaderboard for this task**

Sort passing models by elapsed time (fastest first), then failing models below.

**4. Eject failing models**

For each model that gave a wrong or missing answer, ask the user whether to eject it.
For each confirmed ejection, kill its tmux window and remove it from `active_models`.

If no models remain, show the final leaderboard and stop.

---

### Phase C — Advance

#### If this was Part 1 → move to Part 2

Part 2 reuses the same tmux sessions. This is intentional — the agents keep their Part 1
context, which helps since AoC Part 2 typically builds on Part 1.

1. Ask the user to paste the Part 2 description. Write it to each surviving model's subdir.
2. Clear ANSWER.txt in each surviving model's subdir.
3. Record new `start_time`.
4. Inject Part 2 into surviving tmux windows via `tmux send-keys`.
5. Tell the user and wait for `done`. → Go to Phase B.

#### If this was Part 2 → move to next day

1. Increment `current_day`. If > 25, show the final leaderboard and stop.
2. Kill all surviving tmux windows. Fresh windows will be created by Phase A.
3. Go to Phase A.

---

## Final leaderboard

When all models are ejected or day 25 part 2 is complete, display a full summary table
showing times for passing cells and ✗ for failing cells.

---

## Rules

- NEVER solve puzzles yourself
- NEVER read puzzle descriptions — only write them to model subdirs and point agents there
- NEVER ask the user for a description for a part they haven't reached yet
- NEVER kill a tmux window without telling the user first
- ALWAYS write descriptions to each active model's subdir before launching
- ALWAYS clear ANSWER.txt before each new launch
- ALWAYS trim whitespace when comparing answers

Transparency

This post was written with AI assistance to maximize efficiency given my time constraints.

Developers, developers, developers!

Blog about programming, programming and, ah more programming!