Developers, developers, developers!

Blog about programming, programming and, ah more programming!

Benchmarking LLMs on Advent of Code 2025 (Haskell)

Tags = [ Haskell, AI, Advent of Code ]

I benchmarked 11 LLMs on Advent of Code 2025 Days 1–5, each solving independently in Haskell. The goal: see which models can reliably produce correct, working solutions — and how fast.

How it works

I built a custom orchestration prompt for pi, a CLI coding agent. The prompt acts as a benchmark controller: it launches one pi agent per model in separate tmux windows, feeds them the puzzle description, waits for them to finish, collects their answers, scores them for correctness and time-to-solution, and maintains a leaderboard. Models that fail a puzzle get ejected from the competition.

Each agent works in complete isolation — its own directory, no shared state, no awareness of the others. They all get the same puzzle description and input files. The orchestrator never solves anything itself; it just dispatches, collects, and scores.

The contestants

All 11 models came from my enabled model list:

#Model
1anthropic/claude-opus-4-6
2anthropic/claude-sonnet-4-6
3openai-codex/gpt-5.3-codex
4zai/glm-5
5minimax/MiniMax-M2.5
6kimi-coding/k2p5
7mistral/devstral-2512
8alibaba/qwen3.5-plus
9alibaba/qwen3-max-2026-01-23
10alibaba/qwen3-coder-next
11alibaba/qwen3-coder-plus

Ejections

Models that failed a puzzle were offered for ejection. Four models didn't survive past Day 1:

ModelEjected atReason
alibaba/qwen3-coder-plusD1P1No answer (never wrote ANSWER.txt)
mistral/devstral-2512D1P2Wrong answer
alibaba/qwen3-coder-nextD1P2Wrong answer
alibaba/qwen3-max-2026-01-23D1P2No answer (got the right answer but stopped before writing the file!)

The qwen3-max case was particularly painful — the model computed the correct answer, said "Now I'll write the answer to ANSWER.txt:" and then... just stopped generating. The answer was right there. 😩

Results (Days 1–5)

The surviving 7 models went on a perfect streak through Days 1–5, all producing correct answers for every part. Here's the full timing breakdown:


Per-task leaderboards


Day 1 Part 1 — Dial rotation counting

ModelTime
mistral/devstral-251233s
anthropic/claude-opus-4-640s
anthropic/claude-sonnet-4-641s
kimi-coding/k2p542s
alibaba/qwen3.5-plus45s
openai-codex/gpt-5.3-codex50s
zai/glm-570s
alibaba/qwen3-coder-next74s
alibaba/qwen3-max-2026-01-2378s
minimax/MiniMax-M2.5111s



Day 1 Part 2 — Counting zero-crossings during dial rotation

ModelTime
anthropic/claude-opus-4-632s
anthropic/claude-sonnet-4-639s
openai-codex/gpt-5.3-codex53s
kimi-coding/k2p570s
alibaba/qwen3.5-plus85s
zai/glm-5130s
minimax/MiniMax-M2.5972s



Day 2 Part 1 — Summing repeated-digit IDs in ranges

ModelTime
openai-codex/gpt-5.3-codex17s
kimi-coding/k2p541s
alibaba/qwen3.5-plus45s
anthropic/claude-sonnet-4-653s
anthropic/claude-opus-4-654s
zai/glm-572s
minimax/MiniMax-M2.5113s



Day 2 Part 2 — Repeated-pattern IDs (any repeat count)

ModelTime
openai-codex/gpt-5.3-codex22s
kimi-coding/k2p522s
alibaba/qwen3.5-plus25s
anthropic/claude-sonnet-4-631s
minimax/MiniMax-M2.538s
anthropic/claude-opus-4-641s
zai/glm-563s



Day 3 Part 1 — Maximizing 2-digit joltage from battery banks

ModelTime
anthropic/claude-sonnet-4-628s
kimi-coding/k2p530s
anthropic/claude-opus-4-632s
zai/glm-539s
openai-codex/gpt-5.3-codex41s
alibaba/qwen3.5-plus42s
minimax/MiniMax-M2.51078s



Day 3 Part 2 — Maximizing 12-digit joltage from battery banks

ModelTime
openai-codex/gpt-5.3-codex19s
alibaba/qwen3.5-plus24s
anthropic/claude-sonnet-4-627s
anthropic/claude-opus-4-630s
zai/glm-537s
kimi-coding/k2p571s
minimax/MiniMax-M2.5473s



Day 4 Part 1 — Grid neighbor counting (accessible paper rolls)

ModelTime
openai-codex/gpt-5.3-codex27s
alibaba/qwen3.5-plus27s
anthropic/claude-sonnet-4-628s
anthropic/claude-opus-4-633s
kimi-coding/k2p536s
zai/glm-570s
minimax/MiniMax-M2.5707s



Day 4 Part 2 — Iterative grid removal simulation

ModelTime
openai-codex/gpt-5.3-codex23s
anthropic/claude-sonnet-4-624s
alibaba/qwen3.5-plus24s
anthropic/claude-opus-4-628s
kimi-coding/k2p537s
zai/glm-539s
minimax/MiniMax-M2.5166s



Day 5 Part 1 — Range membership checking

ModelTime
openai-codex/gpt-5.3-codex27s
anthropic/claude-sonnet-4-628s
alibaba/qwen3.5-plus28s
anthropic/claude-opus-4-632s
kimi-coding/k2p533s
zai/glm-572s
minimax/MiniMax-M2.5140s



Day 5 Part 2 — Counting total fresh IDs from overlapping ranges

ModelTime
openai-codex/gpt-5.3-codex20s
kimi-coding/k2p520s
anthropic/claude-sonnet-4-623s
anthropic/claude-opus-4-624s
alibaba/qwen3.5-plus28s
minimax/MiniMax-M2.544s
zai/glm-554s

Summary table

Model D1P1 D1P2 D2P1 D2P2 D3P1 D3P2 D4P1 D4P2 D5P1 D5P2
openai-codex/gpt-5.3-codex 50s 53s 17s 22s 41s 19s 27s 23s 27s 20s
anthropic/claude-sonnet-4-6 41s 39s 53s 31s 28s 27s 28s 24s 28s 23s
anthropic/claude-opus-4-6 40s 32s 54s 41s 32s 30s 33s 28s 32s 24s
kimi-coding/k2p5 42s 70s 41s 22s 30s 71s 36s 37s 33s 20s
alibaba/qwen3.5-plus 45s 85s 45s 25s 42s 24s 27s 24s 28s 28s
zai/glm-5 70s 130s 72s 63s 39s 37s 70s 39s 72s 54s
minimax/MiniMax-M2.5 111s 972s 113s 38s 1078s 473s 707s 166s 140s 44s
— ejected —
mistral/devstral-2512 33s
alibaba/qwen3-coder-next 74s
alibaba/qwen3-max-2026-01-23 78s ✗(—)
alibaba/qwen3-coder-plus ✗(—)

Observations

The top tieropenai-codex/gpt-5.3-codex, anthropic/claude-sonnet-4-6, and anthropic/claude-opus-4-6 were consistently fast and correct. GPT-5.3-Codex was often the fastest, with the Anthropic models close behind.

Solid mid-packkimi-coding/k2p5 and alibaba/qwen3.5-plus were reliable and reasonably quick, occasionally matching the top performers.

Consistent but slowzai/glm-5 always got the right answer but typically took 2–3x longer than the leaders.

The outlierminimax/MiniMax-M2.5 always got the right answer eventually, but with wildly inconsistent timing. It ranged from 38s (competitive) to 1078s (18 minutes!) for puzzles others solved in under a minute. Something about its approach leads to very expensive wrong turns before converging.

Early casualties — The four ejected models all failed on the very first day. mistral/devstral-2512 was actually the fastest on D1P1 (33s!) but got Part 2 wrong. qwen3-max was the most frustrating: it computed the correct answer and then stopped generating before writing it to disk.

Haskell — All surviving models were able to produce compilable Haskell code that solved the puzzles correctly. I did not review the code quality itself.

Methodology

Orchestration

The whole benchmark is driven by a custom pi prompt that acts as a controller. It:

  • Reads the enabled model list from pi's settings
  • Creates an isolated working directory per model
  • Launches each model as a separate pi agent in its own tmux window
  • Feeds puzzle descriptions (pasted by the operator) into each model's directory
  • Waits for agents to finish, then reads their ANSWER.txt files
  • Compares answers, displays leaderboards, and handles ejections
  • For Part 2, reuses the same tmux sessions so agents keep their Part 1 context (since AoC Part 2 typically builds on Part 1)

The orchestrator never reads puzzle descriptions itself and never solves anything — it only dispatches and scores.

Timing

  • Elapsed time = ANSWER.txt file modification time − launch time − stagger offset
  • Agents are launched with a 3-second stagger between each to avoid a lock file race condition in pi's settings. Each model's offset is subtracted from its raw time
  • Times include the full cycle: reading the puzzle, writing Haskell code, compiling with GHC, testing against example input, running against real input, and writing the answer

Fairness controls

  • Thinking/reasoning disabled (--thinking off) — keeps things fair across models that support extended thinking differently
  • 5-second execution timeout — prevents runaway brute-force solutions from locking up the machine
  • nice -n 10 on all agent processes — prevents CPU starvation with 7+ concurrent compilations and executions
  • No shared state — each model works in its own directory with no awareness of others
  • Same prompt for all — every agent receives identical instructions and input paths

Caveats

  • This is a single run, not averaged over multiple attempts. Results may vary on repeated runs
  • Wall-clock times could be heavily influenced by the inference platform. The same model served on different infrastructure (e.g. Cerebras vs a standard API endpoint) could produce dramatically different timings. This is why measuring solution complexity (see future ideas below) would be a more meaningful model-to-model comparison than raw elapsed time

Future ideas

  • Measure solution complexity — count the number of output lines or tool invocations each model produces. A model that solves a puzzle in 3 steps vs. 30 tells a very different story, even if wall-clock times are similar
  • Post-mortem reflections — have each agent write a CONCLUSION.txt summarizing what went well, what went wrong, and how many attempts it took. This would give qualitative insight into each model's problem-solving approach
  • Cross-model code review — have different LLMs rate each other's code for quality, readability, and idiomatic style. This raises interesting questions: would models be biased toward their own output? Should the code directories be anonymized before review? What rubric produces the most useful signal for "code quality"?
  • Language comparison — run the same benchmark in different languages (e.g. Haskell vs Python vs Go) to see which models are language-specialists vs generalists
  • Average over multiple runs — reduce variance from network latency and non-deterministic generation

Benchmarked on 2026-02-24 using pi as the agent harness.

The full orchestration prompt

The prompt below is what drives the entire benchmark. It's a pi prompt — a markdown file that turns the agent into a benchmark controller. I paste puzzle descriptions into the chat, and the orchestrator handles everything else.

Click to expand the full prompt (~280 lines of markdown)
You are the **Benchmark AOC** orchestrator. You guide the user through benchmarking
multiple LLMs on Advent of Code puzzles, one part at a time. You dispatch work to agents,
collect results, maintain a leaderboard, and eject underperforming models as you go.

## Arguments

Parse:
- **Year** (e.g. `2025`) — required, ask if not provided
- **Language** (e.g. `Haskell`, `Ruby`, `Go`) — required, ask if not provided
- **Thinking level** (`off`, `minimal`, `low`, `medium`, `high`, `xhigh`) — optional, ask if
  not provided. This sets the `--thinking` flag uniformly for all models, keeping the
  benchmark fair. If a model doesn't support extended thinking, pi handles it gracefully.

Example invocations: `/benchmark-aoc 2025 Haskell high`, `/benchmark-aoc 2025 Ruby medium`

---

## Required filesystem layout

The inputs base is: `~/benchmark/aoc-inputs/<year>/inputs/`

Each day has a subdirectory:

    DayNN/
      input.example       ← small example input (can be pre-staged for all days)
      input.real          ← the actual puzzle input (can be pre-staged for all days)

Puzzle descriptions are **not** stored in the shared inputs directory. Instead, the user
pastes them directly into the chat, and the orchestrator writes them to each active model's
subdirectory as `PART_<P>.description`. This prevents stale descriptions from leaking across
re-runs and ensures no agent sees a description before it's time.

Zero-pad the day number: `Day01`, `Day02`, ..., `Day09`, `Day10`, etc.

---

## State (track across the session)

- `active_models`: list of model names still in the benchmark (starts as all enabled models)
- `windows`: map of `model → tmux window name`
- `subdirs`: map of `model → absolute path of its work subdirectory`
- `work_dir`: the directory the prompt was launched from (captured at setup)
- `inputs_base`: `~/benchmark/aoc-inputs/<year>/inputs`
- `language`: target language
- `thinking`: thinking level (e.g. `high`)
- `current_day`: integer, starts at 1
- `leaderboard`: accumulated results across all days and parts

---

## Setup (once per session)

### 1. Get model list

    cat ~/.pi/agent/settings.json | jq -r '.enabledModels[]'

This is the starting `active_models` list. Show it to the user.

### 2. Set work directory

The work directory is wherever the user launched this prompt from. Store it as `work_dir`.
Model subdirectories will be created inside this directory.

### 3. Create subdirectory per model

For each model, create `<work_dir>/<model-subdir>/` where the subdir name is the full model
name with `/` replaced by `__`.

Example: `anthropic/claude-opus-4-6``anthropic__claude-opus-4-6/`

---

## Main loop

Repeat for each day (1–25), parts 1 then 2, until all models are ejected or day 25 part 2
is complete.

---

### Phase A — Launch

**1. Verify input files and collect description**

Check that `input.example` and `input.real` exist. If input files are missing, tell the user
and wait.

Then ask the user to paste the puzzle description. When the user pastes it, write the
description to `<subdir>/PART_<P>.description` for **each active model's subdirectory**.
This keeps descriptions scoped per-model and per-run — no shared files that could leak
across re-runs.

**2. Clear stale ANSWER.txt files**

For each model in `active_models`:

    rm -f <subdir>/ANSWER.txt

**3. Record start time**

    date +%s

Store as `start_time`.

**4. Launch tmux windows**

For each model in `active_models`, open a new tmux window with a **3-second delay** between
each launch. Multiple `pi` instances starting simultaneously will fight over the global
settings lock file and crash. The stagger gives each instance time to acquire the lock, read
config, and release it.

Because of the stagger, **subtract each model's launch offset** when computing elapsed time.
Model #0 (launched first) gets 0s subtracted, model #1 gets 3s, model #2 gets 6s, etc.

    tmux new-window -n <window-name> -c <subdir> \
      "nice -n 10 pi --model <model> --thinking <thinking> '<prompt>'"
    sleep 3

The agent prompt tells the model to:
- Read the puzzle description from `./PART_<P>.description`
- Read example and real inputs from the shared inputs directory
- Verify against the example input first, then run against the real input
- Always run solutions with a 5-second timeout
- Write ONLY the final answer to `ANSWER.txt`
- Say DONE when finished

Window name = full model name with `/` replaced by `__` and `.` replaced by `_`.
The `.` replacement prevents tmux from interpreting `.` as a pane separator in `-t` targets.

**5. Tell the user**

Report how many agents were launched. Then wait for the user to type `done`.

---

### Phase B — Collect results

When the user types `done`:

**1. Read results**

For each model, read `ANSWER.txt` and compute elapsed time (file mtime − start_time).

**2. Ask for the correct answer**

Wait for the user's reply. Trim whitespace from both answers before comparing.

**3. Display leaderboard for this task**

Sort passing models by elapsed time (fastest first), then failing models below.

**4. Eject failing models**

For each model that gave a wrong or missing answer, ask the user whether to eject it.
For each confirmed ejection, kill its tmux window and remove it from `active_models`.

If no models remain, show the final leaderboard and stop.

---

### Phase C — Advance

#### If this was Part 1 → move to Part 2

Part 2 reuses the same tmux sessions. This is intentional — the agents keep their Part 1
context, which helps since AoC Part 2 typically builds on Part 1.

1. Ask the user to paste the Part 2 description. Write it to each surviving model's subdir.
2. Clear ANSWER.txt in each surviving model's subdir.
3. Record new `start_time`.
4. Inject Part 2 into surviving tmux windows via `tmux send-keys`.
5. Tell the user and wait for `done`. → Go to Phase B.

#### If this was Part 2 → move to next day

1. Increment `current_day`. If > 25, show the final leaderboard and stop.
2. Kill all surviving tmux windows. Fresh windows will be created by Phase A.
3. Go to Phase A.

---

## Final leaderboard

When all models are ejected or day 25 part 2 is complete, display a full summary table
showing times for passing cells and ✗ for failing cells.

---

## Rules

- NEVER solve puzzles yourself
- NEVER read puzzle descriptions — only write them to model subdirs and point agents there
- NEVER ask the user for a description for a part they haven't reached yet
- NEVER kill a tmux window without telling the user first
- ALWAYS write descriptions to each active model's subdir before launching
- ALWAYS clear ANSWER.txt before each new launch
- ALWAYS trim whitespace when comparing answers


Transparency

This post was written with AI assistance to maximize efficiency given my time constraints.