Developers, developers, developers!

Blog about programming, programming and, ah more programming!

Benchmarking a Local LLM on Advent of Code 2025 (Ollama)

Tags = [ Python, Haskell, AI, Advent of Code, Ollama ]

The previous benchmarks in this series (Haskell, OCaml, Python, ReScript, Ruby, Elixir, Java, Elm) all used cloud API models — big, frontier-class LLMs served by Anthropic, OpenAI, and others. But what about a local model? Can a 14-billion-parameter model running on a single machine solve the same puzzles?

This post answers that question using qwen2.5-coder:14b via Ollama, tested on AoC 2025 Day 1 in both Python and Haskell.

The hardware

LaptopTUXEDO Pulse 14 Gen4
CPUAMD Ryzen 7 8845HS (8 cores / 16 threads, up to 5.1 GHz)
RAM26 GB DDR5
GPUAMD Radeon 780M (integrated — no dedicated GPU)
OSTUXEDO OS 3 (Ubuntu-based), kernel 6.11

This is a general-purpose laptop — not a machine you'd pick for local AI inference. No dedicated GPU means Ollama probably falls back to CPU, which makes generation obviously slow.

The model

Modelqwen2.5-coder:14b
ReleasedNovember 2024
Parameters~14B
Size on disk~9 GB
RuntimeOllama 0.4.1 (CPU inference most likely)
Generation speedVery slow (~40–90s per response)

For comparison, the cloud models in previous benchmarks typically solve Day 1 in 10–50 seconds total — including reading the puzzle, writing code, running it, verifying, and producing the answer. The local model takes that long just to generate a single response.

Why not use pi?

The first thing I tried was running the model through pi, the same agent harness used for all the other benchmarks. That failed immediately.

I assumed Ollama's OpenAI-compatible API would let pi drive the model the same way it drives cloud models. It didn't work — instead of making actual tool calls, the model output JSON-shaped text with placeholder values and immediately said "DONE" — without reading the puzzle, writing any code, or running anything. I didn't investigate further.

{"name": "read", "arguments": {"path": "./PART_1.description"}}
{"name": "bash", "arguments": {"command": "timeout 5 python solution.py < ..."}}
{"name": "write", "arguments": {"path": "./ANSWER.txt", "content": "<paste the output here>"}}
DONE

Even if tool calls had worked, pi's background work (reading files, running commands) was very slow through Ollama on this hardware. This ruled out using pi as-is and led me to a completely different methodology.

Methodology

Instead of having the local model use pi directly, I used raw Ollama CLI calls orchestrated by a separate cloud-hosted pi instance acting as a manual agent loop:

  1. Write a prompt containing the full puzzle description, example input, and expected output
  2. Send it to ollama via ollama run qwen2.5-coder:14b < prompt.txt
  3. Extract the code from the response (stripping markdown fences)
  4. Run it against the example input, then the real input
  5. If wrong: build a new prompt with the puzzle description + the actual error output, retry
  6. Up to 10 attempts per part

Each Ollama call is stateless — the model has no memory of previous attempts. Every retry includes the full puzzle context plus the error feedback from the previous attempt.

Error feedback rules

The feedback given on retries was deliberately minimal — similar to what a human would see:

  • Compile/runtime error → include the actual error output (traceback, compiler message)
  • Wrong answer on example → "Your code outputs X but the expected answer is Y"
  • Wrong answer on real input → "That's not the right answer"

No debugging hints, no solution suggestions, no explanation of why the code was wrong.

Timing

Each Ollama call was timed independently (wall-clock, generation only). The reported time is for the successful attempt — the one that produced the correct answer. Total cumulative time across all attempts is also noted.

Key differences from the cloud benchmarks

Cloud benchmarksThis test
Agent harnesspi (autonomous)Manual loop (pi as orchestrator)
Tool useFull (read, write, bash)None — one-shot code generation
Self-correctionModel can run & debug its own codeOrchestrator runs code, sends error back
MemoryFull conversation contextStateless (each attempt is fresh)
Retries3 max10 max
ConcurrencyUp to 11 models in parallel1 model, sequential

Results

Python

Day 1 Part 1 — Dial rotation counting

AttemptTimeResultIssue
162sParsing bug
248sParsing bug
334sParsing bug
450sParsing bug
545sParsing bug
641s

Day 1 Part 2 — Counting zero-crossings during dial rotation

AttemptTimeResultIssue
159s

Haskell

Day 1 Part 1 — Dial rotation counting

AttemptTimeResultIssue
178sCompiled, ran, wrong answer — normalize only adds 100 once
269sType error: Int vs String
348sType error: Char vs String
462sNon-exhaustive pattern match
554sCorrect — used mod properly

Day 1 Part 2 — Counting zero-crossings during dial rotation

AttemptTimeResultIssue
168sWrong answer (5) — iterate includes start position
244sWrong answer (5) — same off-by-one
360sType error: variable not in scope
461sWrong answer (5)
573sType errors: wrong accumulator type in fold
696sSame type error
754sWrong answer (5) — resets position to 50 each rotation
862sType error: Int vs String
959sShadowed lines function
1066sCorrect — on the very last attempt

Summary

D1P1 D1P2 Attempts Total gen time
Python ✓ (6th try, 41s) ✓ (1st try, 59s) 7 ~339s
Haskell ✓ (5th try, 54s) ✓ (10th try, 66s) 15 ~954s

Observations

The most visible weakness was input parsing. In Python, the model kept making the same parsing mistake for 5 attempts in a row, despite receiving the error output each time. It took 6 attempts to find a correct approach.

The algorithmic logic seemed mostly sound. When the code compiled and parsed input correctly, the core algorithm was often right. The Part 1 logic (modular arithmetic) and Part 2 logic (click-by-click simulation) were correct in most attempts — the model just couldn't always wire them up with correct parsing and types.

Haskell was significantly harder. 15 attempts vs. 7 for Python. The model struggled with Haskell's type system — confusing Char with String, Int with [Char], shadowing the lines function, and getting accumulator types wrong in folds.

Haskell Part 2 barely made it. The model solved it on attempt 10 of 10. Four attempts compiled and ran but produced the wrong answer (5 instead of 6) — a consistent off-by-one error where iterate included the start position in the click list. Five attempts didn't compile at all. One attempt out of ten got everything right.

Python Part 2 worked on the first try. First-try success on Part 2 — the part that ejected 4 of 11 cloud models in the Haskell benchmark and 5 of 9 in the OCaml benchmark.

Cost: $0. The entire benchmark cost nothing in API fees. The only cost is electricity and patience. For a hobbyist or someone learning, being able to get eventual correct answers for free is meaningful — even if it takes 10 tries.

Conclusion

A 14B local model can solve AoC Day 1 — but it needs patience. Where cloud models solve both parts in under a minute with zero retries, qwen2.5-coder:14b needed 7 attempts for Python and 15 for Haskell, with a total generation time of ~22 minutes.

Whether that's useful depends on your tolerance for iteration. At $0 per attempt, with a local setup you control entirely, there's a case to be made — especially for easier puzzles or languages the model handles well (Python >> Haskell).

Benchmarked on 2026-02-27 using Ollama 0.4.1 for local inference and pi as the orchestrator.

Discussion

Join the conversation on the Haskell Discourse.


This post was written with AI assistance.