Benchmarking a Local LLM on Advent of Code 2025 (Ollama)

Tags = [ Python, Haskell, AI, Advent of Code, Ollama ] Posted on February 27, 2026

The previous benchmarks in this series (Haskell, OCaml, Python, ReScript, Ruby, Elixir, Java, Elm) all used cloud API models — big, frontier-class LLMs served by Anthropic, OpenAI, and others. But what about a local model? Can a 14-billion-parameter model running on a single machine solve the same puzzles?

This post answers that question using qwen2.5-coder:14b via Ollama, tested on AoC 2025 Day 1 in both Python and Haskell.

The hardware


Laptop	TUXEDO Pulse 14 Gen4
CPU	AMD Ryzen 7 8845HS (8 cores / 16 threads, up to 5.1 GHz)
RAM	26 GB DDR5
GPU	AMD Radeon 780M (integrated — no dedicated GPU)
OS	TUXEDO OS 3 (Ubuntu-based), kernel 6.11

This is a general-purpose laptop — not a machine you'd pick for local AI inference. No dedicated GPU means Ollama probably falls back to CPU, which makes generation obviously slow.

The model


Model	`qwen2.5-coder:14b`
Released	November 2024
Parameters	~14B
Size on disk	~9 GB
Runtime	Ollama 0.4.1 (CPU inference most likely)
Generation speed	Very slow (~40–90s per response)

For comparison, the cloud models in previous benchmarks typically solve Day 1 in 10–50 seconds total — including reading the puzzle, writing code, running it, verifying, and producing the answer. The local model takes that long just to generate a single response.

Why not use pi?

The first thing I tried was running the model through pi, the same agent harness used for all the other benchmarks. That failed immediately.

I assumed Ollama's OpenAI-compatible API would let pi drive the model the same way it drives cloud models. It didn't work — instead of making actual tool calls, the model output JSON-shaped text with placeholder values and immediately said "DONE" — without reading the puzzle, writing any code, or running anything. I didn't investigate further.

{"name": "read", "arguments": {"path": "./PART_1.description"}}
{"name": "bash", "arguments": {"command": "timeout 5 python solution.py < ..."}}
{"name": "write", "arguments": {"path": "./ANSWER.txt", "content": "<paste the output here>"}}
DONE

Even if tool calls had worked, pi's background work (reading files, running commands) was very slow through Ollama on this hardware. This ruled out using pi as-is and led me to a completely different methodology.

Methodology

Instead of having the local model use pi directly, I used raw Ollama CLI calls orchestrated by a separate cloud-hosted pi instance acting as a manual agent loop:

Write a prompt containing the full puzzle description, example input, and expected output
Send it to ollama via ollama run qwen2.5-coder:14b < prompt.txt
Extract the code from the response (stripping markdown fences)
Run it against the example input, then the real input
If wrong: build a new prompt with the puzzle description + the actual error output, retry
Up to 10 attempts per part

Each Ollama call is stateless — the model has no memory of previous attempts. Every retry includes the full puzzle context plus the error feedback from the previous attempt.

Error feedback rules

The feedback given on retries was deliberately minimal — similar to what a human would see:

Compile/runtime error → include the actual error output (traceback, compiler message)
Wrong answer on example → "Your code outputs X but the expected answer is Y"
Wrong answer on real input → "That's not the right answer"

No debugging hints, no solution suggestions, no explanation of why the code was wrong.

Timing

Each Ollama call was timed independently (wall-clock, generation only). The reported time is for the successful attempt — the one that produced the correct answer. Total cumulative time across all attempts is also noted.

Key differences from the cloud benchmarks

	Cloud benchmarks	This test
Agent harness	pi (autonomous)	Manual loop (pi as orchestrator)
Tool use	Full (read, write, bash)	None — one-shot code generation
Self-correction	Model can run & debug its own code	Orchestrator runs code, sends error back
Memory	Full conversation context	Stateless (each attempt is fresh)
Retries	3 max	10 max
Concurrency	Up to 11 models in parallel	1 model, sequential

Results

Python

Day 1 Part 1 — Dial rotation counting

Attempt	Time	Result	Issue
1	62s	✗	Parsing bug
2	48s	✗	Parsing bug
3	34s	✗	Parsing bug
4	50s	✗	Parsing bug
5	45s	✗	Parsing bug
6	41s	✓	—

Day 1 Part 2 — Counting zero-crossings during dial rotation

Attempt	Time	Result	Issue
1	59s	✓	—

Haskell

Day 1 Part 1 — Dial rotation counting

Attempt	Time	Result	Issue
1	78s	✗	Compiled, ran, wrong answer — `normalize` only adds 100 once
2	69s	✗	Type error: `Int` vs `String`
3	48s	✗	Type error: `Char` vs `String`
4	62s	✗	Non-exhaustive pattern match
5	54s	✓	Correct — used `mod` properly

Day 1 Part 2 — Counting zero-crossings during dial rotation

Attempt	Time	Result	Issue
1	68s	✗	Wrong answer (5) — `iterate` includes start position
2	44s	✗	Wrong answer (5) — same off-by-one
3	60s	✗	Type error: variable not in scope
4	61s	✗	Wrong answer (5)
5	73s	✗	Type errors: wrong accumulator type in fold
6	96s	✗	Same type error
7	54s	✗	Wrong answer (5) — resets position to 50 each rotation
8	62s	✗	Type error: `Int` vs `String`
9	59s	✗	Shadowed `lines` function
10	66s	✓	Correct — on the very last attempt

Summary

	D1P1	D1P2	Attempts	Total gen time
Python	✓ (6th try, 41s)	✓ (1st try, 59s)	7	~339s
Haskell	✓ (5th try, 54s)	✓ (10th try, 66s)	15	~954s

Observations

The most visible weakness was input parsing. In Python, the model kept making the same parsing mistake for 5 attempts in a row, despite receiving the error output each time. It took 6 attempts to find a correct approach.

The algorithmic logic seemed mostly sound. When the code compiled and parsed input correctly, the core algorithm was often right. The Part 1 logic (modular arithmetic) and Part 2 logic (click-by-click simulation) were correct in most attempts — the model just couldn't always wire them up with correct parsing and types.

Haskell was significantly harder. 15 attempts vs. 7 for Python. The model struggled with Haskell's type system — confusing Char with String, Int with [Char], shadowing the lines function, and getting accumulator types wrong in folds.

Haskell Part 2 barely made it. The model solved it on attempt 10 of 10. Four attempts compiled and ran but produced the wrong answer (5 instead of 6) — a consistent off-by-one error where iterate included the start position in the click list. Five attempts didn't compile at all. One attempt out of ten got everything right.

Python Part 2 worked on the first try. First-try success on Part 2 — the part that ejected 4 of 11 cloud models in the Haskell benchmark and 5 of 9 in the OCaml benchmark.

Cost: $0. The entire benchmark cost nothing in API fees. The only cost is electricity and patience. For a hobbyist or someone learning, being able to get eventual correct answers for free is meaningful — even if it takes 10 tries.

Conclusion

A 14B local model can solve AoC Day 1 — but it needs patience. Where cloud models solve both parts in under a minute with zero retries, qwen2.5-coder:14b needed 7 attempts for Python and 15 for Haskell, with a total generation time of ~22 minutes.

Whether that's useful depends on your tolerance for iteration. At $0 per attempt, with a local setup you control entirely, there's a case to be made — especially for easier puzzles or languages the model handles well (Python >> Haskell).

Benchmarked on 2026-02-27 using Ollama 0.4.1 for local inference and pi as the orchestrator.

Discussion

Join the conversation on the Haskell Discourse.

This post was written with AI assistance.

Developers, developers, developers!

Blog about programming, programming and, ah more programming!