The previous benchmarks in this series (Haskell, OCaml, Python, ReScript, Ruby, Elixir, Java, Elm) all used cloud API models — big, frontier-class LLMs served by Anthropic, OpenAI, and others. But what about a local model? Can a 14-billion-parameter model running on a single machine solve the same puzzles?
This post answers that question using qwen2.5-coder:14b via
Ollama, tested on AoC 2025 Day 1 in both Python and Haskell.