Developers, developers, developers!

Benchmarking LLMs on Advent of Code 2025 — Recap Strict Mode

Posted on February 28, 2026

This is a second run of the AoC 2025 LLM benchmark, with stricter rules. The same 10 models solved the same 5 days (10 parts) across the same 12 programming languages, but with two changes:

No retries. A wrong answer or timeout results in immediate ejection from that language. No nudges, no second chances.
No language-specific scaffolding. No system prompts teaching syntax (as was done for ReScript run 3 in the previous benchmark). Every model receives the same prompt regardless of language.

The previous benchmark allowed retries during the run, then applied strict scoring retroactively in the recap. This run enforces strict mode at execution time — ejected models never get the chance to try again.

Benchmarking LLMs on Advent of Code 2025 — The Recap

Posted on February 28, 2026

Over the past week I ran the same 10 AoC 2025 puzzles (Days 1–5, Parts 1 and 2) across 12 programming languages, pitting 10 LLMs against each other in complete isolation. Each model got the same puzzle, the same inputs, and had to produce a correct answer — or be ejected.

This post pulls together the results from all 12 individual benchmark posts and applies a stricter scoring rule: a retry counts as a failure. Only first-try correct answers count as passes.

I am an AI reviewer assisting with this recap. I also reviewed the generated code in the benchmark directories for indicators such as language-idiomatic usage, raw JavaScript injection in ReScript, and general implementation quality. These quality checks are heuristic and should be interpreted as qualitative signal, not ground truth.

Benchmarking LLMs on Advent of Code 2025 (Clojure)

Posted on February 27, 2026

Following up on the Haskell, OCaml, Python, ReScript, Ruby, Elixir, Java, Elm, Rust, and Racket benchmarks, I ran the same AoC 2025 Days 1–5 setup in Clojure.

Clojure is a Lisp dialect that runs on the JVM. It's known for its persistent data structures, REPL-driven development, and strong concurrency primitives. For this benchmark, models needed to write standalone scripts runnable via clj. The JVM startup cost is real — one model got trapped in repeated slow clj invocations on a single part, ballooning its wall-clock time — but the language itself posed no conceptual difficulty. No scaffolding was provided.

The result: 9 of 10 models completed all 10 parts. One ejection on Day 1 Part 2.

Benchmarking LLMs on Advent of Code 2025 (F#)

Posted on February 27, 2026

Following up on the Haskell benchmark, the OCaml benchmark, the Python benchmark, the ReScript benchmark, the Ruby benchmark, the Elixir benchmark, the Java benchmark, and the Elm benchmark, I ran the same AoC 2025 Days 1–5 setup in F#.

F# occupies an interesting middle ground. It's a functional-first language on .NET — strongly typed with type inference, pattern matching, and pipelines, but with full access to the imperative .NET ecosystem when needed. It sees real production use but isn't anywhere near as common as C# or Python in training data. No scaffold was provided; each model had to figure out dotnet fsi scripting or full project setup on its own.

The result: another clean sweep. Every model solved every part.

Benchmarking a Local LLM on Advent of Code 2025 (Ollama)

Posted on February 27, 2026

The previous benchmarks in this series (Haskell, OCaml, Python, ReScript, Ruby, Elixir, Java, Elm) all used cloud API models — big, frontier-class LLMs served by Anthropic, OpenAI, and others. But what about a local model? Can a 14-billion-parameter model running on a single machine solve the same puzzles?

This post answers that question using qwen2.5-coder:14b via Ollama, tested on AoC 2025 Day 1 in both Python and Haskell.

Benchmarking LLMs on Advent of Code 2025 (Racket)

Posted on February 27, 2026

Following up on the Haskell, OCaml, Python, ReScript, Ruby, Elixir, Java, Elm, and Rust benchmarks, I ran the same AoC 2025 Days 1–5 setup in Racket.

Racket is a Lisp dialect from the Scheme family. It's well-known in the programming languages community and widely used in education (How to Design Programs, SICP variants), but it's not a mainstream production language. Models need to handle S-expressions, #lang racket conventions, and functional idioms with mutable state available but discouraged. No scaffolding was provided — each model started from scratch.

The result: another clean sweep. Every model solved every part.

Benchmarking LLMs on Advent of Code 2025 (Rust)

Posted on February 27, 2026

Following up on the Haskell, OCaml, Python, Elixir, Elm, Java, ReScript, and Ruby benchmarks, I ran the same orchestration setup on the same AoC 2025 Days 1–5 puzzles — this time in Rust.

Rust is a compiled systems language with strict ownership rules and a demanding compiler. Models have to deal with borrow-checking, lifetime annotations, and explicit error handling just to get a solution that compiles.

Benchmarking LLMs on Advent of Code 2025 (Elixir)

Posted on February 26, 2026

Following up on the Haskell benchmark, the OCaml benchmark, the Python benchmark, the ReScript benchmark, and the Ruby benchmark, I ran the same AoC 2025 Days 1–5 setup in Elixir.

Elixir is dynamic like Ruby/Python, but with its own ecosystem and idioms that models don't always handle cleanly.

Benchmarking LLMs on Advent of Code 2025 (Elm)

Posted on February 26, 2026

Following up on the Haskell benchmark, the OCaml benchmark, the Python benchmark, the ReScript benchmark, the Ruby benchmark, the Elixir benchmark, and the Java benchmark, I ran the same AoC 2025 Days 1–5 setup in Elm.

Elm is the most niche language in this series. It's a pure functional language that compiles to JavaScript, has no native CLI story, and sees relatively little use outside its frontend niche. Each model received a pre-built scaffold — run.mjs, elm.json, and a Day00.elm template — that compiles and runs Elm modules via Node.js. The question was whether models would handle Elm's strict type system, lack of escape hatches, and unfamiliar idioms (e.g. Debug.log for output, Platform.worker for headless programs).

The answer: every single one of them did.

Benchmarking LLMs on Advent of Code 2025 (Java)

Posted on February 26, 2026

Following up on the Haskell benchmark, the OCaml benchmark, the Python benchmark, the ReScript benchmark, the Ruby benchmark, and the Elixir benchmark, I ran the same AoC 2025 Days 1–5 setup in Java.

Benchmarking LLMs on Advent of Code 2025 (ReScript)

Posted on February 26, 2026

Following up on the Haskell benchmark, the OCaml benchmark, and the Python benchmark, I ran AoC 2025 Days 1–5 in ReScript — a typed functional language that compiles to JavaScript with a lean standard library, a distinct syntax, and very limited LLM training data.

This post covers three runs of the same benchmark, each adding a different intervention to see what helps models cope with an unfamiliar language:

Run 1 — no help at all. 3-minute timeout. 1 completer out of 10.
Run 2 — overflow warning + longer timeout. 2 completers.
Run 3 — a ReScript system prompt teaching syntax, stdlib, and types. 7 completers.

Benchmarking LLMs on Advent of Code 2025 (Ruby)

Posted on February 26, 2026

Following up on the Haskell benchmark, the OCaml benchmark, the Python benchmark, and the ReScript benchmark, I ran the same AoC 2025 Days 1–5 puzzles in Ruby.

Same setup as before — the question is whether the leaderboard reshuffles when the target language changes.

Benchmarking LLMs on Advent of Code 2025 (OCaml)

Posted on February 25, 2026

Following up on the Haskell benchmark, I ran the same orchestration setup on the same AoC 2025 Days 1–5 puzzles — this time requiring solutions in OCaml. The methodology is identical: each model gets an isolated directory, a puzzle description, and must write its final answer to ANSWER.txt. Wrong answer or no answer = ejection.

Benchmarking LLMs on Advent of Code 2025 (Python)

Posted on February 25, 2026

Following up on the Haskell benchmark and the OCaml benchmark, I ran the same orchestration setup on the same AoC 2025 Days 1–5 puzzles — this time in Python.

This is also the first run with full token usage and API cost tracking per part, which adds a new angle beyond raw wall-clock time.

Benchmarking LLMs on Advent of Code 2025 (Haskell)

Posted on February 24, 2026

I benchmarked 11 LLMs on Advent of Code 2025 Days 1–5, each solving independently in Haskell. The goal: see which models can reliably produce correct, working solutions — and how fast. One additional model (claude-haiku-4-5) was tested retroactively and has been added to the results.

<-- Previous Next -->

Developers, developers, developers!

Blog about programming, programming and, ah more programming!

Benchmarking LLMs on Advent of Code 2025 — Recap Strict Mode

Benchmarking LLMs on Advent of Code 2025 — The Recap

Benchmarking LLMs on Advent of Code 2025 (Clojure)

Benchmarking LLMs on Advent of Code 2025 (F#)

Benchmarking a Local LLM on Advent of Code 2025 (Ollama)

Benchmarking LLMs on Advent of Code 2025 (Racket)

Benchmarking LLMs on Advent of Code 2025 (Rust)

Benchmarking LLMs on Advent of Code 2025 (Elixir)

Benchmarking LLMs on Advent of Code 2025 (Elm)

Benchmarking LLMs on Advent of Code 2025 (Java)

Benchmarking LLMs on Advent of Code 2025 (ReScript)

Benchmarking LLMs on Advent of Code 2025 (Ruby)

Benchmarking LLMs on Advent of Code 2025 (OCaml)

Benchmarking LLMs on Advent of Code 2025 (Python)

Benchmarking LLMs on Advent of Code 2025 (Haskell)

I3 dev setup on Ubuntu 18.04

HOWTO: Ubuntu 18.04 install with LVM on LUKS

Easy Single Packet Authorization Setup With fwknop

Generate Passwords

Closures