Tags = [ ReScript, AI, Advent of Code ] Posted on February 26, 2026

Following up on the Haskell benchmark, the OCaml benchmark, and the Python benchmark, I ran AoC 2025 Days 1–5 in ReScript — a typed functional language that compiles to JavaScript with a lean standard library, a distinct syntax, and very limited LLM training data.

This post covers three runs of the same benchmark, each adding a different intervention to see what helps models cope with an unfamiliar language:

Run 1 — no help at all. 3-minute timeout. 1 completer out of 10.
Run 2 — overflow warning + longer timeout. 2 completers.
Run 3 — a ReScript system prompt teaching syntax, stdlib, and types. 7 completers.

The contestants

Same 10 models across all three runs:

#	Model
1	`anthropic/claude-haiku-4-5`
2	`anthropic/claude-sonnet-4-6`
3	`anthropic/claude-opus-4-6`
4	`openai-codex/gpt-5.3-codex`
5	`zai/glm-5`
6	`minimax/MiniMax-M2.5`
7	`kimi-coding/k2p5`
8	`mistral/devstral-2512`
9	`alibaba/qwen3.5-plus`
10	`alibaba/qwen3-coder-next`

Run 2 — overflow warning + longer timeout

The first run was rough: 7 of 10 models ejected, claude-haiku-4-5 the sole completer, integer overflow the root cause of most failures. Two things changed for run 2:

Overflow warning baked into every prompt. ReScript's int type is 32 bits on the JavaScript runtime. Several puzzles produce answers in the tens of billions or hundreds of trillions — well beyond that ceiling. No warning was given in run 1. In run 2 every prompt includes an explicit note that large answers require float or BigInt.
Timeout increased from 3 to 5 minutes per attempt. ReScript project setup (initialising npm, configuring the build system, first compile) is non-trivial for models unfamiliar with the ecosystem. Three minutes left very little room for the actual solving.

The rules (Run 2)

Part 1 runs up to 3 attempts, each with a 5-minute window:

Attempt 1: plain prompt + overflow warning
Attempt 2 (on failure): "YOU MUST read llm-small.txt before starting"
Attempt 3 (still failing): "YOU MUST read llm-full.txt before starting"
Fail all three → ejected

The working directory contains two versions of the official ReScript llms.txt documentation: a 5,578-line condensed version and a 14,405-line full API reference. Models are not told about them on the first attempt.

Part 2 gets exactly one attempt. Any failure — wrong answer, timeout, or API error — means immediate ejection. No llms.txt hints.

API errors (quota limits, network failures) count as free retries and are not charged against the attempt limit.

Ejections (Run 2)

Model	Ejected at	Reason
`mistral/devstral-2512`	D1P1	Filled ~75% of its 200k context window with no answer
`openai-codex/gpt-5.3-codex`	D1P1	Entered a loop of echoing prompts back as "DONE" without doing any work
`anthropic/claude-haiku-4-5`	D1P2	Wrong answer
`anthropic/claude-sonnet-4-6`	D1P2	Wrong answer
`zai/glm-5`	D1P2	Wrong answer
`minimax/MiniMax-M2.5`	D1P2	Wrong answer
`alibaba/qwen3-coder-next`	D2P1	40+ minutes in a compile loop, 45% of context consumed — ejected on excessive cost
`alibaba/qwen3.5-plus`	D5P2	32-bit integer overflow on the final puzzle

Two models completed all 10 parts: claude-opus-4-6 and kimi-coding/k2p5.

Did the overflow warning make a difference?

Yes — but with an asterisk.

In run 1, both claude-opus-4-6 and claude-sonnet-4-6 overflowed on Day 5 Part 2 and were ejected. In run 2, opus answered correctly and completed the benchmark. The warning worked exactly as intended for it.

alibaba/qwen3.5-plus tells a more complicated story. In run 1 it timed out on Day 2 Part 1 and was ejected. In run 2 it sailed through Days 2, 3, and 4 — even detecting and self-correcting an overflow mid-run on Day 2 Part 2. Then on Day 5 Part 2 it overflowed anyway and was ejected. The warning helped it reach six extra puzzle parts; it wasn't enough to protect it at the end.

The four models ejected on Day 1 Part 2 — haiku, sonnet, glm-5, minimax — gave algorithmically wrong answers rather than overflow answers. The overflow warning had nothing to do with their ejection. (In run 1, different models fell at Day 1 Part 2 — it's a tricky boundary-condition puzzle that catches different models in different sessions.)

Results (Run 2)

Day 1 Part 1 — Dial rotation counting

Model	Time	Output tokens	Cost
`claude-opus-4-6`	52s	2,613	$0.23
`claude-sonnet-4-6`	109s	6,335	$0.29
`zai/glm-5`	150s	3,495	$0.08
`claude-haiku-4-5`	183s	16,079	$0.32
`alibaba/qwen3.5-plus`	337s	18,512	$1.06
`kimi-coding/k2p5`	365s	7,636	$0.14
`alibaba/qwen3-coder-next`	894s	54,900	$6.47
`minimax/MiniMax-M2.5`	1,404s	14,027	$0.45
`mistral/devstral-2512`	—	50,158	$3.46
`openai-codex/gpt-5.3-codex`	—	3,599	$0.22

All 8 surviving models passed on attempt 1 — nobody needed the llms.txt reference this time. In run 1, three models only cleared Day 1 Part 1 by reading llm-full.txt on their third attempt. The extra two minutes made the difference.

mistral consumed most of its 200k context window without producing an answer — a different failure mode from run 1 (where it overflowed on every attempt), same outcome.

gpt-5.3-codex compiled a project, hit errors, then entered a loop of echoing each nudge back and responding "DONE" without doing any work.

Day 1 Part 2 — Counting zero-crossings during dial rotation

Model	Time	Output tokens	Cost	Result
`claude-opus-4-6`	39s	2,329	$0.11	✓
`kimi-coding/k2p5`	92s	682	$0.02	✓
`alibaba/qwen3.5-plus`	108s	8,560	$0.16	✓
`alibaba/qwen3-coder-next`	247s	14,616	$1.09	✓
`claude-sonnet-4-6`	16s	2,118	$0.08	✗ wrong answer → EJECTED
`zai/glm-5`	22s	1,323	$0.05	✗ wrong answer → EJECTED
`claude-haiku-4-5`	46s	5,253	$0.15	✗ wrong answer → EJECTED
`minimax/MiniMax-M2.5`	446s	17,084	$0.56	✗ wrong answer → EJECTED

Day 1 Part 2 ejected four models in this run. k2p5 flipped from a wrong answer in run 1 to correct in run 2; the extra time appears to have made the difference. haiku, which was run 1's sole winner, went out here.

Day 2 Part 1 — Summing repeated-digit IDs in ranges

The puzzle answer is an 11-digit number — the first real overflow test.

Model	Time	Output tokens	Cost	Result
`claude-opus-4-6`	142s	6,918	$0.47	✓
`kimi-coding/k2p5`	146s	5,413	$0.03	✓
`alibaba/qwen3.5-plus`	910s	24,027	$2.29	✓
`alibaba/qwen3-coder-next`	>2,500s	70,017	$3.80	✗ EJECTED (cost/time)

qwen3.5-plus passed where it timed out in run 1 — the overflow warning appears to have steered it toward float arithmetic. Slow, but correct.

qwen3-coder-next spent over 40 minutes in a compile-debug loop and was ejected on cost grounds. Its total across Days 1–2: $11.36.

Day 2 Part 2 — Repeated-pattern IDs (any repeat count)

Another 11-digit answer.

Model	Time	Output tokens	Cost
`claude-opus-4-6`	33s	2,126	$0.13
`alibaba/qwen3.5-plus`	142s	8,758	$0.60
`kimi-coding/k2p5`	854s	1,397	$0.02

qwen3.5-plus detected an overflowed intermediate result mid-run, switched to float arithmetic, and corrected itself — all within the same attempt.

Day 3 Part 1 — Maximizing 2-digit joltage from battery banks

Model	Time	Output tokens	Cost
`claude-opus-4-6`	80s	2,966	$0.29
`kimi-coding/k2p5`	85s	3,188	$0.04
`alibaba/qwen3.5-plus`	847s	24,097	$1.32

Day 3 Part 2 — Maximizing 12-digit joltage from battery banks

Model	Time	Output tokens	Cost
`claude-opus-4-6`	17s	1,315	$0.09
`kimi-coding/k2p5`	23s	1,351	$0.02
`alibaba/qwen3.5-plus`	155s	3,478	$0.26

The answer is a 15-digit number. All three passed.

Day 4 Part 1 — Grid neighbour counting (accessible paper rolls)

Model	Time	Output tokens	Cost
`claude-opus-4-6`	59s	2,770	$0.26
`kimi-coding/k2p5`	64s	3,544	$0.03
`alibaba/qwen3.5-plus`	280s	9,930	$0.37

Day 4 Part 2 — Iterative grid removal simulation

Model	Time	Output tokens	Cost
`claude-opus-4-6`	19s	1,399	$0.10
`alibaba/qwen3.5-plus`	65s	2,878	$0.20
`kimi-coding/k2p5`	79s	4,193	$0.06

Day 5 Part 1 — Range membership checking

Model	Time	Output tokens	Cost
`claude-opus-4-6`	52s	2,426	$0.24
`kimi-coding/k2p5`	236s	4,345	$0.05
`alibaba/qwen3.5-plus`	400s	18,712	$1.29

Day 5 Part 2 — Counting total fresh IDs from overlapping ranges

The answer is a 15-digit number — the final overflow test.

Model	Time	Output tokens	Cost	Result
`claude-opus-4-6`	16s	1,419	$0.09	✓ COMPLETE
`kimi-coding/k2p5`	71s	1,798	$0.03	✓ COMPLETE
`alibaba/qwen3.5-plus`	215s	11,300	$0.96	✗ overflow → EJECTED

opus answered in 16 seconds. In run 1, it overflowed the same puzzle. The overflow warning was the difference.

qwen3.5-plus fell to the pattern it had avoided on Days 2 and 3 — despite the warning and its earlier self-correction, it overflowed here and was ejected.

Full summary (Run 2) — all 10 models

Wall-clock seconds. ✗ = ejected at that part.

Model	D1P1	D1P2	D2P1	D2P2	D3P1	D3P2	D4P1	D4P2	D5P1	D5P2	Total
claude-opus-4-6	52	39	142	33	80	17	59	19	52	16	509s
kimi-coding/k2p5	365	92	146	854	85	23	64	79	236	71	2,015s
alibaba/qwen3.5-plus	337	108	910	142	847	155	280	65	400	✗	DNF
alibaba/qwen3-coder-next	894	247	✗	—	—	—	—	—	—	—	DNF
minimax/MiniMax-M2.5	1,404	✗	—	—	—	—	—	—	—	—	DNF
claude-haiku-4-5	183	✗	—	—	—	—	—	—	—	—	DNF
claude-sonnet-4-6	109	✗	—	—	—	—	—	—	—	—	DNF
zai/glm-5	150	✗	—	—	—	—	—	—	—	—	DNF
mistral/devstral-2512	✗	—	—	—	—	—	—	—	—	—	DNF
openai-codex/gpt-5.3-codex	✗	—	—	—	—	—	—	—	—	—	DNF

Token and cost breakdown for the completers and qwen3.5-plus. Costs are rough approximations based on published per-token pricing. I use subscription plans, so my actual spending is capped regardless.

Model	D1P1	D1P2	D2P1	D2P2	D3P1	D3P2	D4P1	D4P2	D5P1	D5P2	Total tokens	Total cost
claude-opus-4-6	2,613	2,329	6,918	2,126	2,966	1,315	2,770	1,399	2,426	1,419	26,281	$2.01
kimi-coding/k2p5	7,636	682	5,413	1,397	3,188	1,351	3,544	4,193	4,345	1,798	33,547	$0.46
alibaba/qwen3.5-plus †	18,512	8,560	24,027	8,758	24,097	3,478	9,930	2,878	18,712	11,300	130,252	$8.49

† ejected on D5P2; D5P2 cost includes the final wrong attempt.

Run 1 vs Run 2

	Run 1	Run 2
Timeout per attempt	3 min	5 min
Overflow warning in prompt	✗	✓
Full completers	1 (`claude-haiku-4-5`)	2 (`claude-opus-4-6`, `kimi-coding/k2p5`)
`claude-opus-4-6` on D5P2	✗ overflow	✓
`kimi-coding/k2p5` overall	ejected D1P2	all 10 parts ✓
Session cost data	partial (Day 5 only for finalists)	complete

The overflow warning made a real difference for opus. The longer timeout saved k2p5. But six models out of ten still couldn't produce correct answers even with those helps. The bottleneck wasn't overflow — it was the language itself.

Run 3 — ReScript system prompt

Run 2 left an obvious question: if the problem is unfamiliarity with ReScript, what happens when you teach the model the language?

For run 3 I wrote a concise ReScript system prompt — roughly 150 lines covering project setup, syntax essentials, the type system (including the fact that int is 32-bit and bigint exists), common stdlib functions, and file I/O patterns.

How it was injected

The agent harness (pi) supports --append-system-prompt, which adds content to the default system prompt without replacing it. Each model was launched with:

pi --model <model> --thinking off \
   --append-system-prompt rescript-system-prompt.md \
   --session-dir . \
   '<puzzle prompt>'

This means every model saw pi's standard coding-agent instructions plus the ReScript reference below, before receiving the puzzle. No other hints — no overflow warning, no llms.txt documentation files.

The system prompt

Click to expand the full system prompt (rescript-system-prompt.md)

## ReScript v12

ReScript is a typed functional language that compiles to JavaScript.
Think of it as **functional TypeScript** — if you can solve a problem in TypeScript,
you can solve it in ReScript using the same logic, just with different syntax.

### Project setup

**package.json**:
```json
{ "name": "my-project", "dependencies": { "rescript": "^12.1.0" } }
```

**rescript.json** (NOT bsconfig.json):
```json
{
  "name": "my-project",
  "sources": [{ "dir": "src", "subdirs": true }],
  "package-specs": [{ "module": "esmodule", "in-source": true }],
  "suffix": ".res.js"
}
```

Then: `npm install && npx rescript && node src/MyModule.res.js`

### Syntax essentials

```rescript
// Pipe operator (not |>)
[1, 2, 3]->Array.map(x => x * 2)->Array.filter(x => x > 2)

// Let bindings, no semicolons needed
let x = 42
let greet = (name) => "Hello " ++ name  // ++ for string concat

// Pattern matching
switch myValue {
| Some(x) => Console.log(x)
| None => Console.log("nothing")
}

// String operations
let lines = text->String.split("\n")
let trimmed = line->String.trim
let parts = line->String.split(",")
```

### Type system

```rescript
// Primitive types
let n: int = 42              // 32-bit signed integer
let f: float = 3.14          // 64-bit IEEE 754
let b: bigint = 99999999999n // arbitrary precision
let s: string = "hello"      // UTF-8 string
let c: char = 'a'            // single byte, no Unicode — prefer string
let ok: bool = true

// Float arithmetic uses distinct operators
let sum = 1.0 +. 2.5         // +.  -.  *.  /.
let converted = Int.toFloat(n)

// Modulo is a function call, not an infix operator
mod(7, 3)             // int modulo — NOT 7 mod 3, NOT 7 % 3
Float.mod(7.0, 3.0)  // float modulo

// Mutable values use ref
let counter = ref(0)
counter := counter.contents + 1  // := to set, .contents to read

// Records
type point = { x: float, y: float }
let p = { x: 1.0, y: 2.0 }

// Variants
type shape = Circle(float) | Rect(float, float)

// Option and Result
let found: option<int> = Some(42)
let parsed: result<int, string> = Ok(42)

// Arrays — main ordered data structure (like JS arrays)
let a = ["hello", "world"]
let first = a[0]              // Some("hello") — access returns option!
a[0] = "hey"                  // mutation
let b = [1, 2, ...a]          // spread

// List — immutable singly linked list
let l = list{1, 2, 3}
let l2 = list{0, ...l}        // prepend
```

### File I/O (Node.js bindings)

```rescript
@module("fs") external readFileSync: (string, string) => string = "readFileSync"

let content = readFileSync("input.txt", "utf8")
```

### Common stdlib

```rescript
// String
String.length: string => int
String.get: (string, int) => option<string>    // None if out of bounds
String.charAt: (string, int) => string        // "" if out of bounds
String.slice: (string, ~start: int, ~end: int=?) => string
String.split: (string, string) => array<string>
String.trim: string => string
String.includes: (string, string) => bool
String.startsWith: (string, string) => bool
String.replaceAll: (string, string, string) => string
String.make: 'a => string               // convert anything to string

// Array
Array.map: (array<'a>, 'a => 'b) => array<'b>
Array.filter: (array<'a>, 'a => bool) => array<'a>
Array.reduce: (array<'a>, 'b, ('b, 'a) => 'b) => 'b
Array.forEach: (array<'a>, 'a => unit) => unit
Array.length: array<'a> => int
Array.get: (array<'a>, int) => option<'a>

// Option — use getOrThrow, NOT getExn (deprecated)
Option.getOrThrow: (option<'a>, ~message: string=?) => 'a  // throws if None
Option.getOr: (option<'a>, 'a) => 'a           // default if None
Option.map: (option<'a>, 'a => 'b) => option<'b>
Option.flatMap: (option<'a>, 'a => option<'b>) => option<'b>
Option.isSome: option<'a> => bool
Option.isNone: option<'a> => bool
Option.forEach: (option<'a>, 'a => unit) => unit

// Result — use getOrThrow, NOT getExn (deprecated)
Result.getOrThrow: (result<'a, 'b>, ~message: string=?) => 'a  // throws if Error
Result.getOr: (result<'a, 'b>, 'a) => 'a
Result.map: (result<'a, 'c>, 'a => 'b) => result<'b, 'c>
Result.isOk: result<'a, 'b> => bool
Result.isError: result<'a, 'b> => bool

// Conversions
Int.fromString: (string, ~radix: int=?) => option<int>
Int.toString: (int, ~radix: int=?) => string
Int.toFloat: int => float
Float.fromString: string => option<float>
Float.toString: (float, ~radix: int=?) => string

// Output
Console.log: 'a => unit
Console.log2: ('a, 'b) => unit
```

The rules (Run 3)

Same retry logic as run 2 (up to 3 attempts on wrong answers), but:

No overflow warning in the puzzle prompt — the system prompt documents int as 32-bit and bigint as arbitrary precision; models have to connect those dots themselves.
No llms.txt escalation — models either know enough from the system prompt or they don't.
15-minute timeout per part (increased from the polling window, not per-attempt).

Ejections (Run 3)

Model	Ejected at	Reason
`mistral/devstral-2512`	D1P2	Wrong answer on all 3 attempts
`openai-codex/gpt-5.3-codex`	D2P1	Brain-dead — echoed prompts as "DONE" without working
`alibaba/qwen3-coder-next`	D5P1	Brain-dead — froze mid-sentence after dumping the input file

Seven models completed all 10 parts.

Results (Run 3)

Day 1 Part 1 — Dial rotation counting

Model	Time	Output tokens	Cost
`alibaba/qwen3.5-plus`	49s	1,350	$0.02
`anthropic/claude-haiku-4-5`	59s	2,256	$0.03
`mistral/devstral-2512`	59s	1,488	$0.03
`openai-codex/gpt-5.3-codex`	62s	899	$0.04
`anthropic/claude-sonnet-4-6`	68s	1,915	$0.08
`anthropic/claude-opus-4-6`	69s	1,722	$0.13
`kimi-coding/k2p5`	73s	1,118	$0.03
`minimax/MiniMax-M2.5`	120s	2,728	$0.06
`alibaba/qwen3-coder-next`	122s	3,905	$0.22
`zai/glm-5`	133s	1,515	$0.04

10/10 correct on first attempt. In run 2, two models were ejected here (devstral for filling its context window, codex for going brain-dead). The system prompt got them through.

Day 1 Part 2 — Counting zero-crossings during dial rotation

Model	Time	Output tokens	Cost	Result
`openai-codex/gpt-5.3-codex`	26s	752	$0.02	✓
`anthropic/claude-sonnet-4-6`	39s	1,559	$0.05	✓
`anthropic/claude-opus-4-6`	46s	1,809	$0.09	✓
`zai/glm-5`	65s	1,244	$0.03	✓
`alibaba/qwen3.5-plus`	90s	9,100	$0.04	✓
`anthropic/claude-haiku-4-5`	327s	11,607	$0.13	✓ (2nd try)
`alibaba/qwen3-coder-next`	330s	9,936	$0.29	✓ (2nd try)
`kimi-coding/k2p5`	335s	2,078	$0.05	✓ (2nd try)
`minimax/MiniMax-M2.5`	615s	19,746	$0.22	✓ (2nd try)
`mistral/devstral-2512`	—	24,404	$0.58	✗ wrong 3/3 → EJECTED

9/10 survived. In run 2, this puzzle ejected four models (sonnet, haiku, glm-5, minimax). With the system prompt, all four completed it. Only devstral couldn't solve it — it gave three different wrong answers across three attempts, each algorithmically different but each wrong.

Day 2 Part 1 — Summing repeated-digit IDs in ranges

The puzzle answer is an 11-digit number — the first real overflow test.

Model	Time	Output tokens	Cost
`anthropic/claude-haiku-4-5`	50s	4,585	$0.05
`kimi-coding/k2p5`	65s	2,201	$0.01
`anthropic/claude-opus-4-6`	70s	2,680	$0.16
`anthropic/claude-sonnet-4-6`	71s	3,349	$0.12
`minimax/MiniMax-M2.5`	210s	5,421	$0.06
`alibaba/qwen3-coder-next`	226s	9,925	$0.38
`zai/glm-5`	580s	11,880	$0.17
`alibaba/qwen3.5-plus`	840s	29,837	$1.06
`openai-codex/gpt-5.3-codex`	—	1,987	$0.11

8/8 surviving models correct on first try. codex compiled a solution but it timed out on even the example input. After two dirty-stop retries it went brain-dead — echoing nudges and saying "DONE" without doing any work.

Day 2 Part 2 — Repeated-pattern IDs (any repeat count)

Another 11-digit answer.

Model	Time	Output tokens	Cost
`anthropic/claude-haiku-4-5`	18s	1,104	$0.02
`anthropic/claude-sonnet-4-6`	54s	2,935	$0.08
`anthropic/claude-opus-4-6`	54s	2,713	$0.12
`kimi-coding/k2p5`	68s	1,109	$0.01
`minimax/MiniMax-M2.5`	75s	2,912	$0.04
`zai/glm-5`	338s	6,442	$0.14
`alibaba/qwen3-coder-next`	365s	13,194	$0.51
`alibaba/qwen3.5-plus`	394s	15,365	$1.20

All 8 correct. haiku was fastest at 18 seconds.

Day 3 Part 1 — Maximizing 2-digit joltage from battery banks

Model	Time	Output tokens	Cost
`anthropic/claude-sonnet-4-6`	39s	1,370	$0.06
`alibaba/qwen3.5-plus`	50s	2,648	$0.03
`anthropic/claude-opus-4-6`	57s	1,697	$0.12
`anthropic/claude-haiku-4-5`	62s	5,251	$0.06
`minimax/MiniMax-M2.5`	127s	3,704	$0.05
`alibaba/qwen3-coder-next`	158s	5,835	$0.40
`zai/glm-5`	291s	5,600	$0.11
`kimi-coding/k2p5`	299s	1,865	$0.03

Day 3 Part 2 — Maximizing 12-digit joltage from battery banks

The answer is a 15-digit number.

Model	Time	Output tokens	Cost
`kimi-coding/k2p5`	31s	937	$0.01
`anthropic/claude-opus-4-6`	34s	1,252	$0.07
`anthropic/claude-sonnet-4-6`	44s	2,013	$0.08
`zai/glm-5`	106s	1,978	$0.06
`minimax/MiniMax-M2.5`	110s	3,173	$0.09
`anthropic/claude-haiku-4-5`	144s	14,461	$0.17
`alibaba/qwen3-coder-next`	248s	4,409	$0.47
`alibaba/qwen3.5-plus`	265s	10,304	$0.23

All 8 correct. All handled the 15-digit answer — no overflow.

Day 4 Part 1 — Grid neighbour counting (accessible paper rolls)

Model	Time	Output tokens	Cost
`anthropic/claude-sonnet-4-6`	33s	1,307	$0.05
`kimi-coding/k2p5`	33s	974	$0.01
`anthropic/claude-haiku-4-5`	44s	3,673	$0.05
`anthropic/claude-opus-4-6`	44s	1,409	$0.11
`minimax/MiniMax-M2.5`	89s	2,759	$0.03
`alibaba/qwen3.5-plus`	147s	6,383	$0.11
`zai/glm-5`	162s	2,968	$0.04
`alibaba/qwen3-coder-next`	769s	19,731	$0.73

Day 4 Part 2 — Iterative grid removal simulation

Model	Time	Output tokens	Cost
`anthropic/claude-opus-4-6`	22s	1,320	$0.08
`anthropic/claude-sonnet-4-6`	24s	1,243	$0.04
`anthropic/claude-haiku-4-5`	25s	2,423	$0.03
`kimi-coding/k2p5`	28s	1,340	$0.02
`alibaba/qwen3.5-plus`	65s	3,829	$0.12
`alibaba/qwen3-coder-next`	213s	4,696	$0.29
`zai/glm-5`	221s	3,976	$0.09
`minimax/MiniMax-M2.5`	237s	7,399	$0.12

Day 5 Part 1 — Range membership checking

Model	Time	Output tokens	Cost
`anthropic/claude-sonnet-4-6`	67s	2,632	$0.10
`kimi-coding/k2p5`	74s	3,287	$0.03
`anthropic/claude-haiku-4-5`	82s	7,079	$0.09
`anthropic/claude-opus-4-6`	84s	3,399	$0.22
`alibaba/qwen3.5-plus`	172s	8,297	$0.09
`minimax/MiniMax-M2.5`	293s	8,600	$0.27
`zai/glm-5`	361s	6,021	$0.23
`alibaba/qwen3-coder-next`	—	160	$0.01

qwen3-coder-next read the real input file, dumped it into the context, started writing "I understand the problem now. The task is to: 1. Parse Ingredi—" and froze mid-sentence. No recovery.

Day 5 Part 2 — Counting total fresh IDs from overlapping ranges

The answer is a 15-digit number — the final overflow test.

Model	Time	Output tokens	Cost	Result
`anthropic/claude-sonnet-4-6`	25s	1,360	$0.06	✓ COMPLETE
`anthropic/claude-opus-4-6`	31s	1,269	$0.10	✓ COMPLETE
`kimi-coding/k2p5`	73s	4,663	$0.04	✓ COMPLETE
`anthropic/claude-haiku-4-5`	76s	7,877	$0.11	✓ COMPLETE
`zai/glm-5`	171s	3,382	$0.15	✓ COMPLETE
`minimax/MiniMax-M2.5`	211s	6,854	$0.34	✓ COMPLETE
`alibaba/qwen3.5-plus`	551s	15,764	$0.60	✓ COMPLETE

Seven out of seven. In run 2, qwen3.5-plus overflowed this exact puzzle. With the system prompt documenting that int is 32-bit and bigint exists, it used bigint from the start and answered correctly. No explicit warning needed.

Full summary (Run 3) — all 10 models

Wall-clock seconds. ✗ = ejected at that part.

Model	D1P1	D1P2	D2P1	D2P2	D3P1	D3P2	D4P1	D4P2	D5P1	D5P2	Total
claude-sonnet-4-6	68	39	71	54	39	44	33	24	67	25	464s
claude-opus-4-6	69	46	70	54	57	34	44	22	84	31	511s
claude-haiku-4-5	59	327	50	18	62	144	44	25	82	76	887s
kimi-coding/k2p5	73	335	65	68	299	31	33	28	74	73	1,079s
minimax/MiniMax-M2.5	120	615	210	75	127	110	89	237	293	211	2,087s
zai/glm-5	133	65	580	338	291	106	162	221	361	171	2,428s
alibaba/qwen3.5-plus	49	90	840	394	50	265	147	65	172	551	2,623s
alibaba/qwen3-coder-next	122	330	226	365	158	248	769	213	✗	—	DNF
mistral/devstral-2512	59	✗	—	—	—	—	—	—	—	—	DNF
openai-codex/gpt-5.3-codex	62	26	✗	—	—	—	—	—	—	—	DNF

Token and cost breakdown for all models. Costs are rough approximations based on published per-token pricing. I use subscription plans, so my actual spending is capped regardless.

Model	Total tokens	Total cost
alibaba/qwen3.5-plus	102,877	$3.50
minimax/MiniMax-M2.5	63,296	$1.28
claude-opus-4-6	19,270	$1.20
zai/glm-5	45,006	$1.06
claude-haiku-4-5	60,316	$0.74
claude-sonnet-4-6	19,683	$0.72
kimi-coding/k2p5	19,572	$0.24
alibaba/qwen3-coder-next †	71,791	$3.30
mistral/devstral-2512 †	25,892	$0.61
openai-codex/gpt-5.3-codex †	3,638	$0.17

† ejected before completing all parts.

Across all three runs

	Run 1	Run 2	Run 3
Intervention	none	overflow warning	system prompt
Timeout	3 min	5 min	15 min
Full completers	1/10	2/10	7/10
D1P1 first-try pass rate	7/10	8/10	10/10
D1P2 survivors	3/10	4/10	9/10
`claude-opus-4-6`	ejected D5P2	✓ all 10	✓ all 10
`claude-sonnet-4-6`	ejected D1P2	ejected D1P2	✓ fastest (464s)
`claude-haiku-4-5`	✓ sole completer	ejected D1P2	✓ all 10
`kimi-coding/k2p5`	ejected D1P2	✓ all 10	✓ all 10
`zai/glm-5`	ejected D1P2	ejected D1P2	✓ all 10
`minimax/MiniMax-M2.5`	ejected D1P2	ejected D1P2	✓ all 10
`alibaba/qwen3.5-plus`	ejected D2P1	ejected D5P2	✓ all 10
`alibaba/qwen3-coder-next`	ejected D1P2	ejected D2P1	ejected D5P1
`mistral/devstral-2512`	ejected D1P1	ejected D1P1	ejected D1P2
`openai-codex/gpt-5.3-codex`	ejected D1P1	ejected D1P1	ejected D2P1

Observations

Teaching the language was the biggest lever. The system prompt — 150 lines of syntax, types, and stdlib — took the completion rate from 2/10 to 7/10. The overflow warning in run 2 helped exactly one model (opus). The system prompt helped five more (sonnet, haiku, glm-5, minimax, qwen3.5-plus).

The overflow problem was a documentation problem. In run 2, qwen3.5-plus overflowed on Day 5 Part 2 despite an explicit overflow warning. In run 3, it used bigint from the start and answered correctly — because the system prompt documented int as 32-bit and bigint as arbitrary precision. Models don't need warnings; they need language specs.

claude-sonnet-4-6 went from run 2's Day 1 Part 2 casualty to run 3's fastest completer (464s). In run 2 it gave a wrong answer on the boundary-condition puzzle and was ejected. In run 3 it completed all 10 parts and beat opus on total time. The system prompt didn't just help with overflow — it helped models write correct algorithms in an unfamiliar language.

kimi-coding/k2p5 remains the value champion: $0.24 for all 10 parts. Down from $0.46 in run 2. The system prompt cut its token usage almost in half — fewer compile errors, fewer false starts.

Three models remain unreachable. devstral got further (past D1P1) but still couldn't solve D1P2. gpt-5.3-codex got further (past D1) but went brain-dead on D2P1. qwen3-coder-next got the furthest (8 parts) but froze mid-thought on D5P1. The system prompt can't fix models that go catatonic.

The llms.txt files were never needed. In run 2, no model proactively consulted them. In run 3, they weren't even offered. A concise system prompt covering the essentials was more effective than 14,000 lines of API reference sitting in the working directory.

Benchmarked on 2026-02-26 (run 2) and 2026-02-27 (run 3) using pi as the agent harness.

This post was written with AI assistance.

Developers, developers, developers!

Blog about programming, programming and, ah more programming!

Benchmarking LLMs on Advent of Code 2025 (ReScript)

The contestants

Run 2 — overflow warning + longer timeout

The rules (Run 2)

Ejections (Run 2)

Did the overflow warning make a difference?

Results (Run 2)

Day 1 Part 1 — Dial rotation counting

Day 1 Part 2 — Counting zero-crossings during dial rotation

Day 2 Part 1 — Summing repeated-digit IDs in ranges

Day 2 Part 2 — Repeated-pattern IDs (any repeat count)

Day 3 Part 1 — Maximizing 2-digit joltage from battery banks

Day 3 Part 2 — Maximizing 12-digit joltage from battery banks

Day 4 Part 1 — Grid neighbour counting (accessible paper rolls)

Day 4 Part 2 — Iterative grid removal simulation

Day 5 Part 1 — Range membership checking

Day 5 Part 2 — Counting total fresh IDs from overlapping ranges

Full summary (Run 2) — all 10 models

Run 1 vs Run 2

Run 3 — ReScript system prompt

How it was injected

The system prompt

The rules (Run 3)

Ejections (Run 3)

Results (Run 3)

Day 1 Part 1 — Dial rotation counting

Day 1 Part 2 — Counting zero-crossings during dial rotation

Day 2 Part 1 — Summing repeated-digit IDs in ranges

Day 2 Part 2 — Repeated-pattern IDs (any repeat count)

Day 3 Part 1 — Maximizing 2-digit joltage from battery banks

Day 3 Part 2 — Maximizing 12-digit joltage from battery banks

Day 4 Part 1 — Grid neighbour counting (accessible paper rolls)

Day 4 Part 2 — Iterative grid removal simulation

Day 5 Part 1 — Range membership checking

Day 5 Part 2 — Counting total fresh IDs from overlapping ranges

Full summary (Run 3) — all 10 models

Across all three runs

Observations