Developers, developers, developers!

Blog about programming, programming and, ah more programming!

Benchmarking LLMs on Advent of Code 2025 (ReScript)

Tags = [ ReScript, AI, Advent of Code ]

Following up on the Haskell benchmark, the OCaml benchmark, and the Python benchmark, I ran AoC 2025 Days 1–5 in ReScript — a typed functional language that compiles to JavaScript with a lean standard library, a distinct syntax, and very limited LLM training data.

This post covers three runs of the same benchmark, each adding a different intervention to see what helps models cope with an unfamiliar language:

  1. Run 1 — no help at all. 3-minute timeout. 1 completer out of 10.
  2. Run 2 — overflow warning + longer timeout. 2 completers.
  3. Run 3 — a ReScript system prompt teaching syntax, stdlib, and types. 7 completers.

The contestants

Same 10 models across all three runs:

#Model
1anthropic/claude-haiku-4-5
2anthropic/claude-sonnet-4-6
3anthropic/claude-opus-4-6
4openai-codex/gpt-5.3-codex
5zai/glm-5
6minimax/MiniMax-M2.5
7kimi-coding/k2p5
8mistral/devstral-2512
9alibaba/qwen3.5-plus
10alibaba/qwen3-coder-next

Run 2 — overflow warning + longer timeout

The first run was rough: 7 of 10 models ejected, claude-haiku-4-5 the sole completer, integer overflow the root cause of most failures. Two things changed for run 2:

  1. Overflow warning baked into every prompt. ReScript's int type is 32 bits on the JavaScript runtime. Several puzzles produce answers in the tens of billions or hundreds of trillions — well beyond that ceiling. No warning was given in run 1. In run 2 every prompt includes an explicit note that large answers require float or BigInt.

  2. Timeout increased from 3 to 5 minutes per attempt. ReScript project setup (initialising npm, configuring the build system, first compile) is non-trivial for models unfamiliar with the ecosystem. Three minutes left very little room for the actual solving.

The rules (Run 2)

Part 1 runs up to 3 attempts, each with a 5-minute window:

  • Attempt 1: plain prompt + overflow warning
  • Attempt 2 (on failure): "YOU MUST read llm-small.txt before starting"
  • Attempt 3 (still failing): "YOU MUST read llm-full.txt before starting"
  • Fail all three → ejected

The working directory contains two versions of the official ReScript llms.txt documentation: a 5,578-line condensed version and a 14,405-line full API reference. Models are not told about them on the first attempt.

Part 2 gets exactly one attempt. Any failure — wrong answer, timeout, or API error — means immediate ejection. No llms.txt hints.

API errors (quota limits, network failures) count as free retries and are not charged against the attempt limit.


Ejections (Run 2)

ModelEjected atReason
mistral/devstral-2512D1P1Filled ~75% of its 200k context window with no answer
openai-codex/gpt-5.3-codexD1P1Entered a loop of echoing prompts back as "DONE" without doing any work
anthropic/claude-haiku-4-5D1P2Wrong answer
anthropic/claude-sonnet-4-6D1P2Wrong answer
zai/glm-5D1P2Wrong answer
minimax/MiniMax-M2.5D1P2Wrong answer
alibaba/qwen3-coder-nextD2P140+ minutes in a compile loop, 45% of context consumed — ejected on excessive cost
alibaba/qwen3.5-plusD5P232-bit integer overflow on the final puzzle

Two models completed all 10 parts: claude-opus-4-6 and kimi-coding/k2p5.


Did the overflow warning make a difference?

Yes — but with an asterisk.

In run 1, both claude-opus-4-6 and claude-sonnet-4-6 overflowed on Day 5 Part 2 and were ejected. In run 2, opus answered correctly and completed the benchmark. The warning worked exactly as intended for it.

alibaba/qwen3.5-plus tells a more complicated story. In run 1 it timed out on Day 2 Part 1 and was ejected. In run 2 it sailed through Days 2, 3, and 4 — even detecting and self-correcting an overflow mid-run on Day 2 Part 2. Then on Day 5 Part 2 it overflowed anyway and was ejected. The warning helped it reach six extra puzzle parts; it wasn't enough to protect it at the end.

The four models ejected on Day 1 Part 2 — haiku, sonnet, glm-5, minimax — gave algorithmically wrong answers rather than overflow answers. The overflow warning had nothing to do with their ejection. (In run 1, different models fell at Day 1 Part 2 — it's a tricky boundary-condition puzzle that catches different models in different sessions.)


Results (Run 2)

Day 1 Part 1 — Dial rotation counting

ModelTimeOutput tokensCost
claude-opus-4-652s2,613$0.23
claude-sonnet-4-6109s6,335$0.29
zai/glm-5150s3,495$0.08
claude-haiku-4-5183s16,079$0.32
alibaba/qwen3.5-plus337s18,512$1.06
kimi-coding/k2p5365s7,636$0.14
alibaba/qwen3-coder-next894s54,900$6.47
minimax/MiniMax-M2.51,404s14,027$0.45
mistral/devstral-251250,158$3.46
openai-codex/gpt-5.3-codex3,599$0.22

All 8 surviving models passed on attempt 1 — nobody needed the llms.txt reference this time. In run 1, three models only cleared Day 1 Part 1 by reading llm-full.txt on their third attempt. The extra two minutes made the difference.

mistral consumed most of its 200k context window without producing an answer — a different failure mode from run 1 (where it overflowed on every attempt), same outcome.

gpt-5.3-codex compiled a project, hit errors, then entered a loop of echoing each nudge back and responding "DONE" without doing any work.


Day 1 Part 2 — Counting zero-crossings during dial rotation

ModelTimeOutput tokensCostResult
claude-opus-4-639s2,329$0.11
kimi-coding/k2p592s682$0.02
alibaba/qwen3.5-plus108s8,560$0.16
alibaba/qwen3-coder-next247s14,616$1.09
claude-sonnet-4-616s2,118$0.08✗ wrong answer → EJECTED
zai/glm-522s1,323$0.05✗ wrong answer → EJECTED
claude-haiku-4-546s5,253$0.15✗ wrong answer → EJECTED
minimax/MiniMax-M2.5446s17,084$0.56✗ wrong answer → EJECTED

Day 1 Part 2 ejected four models in this run. k2p5 flipped from a wrong answer in run 1 to correct in run 2; the extra time appears to have made the difference. haiku, which was run 1's sole winner, went out here.


Day 2 Part 1 — Summing repeated-digit IDs in ranges

The puzzle answer is an 11-digit number — the first real overflow test.

ModelTimeOutput tokensCostResult
claude-opus-4-6142s6,918$0.47
kimi-coding/k2p5146s5,413$0.03
alibaba/qwen3.5-plus910s24,027$2.29
alibaba/qwen3-coder-next>2,500s70,017$3.80EJECTED (cost/time)

qwen3.5-plus passed where it timed out in run 1 — the overflow warning appears to have steered it toward float arithmetic. Slow, but correct.

qwen3-coder-next spent over 40 minutes in a compile-debug loop and was ejected on cost grounds. Its total across Days 1–2: $11.36.


Day 2 Part 2 — Repeated-pattern IDs (any repeat count)

Another 11-digit answer.

ModelTimeOutput tokensCost
claude-opus-4-633s2,126$0.13
alibaba/qwen3.5-plus142s8,758$0.60
kimi-coding/k2p5854s1,397$0.02

qwen3.5-plus detected an overflowed intermediate result mid-run, switched to float arithmetic, and corrected itself — all within the same attempt.


Day 3 Part 1 — Maximizing 2-digit joltage from battery banks

ModelTimeOutput tokensCost
claude-opus-4-680s2,966$0.29
kimi-coding/k2p585s3,188$0.04
alibaba/qwen3.5-plus847s24,097$1.32

Day 3 Part 2 — Maximizing 12-digit joltage from battery banks

ModelTimeOutput tokensCost
claude-opus-4-617s1,315$0.09
kimi-coding/k2p523s1,351$0.02
alibaba/qwen3.5-plus155s3,478$0.26

The answer is a 15-digit number. All three passed.


Day 4 Part 1 — Grid neighbour counting (accessible paper rolls)

ModelTimeOutput tokensCost
claude-opus-4-659s2,770$0.26
kimi-coding/k2p564s3,544$0.03
alibaba/qwen3.5-plus280s9,930$0.37

Day 4 Part 2 — Iterative grid removal simulation

ModelTimeOutput tokensCost
claude-opus-4-619s1,399$0.10
alibaba/qwen3.5-plus65s2,878$0.20
kimi-coding/k2p579s4,193$0.06

Day 5 Part 1 — Range membership checking

ModelTimeOutput tokensCost
claude-opus-4-652s2,426$0.24
kimi-coding/k2p5236s4,345$0.05
alibaba/qwen3.5-plus400s18,712$1.29

Day 5 Part 2 — Counting total fresh IDs from overlapping ranges

The answer is a 15-digit number — the final overflow test.

ModelTimeOutput tokensCostResult
claude-opus-4-616s1,419$0.09COMPLETE
kimi-coding/k2p571s1,798$0.03COMPLETE
alibaba/qwen3.5-plus215s11,300$0.96✗ overflow → EJECTED

opus answered in 16 seconds. In run 1, it overflowed the same puzzle. The overflow warning was the difference.

qwen3.5-plus fell to the pattern it had avoided on Days 2 and 3 — despite the warning and its earlier self-correction, it overflowed here and was ejected.


Full summary (Run 2) — all 10 models

Wall-clock seconds. = ejected at that part.

Model D1P1 D1P2 D2P1 D2P2 D3P1 D3P2 D4P1 D4P2 D5P1 D5P2 Total
claude-opus-4-6 52 39 142 33 80 17 59 19 52 16 509s
kimi-coding/k2p5 365 92 146 854 85 23 64 79 236 71 2,015s
alibaba/qwen3.5-plus 337 108 910 142 847 155 280 65 400 DNF
alibaba/qwen3-coder-next 894 247 DNF
minimax/MiniMax-M2.5 1,404 DNF
claude-haiku-4-5 183 DNF
claude-sonnet-4-6 109 DNF
zai/glm-5 150 DNF
mistral/devstral-2512 DNF
openai-codex/gpt-5.3-codex DNF

Token and cost breakdown for the completers and qwen3.5-plus. Costs are rough approximations based on published per-token pricing. I use subscription plans, so my actual spending is capped regardless.

Model D1P1 D1P2 D2P1 D2P2 D3P1 D3P2 D4P1 D4P2 D5P1 D5P2 Total tokens Total cost
claude-opus-4-6 2,613 2,329 6,918 2,126 2,966 1,315 2,770 1,399 2,426 1,419 26,281 $2.01
kimi-coding/k2p5 7,636 682 5,413 1,397 3,188 1,351 3,544 4,193 4,345 1,798 33,547 $0.46
alibaba/qwen3.5-plus † 18,512 8,560 24,027 8,758 24,097 3,478 9,930 2,878 18,712 11,300 130,252 $8.49

† ejected on D5P2; D5P2 cost includes the final wrong attempt.


Run 1 vs Run 2

Run 1Run 2
Timeout per attempt3 min5 min
Overflow warning in prompt
Full completers1 (claude-haiku-4-5)2 (claude-opus-4-6, kimi-coding/k2p5)
claude-opus-4-6 on D5P2✗ overflow
kimi-coding/k2p5 overallejected D1P2all 10 parts ✓
Session cost datapartial (Day 5 only for finalists)complete

The overflow warning made a real difference for opus. The longer timeout saved k2p5. But six models out of ten still couldn't produce correct answers even with those helps. The bottleneck wasn't overflow — it was the language itself.


Run 3 — ReScript system prompt

Run 2 left an obvious question: if the problem is unfamiliarity with ReScript, what happens when you teach the model the language?

For run 3 I wrote a concise ReScript system prompt — roughly 150 lines covering project setup, syntax essentials, the type system (including the fact that int is 32-bit and bigint exists), common stdlib functions, and file I/O patterns.

How it was injected

The agent harness (pi) supports --append-system-prompt, which adds content to the default system prompt without replacing it. Each model was launched with:

pi --model <model> --thinking off \
   --append-system-prompt rescript-system-prompt.md \
   --session-dir . \
   '<puzzle prompt>'

This means every model saw pi's standard coding-agent instructions plus the ReScript reference below, before receiving the puzzle. No other hints — no overflow warning, no llms.txt documentation files.

The system prompt

Click to expand the full system prompt (rescript-system-prompt.md)
## ReScript v12

ReScript is a typed functional language that compiles to JavaScript.
Think of it as **functional TypeScript** — if you can solve a problem in TypeScript,
you can solve it in ReScript using the same logic, just with different syntax.

### Project setup

**package.json**:
```json
{ "name": "my-project", "dependencies": { "rescript": "^12.1.0" } }
```

**rescript.json** (NOT bsconfig.json):
```json
{
  "name": "my-project",
  "sources": [{ "dir": "src", "subdirs": true }],
  "package-specs": [{ "module": "esmodule", "in-source": true }],
  "suffix": ".res.js"
}
```

Then: `npm install && npx rescript && node src/MyModule.res.js`

### Syntax essentials

```rescript
// Pipe operator (not |>)
[1, 2, 3]->Array.map(x => x * 2)->Array.filter(x => x > 2)

// Let bindings, no semicolons needed
let x = 42
let greet = (name) => "Hello " ++ name  // ++ for string concat

// Pattern matching
switch myValue {
| Some(x) => Console.log(x)
| None => Console.log("nothing")
}

// String operations
let lines = text->String.split("\n")
let trimmed = line->String.trim
let parts = line->String.split(",")
```

### Type system

```rescript
// Primitive types
let n: int = 42              // 32-bit signed integer
let f: float = 3.14          // 64-bit IEEE 754
let b: bigint = 99999999999n // arbitrary precision
let s: string = "hello"      // UTF-8 string
let c: char = 'a'            // single byte, no Unicode — prefer string
let ok: bool = true

// Float arithmetic uses distinct operators
let sum = 1.0 +. 2.5         // +.  -.  *.  /.
let converted = Int.toFloat(n)

// Modulo is a function call, not an infix operator
mod(7, 3)             // int modulo — NOT 7 mod 3, NOT 7 % 3
Float.mod(7.0, 3.0)  // float modulo

// Mutable values use ref
let counter = ref(0)
counter := counter.contents + 1  // := to set, .contents to read

// Records
type point = { x: float, y: float }
let p = { x: 1.0, y: 2.0 }

// Variants
type shape = Circle(float) | Rect(float, float)

// Option and Result
let found: option<int> = Some(42)
let parsed: result<int, string> = Ok(42)

// Arrays — main ordered data structure (like JS arrays)
let a = ["hello", "world"]
let first = a[0]              // Some("hello") — access returns option!
a[0] = "hey"                  // mutation
let b = [1, 2, ...a]          // spread

// List — immutable singly linked list
let l = list{1, 2, 3}
let l2 = list{0, ...l}        // prepend
```

### File I/O (Node.js bindings)

```rescript
@module("fs") external readFileSync: (string, string) => string = "readFileSync"

let content = readFileSync("input.txt", "utf8")
```

### Common stdlib

```rescript
// String
String.length: string => int
String.get: (string, int) => option<string>    // None if out of bounds
String.charAt: (string, int) => string        // "" if out of bounds
String.slice: (string, ~start: int, ~end: int=?) => string
String.split: (string, string) => array<string>
String.trim: string => string
String.includes: (string, string) => bool
String.startsWith: (string, string) => bool
String.replaceAll: (string, string, string) => string
String.make: 'a => string               // convert anything to string

// Array
Array.map: (array<'a>, 'a => 'b) => array<'b>
Array.filter: (array<'a>, 'a => bool) => array<'a>
Array.reduce: (array<'a>, 'b, ('b, 'a) => 'b) => 'b
Array.forEach: (array<'a>, 'a => unit) => unit
Array.length: array<'a> => int
Array.get: (array<'a>, int) => option<'a>

// Option — use getOrThrow, NOT getExn (deprecated)
Option.getOrThrow: (option<'a>, ~message: string=?) => 'a  // throws if None
Option.getOr: (option<'a>, 'a) => 'a           // default if None
Option.map: (option<'a>, 'a => 'b) => option<'b>
Option.flatMap: (option<'a>, 'a => option<'b>) => option<'b>
Option.isSome: option<'a> => bool
Option.isNone: option<'a> => bool
Option.forEach: (option<'a>, 'a => unit) => unit

// Result — use getOrThrow, NOT getExn (deprecated)
Result.getOrThrow: (result<'a, 'b>, ~message: string=?) => 'a  // throws if Error
Result.getOr: (result<'a, 'b>, 'a) => 'a
Result.map: (result<'a, 'c>, 'a => 'b) => result<'b, 'c>
Result.isOk: result<'a, 'b> => bool
Result.isError: result<'a, 'b> => bool

// Conversions
Int.fromString: (string, ~radix: int=?) => option<int>
Int.toString: (int, ~radix: int=?) => string
Int.toFloat: int => float
Float.fromString: string => option<float>
Float.toString: (float, ~radix: int=?) => string

// Output
Console.log: 'a => unit
Console.log2: ('a, 'b) => unit
```

The rules (Run 3)

Same retry logic as run 2 (up to 3 attempts on wrong answers), but:

  • No overflow warning in the puzzle prompt — the system prompt documents int as 32-bit and bigint as arbitrary precision; models have to connect those dots themselves.
  • No llms.txt escalation — models either know enough from the system prompt or they don't.
  • 15-minute timeout per part (increased from the polling window, not per-attempt).

Ejections (Run 3)

ModelEjected atReason
mistral/devstral-2512D1P2Wrong answer on all 3 attempts
openai-codex/gpt-5.3-codexD2P1Brain-dead — echoed prompts as "DONE" without working
alibaba/qwen3-coder-nextD5P1Brain-dead — froze mid-sentence after dumping the input file

Seven models completed all 10 parts.


Results (Run 3)

Day 1 Part 1 — Dial rotation counting

ModelTimeOutput tokensCost
alibaba/qwen3.5-plus49s1,350$0.02
anthropic/claude-haiku-4-559s2,256$0.03
mistral/devstral-251259s1,488$0.03
openai-codex/gpt-5.3-codex62s899$0.04
anthropic/claude-sonnet-4-668s1,915$0.08
anthropic/claude-opus-4-669s1,722$0.13
kimi-coding/k2p573s1,118$0.03
minimax/MiniMax-M2.5120s2,728$0.06
alibaba/qwen3-coder-next122s3,905$0.22
zai/glm-5133s1,515$0.04

10/10 correct on first attempt. In run 2, two models were ejected here (devstral for filling its context window, codex for going brain-dead). The system prompt got them through.


Day 1 Part 2 — Counting zero-crossings during dial rotation

ModelTimeOutput tokensCostResult
openai-codex/gpt-5.3-codex26s752$0.02
anthropic/claude-sonnet-4-639s1,559$0.05
anthropic/claude-opus-4-646s1,809$0.09
zai/glm-565s1,244$0.03
alibaba/qwen3.5-plus90s9,100$0.04
anthropic/claude-haiku-4-5327s11,607$0.13✓ (2nd try)
alibaba/qwen3-coder-next330s9,936$0.29✓ (2nd try)
kimi-coding/k2p5335s2,078$0.05✓ (2nd try)
minimax/MiniMax-M2.5615s19,746$0.22✓ (2nd try)
mistral/devstral-251224,404$0.58✗ wrong 3/3 → EJECTED

9/10 survived. In run 2, this puzzle ejected four models (sonnet, haiku, glm-5, minimax). With the system prompt, all four completed it. Only devstral couldn't solve it — it gave three different wrong answers across three attempts, each algorithmically different but each wrong.


Day 2 Part 1 — Summing repeated-digit IDs in ranges

The puzzle answer is an 11-digit number — the first real overflow test.

ModelTimeOutput tokensCost
anthropic/claude-haiku-4-550s4,585$0.05
kimi-coding/k2p565s2,201$0.01
anthropic/claude-opus-4-670s2,680$0.16
anthropic/claude-sonnet-4-671s3,349$0.12
minimax/MiniMax-M2.5210s5,421$0.06
alibaba/qwen3-coder-next226s9,925$0.38
zai/glm-5580s11,880$0.17
alibaba/qwen3.5-plus840s29,837$1.06
openai-codex/gpt-5.3-codex1,987$0.11

8/8 surviving models correct on first try. codex compiled a solution but it timed out on even the example input. After two dirty-stop retries it went brain-dead — echoing nudges and saying "DONE" without doing any work.


Day 2 Part 2 — Repeated-pattern IDs (any repeat count)

Another 11-digit answer.

ModelTimeOutput tokensCost
anthropic/claude-haiku-4-518s1,104$0.02
anthropic/claude-sonnet-4-654s2,935$0.08
anthropic/claude-opus-4-654s2,713$0.12
kimi-coding/k2p568s1,109$0.01
minimax/MiniMax-M2.575s2,912$0.04
zai/glm-5338s6,442$0.14
alibaba/qwen3-coder-next365s13,194$0.51
alibaba/qwen3.5-plus394s15,365$1.20

All 8 correct. haiku was fastest at 18 seconds.


Day 3 Part 1 — Maximizing 2-digit joltage from battery banks

ModelTimeOutput tokensCost
anthropic/claude-sonnet-4-639s1,370$0.06
alibaba/qwen3.5-plus50s2,648$0.03
anthropic/claude-opus-4-657s1,697$0.12
anthropic/claude-haiku-4-562s5,251$0.06
minimax/MiniMax-M2.5127s3,704$0.05
alibaba/qwen3-coder-next158s5,835$0.40
zai/glm-5291s5,600$0.11
kimi-coding/k2p5299s1,865$0.03

Day 3 Part 2 — Maximizing 12-digit joltage from battery banks

The answer is a 15-digit number.

ModelTimeOutput tokensCost
kimi-coding/k2p531s937$0.01
anthropic/claude-opus-4-634s1,252$0.07
anthropic/claude-sonnet-4-644s2,013$0.08
zai/glm-5106s1,978$0.06
minimax/MiniMax-M2.5110s3,173$0.09
anthropic/claude-haiku-4-5144s14,461$0.17
alibaba/qwen3-coder-next248s4,409$0.47
alibaba/qwen3.5-plus265s10,304$0.23

All 8 correct. All handled the 15-digit answer — no overflow.


Day 4 Part 1 — Grid neighbour counting (accessible paper rolls)

ModelTimeOutput tokensCost
anthropic/claude-sonnet-4-633s1,307$0.05
kimi-coding/k2p533s974$0.01
anthropic/claude-haiku-4-544s3,673$0.05
anthropic/claude-opus-4-644s1,409$0.11
minimax/MiniMax-M2.589s2,759$0.03
alibaba/qwen3.5-plus147s6,383$0.11
zai/glm-5162s2,968$0.04
alibaba/qwen3-coder-next769s19,731$0.73

Day 4 Part 2 — Iterative grid removal simulation

ModelTimeOutput tokensCost
anthropic/claude-opus-4-622s1,320$0.08
anthropic/claude-sonnet-4-624s1,243$0.04
anthropic/claude-haiku-4-525s2,423$0.03
kimi-coding/k2p528s1,340$0.02
alibaba/qwen3.5-plus65s3,829$0.12
alibaba/qwen3-coder-next213s4,696$0.29
zai/glm-5221s3,976$0.09
minimax/MiniMax-M2.5237s7,399$0.12

Day 5 Part 1 — Range membership checking

ModelTimeOutput tokensCost
anthropic/claude-sonnet-4-667s2,632$0.10
kimi-coding/k2p574s3,287$0.03
anthropic/claude-haiku-4-582s7,079$0.09
anthropic/claude-opus-4-684s3,399$0.22
alibaba/qwen3.5-plus172s8,297$0.09
minimax/MiniMax-M2.5293s8,600$0.27
zai/glm-5361s6,021$0.23
alibaba/qwen3-coder-next160$0.01

qwen3-coder-next read the real input file, dumped it into the context, started writing "I understand the problem now. The task is to: 1. Parse Ingredi—" and froze mid-sentence. No recovery.


Day 5 Part 2 — Counting total fresh IDs from overlapping ranges

The answer is a 15-digit number — the final overflow test.

ModelTimeOutput tokensCostResult
anthropic/claude-sonnet-4-625s1,360$0.06COMPLETE
anthropic/claude-opus-4-631s1,269$0.10COMPLETE
kimi-coding/k2p573s4,663$0.04COMPLETE
anthropic/claude-haiku-4-576s7,877$0.11COMPLETE
zai/glm-5171s3,382$0.15COMPLETE
minimax/MiniMax-M2.5211s6,854$0.34COMPLETE
alibaba/qwen3.5-plus551s15,764$0.60COMPLETE

Seven out of seven. In run 2, qwen3.5-plus overflowed this exact puzzle. With the system prompt documenting that int is 32-bit and bigint exists, it used bigint from the start and answered correctly. No explicit warning needed.


Full summary (Run 3) — all 10 models

Wall-clock seconds. = ejected at that part.

Model D1P1 D1P2 D2P1 D2P2 D3P1 D3P2 D4P1 D4P2 D5P1 D5P2 Total
claude-sonnet-4-6 68 39 71 54 39 44 33 24 67 25 464s
claude-opus-4-6 69 46 70 54 57 34 44 22 84 31 511s
claude-haiku-4-5 59 327 50 18 62 144 44 25 82 76 887s
kimi-coding/k2p5 73 335 65 68 299 31 33 28 74 73 1,079s
minimax/MiniMax-M2.5 120 615 210 75 127 110 89 237 293 211 2,087s
zai/glm-5 133 65 580 338 291 106 162 221 361 171 2,428s
alibaba/qwen3.5-plus 49 90 840 394 50 265 147 65 172 551 2,623s
alibaba/qwen3-coder-next 122 330 226 365 158 248 769 213 DNF
mistral/devstral-2512 59 DNF
openai-codex/gpt-5.3-codex 62 26 DNF

Token and cost breakdown for all models. Costs are rough approximations based on published per-token pricing. I use subscription plans, so my actual spending is capped regardless.

Model Total tokens Total cost
alibaba/qwen3.5-plus 102,877 $3.50
minimax/MiniMax-M2.5 63,296 $1.28
claude-opus-4-6 19,270 $1.20
zai/glm-5 45,006 $1.06
claude-haiku-4-5 60,316 $0.74
claude-sonnet-4-6 19,683 $0.72
kimi-coding/k2p5 19,572 $0.24
alibaba/qwen3-coder-next † 71,791 $3.30
mistral/devstral-2512 † 25,892 $0.61
openai-codex/gpt-5.3-codex † 3,638 $0.17

† ejected before completing all parts.


Across all three runs

Run 1Run 2Run 3
Interventionnoneoverflow warningsystem prompt
Timeout3 min5 min15 min
Full completers1/102/107/10
D1P1 first-try pass rate7/108/1010/10
D1P2 survivors3/104/109/10
claude-opus-4-6ejected D5P2✓ all 10✓ all 10
claude-sonnet-4-6ejected D1P2ejected D1P2✓ fastest (464s)
claude-haiku-4-5✓ sole completerejected D1P2✓ all 10
kimi-coding/k2p5ejected D1P2✓ all 10✓ all 10
zai/glm-5ejected D1P2ejected D1P2✓ all 10
minimax/MiniMax-M2.5ejected D1P2ejected D1P2✓ all 10
alibaba/qwen3.5-plusejected D2P1ejected D5P2✓ all 10
alibaba/qwen3-coder-nextejected D1P2ejected D2P1ejected D5P1
mistral/devstral-2512ejected D1P1ejected D1P1ejected D1P2
openai-codex/gpt-5.3-codexejected D1P1ejected D1P1ejected D2P1

Observations

Teaching the language was the biggest lever. The system prompt — 150 lines of syntax, types, and stdlib — took the completion rate from 2/10 to 7/10. The overflow warning in run 2 helped exactly one model (opus). The system prompt helped five more (sonnet, haiku, glm-5, minimax, qwen3.5-plus).

The overflow problem was a documentation problem. In run 2, qwen3.5-plus overflowed on Day 5 Part 2 despite an explicit overflow warning. In run 3, it used bigint from the start and answered correctly — because the system prompt documented int as 32-bit and bigint as arbitrary precision. Models don't need warnings; they need language specs.

claude-sonnet-4-6 went from run 2's Day 1 Part 2 casualty to run 3's fastest completer (464s). In run 2 it gave a wrong answer on the boundary-condition puzzle and was ejected. In run 3 it completed all 10 parts and beat opus on total time. The system prompt didn't just help with overflow — it helped models write correct algorithms in an unfamiliar language.

kimi-coding/k2p5 remains the value champion: $0.24 for all 10 parts. Down from $0.46 in run 2. The system prompt cut its token usage almost in half — fewer compile errors, fewer false starts.

Three models remain unreachable. devstral got further (past D1P1) but still couldn't solve D1P2. gpt-5.3-codex got further (past D1) but went brain-dead on D2P1. qwen3-coder-next got the furthest (8 parts) but froze mid-thought on D5P1. The system prompt can't fix models that go catatonic.

The llms.txt files were never needed. In run 2, no model proactively consulted them. In run 3, they weren't even offered. A concise system prompt covering the essentials was more effective than 14,000 lines of API reference sitting in the working directory.


Benchmarked on 2026-02-26 (run 2) and 2026-02-27 (run 3) using pi as the agent harness.


This post was written with AI assistance.