Developers, developers, developers!

Blog about programming, programming and, ah more programming!

Benchmarking LLMs on Advent of Code 2025 (F#)

Tags = [ F#, AI, Advent of Code ]

Following up on the Haskell benchmark, the OCaml benchmark, the Python benchmark, the ReScript benchmark, the Ruby benchmark, the Elixir benchmark, the Java benchmark, and the Elm benchmark, I ran the same AoC 2025 Days 1–5 setup in F#.

F# occupies an interesting middle ground. It's a functional-first language on .NET — strongly typed with type inference, pattern matching, and pipelines, but with full access to the imperative .NET ecosystem when needed. It sees real production use but isn't anywhere near as common as C# or Python in training data. No scaffold was provided; each model had to figure out dotnet fsi scripting or full project setup on its own.

The result: another clean sweep. Every model solved every part.

The contestants

#Model
1anthropic/claude-haiku-4-5
2anthropic/claude-sonnet-4-6
3anthropic/claude-opus-4-6
4openai-codex/gpt-5.3-codex
5zai/glm-5
6minimax/MiniMax-M2.5
7kimi-coding/k2p5
8mistral/devstral-2512
9alibaba/qwen3.5-plus
10alibaba/qwen3-coder-next

Ejections

None. All 10 models completed all 10 parts. F# joins Python, Ruby, and Elm as the fourth language with a perfect completion rate.

Day 1 Part 2 was the only real trouble spot. Four models needed retries there — gpt-5.3-codex, devstral-2512, and MiniMax-M2.5 each needed two attempts, while qwen3-coder-next took three. Beyond that, glm-5 had a dirty retry on Day 3 Part 1 (it wrote a premature answer while still working). Every other part was a clean first-try solve across the board.

Results (Days 1–5)

Per-task leaderboards


Day 1 Part 1 — Dial rotation counting

ModelTime
mistral/devstral-251212s
anthropic/claude-sonnet-4-617s
anthropic/claude-haiku-4-519s
openai-codex/gpt-5.3-codex20s
kimi-coding/k2p524s
anthropic/claude-opus-4-627s
zai/glm-540s
alibaba/qwen3-coder-next42s
minimax/MiniMax-M2.583s
alibaba/qwen3.5-plus86s



Day 1 Part 2 — Counting zero-crossings during dial rotation

ModelTimeResult
anthropic/claude-haiku-4-510s
anthropic/claude-sonnet-4-628s
zai/glm-530s
anthropic/claude-opus-4-633s
kimi-coding/k2p565s
alibaba/qwen3.5-plus73s
openai-codex/gpt-5.3-codex315s✓ (2nd try)
mistral/devstral-2512342s✓ (2nd try)
minimax/MiniMax-M2.5547s✓ (2nd try)
alibaba/qwen3-coder-next625s✓ (3rd try)



Day 2 Part 1 — Summing repeated-digit IDs in ranges

ModelTime
anthropic/claude-haiku-4-530s
anthropic/claude-sonnet-4-630s
anthropic/claude-opus-4-632s
openai-codex/gpt-5.3-codex37s
zai/glm-538s
mistral/devstral-251239s
alibaba/qwen3-coder-next43s
minimax/MiniMax-M2.579s
alibaba/qwen3.5-plus85s
kimi-coding/k2p5155s



Day 2 Part 2 — Repeated-pattern IDs (any repeat count)

ModelTime
anthropic/claude-haiku-4-513s
openai-codex/gpt-5.3-codex16s
alibaba/qwen3.5-plus17s
alibaba/qwen3-coder-next21s
mistral/devstral-251228s
anthropic/claude-sonnet-4-629s
anthropic/claude-opus-4-633s
zai/glm-536s
minimax/MiniMax-M2.548s
kimi-coding/k2p576s



Day 3 Part 1 — Maximizing 2-digit joltage from battery banks

ModelTimeResult
anthropic/claude-sonnet-4-618s
kimi-coding/k2p518s
anthropic/claude-opus-4-625s
openai-codex/gpt-5.3-codex26s
alibaba/qwen3-coder-next29s
anthropic/claude-haiku-4-530s
alibaba/qwen3.5-plus45s
minimax/MiniMax-M2.551s
mistral/devstral-251264s
zai/glm-5178s✓ (dirty retry)



Day 3 Part 2 — Maximizing 12-digit joltage from battery banks

ModelTime
openai-codex/gpt-5.3-codex14s
kimi-coding/k2p518s
anthropic/claude-sonnet-4-620s
alibaba/qwen3.5-plus22s
anthropic/claude-opus-4-624s
alibaba/qwen3-coder-next25s
zai/glm-533s
minimax/MiniMax-M2.537s
anthropic/claude-haiku-4-540s
mistral/devstral-251244s



Day 4 Part 1 — Grid neighbor counting (accessible paper rolls)

ModelTime
alibaba/qwen3-coder-next10s
alibaba/qwen3.5-plus17s
openai-codex/gpt-5.3-codex18s
anthropic/claude-sonnet-4-620s
mistral/devstral-251221s
anthropic/claude-haiku-4-524s
anthropic/claude-opus-4-629s
kimi-coding/k2p529s
zai/glm-537s
minimax/MiniMax-M2.582s



Day 4 Part 2 — Iterative grid removal simulation

ModelTime
alibaba/qwen3-coder-next8s
anthropic/claude-haiku-4-514s
alibaba/qwen3.5-plus17s
anthropic/claude-sonnet-4-618s
openai-codex/gpt-5.3-codex19s
anthropic/claude-opus-4-622s
mistral/devstral-251224s
kimi-coding/k2p534s
zai/glm-537s
minimax/MiniMax-M2.569s



Day 5 Part 1 — Range membership checking

ModelTime
openai-codex/gpt-5.3-codex16s
anthropic/claude-sonnet-4-623s
anthropic/claude-opus-4-627s
zai/glm-529s
kimi-coding/k2p530s
mistral/devstral-251231s
alibaba/qwen3.5-plus31s
anthropic/claude-haiku-4-544s
minimax/MiniMax-M2.545s
alibaba/qwen3-coder-next58s



Day 5 Part 2 — Counting total fresh IDs from overlapping ranges

ModelTime
mistral/devstral-251211s
openai-codex/gpt-5.3-codex12s
anthropic/claude-sonnet-4-613s
anthropic/claude-haiku-4-515s
anthropic/claude-opus-4-618s
kimi-coding/k2p521s
alibaba/qwen3.5-plus27s
alibaba/qwen3-coder-next27s
zai/glm-537s
minimax/MiniMax-M2.537s

Speed vs accuracy

Summary tables

Wall-clock time (seconds)

Model D1P1 D1P2 D2P1 D2P2 D3P1 D3P2 D4P1 D4P2 D5P1 D5P2 Total
anthropic/claude-sonnet-4-6 17s 28s 30s 29s 18s 20s 20s 18s 23s 13s 216s
anthropic/claude-haiku-4-5 19s 10s 30s 13s 30s 40s 24s 14s 44s 15s 239s
anthropic/claude-opus-4-6 27s 33s 32s 33s 25s 24s 29s 22s 27s 18s 270s
alibaba/qwen3.5-plus 86s 73s 85s 17s 45s 22s 17s 17s 31s 27s 420s
kimi-coding/k2p5 24s 65s 155s 76s 18s 18s 29s 34s 30s 21s 470s
openai-codex/gpt-5.3-codex 20s 315s 37s 16s 26s 14s 18s 19s 16s 12s 493s
zai/glm-5 40s 30s 38s 36s 178s 33s 37s 37s 29s 37s 495s
mistral/devstral-2512 12s 342s 39s 28s 64s 44s 21s 24s 31s 11s 616s
alibaba/qwen3-coder-next 42s 625s 43s 21s 29s 25s 10s 8s 58s 27s 888s
minimax/MiniMax-M2.5 83s 547s 79s 48s 51s 37s 82s 69s 45s 37s 1078s

Output tokens per part

Model D1P1 D1P2 D2P1 D2P2 D3P1 D3P2 D4P1 D4P2 D5P1 D5P2 Total
openai-codex/gpt-5.3-codex 709 1,248 1,205 553 758 533 547 803 543 517 7,416
zai/glm-5 861 609 841 767 3,630 648 777 666 609 823 10,231
anthropic/claude-sonnet-4-6 750 1,279 1,404 1,503 859 932 855 873 1,184 695 10,334
anthropic/claude-opus-4-6 1,054 1,736 1,438 1,429 961 1,015 1,122 899 1,030 806 11,490
kimi-coding/k2p5 639 2,636 4,824 2,640 677 868 903 1,225 850 616 15,878
anthropic/claude-haiku-4-5 1,559 898 2,339 1,037 2,560 3,326 1,962 1,131 3,864 1,139 19,815
mistral/devstral-2512 618 5,672 3,459 2,511 4,324 2,710 1,560 3,129 2,437 790 27,210
alibaba/qwen3.5-plus 4,919 7,012 6,106 1,138 3,165 1,430 1,198 1,160 2,158 2,161 30,447
minimax/MiniMax-M2.5 2,481 18,006 2,230 1,060 1,720 993 2,007 2,514 991 1,482 33,484
alibaba/qwen3-coder-next 3,355 31,718 2,426 2,066 1,391 1,300 819 803 4,027 1,547 49,452

API cost per part (approximate USD)

Costs are rough approximations based on published per-token pricing. I use subscription plans, so my actual spending is capped regardless.

Model D1P1 D1P2 D2P1 D2P2 D3P1 D3P2 D4P1 D4P2 D5P1 D5P2 Total
kimi-coding/k2p5 .0091 .0122 .0228 .0165 .0100 .0095 .0117 .0114 .0040 .0037 $0.11
openai-codex/gpt-5.3-codex .0291 .0363 .0408 .0168 .0352 .0158 .0187 .0284 .0220 .0133 $0.26
zai/glm-5 .0279 .0186 .0092 .0086 .0720 .0226 .0304 .0219 .0408 .0283 $0.28
anthropic/claude-haiku-4-5 .0352 .0098 .0402 .0114 .0469 .0337 .0372 .0134 .0638 .0175 $0.31
anthropic/claude-sonnet-4-6 .0355 .0349 .0489 .0458 .0361 .0275 .0362 .0288 .0455 .0214 $0.36
mistral/devstral-2512 .0081 .0528 .0502 .0463 .0828 .0538 .0171 .0327 .0268 .0141 $0.38
alibaba/qwen3.5-plus .0892 .0583 .0928 .0345 .0453 .0255 .0116 .0179 .0261 .0318 $0.43
minimax/MiniMax-M2.5 .0696 .1855 .0209 .0132 .0171 .0192 .0435 .0555 .0166 .0282 $0.47
anthropic/claude-opus-4-6 .1711 .0743 .1466 .0668 .1185 .0873 .1628 .0737 .1520 .0798 $1.13
alibaba/qwen3-coder-next .0991 1.1932 .0505 .0435 .0997 .1027 .0210 .0280 .2615 .1149 $2.01

Observations

10/10 completers — zero ejections. F# joins Python, Ruby, and Elm as the only languages in this series where every model solved every part.

claude-sonnet-4-6 — fastest overall at 216s. No single part over 30s, never needed a retry. The most consistent performer in this run, never once stumbling. ~$0.36 total.

claude-haiku-4-5 — second fastest at 239s and remarkably cheap at ~$0.31. Hit a 10s solve on D1P2 — the single fastest part solve in the entire benchmark. Never needed a retry.

claude-opus-4-6 — the steadiest clock in the field. Every single part between 18s and 33s, never needed a retry. No part was a blowout but none was slow either. The most expensive Anthropic model at ~$1.13.

kimi-coding/k2p5 — cheapest at ~$0.11. That's roughly 10× cheaper than Opus for comparable results. Slow on D2P1 (155s) and D2P2 (76s) but otherwise quick.

gpt-5.3-codex — fewest tokens: 7,416 total for 10 parts. Incredibly concise. Would have been a top-3 finisher on time if not for the 315s D1P2 retry that dragged its total to 493s.

Day 1 Part 2 was the filter. Six models solved it instantly; four needed retries. This was the only part in the entire F# benchmark where any model gave a wrong answer. Whatever the conceptual shift between Part 1 and Part 2 was, it tripped up the same models that struggle with Part 2 pivots in other languages.

qwen3-coder-next — the most extreme profile. Produced the fastest D4P1 (10s) and D4P2 (8s) solves, but also the most expensive D1P2 at $1.19 and 31,718 tokens after needing three attempts. Total cost: $2.01, the highest in the field.

MiniMax-M2.5 — slowest overall at 1,078s. D1P2 alone took 547s after a retry. But it got there in the end, and its per-token pricing kept costs moderate at ~$0.47.

Cross-language snapshot

LanguageModels completing all 10 parts
Python10/10
Ruby10/10
Elm10/10
F#10/10
Java9/10
Elixir7/10
Haskell7/11
OCaml5/9
ReScript (run 2)2/10

F#'s 10/10 was expected more than Elm's — it's a .NET language with decent representation in training data thanks to the broader .NET ecosystem. Models could reach for imperative patterns when functional ones didn't work, and dotnet fsi provides a frictionless scripting experience. Still, zero ejections across 10 models and 10 parts is a strong result for a language that isn't Python or JavaScript.

Benchmarked on 2026-02-27 using pi as the agent harness.


This post was written with AI assistance.