Following up on the Haskell benchmark, the OCaml benchmark, the Python benchmark, the ReScript benchmark, the Ruby benchmark, the Elixir benchmark, the Java benchmark, and the Elm benchmark, I ran the same AoC 2025 Days 1–5 setup in F#.
F# occupies an interesting middle ground. It's a functional-first language on .NET — strongly
typed with type inference, pattern matching, and pipelines, but with full access to the
imperative .NET ecosystem when needed. It sees real production use but isn't anywhere near as
common as C# or Python in training data. No scaffold was provided; each model had to figure
out dotnet fsi scripting or full project setup on its own.
The result: another clean sweep. Every model solved every part.
The contestants
| # | Model |
|---|---|
| 1 | anthropic/claude-haiku-4-5 |
| 2 | anthropic/claude-sonnet-4-6 |
| 3 | anthropic/claude-opus-4-6 |
| 4 | openai-codex/gpt-5.3-codex |
| 5 | zai/glm-5 |
| 6 | minimax/MiniMax-M2.5 |
| 7 | kimi-coding/k2p5 |
| 8 | mistral/devstral-2512 |
| 9 | alibaba/qwen3.5-plus |
| 10 | alibaba/qwen3-coder-next |
Ejections
None. All 10 models completed all 10 parts. F# joins Python, Ruby, and Elm as the fourth language with a perfect completion rate.
Day 1 Part 2 was the only real trouble spot. Four models needed retries there —
gpt-5.3-codex, devstral-2512, and MiniMax-M2.5 each needed two attempts, while
qwen3-coder-next took three. Beyond that, glm-5 had a dirty retry on Day 3 Part 1
(it wrote a premature answer while still working). Every other part was a clean first-try
solve across the board.
Results (Days 1–5)
Per-task leaderboards
Day 1 Part 1 — Dial rotation counting
| Model | Time |
|---|---|
| mistral/devstral-2512 | 12s |
| anthropic/claude-sonnet-4-6 | 17s |
| anthropic/claude-haiku-4-5 | 19s |
| openai-codex/gpt-5.3-codex | 20s |
| kimi-coding/k2p5 | 24s |
| anthropic/claude-opus-4-6 | 27s |
| zai/glm-5 | 40s |
| alibaba/qwen3-coder-next | 42s |
| minimax/MiniMax-M2.5 | 83s |
| alibaba/qwen3.5-plus | 86s |
Day 1 Part 2 — Counting zero-crossings during dial rotation
| Model | Time | Result |
|---|---|---|
| anthropic/claude-haiku-4-5 | 10s | ✓ |
| anthropic/claude-sonnet-4-6 | 28s | ✓ |
| zai/glm-5 | 30s | ✓ |
| anthropic/claude-opus-4-6 | 33s | ✓ |
| kimi-coding/k2p5 | 65s | ✓ |
| alibaba/qwen3.5-plus | 73s | ✓ |
| openai-codex/gpt-5.3-codex | 315s | ✓ (2nd try) |
| mistral/devstral-2512 | 342s | ✓ (2nd try) |
| minimax/MiniMax-M2.5 | 547s | ✓ (2nd try) |
| alibaba/qwen3-coder-next | 625s | ✓ (3rd try) |
Day 2 Part 1 — Summing repeated-digit IDs in ranges
| Model | Time |
|---|---|
| anthropic/claude-haiku-4-5 | 30s |
| anthropic/claude-sonnet-4-6 | 30s |
| anthropic/claude-opus-4-6 | 32s |
| openai-codex/gpt-5.3-codex | 37s |
| zai/glm-5 | 38s |
| mistral/devstral-2512 | 39s |
| alibaba/qwen3-coder-next | 43s |
| minimax/MiniMax-M2.5 | 79s |
| alibaba/qwen3.5-plus | 85s |
| kimi-coding/k2p5 | 155s |
Day 2 Part 2 — Repeated-pattern IDs (any repeat count)
| Model | Time |
|---|---|
| anthropic/claude-haiku-4-5 | 13s |
| openai-codex/gpt-5.3-codex | 16s |
| alibaba/qwen3.5-plus | 17s |
| alibaba/qwen3-coder-next | 21s |
| mistral/devstral-2512 | 28s |
| anthropic/claude-sonnet-4-6 | 29s |
| anthropic/claude-opus-4-6 | 33s |
| zai/glm-5 | 36s |
| minimax/MiniMax-M2.5 | 48s |
| kimi-coding/k2p5 | 76s |
Day 3 Part 1 — Maximizing 2-digit joltage from battery banks
| Model | Time | Result |
|---|---|---|
| anthropic/claude-sonnet-4-6 | 18s | ✓ |
| kimi-coding/k2p5 | 18s | ✓ |
| anthropic/claude-opus-4-6 | 25s | ✓ |
| openai-codex/gpt-5.3-codex | 26s | ✓ |
| alibaba/qwen3-coder-next | 29s | ✓ |
| anthropic/claude-haiku-4-5 | 30s | ✓ |
| alibaba/qwen3.5-plus | 45s | ✓ |
| minimax/MiniMax-M2.5 | 51s | ✓ |
| mistral/devstral-2512 | 64s | ✓ |
| zai/glm-5 | 178s | ✓ (dirty retry) |
Day 3 Part 2 — Maximizing 12-digit joltage from battery banks
| Model | Time |
|---|---|
| openai-codex/gpt-5.3-codex | 14s |
| kimi-coding/k2p5 | 18s |
| anthropic/claude-sonnet-4-6 | 20s |
| alibaba/qwen3.5-plus | 22s |
| anthropic/claude-opus-4-6 | 24s |
| alibaba/qwen3-coder-next | 25s |
| zai/glm-5 | 33s |
| minimax/MiniMax-M2.5 | 37s |
| anthropic/claude-haiku-4-5 | 40s |
| mistral/devstral-2512 | 44s |
Day 4 Part 1 — Grid neighbor counting (accessible paper rolls)
| Model | Time |
|---|---|
| alibaba/qwen3-coder-next | 10s |
| alibaba/qwen3.5-plus | 17s |
| openai-codex/gpt-5.3-codex | 18s |
| anthropic/claude-sonnet-4-6 | 20s |
| mistral/devstral-2512 | 21s |
| anthropic/claude-haiku-4-5 | 24s |
| anthropic/claude-opus-4-6 | 29s |
| kimi-coding/k2p5 | 29s |
| zai/glm-5 | 37s |
| minimax/MiniMax-M2.5 | 82s |
Day 4 Part 2 — Iterative grid removal simulation
| Model | Time |
|---|---|
| alibaba/qwen3-coder-next | 8s |
| anthropic/claude-haiku-4-5 | 14s |
| alibaba/qwen3.5-plus | 17s |
| anthropic/claude-sonnet-4-6 | 18s |
| openai-codex/gpt-5.3-codex | 19s |
| anthropic/claude-opus-4-6 | 22s |
| mistral/devstral-2512 | 24s |
| kimi-coding/k2p5 | 34s |
| zai/glm-5 | 37s |
| minimax/MiniMax-M2.5 | 69s |
Day 5 Part 1 — Range membership checking
| Model | Time |
|---|---|
| openai-codex/gpt-5.3-codex | 16s |
| anthropic/claude-sonnet-4-6 | 23s |
| anthropic/claude-opus-4-6 | 27s |
| zai/glm-5 | 29s |
| kimi-coding/k2p5 | 30s |
| mistral/devstral-2512 | 31s |
| alibaba/qwen3.5-plus | 31s |
| anthropic/claude-haiku-4-5 | 44s |
| minimax/MiniMax-M2.5 | 45s |
| alibaba/qwen3-coder-next | 58s |
Day 5 Part 2 — Counting total fresh IDs from overlapping ranges
| Model | Time |
|---|---|
| mistral/devstral-2512 | 11s |
| openai-codex/gpt-5.3-codex | 12s |
| anthropic/claude-sonnet-4-6 | 13s |
| anthropic/claude-haiku-4-5 | 15s |
| anthropic/claude-opus-4-6 | 18s |
| kimi-coding/k2p5 | 21s |
| alibaba/qwen3.5-plus | 27s |
| alibaba/qwen3-coder-next | 27s |
| zai/glm-5 | 37s |
| minimax/MiniMax-M2.5 | 37s |
Speed vs accuracy
Summary tables
Wall-clock time (seconds)
| Model | D1P1 | D1P2 | D2P1 | D2P2 | D3P1 | D3P2 | D4P1 | D4P2 | D5P1 | D5P2 | Total |
|---|---|---|---|---|---|---|---|---|---|---|---|
| anthropic/claude-sonnet-4-6 | 17s | 28s | 30s | 29s | 18s | 20s | 20s | 18s | 23s | 13s | 216s |
| anthropic/claude-haiku-4-5 | 19s | 10s | 30s | 13s | 30s | 40s | 24s | 14s | 44s | 15s | 239s |
| anthropic/claude-opus-4-6 | 27s | 33s | 32s | 33s | 25s | 24s | 29s | 22s | 27s | 18s | 270s |
| alibaba/qwen3.5-plus | 86s | 73s | 85s | 17s | 45s | 22s | 17s | 17s | 31s | 27s | 420s |
| kimi-coding/k2p5 | 24s | 65s | 155s | 76s | 18s | 18s | 29s | 34s | 30s | 21s | 470s |
| openai-codex/gpt-5.3-codex | 20s | 315s | 37s | 16s | 26s | 14s | 18s | 19s | 16s | 12s | 493s |
| zai/glm-5 | 40s | 30s | 38s | 36s | 178s | 33s | 37s | 37s | 29s | 37s | 495s |
| mistral/devstral-2512 | 12s | 342s | 39s | 28s | 64s | 44s | 21s | 24s | 31s | 11s | 616s |
| alibaba/qwen3-coder-next | 42s | 625s | 43s | 21s | 29s | 25s | 10s | 8s | 58s | 27s | 888s |
| minimax/MiniMax-M2.5 | 83s | 547s | 79s | 48s | 51s | 37s | 82s | 69s | 45s | 37s | 1078s |
Output tokens per part
| Model | D1P1 | D1P2 | D2P1 | D2P2 | D3P1 | D3P2 | D4P1 | D4P2 | D5P1 | D5P2 | Total |
|---|---|---|---|---|---|---|---|---|---|---|---|
| openai-codex/gpt-5.3-codex | 709 | 1,248 | 1,205 | 553 | 758 | 533 | 547 | 803 | 543 | 517 | 7,416 |
| zai/glm-5 | 861 | 609 | 841 | 767 | 3,630 | 648 | 777 | 666 | 609 | 823 | 10,231 |
| anthropic/claude-sonnet-4-6 | 750 | 1,279 | 1,404 | 1,503 | 859 | 932 | 855 | 873 | 1,184 | 695 | 10,334 |
| anthropic/claude-opus-4-6 | 1,054 | 1,736 | 1,438 | 1,429 | 961 | 1,015 | 1,122 | 899 | 1,030 | 806 | 11,490 |
| kimi-coding/k2p5 | 639 | 2,636 | 4,824 | 2,640 | 677 | 868 | 903 | 1,225 | 850 | 616 | 15,878 |
| anthropic/claude-haiku-4-5 | 1,559 | 898 | 2,339 | 1,037 | 2,560 | 3,326 | 1,962 | 1,131 | 3,864 | 1,139 | 19,815 |
| mistral/devstral-2512 | 618 | 5,672 | 3,459 | 2,511 | 4,324 | 2,710 | 1,560 | 3,129 | 2,437 | 790 | 27,210 |
| alibaba/qwen3.5-plus | 4,919 | 7,012 | 6,106 | 1,138 | 3,165 | 1,430 | 1,198 | 1,160 | 2,158 | 2,161 | 30,447 |
| minimax/MiniMax-M2.5 | 2,481 | 18,006 | 2,230 | 1,060 | 1,720 | 993 | 2,007 | 2,514 | 991 | 1,482 | 33,484 |
| alibaba/qwen3-coder-next | 3,355 | 31,718 | 2,426 | 2,066 | 1,391 | 1,300 | 819 | 803 | 4,027 | 1,547 | 49,452 |
API cost per part (approximate USD)
Costs are rough approximations based on published per-token pricing. I use subscription plans, so my actual spending is capped regardless.
| Model | D1P1 | D1P2 | D2P1 | D2P2 | D3P1 | D3P2 | D4P1 | D4P2 | D5P1 | D5P2 | Total |
|---|---|---|---|---|---|---|---|---|---|---|---|
| kimi-coding/k2p5 | .0091 | .0122 | .0228 | .0165 | .0100 | .0095 | .0117 | .0114 | .0040 | .0037 | $0.11 |
| openai-codex/gpt-5.3-codex | .0291 | .0363 | .0408 | .0168 | .0352 | .0158 | .0187 | .0284 | .0220 | .0133 | $0.26 |
| zai/glm-5 | .0279 | .0186 | .0092 | .0086 | .0720 | .0226 | .0304 | .0219 | .0408 | .0283 | $0.28 |
| anthropic/claude-haiku-4-5 | .0352 | .0098 | .0402 | .0114 | .0469 | .0337 | .0372 | .0134 | .0638 | .0175 | $0.31 |
| anthropic/claude-sonnet-4-6 | .0355 | .0349 | .0489 | .0458 | .0361 | .0275 | .0362 | .0288 | .0455 | .0214 | $0.36 |
| mistral/devstral-2512 | .0081 | .0528 | .0502 | .0463 | .0828 | .0538 | .0171 | .0327 | .0268 | .0141 | $0.38 |
| alibaba/qwen3.5-plus | .0892 | .0583 | .0928 | .0345 | .0453 | .0255 | .0116 | .0179 | .0261 | .0318 | $0.43 |
| minimax/MiniMax-M2.5 | .0696 | .1855 | .0209 | .0132 | .0171 | .0192 | .0435 | .0555 | .0166 | .0282 | $0.47 |
| anthropic/claude-opus-4-6 | .1711 | .0743 | .1466 | .0668 | .1185 | .0873 | .1628 | .0737 | .1520 | .0798 | $1.13 |
| alibaba/qwen3-coder-next | .0991 | 1.1932 | .0505 | .0435 | .0997 | .1027 | .0210 | .0280 | .2615 | .1149 | $2.01 |
Observations
10/10 completers — zero ejections. F# joins Python, Ruby, and Elm as the only languages in this series where every model solved every part.
claude-sonnet-4-6 — fastest overall at 216s. No single part over 30s, never needed a
retry. The most consistent performer in this run, never once stumbling. ~$0.36 total.
claude-haiku-4-5 — second fastest at 239s and remarkably cheap at ~$0.31. Hit a 10s
solve on D1P2 — the single fastest part solve in the entire benchmark. Never needed a
retry.
claude-opus-4-6 — the steadiest clock in the field. Every single part between 18s and
33s, never needed a retry. No part was a blowout but none was slow either. The most expensive
Anthropic model at ~$1.13.
kimi-coding/k2p5 — cheapest at ~$0.11. That's roughly 10× cheaper than Opus for
comparable results. Slow on D2P1 (155s) and D2P2 (76s) but otherwise quick.
gpt-5.3-codex — fewest tokens: 7,416 total for 10 parts. Incredibly concise. Would
have been a top-3 finisher on time if not for the 315s D1P2 retry that dragged its total
to 493s.
Day 1 Part 2 was the filter. Six models solved it instantly; four needed retries. This was the only part in the entire F# benchmark where any model gave a wrong answer. Whatever the conceptual shift between Part 1 and Part 2 was, it tripped up the same models that struggle with Part 2 pivots in other languages.
qwen3-coder-next — the most extreme profile. Produced the fastest D4P1 (10s) and
D4P2 (8s) solves, but also the most expensive D1P2 at $1.19 and 31,718 tokens after
needing three attempts. Total cost: $2.01, the highest in the field.
MiniMax-M2.5 — slowest overall at 1,078s. D1P2 alone took 547s after a retry. But it
got there in the end, and its per-token pricing kept costs moderate at ~$0.47.
Cross-language snapshot
| Language | Models completing all 10 parts |
|---|---|
| Python | 10/10 |
| Ruby | 10/10 |
| Elm | 10/10 |
| F# | 10/10 |
| Java | 9/10 |
| Elixir | 7/10 |
| Haskell | 7/11 |
| OCaml | 5/9 |
| ReScript (run 2) | 2/10 |
F#'s 10/10 was expected more than Elm's — it's a .NET language with decent representation
in training data thanks to the broader .NET ecosystem. Models could reach for imperative
patterns when functional ones didn't work, and dotnet fsi provides a frictionless scripting
experience. Still, zero ejections across 10 models and 10 parts is a strong result for a
language that isn't Python or JavaScript.
Benchmarked on 2026-02-27 using pi as the agent harness.
This post was written with AI assistance.