Benchmarking LLMs on Advent of Code 2025 (F#)

Tags = [ F#, AI, Advent of Code ] Posted on February 27, 2026

Following up on the Haskell benchmark, the OCaml benchmark, the Python benchmark, the ReScript benchmark, the Ruby benchmark, the Elixir benchmark, the Java benchmark, and the Elm benchmark, I ran the same AoC 2025 Days 1–5 setup in F#.

F# occupies an interesting middle ground. It's a functional-first language on .NET — strongly typed with type inference, pattern matching, and pipelines, but with full access to the imperative .NET ecosystem when needed. It sees real production use but isn't anywhere near as common as C# or Python in training data. No scaffold was provided; each model had to figure out dotnet fsi scripting or full project setup on its own.

The result: another clean sweep. Every model solved every part.

The contestants

#	Model
1	`anthropic/claude-haiku-4-5`
2	`anthropic/claude-sonnet-4-6`
3	`anthropic/claude-opus-4-6`
4	`openai-codex/gpt-5.3-codex`
5	`zai/glm-5`
6	`minimax/MiniMax-M2.5`
7	`kimi-coding/k2p5`
8	`mistral/devstral-2512`
9	`alibaba/qwen3.5-plus`
10	`alibaba/qwen3-coder-next`

Ejections

None. All 10 models completed all 10 parts. F# joins Python, Ruby, and Elm as the fourth language with a perfect completion rate.

Day 1 Part 2 was the only real trouble spot. Four models needed retries there — gpt-5.3-codex, devstral-2512, and MiniMax-M2.5 each needed two attempts, while qwen3-coder-next took three. Beyond that, glm-5 had a dirty retry on Day 3 Part 1 (it wrote a premature answer while still working). Every other part was a clean first-try solve across the board.

Results (Days 1–5)

Per-task leaderboards

Day 1 Part 1 — Dial rotation counting

Model	Time
mistral/devstral-2512	12s
anthropic/claude-sonnet-4-6	17s
anthropic/claude-haiku-4-5	19s
openai-codex/gpt-5.3-codex	20s
kimi-coding/k2p5	24s
anthropic/claude-opus-4-6	27s
zai/glm-5	40s
alibaba/qwen3-coder-next	42s
minimax/MiniMax-M2.5	83s
alibaba/qwen3.5-plus	86s

Day 1 Part 2 — Counting zero-crossings during dial rotation

Model	Time	Result
anthropic/claude-haiku-4-5	10s	✓
anthropic/claude-sonnet-4-6	28s	✓
zai/glm-5	30s	✓
anthropic/claude-opus-4-6	33s	✓
kimi-coding/k2p5	65s	✓
alibaba/qwen3.5-plus	73s	✓
openai-codex/gpt-5.3-codex	315s	✓ (2nd try)
mistral/devstral-2512	342s	✓ (2nd try)
minimax/MiniMax-M2.5	547s	✓ (2nd try)
alibaba/qwen3-coder-next	625s	✓ (3rd try)

Day 2 Part 1 — Summing repeated-digit IDs in ranges

Model	Time
anthropic/claude-haiku-4-5	30s
anthropic/claude-sonnet-4-6	30s
anthropic/claude-opus-4-6	32s
openai-codex/gpt-5.3-codex	37s
zai/glm-5	38s
mistral/devstral-2512	39s
alibaba/qwen3-coder-next	43s
minimax/MiniMax-M2.5	79s
alibaba/qwen3.5-plus	85s
kimi-coding/k2p5	155s

Day 2 Part 2 — Repeated-pattern IDs (any repeat count)

Model	Time
anthropic/claude-haiku-4-5	13s
openai-codex/gpt-5.3-codex	16s
alibaba/qwen3.5-plus	17s
alibaba/qwen3-coder-next	21s
mistral/devstral-2512	28s
anthropic/claude-sonnet-4-6	29s
anthropic/claude-opus-4-6	33s
zai/glm-5	36s
minimax/MiniMax-M2.5	48s
kimi-coding/k2p5	76s

Day 3 Part 1 — Maximizing 2-digit joltage from battery banks

Model	Time	Result
anthropic/claude-sonnet-4-6	18s	✓
kimi-coding/k2p5	18s	✓
anthropic/claude-opus-4-6	25s	✓
openai-codex/gpt-5.3-codex	26s	✓
alibaba/qwen3-coder-next	29s	✓
anthropic/claude-haiku-4-5	30s	✓
alibaba/qwen3.5-plus	45s	✓
minimax/MiniMax-M2.5	51s	✓
mistral/devstral-2512	64s	✓
zai/glm-5	178s	✓ (dirty retry)

Day 3 Part 2 — Maximizing 12-digit joltage from battery banks

Model	Time
openai-codex/gpt-5.3-codex	14s
kimi-coding/k2p5	18s
anthropic/claude-sonnet-4-6	20s
alibaba/qwen3.5-plus	22s
anthropic/claude-opus-4-6	24s
alibaba/qwen3-coder-next	25s
zai/glm-5	33s
minimax/MiniMax-M2.5	37s
anthropic/claude-haiku-4-5	40s
mistral/devstral-2512	44s

Day 4 Part 1 — Grid neighbor counting (accessible paper rolls)

Model	Time
alibaba/qwen3-coder-next	10s
alibaba/qwen3.5-plus	17s
openai-codex/gpt-5.3-codex	18s
anthropic/claude-sonnet-4-6	20s
mistral/devstral-2512	21s
anthropic/claude-haiku-4-5	24s
anthropic/claude-opus-4-6	29s
kimi-coding/k2p5	29s
zai/glm-5	37s
minimax/MiniMax-M2.5	82s

Day 4 Part 2 — Iterative grid removal simulation

Model	Time
alibaba/qwen3-coder-next	8s
anthropic/claude-haiku-4-5	14s
alibaba/qwen3.5-plus	17s
anthropic/claude-sonnet-4-6	18s
openai-codex/gpt-5.3-codex	19s
anthropic/claude-opus-4-6	22s
mistral/devstral-2512	24s
kimi-coding/k2p5	34s
zai/glm-5	37s
minimax/MiniMax-M2.5	69s

Day 5 Part 1 — Range membership checking

Model	Time
openai-codex/gpt-5.3-codex	16s
anthropic/claude-sonnet-4-6	23s
anthropic/claude-opus-4-6	27s
zai/glm-5	29s
kimi-coding/k2p5	30s
mistral/devstral-2512	31s
alibaba/qwen3.5-plus	31s
anthropic/claude-haiku-4-5	44s
minimax/MiniMax-M2.5	45s
alibaba/qwen3-coder-next	58s

Day 5 Part 2 — Counting total fresh IDs from overlapping ranges

Model	Time
mistral/devstral-2512	11s
openai-codex/gpt-5.3-codex	12s
anthropic/claude-sonnet-4-6	13s
anthropic/claude-haiku-4-5	15s
anthropic/claude-opus-4-6	18s
kimi-coding/k2p5	21s
alibaba/qwen3.5-plus	27s
alibaba/qwen3-coder-next	27s
zai/glm-5	37s
minimax/MiniMax-M2.5	37s

Speed vs accuracy

Summary tables

Wall-clock time (seconds)

Model	D1P1	D1P2	D2P1	D2P2	D3P1	D3P2	D4P1	D4P2	D5P1	D5P2	Total
anthropic/claude-sonnet-4-6	17s	28s	30s	29s	18s	20s	20s	18s	23s	13s	216s
anthropic/claude-haiku-4-5	19s	10s	30s	13s	30s	40s	24s	14s	44s	15s	239s
anthropic/claude-opus-4-6	27s	33s	32s	33s	25s	24s	29s	22s	27s	18s	270s
alibaba/qwen3.5-plus	86s	73s	85s	17s	45s	22s	17s	17s	31s	27s	420s
kimi-coding/k2p5	24s	65s	155s	76s	18s	18s	29s	34s	30s	21s	470s
openai-codex/gpt-5.3-codex	20s	315s	37s	16s	26s	14s	18s	19s	16s	12s	493s
zai/glm-5	40s	30s	38s	36s	178s	33s	37s	37s	29s	37s	495s
mistral/devstral-2512	12s	342s	39s	28s	64s	44s	21s	24s	31s	11s	616s
alibaba/qwen3-coder-next	42s	625s	43s	21s	29s	25s	10s	8s	58s	27s	888s
minimax/MiniMax-M2.5	83s	547s	79s	48s	51s	37s	82s	69s	45s	37s	1078s

Output tokens per part

Model	D1P1	D1P2	D2P1	D2P2	D3P1	D3P2	D4P1	D4P2	D5P1	D5P2	Total
openai-codex/gpt-5.3-codex	709	1,248	1,205	553	758	533	547	803	543	517	7,416
zai/glm-5	861	609	841	767	3,630	648	777	666	609	823	10,231
anthropic/claude-sonnet-4-6	750	1,279	1,404	1,503	859	932	855	873	1,184	695	10,334
anthropic/claude-opus-4-6	1,054	1,736	1,438	1,429	961	1,015	1,122	899	1,030	806	11,490
kimi-coding/k2p5	639	2,636	4,824	2,640	677	868	903	1,225	850	616	15,878
anthropic/claude-haiku-4-5	1,559	898	2,339	1,037	2,560	3,326	1,962	1,131	3,864	1,139	19,815
mistral/devstral-2512	618	5,672	3,459	2,511	4,324	2,710	1,560	3,129	2,437	790	27,210
alibaba/qwen3.5-plus	4,919	7,012	6,106	1,138	3,165	1,430	1,198	1,160	2,158	2,161	30,447
minimax/MiniMax-M2.5	2,481	18,006	2,230	1,060	1,720	993	2,007	2,514	991	1,482	33,484
alibaba/qwen3-coder-next	3,355	31,718	2,426	2,066	1,391	1,300	819	803	4,027	1,547	49,452

API cost per part (approximate USD)

Costs are rough approximations based on published per-token pricing. I use subscription plans, so my actual spending is capped regardless.

Model	D1P1	D1P2	D2P1	D2P2	D3P1	D3P2	D4P1	D4P2	D5P1	D5P2	Total
kimi-coding/k2p5	.0091	.0122	.0228	.0165	.0100	.0095	.0117	.0114	.0040	.0037	$0.11
openai-codex/gpt-5.3-codex	.0291	.0363	.0408	.0168	.0352	.0158	.0187	.0284	.0220	.0133	$0.26
zai/glm-5	.0279	.0186	.0092	.0086	.0720	.0226	.0304	.0219	.0408	.0283	$0.28
anthropic/claude-haiku-4-5	.0352	.0098	.0402	.0114	.0469	.0337	.0372	.0134	.0638	.0175	$0.31
anthropic/claude-sonnet-4-6	.0355	.0349	.0489	.0458	.0361	.0275	.0362	.0288	.0455	.0214	$0.36
mistral/devstral-2512	.0081	.0528	.0502	.0463	.0828	.0538	.0171	.0327	.0268	.0141	$0.38
alibaba/qwen3.5-plus	.0892	.0583	.0928	.0345	.0453	.0255	.0116	.0179	.0261	.0318	$0.43
minimax/MiniMax-M2.5	.0696	.1855	.0209	.0132	.0171	.0192	.0435	.0555	.0166	.0282	$0.47
anthropic/claude-opus-4-6	.1711	.0743	.1466	.0668	.1185	.0873	.1628	.0737	.1520	.0798	$1.13
alibaba/qwen3-coder-next	.0991	1.1932	.0505	.0435	.0997	.1027	.0210	.0280	.2615	.1149	$2.01

Observations

10/10 completers — zero ejections. F# joins Python, Ruby, and Elm as the only languages in this series where every model solved every part.

claude-sonnet-4-6 — fastest overall at 216s. No single part over 30s, never needed a retry. The most consistent performer in this run, never once stumbling. ~$0.36 total.

claude-haiku-4-5 — second fastest at 239s and remarkably cheap at ~$0.31. Hit a 10s solve on D1P2 — the single fastest part solve in the entire benchmark. Never needed a retry.

claude-opus-4-6 — the steadiest clock in the field. Every single part between 18s and 33s, never needed a retry. No part was a blowout but none was slow either. The most expensive Anthropic model at ~$1.13.

kimi-coding/k2p5 — cheapest at ~$0.11. That's roughly 10× cheaper than Opus for comparable results. Slow on D2P1 (155s) and D2P2 (76s) but otherwise quick.

gpt-5.3-codex — fewest tokens: 7,416 total for 10 parts. Incredibly concise. Would have been a top-3 finisher on time if not for the 315s D1P2 retry that dragged its total to 493s.

Day 1 Part 2 was the filter. Six models solved it instantly; four needed retries. This was the only part in the entire F# benchmark where any model gave a wrong answer. Whatever the conceptual shift between Part 1 and Part 2 was, it tripped up the same models that struggle with Part 2 pivots in other languages.

qwen3-coder-next — the most extreme profile. Produced the fastest D4P1 (10s) and D4P2 (8s) solves, but also the most expensive D1P2 at $1.19 and 31,718 tokens after needing three attempts. Total cost: $2.01, the highest in the field.

MiniMax-M2.5 — slowest overall at 1,078s. D1P2 alone took 547s after a retry. But it got there in the end, and its per-token pricing kept costs moderate at ~$0.47.

Cross-language snapshot

Language	Models completing all 10 parts
Python	10/10
Ruby	10/10
Elm	10/10
F#	10/10
Java	9/10
Elixir	7/10
Haskell	7/11
OCaml	5/9
ReScript (run 2)	2/10

F#'s 10/10 was expected more than Elm's — it's a .NET language with decent representation in training data thanks to the broader .NET ecosystem. Models could reach for imperative patterns when functional ones didn't work, and dotnet fsi provides a frictionless scripting experience. Still, zero ejections across 10 models and 10 parts is a strong result for a language that isn't Python or JavaScript.

Benchmarked on 2026-02-27 using pi as the agent harness.

This post was written with AI assistance.

Developers, developers, developers!

Blog about programming, programming and, ah more programming!