This is a second run of the AoC 2025 LLM benchmark, with stricter rules. The same 10 models solved the same 5 days (10 parts) across the same 12 programming languages, but with two changes:
- No retries. A wrong answer or timeout results in immediate ejection from that language. No nudges, no second chances.
- No language-specific scaffolding. No system prompts teaching syntax (as was done for ReScript run 3 in the previous benchmark). Every model receives the same prompt regardless of language.
The previous benchmark allowed retries during the run, then applied strict scoring retroactively in the recap. This run enforces strict mode at execution time — ejected models never get the chance to try again.