The Test
Models are given a standardized prompt to build a Tetris CLI game for Linux. The prompt emphasizes playability, good game design, and proper testing. Every model gets the exact same initial prompt and the same chance to succeed.
What We Evaluate
- Writing a simple program - Can it produce working code?
- Picking a reasonable language - Python with curses is common, but unconventional choices are fine if they work
- Simple software architecture - Clean structure without over-engineering
- Ignoring unnecessary information - Not getting distracted by irrelevant details
- Dealing with harness limitations - Working around environment constraints
- Basic spatial reasoning - Piece rotation, collision detection
- TUI design - Clean terminal interface
- File operations & tool calling - Using available tools effectively
- Not overcomplicating things - The KISS principle
The Ratings
Rankings are subjective ("vibes-based") but grounded in actual performance:
- Code Quality - How clean, correct, and maintainable is the result?
- Prompt Efficiency - How many back-and-forths to get working code?
- Experience - How pleasant was the interaction?
- Overall - Manually assigned based on the above
Philosophy
This is a real-world benchmark. Models are tested as development tools, not as quiz-takers. The question isn't "can it regurgitate trivia?" but "can it build something useful on the first try?"
S
Excellent
A
Very Good
B
Good
C
Average
D
Below Average
F
Fail
0
Models Tested
0%
Pass Rate
0
S-Tier Models