HardDrop - LLM Coding Benchmark

The Test

Models are given a standardized prompt to build a Tetris CLI game for Linux. The prompt emphasizes playability, good game design, and proper testing. Every model gets the exact same initial prompt and the same chance to succeed.

What We Evaluate

Writing a simple program - Can it produce working code?
Picking a reasonable language - Python with curses is common, but unconventional choices are fine if they work
Simple software architecture - Clean structure without over-engineering
Ignoring unnecessary information - Not getting distracted by irrelevant details
Dealing with harness limitations - Working around environment constraints
Basic spatial reasoning - Piece rotation, collision detection
TUI design - Clean terminal interface
File operations & tool calling - Using available tools effectively
Not overcomplicating things - The KISS principle

The Ratings

Rankings are subjective ("vibes-based") but grounded in actual performance:

Code Quality - How clean, correct, and maintainable is the result?
Prompt Efficiency - How many back-and-forths to get working code?
Experience - How pleasant was the interaction?
Overall - Manually assigned based on the above

Philosophy

This is a real-world benchmark. Models are tested as development tools, not as quiz-takers. The question isn't "can it regurgitate trivia?" but "can it build something useful on the first try?"

Sort by:

S Excellent

A Very Good

B Good

C Average

D Below Average

F Fail

0 Models Tested

0% Pass Rate

0 S-Tier Models