Kubi's Blog

Standard benchmarks are fine for measuring raw knowledge, but they rarely capture what matters in practice: can a model translate strategic intent into clean, executable code? That question is the foundation of the Game Agent Coding League — a framework that pits LLMs against each other through a suite of tactical games, where the model's job isn't to pick a move, but to engineer a player.

All game-by-game leaderboards and combined weighted results are published live at gameagentcodingleague.com, updated after every tournament run.

The Core Idea: Code-as-Agent

Most game-playing benchmarks prompt a model for every single move. This framework takes a different approach — a hybrid generation-and-reasoning model:

The LLM receives a full set of game rules and a coding interface.
It generates a standalone Python class — an "Agent" — that encapsulates its entire strategy.
That agent is deployed into an arena to play hundreds of matches autonomously.

This design tests something deeper than move selection. It tests whether a model can write bug-free, strategically coherent code under ambiguous constraints.

Project Architecture

The framework manages the full lifecycle of an AI competitor: from generation, through competition, to scoring.

1. Game Definitions & Config

Everything begins in the games/ and config/ directories.

Game Definitions: Plain text files (e.g., A1-Battleship.txt) describe the full rules, state format, and coding requirements for each game.
Model Registry: config/models.txt lists every participating LLM — from GPT-5 to Gemini 3.1 Pro.
Token Management: config/max_tokens.txt provides per-game token multipliers, giving complex games like Wizard the context they need.

2. Agent Generation (`utils/populate_agents.py`)

This is the factory. Using model_api.py (an OpenRouter wrapper), the script sends game rules to each LLM, extracts the returned Python code, sanitizes it, and saves it into the agents/ folder. Each model can produce multiple "runs," which accounts for the inherent randomness in generation.

3. The Arena (`game_scripts/`)

This directory contains the actual game engines.

Match Runners: Each game has a dedicated script (e.g., A2-tictactoe_match.py) that loads two agents, alternates turns, enforces time limits, and handles crashes gracefully.
The Matchmaker (matchmaker.py): The tournament orchestrator. It builds round-robin fixtures so every model faces every other model, and runs dozens of matches concurrently via subprocesses.

4. Evolutionary Improvement (`utils/try_enhancing_agents.py`)

This component closes the loop. It follows a simple but effective cycle:

Generate a new version of an agent.
Run it through an extended series of matches against its own previous versions.
Discard the weakest performer.

Over time, only the fittest version of each model's agent survives.

Scoring & Robustness Tracking

Every game script follows a shared standard defined in global_game_rules.md:

Points: A 3/1/0 (Win/Draw/Loss) system across all games.
Tie-breakers: A secondary "Score" metric (e.g., ships remaining in Battleship, piece count in Surround Morris).
Crash Penalties: The system tracks make_move_crash, other_crash, timeout, and invalid events. Agents that crash or stall are penalized with random moves or forfeits. A model that writes brittle code pays a real competitive price.

The Game Suite

The benchmark currently covers eight distinct challenges, each targeting a different cognitive skill:

ID	Game	Primary Skill Tested
A1	Battleship	Hidden information & probabilistic search
A2	TicTacToe (5×5)	Pattern recognition and blocking
A3	Wizard	Trick-taking, bidding, and 6-player social dynamics
A4	WordFinder	Vocabulary depth and constraint satisfaction
A5	Connect4	Look-ahead logic and spatial positioning
A6	WordMatrix	Pathfinding and subsequence matching
A7	2×8 Mini Chess	Tactical planning in a constrained space
A8	Surround Morris	Territorial control and friendly-fire avoidance

The diversity is deliberate. A model that excels at probabilistic search (Battleship) may fall apart under social dynamics (Wizard) or struggle with lexical constraints (WordFinder). No single skill dominates the leaderboard.

Here's an example from Connect4 — one of the more revealing games because look-ahead depth is hard to fake with shallow heuristics:

Connect4 performance leaderboard across all participating LLMs

Overall Results

Each game produces its own leaderboard, but the combined weighted score is where the full picture emerges. Models are ranked by aggregated performance across all eight games, penalized for crashes and timeouts:

Overall weighted performance leaderboard across all games and LLMs

The early data is already surfacing some surprising gaps between models that look nearly identical on standard benchmarks. Per-game breakdowns and full match logs are available at gameagentcodingleague.com.