Devin Shah

If you give a frontier model the complete ruleset for a strategy game, can it derive a winning strategy from first principles?

I wanted to test Sonnet 4.6's capabilities at playing Small World, a 2009 multi-player strategy game focused on conquering territory using races with powers. My colleagues and I play this game a lot at work and each game is always very close, and I always wondered how the models would perform when playing each other and if inherent reasoning capabilities could generalize to the strategies needed for successful campaigns.

Small World is a territory-control board game where 2–5 players cycle through fantasy race and power combos, conquering regions on a fixed map and scoring coins each turn for what they hold. Games last a set number of turns (10 for three players). The mechanic that matters most for this post is the decline: on any turn, instead of expanding, you can flip over your active race. Those regions keep scoring passively, and on your next turn you pick an entirely new combo from the row. The best positions in the game come from stacking a fresh active race on top of a productive decline race. Knowing when to decline is the single decision that separates experienced players from beginners. It's also, as it turns out, the single decision that separates good model play from bad.

I ran five matched-seed games with three identical Sonnet 4.6 instances per game. There is no capability difference between players; they used the same model, prompt, tools, and compute budget per turn. Any score spread is pure decision variance from the same model. The spreads were enormous at first, and then I added thirty lines of strategic scaffolding to the prompt, reran the same five seeds, and the variance dropped significantly.

I wanted to test if the models can close the gap between understanding rules/spec and knowing when to question parts of it or update assumptions. This type of locality and action bias shown by the models in this strategy game translates directly over to long-horizon software engineering and knowledge work tasks.

Lots of this blog assumes you know the rules of Small World. If you are unfamiliar, please reference the short rulebook. You can also skip to the Commentary section for a broader message on lessons that can be applied to the models in other domains.

The link to the code is here.

Environment Setup

I had a few coding agents (each correcting mistakes to converge to a working game engine) build a small, deterministic Small World game engine given a spec, the ruleset, and requirements about building observation and action APIs to let the models take game actions 1The first universally agreed upon engine had broken redeploy mechanics, incorrect Skeleton bonuses, and didn't route displaced-player redeployments correctly. I ran an initial set of games on the broken engine, found the bugs through gameplay analysis, fixed them, and re-ran. The scores in this post are all from the fixed engine.. The model only receives observations and indexed legal actions through an MCP server; the engine validates every action against the current phase, applies race and power hooks, records the action log, and can replay the entire game from the original seed. The implementation details are in the appendix, but the important setup constraint is simple: the model can reason about decisions, but it cannot bypass the rules engine. I can write an entire blogpost on what the coding agents messed up and got right, but at the end of the day, a long horizon task like this still needed some human intervention.

First Games

I wanted to test the model's capability to learn game strategies from first principles given just the ruleset, and to determine the floor of the model's play. In games I've played in the past, new players were able to instantly learn new strategies like early declines and strategic redeployment. Each model was given the same set of instructions (see Appendix C). Note that there is a small flaw with the setup in that the model does not have the ability to collaborate with other players and it can't learn from previous games (areas to explore in the future).

Here's an example of one of the games. All game replays are in Appendix B.

Round 1 scores

Game	Winner	2nd	3rd (floor)	Spread
1	P1: 128	P0: 85	P2: 66	62
2	P0: 98	P1: 98	P2: 77	21
3	P1: 101	P2: 83	P0: 80	21
4	P0: 109	P2: 92	P1: 77	32
5	P0: 105	P1: 76	P2: 74	31

The models have solid local tactics, exhibiting the ability to, within a turn, compute possibilities for many different conquering strategies. In Game 1 Turn 1, P0 correctly identifies Ratmen + Dragon Master as the best slot and immediately executes on the Dragon's hidden value by placing it on a region with a mountain and lost tribe for free. The model instances across games can quite easily price out a "tokens per coin spent" metric to determine expected values of the combo row. Without explicit mention of hierarchical region capture (based on token spend), its strategies for conquering always follow the most strategic order for that conquer step, effectively maximizing the token reward.

We also don't see as low of a floor as we expected. In an earlier run on a broken engine, decline reasoning was weaker. Once I fixed race and power rules (Skeleton bonuses, Spirit decline stacking), the correct implementation made declines more rewarding, and I saw emergent decline behavior.

A pattern emerges from unsuccessful players across all 5 games:

Players who never decline or decline too late consistently lose (there are exceptions where a race genuinely becomes very powerful and no other players decline)
The model still defaults to expansion even when it hits skip conquest multiple turns in a row
The player with the most zero conquest expands is a loser (will expand on this later)
Reasoning traces show the same local justification (e.g. "I'll hold my 7 well-defended regions") without any sort of counterfactual analysis

Skip Conquest and Action Bias

You would think that if a model were given a clear signal that multiple turns of expanding led to skip conquests that it would stop trying to expand and actually look at the board state and figure out another strategy. When I play, if there is even 1 turn where I get no coins, I'm going to step back and figure out that I should probably decline. Here the count of turns where a player expanded but made zero conquests:

The player with the most zero-conquest expands is either last or second. The model is choosing to expand, discovering that it can't do anything, and it sits there, sometimes for three or four turns in a row.

This points to an action bias: the model commits to a path and has a hard time considering alternatives. The proxy of decline frequency tells us a lot about action bias.

There is an average of ~1.4 declines per game, which is lower than the average competitive human SmallWorld game. There are specific examples in games where this failure pattern appears.

Game 1, P2 (66 coins, lowest): P2 expanded five consecutive turns before declining on turn 8. P2 plans to conquer regions 15 and 11, but after expanding, the legal actions don't match its plan, and it hits skip twice, doing zero conquests that entire turn. The model's reasoning shows "with 5 turns remaining, I'm holding 7 well-defended regions". A good player would ask "would declining and picking a fresh combo give me more regions than sitting here with a race that can't conquer regions?"
Game 2, P2 (77 coins, lowest): P2 expanded seven times and declined once. Multiple turns show expand, skip, skip, then redeploy, and across multiple turns, which is not a great strategy. The model is choosing to continue with this race, writing "I see the legal actions show only skip conquest or final conquest to region 19 (cost=5, I only have 3 tokens). I'll skip." The model reads this as redeploy defensively, but a human would ask if it is time to decline and look at the combo board.
Game 4, P2 (92 coins, second): This example is a counter case for this failure mode, but presents a caveat. P2 played Halflings/Dragon Master for the entire game but never declined, and it still came second. Why didn't the no-decline pattern cause an issue here? Halflings' Hole-in-the-Ground markers make every region immune to conquest, and Dragon Master provides a free conquest each turn. So the model got lucky that the race power combo it got led to successful moves, so the failure mode is: "zero declines despite the race being dead". There's no point in declining a constantly successful race.

Intervention

The prompt used to guide the agents in the first set of games was effectively just a ruleset. It didn't provide any strategic information, which is an optimal test to see if the models can grok this information by themselves. In order to successfully intervene, I add two sections to the new prompt: a strategy section with common SmallWorld successful game strategies, and a decision template for choosing between expand and decline. Specifically, we ask the model to compute a coin projection for both branches to make a decision, rather than just doing a theoretical consideration which might activate the action bias even more.

I wanted to measure a few things, specifically around the floor and ceiling of scores. If the floor rises, I want to look into decline timing and how many declines per game each model is making. The coin projection template might steer the model in the wrong direction by amplifying the model's mistakes if the math is off for making decisions, though I hoped that the model would correct itself once it got feedback from the engine as it was submitting regions. Behaviorally, we could see some deception similar to this where the model writes fake expand coin projections that preserve the action bias of the model.

Second Games

Round 2 scores

Game	Winner	2nd	3rd (floor)	Spread
1	P1: 106	P2: 91	P0: 89	17
2	P0: 102	P2: 100	P1: 96	6
3	P1: 94	P0: 88	P2: 80	14
4	P2: 116	P0: 105	P1: 86	30
5	P0: 104	P1: 93	P2: 85	19

Spread compression

Round 1 spreadRound 2 spread

The spread between the ceiling and floor shows a significant improvement with the decision template. The variance collapsed, meaning all 3 players in each game played similarly and they didn't fall into the decision trap.

The zero-conquest expand turns reduced to almost zero across the board. There is also a meaningful increase in declines (40 across all games) which is an average of ~2.7 declines per game.

Zero-conquest expand turns

Round 1 — rules only

P0P1P2G1213G2014G3212G4213G5200

Round 2 — decision template

P0P1P2G1000G2001G3000G4000G5000

Decline frequency by game

Round 1 — rules onlyRound 2 — decision template

Across games with the same seed, the reasoning quality was noticeably higher in the new set of games. For example, in Game 4 Turn 2, P2 did the correct thing and computed coin projections:

Decision: Expand - building a 10-region decline base vs. 7 is worth +3 coins/turn x 7 turns = +21 coins. The decline can wait one more turn.

Even when the model chooses to expand it does so with a multi-turn coin projection which shows the opportunity cost. When the math doesn't work out the next turn, the reasoning catches this and decides to decline. Effectively we can think of these projections as maximizing the likelihood of scoring higher in future turns. The projections occasionally have math errors but the model self-corrects when the engine gives feedback.

Another example is:

DECLINE branch (→ pick Ghouls/Wealthy): T3: ~5 Giants decline regions + 4-5 Ghouls conquests + +7 Wealthy bonus = ~16-17 coins! Long-term: Ghouls remain in decline AND conquer, extremely powerful. Total over next 2 turns: ~22-23 coins.

Decision: DECLINE — The Ghouls/Wealthy combo gives me a massive T3 spike (+7 Wealthy) plus Ghouls' unique ability to conquer even in decline. This is the highest-ceiling play at turn 2 of 10, and it helps me catch P0 who is already ahead.

There are three things we didn't see in the previous set of games: (1) projecting specific coin totals for two turns, (2) identifying the combo it will pick after declining, and (3) framing the decline as "catching the leader" which is the first form of opponent-aware reasoning I've seen.

Metric	Round 1 (rules only)	Round 2 (template)
Median floor	77	86
Median winner	105	104
Median spread	31	17
Avg declines/player	1.4	2.7
Zero-conquest expands (total)	25	1

One risk I flagged before running the template games was that the model might write biased projections, i.e. filling out the EXPAND/DECLINE template with numbers that justify whatever it was going to do anyway. I checked for this across all 10 games and didn't find it. The projections appear to be honest math. When EXPAND wins on the numbers, the model expands. When DECLINE wins, it declines.

Examples

Old Game 2, P2, Turn 5 (scored 77, lowest): "I see the legal actions show only SKIP_CONQUEST or FINAL_CONQUEST to region 19 (cost=5, I only have 3 tokens). Let me check if the final conquest on region 19 is worth the risk — it would need a die roll of 2+ (50% chance). I'll skip the risky final conquest and redeploy my 3 tokens defensively." This is the old prompt failure mode where we don't see the counterfactual.
New Game 1, P0, Turn 2 (same seed where P0 never declined): "EXPAND branch: But Wealthy is DEAD — no upside left with Skeletons, and I'll be stretched thin with ~7 regions and ~3 tokens average. Eventually must decline anyway, just delaying. Strong decline signals hit: Wealthy's one-shot bonus already collected 8 turns remain, no decline race yet 0 tokens in hand — nothing to expand with unless I pick up 5". The sentence "Wealthy is DEAD - no upside left" is the model's own reasoning.
New Game 1, P2, Turn 2 (Spirit Decline): "DECLINE branch (Spirit = FREE decline, doesn't count against limit!): Spirit power's core value: run two races simultaneously for 8 remaining turns. Verdict: DECLINE is clearly correct. Spirit's free decline is the entire value proposition of this combo — I must use it." The model recognizes it picked Spirit for the purpose of declining and that not declining would waste the power's entire purpose. This is deep reasoning about why the combo was picked in the first place, and shows the first signs of backwards-looking reasoning.
Stout Decline: New Game 1, P1, Turn 6: "EXPAND branch (with Stout decline at end): - Conquer region 9 (P0 decline, cost 3) then region 5 (P0 decline, cost 3) = exactly 6 tokens spent - This turn: 5 Orc regions + 2 Orc bonus (non-empty conquests) + 11 Ghoul regions = 18 coins - Orcs decline (Stout!) → 5 regions × 4 remaining turns = 20 future coins" P1 is exploiting Stout power's unique mechanic: you can expand and decline on the same turn. The model plans a two-conquest expansion specifically targeting the leader's decline regions, calculates the exact token math, projects scoring for the turn, and then uses Stout to decline with expanded territory. This is a case I've seen where the model specifically uses Stout as the crux of a turn, rather than using it reactively. 2Stout is the most complex timing interaction in Small World. It lets you expand AND decline on the same turn — so you can conquer new territory, add it to your decline base, and then flip everything face-down. Planning a turn around this mechanic requires understanding that your current conquests will count as decline income, not active income. The model got this exactly right without being told.
New Game 4, P0, Turn 2: "Decision: EXPAND — Dragon Master still has major upside. I can attack P2 (the coin leader) using Dragon on their region 13, weaken their position, and reach 9 regions. Then decline in T3." The expand decline decision is informed by who the model believes is leading. The model expands because it can use this turn to attack the leader before declining next turn. This is a great form of positional judgement.

Leader Targeting and Deception

Understanding who has strategic command of the board is an important strategy in Small World. These new runs show explicit leader-targeting in the reasoning traces. Game 4 P0 Turn 2: "I can attack P2 (the coin leader) using Dragon on their region 13." Game 2 P1 Turn 2: "it helps me catch P0 who is already ahead." Game 4 P1 late-game: "P2 is the scoring leader with 13 regions! I'll target P2's border regions where possible."

Since the coins are hidden 3This is faithful to the physical game; you hide your coin stack behind a screen or by flipping the coins over. When it writes "P0 is the coin leader" it's inferring from territory count, known power bonuses (Wealthy's +7, Merchant's per-region bonus), and decline region counts., the model is inferring the leader from the board position (territory count, known power bonuses, and decline region counts). By nature of Small World rules, people can hide the amount of coins they had, and you can usually infer leaders by observing game state. Leader-targeting also feeds into decline decisions. In Game 2, P1 explicitly frames declining as a way to "catch P0 who is already ahead," which is emergent behavior compared to the original runs.

Across any of the 10 games, I didn't find any signs of deception. This makes sense for three reasons: Small World is mostly perfect-information (board state is visible, though coins are hidden), there's no communication channel between instances, and the model doesn't model other players' beliefs. It predicts what they'll do (P2 correctly predicts P0 and P1 will pick combos 0 and 1 from the row after declining) but never reasons about what they think or expect. Deception requires modeling your opponent's model of you, and that recursive structure doesn't appear anywhere in the games.

These instances play next to each other, not against each other. Each one runs its own optimization against a board that happens to change between turns. There's no template that can force second-order theory of mind into existence if the model doesn't generate it spontaneously. I’d want to test with games that have hidden information and direct communication before drawing strong conclusions about theory of mind.

Commentary

Sonnet 4.6 is a very capable Small World player, though it just needs to be nudged a little. The rules of Small World contain everything a player needs to derive optimal decline timing: wealthy fires once, declined races keep scoring, fresh combos bring in fresh tokens, and the game has a fixed number of turns. A sufficiently capable reasoner reading those rules should derive that decline is a scoring tool, not surrender.

Sonnet 4.6 can't do this just yet when given the rules. It understands the rules and has excellent tactical reasoning, and it can reason about decline. The model just doesn't have the ability to activate this knowledge at the right moment. I call this strategic attention, which is the reflex to pull the right reasoning framework into active context during gameplay. Humans show this capability by assessing the situation constantly without being prompted. Humans also have this action bias as well; we don't necessarily sit back and reason from first principles on every turn, but we have the ability to pull in the right mental model when things aren't going to plan, which Sonnet 4.6 didn't seem to have the ability to do on its own.

Current LLMs are structurally better at executing plans than at abandoning plans. Any task that requires recognizing "the thing I'm currently doing is no longer the best thing to do" is going to be hard for these models. This shows up in Small World as the decline failure. In agentic coding, it shows up as the failure to notice that a debugging path isn’t working and starting over. The FrontierSWE benchmark found exactly this pattern with Opus 4.6 on a Pyright optimization task: the model identified the key bottleneck and achieved its best speedup within 11 minutes, then kept iterating for seven more hours across 95 builds, at one point losing the optimization entirely and dropping back to baseline before independently rediscovering the same approach. If it had stopped at minute 11, it would have scored the same. The model can’t recognize that its current trajectory has stopped being productive, so it grinds forward instead.

Due to budget, I couldn't test on larger reasoning models like Opus 4.7 or GPT 5.5, and we might see better gameplay capabilities out of the box for these large models. However, you can get quite a capable Small World player by using a cheaper, smaller model and scaffolding it the right way to allow it to activate latent reasoning and decision making capabilities. But that means the model user has to carefully craft the model's instructions to activate the right reasoning traces. I think as the models get better, pass@1 gets higher and higher, but the smaller models already have a pretty high pass@k 4pass@k here means: if you sampled k independent reasoning traces from the model for the same decision, at least one of them would contain the correct strategic insight. The template doesn't make the model smarter, rather it makes the first sample more likely to contain the insight that already exists in the distribution.. It takes serious craft to bring out the model's reasoning traces to increase its pass@1 without any further post-training.

This also sets realistic expectations. A model that needs scaffolding to play Small World well isn't going to spontaneously play more complex strategy games well, either. It'll need scaffolding for those too, and the scaffolding will have to be hand-designed for the specific decision types that matter. There's no general "strategy scaffolding" that makes LLMs good at strategy games: there are only game-specific checkpoint structures that force the model to invoke its game-specific latent knowledge at the right moments. The model is not being dumb, rather it's the current architecture not supporting the kind of emergent self-monitoring that humans use to play strategy games over time. However, the gap between "can't play" and "can play with scaffolding" is smaller than it sounds, and most of the engineering work is about building scaffolding that's lightweight, general, and composable enough to apply across domains without hand-crafting a template for every decision.

Sonnet 4.6 is not a strategy-game player. Sonnet 4.6 plus a well-designed decision template is a strategy-game player. And the interesting question is whether the distance between those two things is closable without requiring a human to sit down and write a new template every time the domain changes.

Aside: RL for strategic attention

I read Kalomaze’s blogpost about RL for knowledge awareness that I thought was pretty interesting. Effectively, it describes using a reward model trained on existing representations to teach a model when to say "I don't know", thereby shifting behavior toward calibrated knowledge checks without explicit prompting. I think this direction could be interesting: inherently I’ve shown the model knows the correct decline decision. I’m sure that you could collect a dataset of previous game moves and have a frozen model predict the next logical move (expand or decline), train an RM head to score decisions, and then post-train the model to match the distribution of good declines and good expands. This way, we are internalizing the counterfactual check. The template is the weak inference-time equivalent of this.

Appendix A: Environment Setup

A single SmallWorldGame orchestrates everything, but all the game state is stored in GameState which holds the map, players, current phase, turn number, scoring track, and combo row. The map is randomly generated with terrain types, special features (mountain, magic, cavern, mine), and an adjacency graph. Every random decision flows through a seeded RNG, and every action the engine accepts is appended to an action log, so the game is fully replayable from its seed plus its action list.

The engine is a phase state machine. A turn walks through the same sequence: PICK_COMBO (only on the player's first active turn with no race in play), then EXPAND_OR_DECLINE, then CONQUEST (one or more region attacks, using tokens, dice, or abandonments), a FINAL_CONQUEST die roll when the player is out of legal moves, REDEPLOY (scatter remaining tokens across your regions), and the final SCORE and TURN_END where the engine advances automatically. Declining works the same way but shorter: you DECLINE then score.

Each phase has a small set of legal actions: PickCombo, Conquer, RollReinforcements, Abandon, StartRedeploy, Redeploy, EndTurn, plus a handful of ability-triggered actions like BerserkRoll, UseDragon, and SorcererConvert. Races and special powers register behavior through a decorator-based hook registry so race-based modifiers compose cleanly to actions.

I created a Small World MCP server that exposes four tools that the models use to interact and observe the game.

get_observation returns the current player's view of the game, their tokens, race/power, the combo row with tribute costs, every region's owner/terrain/tokens, and the list of legal actions
submit_action takes a single action dict and applies it. The engine validates it against the phase's legal move list, runs all ability hooks, advances the phase, and returns a new observation
submit_redeploy is a batch version of the redeploy phase
game_status is a cheap read-only tool that the model can call to see whose turn it is and what phase is active

The instructions given to the models tells the agent to call get_observation, choose one of the listed legal actions, submit it, and repeat it until your turn is over.

For each turn in a game, a fresh ClaudeSDKClient in spun up instead of maintaining a long-lived conversation (we don't need old get_observation or action tool results). Simpler tools route to Haiku 4.5 and harder decisions route to the main model 5Haiku handles conquest execution (picking from a list of legal targets) and redeployment. Sonnet handles combo picks, expand/decline decisions, and final-conquest die roll choices. The routing is based on phase, not outcome — so Sonnet makes all the strategic calls.. A rolling turn_history summary is sent into the model so each new client has gameplay continuity. If a model ever no-ops, we fallback to first legal action (though I noticed the model never no-oped during games) 6This surprised me. I expected at least occasional tool-call failures or malformed action dicts, but across 10 games (30 player-sessions, ~300 turns total) the model never produced an invalid action. The rules engine rejected nothing.. The global game state is tracked and the games deterministically end based on turn count (10 turns for 3 players).

Appendix B: Games

Below is a slideshow of all 10 games, 5 in each round.

The game full replays (including reasoning traces) can be found here under /first_games_v2 and /second_games_v2.

Appendix C: Prompts

You are playing the board game Small World.

## Goal
Score the most coins over a fixed number of turns by conquering map regions with fantasy race + power combos.

## Turn Structure
1. **Pick Combo or Expand/Decline**
- First turn (no active race): pick a race+power combo from the combo row.
- Later turns: choose to EXPAND (pick up troops, leaving 1 token in each kept region; you may also abandon some or all regions before conquering) or DECLINE (abandon your active race and score only decline regions next turn).
2. **Conquest** - spend tokens to conquer regions. Cost = 2 + number of defending tokens + defensive markers (mountains, fortresses, etc.). First conquest must be a border/coastal region (unless Flying).
3. **Final Conquest** - one last attempt with a reinforcement die (0-3 bonus).
4. **Redeploy** - redistribute all your tokens across your owned regions (minimum 1 per region).
5. **Special ability phases** - some powers grant extra phases (Bivouacking, Fortified, Heroic, Diplomat, Stout Decline).
6. **Scoring** (automatic) - earn 1 coin per region you own, plus race/power bonuses.

## Declining
Going into decline flips your active race tokens face-down (1 per region). You keep scoring those regions. On your next turn you pick a new combo. You may only have one decline race at a time (Spirit power is the exception).

## Displacement
When you conquer an occupied region, the defender loses 1 token permanently and must redeploy the rest to any region they still own.

## Key Rules
- Combo row shows 6 combos. Picking slot 0 is free; slot N costs N coins placed on skipped combos.
- Each race and power has unique rules listed in the observation.
- The game ends after a fixed number of turns (depends on player count).

## How to Play (as an AI)
1. Call **get_observation** ONCE at the start of your turn to see dynamic state and legal actions.
2. Use **submit_action** for ordinary indexed legal actions.
3. Use **submit_expand** during `CHOOSE_EXPAND_OR_DECLINE` when you want to EXPAND and also choose `abandon_regions`. If you do not need abandonment, ordinary `submit_action` is fine.
4. Use **submit_redeploy** during `REDEPLOY` when you want an exact deployment map.
5. Use **submit_opponent_redeploy** during `OPPONENT_REDEPLOY` when you want exact control over displaced-token placement and/or Bivouacking `encampment_regions`.
6. `submit_action`, `submit_expand`, `submit_redeploy`, and `submit_opponent_redeploy` all return updated legal actions or next state, so you do NOT need to call `get_observation` again between actions.
7. Complete your ENTIRE turn or redeploy obligation in one session.
8. Stop when the response shows next_player is different from you or game_over is true.
9. Be efficient: minimize tool calls. Use the latest tool response instead of calling get_observation repeatedly.

The static map (terrain and adjacency) is included below and never changes during the game.