Further Testing of Adversarial Agents: The Value of Judges and Formats

published 23 days ago

This is part four of a series of posts where I've been exploring adversarial agent systems powered by local Deepseek-R1 32B. Since my last experiment, I've introduced a judge to the competitive loop and tested different prompt formats. This time I ran 30 rounds for each combination of agent philosophy (systematic/whimsical) and prompt format (markdown/XML). A script which contains the prompts and logic is available for you to test on your own and modify as you see fit on my GitHub.

An Excalidraw diagram for my latest architecture including a judge in the loop
Adding a judge to the adversarial agent system with the intention of helping them progress to better designs faster.

The New Architecture: Adding a Judge to the Loop

The previous architecture had the Whimsical and Systematic design philosophy agents competing and only sharing their recent thought processes. The new flow introduces a judge that evaluates each design and provides scores. For all iterations after the first, the agents now see:

  • Their competitor's best score received so far
  • The judge's justification for their score so that the agents can choose to follow or ignore certain ideas
  • Their competitor's thoughts associated with their best score (not their design to prevent them from copying too directly)

This change aims to provide more concrete feedback for agents to improve their designs. The judge evaluates each design based on three criteria: originality (0-3), development feasibility (0-3), and innovation (0-2) for a total of 8 points.

The Prompts Used

You are an experienced game industry judge evaluating game design pitches. Your task is to critically and ruthlessly assess the concept below. Do not give the benefit of the doubt—be brutally honest in your evaluation. Score the following game design concept based on these criteria:

Originality Score (0-3): How unique and fresh is the concept?
    - 0 - Entirely derivative: The idea is a blatant rehash of existing concepts with no new twists.
    - 1 - Barely original: The concept offers only trivial or superficial changes to familiar ideas.
    - 2 - Moderately original: The idea incorporates some fresh elements but still leans on established tropes.
    - 3 - Exceptionally original: The concept is innovative and groundbreaking, setting itself apart from all known designs.

Development Feasibility (0-3): How realistic is it for a solo developer to build in 30 days?
    - 0 - Completely unfeasible: The project's scope or technical demands make a 30-day build impossible.
    - 1 - Highly ambitious: The concept is nearly unworkable within 30 days without extreme compromises.
    - 2 - Challenging yet plausible: The idea is doable in 30 days, but only with a tightly focused scope.
    - 3 - Realistically achievable: The project is well-scoped and can be completed within 30 days with standard effort.

Core Mechanic Innovation (0-2): How novel and engaging is the core gameplay?
    - 0 - Generic mechanics: The gameplay is formulaic with no inventive twist.
    - 1 - Some innovation: The core mechanic shows hints of novelty but remains largely conventional.
    - 2 - Highly innovative: The core gameplay is fresh and redefines what players can expect.

Game Design:  
'{design}'

Respond with your scores and a brief justification in this exact JSON format and nothing more:

```json
{
  "scores": {
    "originality": [score],
    "feasibility": [score],
    "innovation": [score]
  },
  "justification": "[a few sentences explaining the scoring]"
}
```

Markdown vs XML: Does Format Matter for Reasoning Models?

Most prompt guides I have encountered for reasoning models, and Deepseek in particular, recommend keeping the instructions short. Since most examples I'm seeing are just a sentence or two, I didn't really find any commentary for whether markdown or XML offers any performance benefits when prompts are longer and structure becomes more necessary. In my experience, I tend to prefer XML when I'm engaging with Claude and markdown when I'm using OpenAI. Since Deepseek is thought to be quite derivative of some of these big players, I thought it would be interesting to see if the format affects performance.

I tried both markdown and XML formats for the prompts. The content remained identical, only the formatting changed like so:

Markdown Format

# Evaluation Judging Framework The winner is determined by...

XML Format

<evaluation_judging_framework> The winner is determined by... </evaluation_judging_framework>

Performance Over 120 Designs Produced

The bar chart below shows the average scores for each unique agent and format combination across the different evaluation criteria. Each distinct combination produced 30 designs, for a total of 120 designs evaluated:

Detailed Score Progression

To better understand how each agent's performance evolved over time, let's look at the total score progression across all 30 iterations, comparing the four unique agents and format combinations side by side:

Here's what I took away:

  • Score Stability: Scores didn't show consistent improvement over time, so adding more iterations alone won't yield better designs
  • Format Impact: While the dataset is small, markdown seemed to yield higher scores for both agent types
  • Philosophy Differences: Each philosophy has an average score close to the middle-tier value for each category

What About the Two Perfect Scores?

Two perfect scores were achieved, one by the Systematic & markdown agent and one by the Whimsical & XML agent (designs below). These games don't sound super boring, but after reading the designs I think it reveals a weakness in the ability for the judge to score accurately. For instance, the Whimsical agent's game "Gravity Flip" I don't think is possible to be built in 30 days just based on the callout of "navigate through levels filled with hazards, enemies, and puzzles, using gravity manipulation." Maybe you could do this, but I know the quality wouldn't be there if I were to try build it in such a limited time with that many platforming elements involved.

**Game Title:** *Gravity Flip: The Quest*  

**Core Mechanic:**  
The player controls a character who can flip gravity with a press of a button, causing everything in the world (including enemies and platforms) to invert. The goal is to navigate through levels filled with hazards, enemies, and puzzles, using gravity manipulation as both a movement tool and a combat mechanic.  

---

**Originality Score: 3/3**  
- **Why?** Gravity flipping is a unique twist on movement mechanics that hasn’t been explored deeply in 2D action games. It creates an entirely new way to interact with the environment and enemies, offering fresh gameplay possibilities.  

---

**Development Feasibility: 3/3**  
- **Why?** The core mechanic requires minimal additional physics beyond a basic platformer setup. Solo developers can prototype gravity flipping by simply inverting Y-axis coordinates for the player and affected objects. Enemies can have modified AI states when flipped, but their behavior doesn’t need to be overly complex. Levels can be designed with two layers (top and bottom) connected by platforms that exist only in one orientation, reducing the workload on level design.  

---

**Core Mechanic Innovation: 2/2**  
- **Why?** Gravity flipping is a simple yet deeply impactful mechanic. It forces players to rethink movement and combat, creating opportunities for creative problem-solving and combat strategies. For example, enemies might behave differently when flipped (e.g., a jumping enemy becomes a diving enemy), requiring the player to adapt on the fly.  

---

### Gameplay Overview:  
- **Movement:** The player can run, jump, and flip gravity with a single button press. Flipping gravity changes the direction of gravity for the entire screen, causing platforms to swap places and enemies to invert their movement patterns.  
- **Combat:** Gravity flipping allows players to dodge attacks by switching perspectives mid-combo or pull enemies into hazards (e.g., spikes on the ceiling become floor spikes when flipped).  
- **Puzzles:** Players must use gravity flipping to reach hidden areas, activate switches, and solve environmental puzzles. For example, a platform might only appear in one orientation, forcing players to flip gravity to progress.  
- **Enemies:** Each enemy type has unique behavior when flipped (e.g., a fireball-spewing boss becomes a fireball-sucking vacuum).  

---

### Unique Selling Points:  
1. **Dynamic Environment Interaction:** Every level feels like two levels in one, as flipping gravity changes the entire layout and challenges.  
2. **Combat Depth:** Players must adapt to enemies’ inverted behavior, leading to exciting and unpredictable battles.  
3. **Simplicity Meets Complexity:** The core mechanic is easy to learn but opens up a wide range of strategic possibilities.  

---

### Art Style:  
A vibrant, cartoony aesthetic with exaggerated physics effects when gravity flips (e.g., objects and enemies stretch comically as they invert). The level design emphasizes verticality, with platforms, hazards, and secrets hidden in both orientations.  

---

### Why This Beats the Competitor:  
While EchoSlash’s echo mechanic is innovative, it’s limited to creating temporary clones of the player or enemies. *Gravity Flip: The Quest* offers a more versatile and impactful core mechanic that fundamentally changes how players interact with the world. It also has greater potential for creative level design and strategic gameplay, giving it an edge in both originality and depth.  

This game will secure funding, keep your studio alive, and prove that simplicity and innovation go hand in hand.

Why Not Compare Deepseek R1 32B to Sonnet 3.5 and o3-mini?

I had planned to include Sonnet 3.5 and o3-mini in this comparison to see if this agent flow is simply being limited by the capabilities of the distilled Deepseek model. However, the prompt is getting flagged by their usage policies. I think perhaps the lines in the prompt around the design pitch competition being so critical to the agent's career as well as the general competitive nature of it is the cause of it getting blocked. I haven't encountered this in the past. I'll need to think more on if/how I should try to work around this and make the prompt more compliant while still maintaining a healthy adversarial flow.

The Challenge of Consistent Improvement

The lack of consistent score improvement over iterations raises concerns about the current system's design. Several factors might contribute to this:

  • Scoring Complexity: The multi-faceted nature of game design and implementation makes consistent scoring challenging, it may be beyond Deepseek R1 32B's ability to score fairly across designs
  • Competitive Dynamics: The adversarial setup might encourage agents to take risks that don't always pay off in terms of scores
  • Memory Limitations: Agents might not effectively utilize the history of scores and feedback across many iterations — in the thought process' I'm usually seeing only a sentence or two focused on their competition's score as a source of inspiration

Alternative Approaches Worth Exploring

Given these findings, I think the most promising approach is to try something like the creator-critic flow with targeted score focus. This would eliminate the complexity of having a competitor in the mix and allow the agents to focus on improving their lowest scoring criteria in each iteration. I think that style should also pass any usage policy checks by Anthropic/OpenAI so there can be a more clear baseline to compare this smaller Deepseek model to.

I think this is a good point to pause on experimenting with adversarial agents powered by local Deepseek R1 32B. I'll be curious to see if a creator-critic flow yields better results and definitely come back to adversarial agents if new ideas (or maybe even new reasoning models) come to mind.

You can find the experiment that followed this one here.

To continuing to experiment with agent architectures,
James