Exploring Adversarial Agent Systems Powered by Locally Running LLMs

published 24 days ago
A large group of people swimming in what looks like a race
Different competitive scenarios and constraints should influence how one strategizes to win.

Following my previous experiment with competitive agent loops, I've shifted focus to explore how these systems perform in game design tasks. This iteration introduces higher stakes, varying constraints, and a more engaging creative challenge to better understand the dynamics of adversarial multi-agent systems.

High-Level Results: Outputs Are Not Maturing in Consistent Directions

Starting this post with results first. The sections that follow cover how this experiment is different than the last along with other findings.

The tabbed interface below lets you explore designs produced by two game designer agents with competing philosophies: whimsical and systematic. Definitions and all prompt text can be found in this script here. The tabs let you compare how their approaches differed across various prompts and what the initial and final designs were (5 iterations).

Design a 2D game...

First Iteration

Game Design Concept: "Whimsy World"

Core Mechanic:

  • Summoning Objects: Players can summon various objects (platforms, blocks) by pressing a button. These objects appear randomly and have different properties, allowing for creative problem-solving.

Gameplay Overview:

  • A platformer where each level presents unique challenges requiring the strategic use of summoned objects to progress.
  • Minimal screens include only the main gameplay area and a start menu, ensuring simplicity.

Originality:

  • The unpredictability of object summoning adds a fresh twist to platformers, offering unique experiences each playthrough.

Development Feasibility:

  • Simple core mechanic with straightforward level design, making it achievable within 30 days for a solo developer.

Core Mechanic Innovation:

  • Encourages creativity and strategy by enabling players to create their own tools during gameplay, setting it apart from traditional platformers.

Market Potential:

  • Appeals to casual gamers and indie enthusiasts, offering accessible puzzle-solving in a whimsical environment.

This concept is designed to excel in originality and core mechanic innovation while maintaining feasible development scope and broad market appeal.

Last Iteration

Game Title: "Brushstroke Odyssey"

Core Mechanic:
Players wield a magic paintbrush that brings their doodles to life in a whimsical platformer world. Draw platforms, enemies, or power-ups on the fly—every stroke becomes a part of the living, breathing level.

Gameplay Loop:
Each procedurally generated level is a blank canvas waiting for your artistic touch. Solve puzzles by sketching tools, create shortcuts with imaginative paths, or summon allies (or foes!) to help you traverse vibrant, surreal landscapes. The game twists when your drawings evolve—drawn objects "wake up" and interact with the world in unexpected ways.

Development Feasibility:
Minimal art assets needed—focus on simple shapes and colors that players can easily customize. Use basic physics for drawn objects, allowing quick prototyping and iteration.

Monetization & Market Potential:
Premium model with optional cosmetic packs for new paintbrushes, palettes, and stamps. Perfect for mobile, this game appeals to both casual gamers who love creativity and hardcore fans of quirky platformers.

Visual Identity:
Bright, watercolor-inspired visuals with soft edges and vibrant colors. Levels are a mix of abstract shapes and recognizable whimsical elements, creating a sense of wonder as players explore their own creations.

Differentiation:
"Brushstroke Odyssey" turns player creativity into gameplay, offering endless possibilities for unique experiences. It’s the perfect blend of chaos and control—every stroke is a surprise!

Overall I'd say the designs tend to shift from broad and experimental ideas to focused gameplay loops. It's not easy to compare many of them given how inconsistent the level/types of details can be across designs.

What's not apparent is what would happen if this was run for 100 iterations instead of 5. What I hope to test and control better for next time is a way for these to more obviously be improving with each iteration. At the very least if there's an easier way to detect progress vs. regression that'd be a big win.

What's Changed in the Design Challenge Since Last Time

Previous experiments had these key sections of the prompt:

  • Design Philosophy: A set of principles that the agent should use to guide their superior design output, these should be antagonistic to the competitor's design philosophy.
  • Competitor's Previous Thoughts: the thinking block that preceded the competition's previous designs.
  • Design Challenge: A description of the design challenge the agent is trying to win against their competitor.
  • Competition Stakes: A description of the stakes of the competition.
  • Previous Designs: the design outputs from the agent.

This time around, the core sections are as follows:

  • Design Philosophy: same purpose as before.
  • Competitor's Previous Thoughts: same purpose as before.
  • Design Challenge: updated for more specificity in the task in hopes that creativity would be more focused.
  • Competition Stakes: updated the stakes so that now the agent's career will be in danger if they fail to outperform their competitor in an attempt to encourage creative, and more competitive, strategy and outputs.
  • Previous Designs: removed as it seemed that this too often pulled in a single direction more and more over time leading to greater divergence but also less adherence to the original prompt.
  • Evaluation & Judging Framework: added with an aim to give the agent more consistently valuable anchors for their thoughts and decisions (in this case the criteria is originality, development feasibility, innovation, market potential).

Constraint Experimentation

I ran three parallel experiments with varying levels of design constraints:

  • Open-ended: Complete creative freedom in game design
  • Semi-constrained: Required an action combat mechanic
  • Highly constrained: Required an action combat mechanic and a loot progression system

This approach helped see how constraints affect not just creativity, but the competitive dynamics between agents. Interestingly, the semi-constrained scenario I thought produced the most innovative and well-developed game concepts.

The Stakes Hypothesis

One of the key insights from earlier experiments was that simply asking agents to "prove you're better than your competition" produced relatively shallow outputs. This aligns with what we know about LLM training: these models are built on a large corpus of human knowledge, including our understanding of motivation. By introducing career-critical stakes to the competition, I'm trying to tap into a more universally understood form of motivation.

This isn't just about making the competition "feel" more real to the agents. It's about aligning the competitive framework with patterns these models would have encountered in their training data about human competitive behavior and decision-making under pressure.

Game Design as a Testing Ground

The switch to game design as the test domain over app interfaces was for several reasons:

  • If I'm going to keep reading tons of these agent outputs, I want it to be fun for me to read and think about
  • Games inherently combine technical, creative, and user experience considerations
  • Success criteria are readily multifaceted (fun, technical feasibility, originality, appeal, etc.)

I wrote about how I think LLMs are awful at this kind of creative work in an earlier post here. It's still the case that I don't think they should replace game design efforts, but this richer problem space has made it easier to identify when agents are producing things closer to what would be considered innovation. Given I haven't seem them able to do well with this yet, it's also simply more exciting for me to pursue.

Technical Remarks

Running these experiments locally with Deepseek continues to be efficient. The model's ability to handle complex interactions while maintaining consistent agent personalities and motivations has been impressive, especially given the modest resource requirements.

Compared to the previous tests, these tighter prompt instructions were completing agent loops in roughly 3 minutes on average compared to the previous 5 minutes. This is an awesome improvement given I'm interested in running lots and lots of iterations.

The game design domain has also helped highlight the model's ability to balance creative and technical considerations which is a crucial skill for real-world applications of multi-agent systems.

Lessons Learned Up Til Now

The iterative approach to these experiments against local Deepseek R1 32B has revealed several key insights so far:

  • Motivation frameworks need to align with patterns in training data
  • Some level of constraint actually enhances the usable creative output
  • Domain-specific tasks (like game design) can provide richer insights than more generic creative exercises

Next Steps Towards Better Adversarial Agent Systems

I think what's most obviously missing at this point is a way to incorporate scores into this adversarial agent flow. Knowing whether you are currently winning or losing is a critical piece of information in game theory which helps inspire to either lead, follow, or maybe pivot entirely for a given strategy.

This could be achieved by putting a judgement step inbetween when the agent produces a design and goes into the next iteration. It would also be interesting to observe letting them see their own and their competitors scores accompanied by varying amounts of context. For instance, I'm curious how much the result changes if they know exactly which criteria they scored low or high for as opposed to just getting the overall score.

I am also pondering whether or not a more explicit output structure should be declared. In some ways it'll make it much easier to compare one iteration to another, and in other ways I think it may dampen the flexibility of the system.

Whichever next modifications I choose, I'll be sure to follow-up with more updates here. Thanks for reading!

You can check out part 4 of this series here.

To competitive creativity,
James