Actor-Critic Multi-Agent Systems via Deepseek R1 32B, 4o, and Sonnet

Why Run This Experiment?

I've written previously about how video games are art at scale. They're multi-modal, interactive, and often have a narrative. LLMs are barely able to generate 2D art and even then the quality is quite mixed. Thus, this problem of designing a game via LLMs seems impossibly hard. I'm curious exactly where the limitations lie and want a better understanding what role language can have in creativity if new tools around language are going to keep emerging.

What's Changed Since the Previous Test?

In part 4 of the adversarial agent exploration, I added a judge into the loop. This judge proved to be unreliable and unfair in their scoring so for this one I updated it in two ways: 1. the judge is powered by Sonnet 3.5 instead of Deepseek R1 32B and 2. the judge only cares about 2 criteria instead of 3: innovation and feasibility.

I also expanded the time constraint for the hypothetical build time to be 3 months instead of 30 days with the hope that this would encourage more complex and interesting designs.

The last change was to have the creator/game designer only focus on improving one criteria at a time — either innovation or feasibility. The intention behind this was to give the game designs a more natural and steady evolution so that criteria scores would only go up and not constantly trade off.

Same as the previous tests, you can find all the code on my GitHub repo and mess with it yourself.

The Creator-Critic Architecture

Unlike the competitive approach where agents try to outdo each other, the creator-critic architecture establishes a collaborative relationship between two specialized roles:

Creator: Focuses purely on generating game designs without the pressure of competition
Critic: Provides detailed feedback and suggestions for improvement, helping guide the creator's iterations

This setup mirrors traditional creative processes where feedback loops help refine and improve ideas over time. The hypothesis was that this more focused, collaborative approach might yield better results than having agents be inspired by their competition.

Comparing Model Performance

This time I was able to conduct a deeper comparison of what distilled local Deepseek R1 32B can do compared to some more established LLMs in the space. I did my beset to adapt the prompt to let each model show their best across 4o, Sonnet 3.5, Mistral NeMo, and Deepseek R1 32B.

Each model went through 30 iterations of the creator-critic loop, with designs scored across innovation (0-3) and feasibility (0-3):

Key Observations

Judge Is No Longer Generous: No one achieved a score of 3 in either category individually. Judge is not giving high scores nearly as easily as it was before (which I think is a good thing) and could indicate some mixture of the problem being too hard or the judge being too strict.
Consistent Challenges: All models struggled to achieve high scores across both categories simultaneously. Improving one aspect often came at the cost of another.
No Benefit of Multiple Iterations: No clear trend for improvement emerged within 30 iterations of the creator-critic loop across all models.

The Multi-Modal Challenge of Game Design

The results highlight a fundamental challenge: game design might be too complex a problem for current language models and/or this simplistic of an architecture to tackle effectively. While we've seen impressive results with LLMs generating text, code, and even helping with visual design, games present a unique challenge:

Interactive Nature: Games are highly interactive experiences where player agency and feedback loops are crucial
Multi-Modal Integration: Successful games require the seamless integration of visuals, sound, mechanics, and narrative
Technical Constraints: Balancing creative vision with technical feasibility adds another layer of complexity
Fun As A Concept: What words would you use to describe fun? Dark Souls, Mario, Tetris, these are all equally fun to me in much different ways. This might be too nuanced for language to effectively convey.

Patterns and Limitations

One interesting observation across all models was their tendency to gravitate toward certain mechanical themes:

Echo Mechanics: A surprising number of designs featured mechanics involving echoes, shadows, or duplicates
Time Manipulation: Time control mechanics appeared frequently

This pattern suggests that the models might be over-indexing on certain successful indie games in their training data (maybe Braid? Katana ZERO?), leading to less truly novel mechanics. It's a reminder that while these models can combine and iterate on existing ideas in a rudimentary way, generating truly original game mechanics requires something more.

Breaking Out of Feedback Loops and Avoiding Dead Ends

The creator-critic architecture, while easier to test with more models than the competitive approach, still showed a tendency to get stuck in loops and a lack of improvement over time. When faced with multi-faceted problems like game design, the models often:

Oscillated Between Solutions: Fixing feasibility issues would lead to less innovative designs, and vice versa
Repeated Similar Patterns: Even with critical feedback, the creator agents often fell back on familiar mechanics
Struggled with Integration: Combining multiple pieces of feedback into a cohesive improvement proved challenging

As for what's next, I think making the prompts more simple could help better understand where to focus future efforts. Right now there's so much to analyze for both the creator and the critic it's hard to know what would be the most valuable thing to try tweaking next. Perhaps directions as simple as "make something fun" for the creator and "provide feedback which would make it more fun" for the critic is sufficient with enough iterations.

You can find the follow-up post to this test here.

blog

tools

Testing Multi-Agent Actor-Critic Systems Powered by Deepseek R1 32B