Creative Reasoning Abilities: Deepseek R1 32B vs Dolphin 3 R1 Mistral

Mistral R1 Dolphin Arrives

I'd heard of Dolphin before as a great uncensored model that lets you have a lot of control, but this is the first time I've used it. With the release of Dolphin 3.0 R1 Mistral 24B I figured it's a good opportunity to see what it can do while confirming that Deepseek R1 32B is still worth prioritizing for local reasoning.

My latest experiment with Deepseek R1 32B (and others) was seeing how it performs with an actor-critic agent architecture. For this experiment, I wanted to see how it performs with a dual-critic agent architecture.

To recap, I'm testing these models with a game design prompt to see how they perform at generating interesting and feasible game ideas. This is a tough prompt since it asks for things like game mechanics, art, and development scope which are all quite subjective. All the code is available in GitHub. You can see the exact prompt(s) used here.

What Changed Since Last Time?

The previous agent-critic experiment revealed that while the architecture could generate interesting ideas, it lacked consistency and focus across iterations. It was easy for it to get stuck on revising based on critic feedback instead of providing a fully coherent game design each time. To address these limitations, I made two changes to the system:

Dual Critics: Instead of a single critic or judge, each iteration now receives feedback from two specialized critics. One focuses on originality while the other evaluates feasibility. I chose to have the critics powered by Mistral-Nemo as I found it to have a good balance of creativity and speed.
Semantic Similarity Checks: Between iterations, the system now performs a semantic similarity check using sentence embeddings. If consecutive designs have a cosine similarity below 0.8, the creator gets up to 3 chances to revise their design to maintain more consistency with the previous iteration.

These changes were implemented to help balance the tension between creative exploration and maintaining a coherent development path. The dual critics provide more specialized feedback, while the similarity checks ensure that each iteration builds meaningfully on its predecessor rather than diverging too radically.

Model Performance Comparison

Each model went through 150 iterations with the dual-critic system. I tracked how often they were able to produce a valid new design, cosine similarity at or above 0.8, without exhausting all retry attempts.

Example Iterations

The interface below lets you explore each model's thinking and how the design over time.

Final Answer:

Core Concept:

"SeedScape: Bloom & Traverse"

Tagline: "Grow your path, unlock new horizons."

Description:
In SeedScape: Bloom & Traverse, players navigate through vibrant, dynamic environments by planting and growing platforms from seeds. Each seed type has unique properties, allowing for strategic planning and creative problem-solving as they traverse increasingly complex landscapes.

Main Mechanic:

Seed-Based Platform Growth

Players collect and plant different types of seeds (e.g., Speed Grow, Strength, Environmental Resistant) to create platforms that mature over time.
Each seed type offers unique abilities, such as rapid growth for quick traversal or durability against environmental hazards.
The mechanic introduces a strategic element, requiring players to plan their routes and manage resources efficiently.

Basic Gameplay Progression:

Introduction (Beginner Levels):
- Players start with basic seeds in simple environments, learning the fundamentals of planting and growing platforms.
- Basic obstacles and enemies introduce the need for strategic seed placement.
Unlocking New Seeds (Intermediate Levels):
- As players progress, they unlock new seed types with varied properties, enhancing their ability to tackle more complex challenges.
- Each level introduces environmental hazards that interact with platform growth, requiring adaptability.
Complex Environments (Advanced Levels):
- Players encounter larger, multi-layered environments with intricate puzzles and enemies.
- The progression involves collecting resources to upgrade seeds or buy new types, offering increased complexity and challenge.
Replayability:
- Each level presents unique challenges, encouraging players to experiment with different seed combinations and strategies for varied outcomes.

Art & Aesthetic:

Visual Style:

Bright, colorful visuals with a focus on plant life and organic shapes.
Environments feature distinct biomes (e.g., lush forests, arid deserts) with evolving aesthetics as platforms grow and interact with the surroundings.

Character Design:

The player character is a botanist with a seed pouch, designed to be approachable and adaptable.
Enemies are designed using organic elements, such as thorny plants or creatures that blend into their environment.

Development Scope:

Feasibility for Small Team:

Central mechanic revolves around planting and growing platforms, allowing expansion without overly complex systems.
Art style focuses on clear visuals and patterns, reducing the need for highly detailed textures.
The game can be developed in stages, starting with core mechanics before introducing new seeds and environments.

Innovation:

Introduces a unique mechanic not commonly seen in 2D games, emphasizing strategic planning and creativity.
Encourages exploration and replayability through diverse seed types and evolving environments.

SeedScape: Bloom & Traverse offers an engaging, innovative experience with depth and visual appeal, suitable for a small development team aiming to create a memorable 2D platformer.

Key Findings

Distinct Design Rate: The chart above shows the percentage of iterations that produced distinct, yet still iterative, designs. Deepseek R1 32B achieved 100% distinct designs across all iterations, while Dolphin Mistral produced 74% distinct designs, with the remainder exhausting all retry attempts to reach the consistency threshold.
Semantic Stability: The similarity threshold of 0.8 proved to be an effective balance, catching significant deviations while allowing for meaningful evolution of ideas.
Model Differences: The distinct design rates reveal that Deepseek R1 32B was more consistently able to produce unique yet coherent iterations, while Dolphin Mistral would often get stuck and dead-end for a given iteration.
Adherence to Instructions: The think blocks were often quite spot on for how to navigate the prompt. However, the designs weren't always as successful. Dolphin almost always involved some kind of time component when one of the first lines of the prompt is "Do **not** use time loops, rewinds, echoes, or recursion-based mechanics."

Overall I'd say for this use case I don't see much reason to pick Dolphin over Deepseek. Not every model needs to be great at everything and I know Dolphin places a lot of emphasis on tasks like coding or math. I think there still could be some valuable in the think blocks alone.

The Value of Think Blocks

One amazing capability of both Deepseek R1 and Dolphin R1 is their explicit reasoning through "think" blocks in their outputs. In this test, the think blocks often revealed:

Which aspects of the previous design it chose to maintain or modify
How the model interprets critic feedback
Its reasoning for specific creative decisions

I could imagine them being useful for things like dynamic instructions or something as simple as improving prompts into these models.

Future Directions

While the dual-critic architecture shows promise for maintaining focused creativity, there are several exciting directions to explore next:

Beyond Game Design: Testing this architecture on other creative tasks could help understand its broader applicability.
Other Creative Models: Exploring how models specifically fine-tuned for creative tasks, like Nous-Hermes, might perform in this architecture.
Think Block Analysis: Developing ways to better utilize the explicit reasoning in think blocks to guide the creative process.

blog

tools

Testing Iterative Creativity: Deepseek R1 32B vs Dolphin 3 R1 Mistral