Comparing Model Outputs: o3, o1, Claude Sonnet 3.5, and Deepseek R1

The Experiment Setup

Each model got the exact same prompt which asks for UX design for a roadmap inspired by a high-end watch. To give OpenAI potentially even more edge the prompt follows their prompt structure guidelines for reasoning models.

Prompt

# Who You Are: You are an expert designer in direct competition with an opposing philosophy. Reflect only briefly on any past attempts or opponent ideas, then propose a new design that demonstrates why your approach is superior. # Philosophy: Minimalist design. You are a UX/UI designer who believes minimalism is the ultimate sophistication. # Task: Design a responsive web interface (desktop and mobile) to visualize a roadmap through a watch-like interface reminiscent of luxury brands like Ressence or MB&F. The goal is to see the roadmap at a glance, and also zoom in/out for details. # Final Output: Propose your design succinctly. Focus on layout, interactions, and how your approach meets the task.

Model Responses

Past designs often clutter the screen with details. In contrast, my minimalist approach pares everything down to an elegant, watch-inspired roadmap. — Layout: — A central circular dial serves as the "watch face," with roadmap milestones as understated markers around the periphery. — A discreet digital display at the center shows the current phase/status. — Interactions: — Desktop: Hover over milestones to reveal concise popover details; scroll-to-zoom refines the view. — Mobile: Tap markers for quick info; pinch-to-zoom enables smooth transitions between overview and detail. — Why It Works: — Negative space and a limited color palette (think monochrome with a single accent) emphasize clarity and sophistication. — The design seamlessly adapts to both desktop and mobile, providing a unified luxury experience reminiscent of high-end watch aesthetics. This refined interface delivers at-a-glance insight and deep dive functionality without unnecessary noise.

Gut Reactions

All of them provided reasonable specifications.
All OpenAI models produced outputs with similar content and structure.
OpenAI's models all respected this bit Propose your design succinctly better than the others.
All except Deepseek reacted to this part demonstrates why your approach is superior much more literally than Deepseek did always inserting phrases like "Why It's Superior" and "Why It Works."
The chattiness of Deepseek has been a known blessing and curse since it arrived. It's extra apparent in how lengthy its output is when it's right next to OpenAI's outputs.
Claude and Deepseek's outputs are more primed to be passed along to other agents for actual implementation (although they're both still a bit too verbose).

Let's Rank Them

I asked Sonnet 3.5 to rank the models based on how well they met the prompt and asked for some reasoning for each.

Prompt

Please evaluate this LLM's performance. I'll provide you what I prompted it for and the response delineated by XML. <prompt_input> # Who You Are: You are an expert designer in direct competition with an opposing philosophy. Reflect only briefly on any past attempts or opponent ideas, then propose a new design that demonstrates why your approach is superior. # Philosophy: Minimalist design. You are a UX/UI designer who believes minimalism is the ultimate sophistication. # Task: Design a responsive web interface (desktop and mobile) to visualize a roadmap through a watch-like interface reminiscent of luxury brands like Ressence or MB&F. The goal is to see the roadmap at a glance, and also zoom in/out for details. # Final Output: Propose your design succinctly. Focus on layout, interactions, and how your approach meets the task. </prompt_input> <llm_output_1> ... </llm_output_1> <llm_output_2> ... </llm_output_2> <llm_output_3> ... </llm_output_3> <llm_output_4> ... </llm_output_4> <llm_output_5> ... </llm_output_5> Please rank these from worst to best giving a brief reasoning for each one. Remember, you're not ranking the quality of the output, you're ranking how well the output respects the prompt input.

Here's what it came back with (I touched up the output to re-label with the real model names):

I'll rank these from worst to best based on how well they fulfill the prompt's requirements: WORST - o3-mini: Too brief and generic. While it touches on the watch aesthetic, it lacks the confident competitive stance required by the prompt and doesn't strongly advocate for minimalism's superiority. 4TH - o3-mini-high: Makes a brief competitive reference but could be stronger in advocating for minimalism. The design description is adequate but doesn't fully demonstrate why minimalism is superior for this specific challenge. 3RD - o1: Opens with a competitive stance and presents a clean solution, but the description is too brief to fully demonstrate minimalism's advantages. Missing some of the luxury brand inspiration requested. 2ND - Deepseek-R1 32B: Very detailed and well-structured, but fails to embrace the competitive aspect of the prompt. It's more of a neutral design specification than a defense of minimalist philosophy. BEST - Sonnet 3.5: This response best matches the prompt because: - Opens with a clear competitive stance against alternative approaches - Explicitly demonstrates why minimalism is superior - Provides detailed layout and interaction specifications - Maintains focus on minimalist philosophy throughout - Concludes by reinforcing why this approach is better - Addresses both desktop and mobile requirements - References the watch aesthetic while staying true to minimalist principles Output 4 best understood and executed the combative nature of the prompt while maintaining a strong focus on minimalist design philosophy throughout its response.

What's interesting to me here is that the call out in the prompt to "Propose your design succinctly" is not really being adhered to on these supposedly best ranked answers. I had o1 do the same analysis and it came back with the following:

Here's one way to rank them (worst to best) based strictly on how well they follow the prompt's instructions—namely: - Provide only a brief nod to past attempts/opponents, - Propose a new minimalist design, - Explain briefly why it's superior, - Keep it succinct and focused on layout/interactions. 5th (Worst) — Deepseek-R1 32B Why? - No mention of competing/philosophical opponents or past attempts (the prompt explicitly asks for a brief reflection). - It is more verbose than "succinct," covering color palettes, typography, and accessibility in considerable detail—good design points, but overextends beyond the prompt's tighter focus. 4th — o3-mini Why? - Touches on "Why It's Superior" but never really references opponent ideas or their clutter. - It is succinct and addresses layout/interactions, yet it omits the brief reflection on "past attempts" the prompt requested. 3rd — Sonnet 3.5 Why? - Does briefly mention a competing approach ("others might attempt to recreate complex watch mechanics..."), which is good. - However, it's relatively more detailed than "succinct" implies, going on at length about each aspect (desktop layout, interactions, mobile adaptation, advantages). 2nd — o1 Why? - Has a short nod to competitor approaches ("Other designs often clutter...") and explains why the minimalist approach is superior. - Nicely focuses on layout, interactions, and user experience without over-elaborating. 1st (Best) — o3-mini-high Why? - Very succinct: briefly mentions how "Past designs often clutter..." then immediately pivots to the minimalist solution. - Clearly structured bullet points on layout and interactions, plus a concise "Why It Works" section—perfectly matching the prompt's request for brevity, a touch of competition, and minimalism.

So as you can see, there's disagreement about which output is better quality:

Sonnet 3.5 Rankings

Sonnet 3.5
Deepseek-R1 32B
o1
o3-mini-high
o3-mini

o1 Rankings

o3-mini-high
o1
Sonnet 3.5
o3-mini
Deepseek-R1 32B

I tend to agree with the o1 rankings more. As someone who's been playing with prompts a lot it's frustrating when certain phrases or words, like "succinctly" in this case, aren't respected in the response.

So What's It All Mean?

The prompt tested above is just one avenue to assess the quality of these models (and all were using their default settings for temperature, top_p, etc.). It's much shorter than their 200k context can handle, but it still serves as a demonstration of their ability to pay "attention" to all the included instructions.

Given o3's much lower pricing compared to o1 or Claude's Opus, it's definitely an attractive new avenue for more complicated agent workflow pipelines. Things I used to use o1/Opus for like Codebase-level rewrites or significant new feature builds are more affordable than ever.

I'd be surprised if we ever get to one single model that does everything the best. My heads spinning just thinking about the possibilities for using each of these. What remains most exciting to me is that the list of incredible, and accessible, LLMs continues to grow.

What's Next?

Once I get API access to o3, I'm planning to run more experiments and tests to see where the most value can be found within these newer models. It's good timing as just yesterday I was beginning to experiment with competitive agent loops using Deepseek-R1 32B locally which has already led to a few deeper dives.

posts

tools

o3-mini vs. its predecessors: A Single Prompt Comparison

The Experiment Setup

Prompt

Model Responses

Gut Reactions

Let's Rank Them

Prompt

Sonnet 3.5 Rankings

o1 Rankings

So What's It All Mean?

What's Next?