GPT-5.2 vs Claude: The Intent Gap

Dec 15, 2025 · 10 min read

#ai-coding#claude-code#openai#model-comparison#testing

I wanted GPT-5.2 to be good.

I really did. Competition makes everything better. When OpenAI ships something great, Anthropic has to respond. When Claude improves, OpenAI has to catch up. This is how we get better tools.

So when GPT-5.2 dropped, I cleared my weekend and ran a proper test. Same specs. Same context. Same tasks. Head to head.

The benchmarks say these models are nearly identical. SWE-bench has GPT-5.2 at 80.0% and Claude Opus 4.5 at 80.9%. Close enough to call it a tie.

My commit history tells a different story.

The Setup

I’ve been rebuilding parts of my website. New components, new layouts, updating the design system. Nothing exotic — just the kind of work that makes up 90% of real development.

I decided to run every task through both models in parallel:

GPT-5.2 via Codex CLI
Claude Opus 4.5 via Claude Code

Same specs. Same branding guidelines. Same tone of voice documentation. Same component library to pull from.

7 spec files total. Each one a discrete task with clear goals, documented intent, and existing patterns to follow.

(Yes, I do this for fun on weekends. Don’t judge.)

The Results

Here’s what happened:

Branches from Codex that I committed: 0

Branches from Codex that I considered committing: 1

Branches from Claude Code that shipped: 7 out of 7

That’s not a typo. Across all 7 tasks, I didn’t ship a single line of GPT-5.2’s code. Claude delivered shippable output every time.

This isn’t about one model being “bad” and one being “good.” GPT-5.2 produced working code. It ran. It didn’t throw errors. Technically, it fulfilled the requirements.

But I wouldn’t have shipped it. Not without significant rework.

The Difference: Intent

Here’s where it gets interesting.

My specs weren’t perfect. They had gaps. Intentionally.

Real specs always have gaps. You can’t document every micro-decision. You write the goals, the constraints, the patterns to follow, the intent behind what you’re building. Then you trust the engineer — human or AI — to fill in the blanks intelligently.

My specs included:

The feature goals and why they mattered
Branding guidelines with tone of voice
Component library with existing patterns
Examples of similar implementations

What they didn’t include:

Every CSS value
Every edge case handling decision
Every micro-interaction detail
The “vibe” (for lack of a better word)

Claude filled in the gaps. It understood what I was building, not just what I wrote. When the spec said “match the existing card component style,” Claude looked at the existing cards, understood the pattern, and extended it appropriately.

GPT-5.2 followed instructions. Literally. When something wasn’t specified, it made generic decisions. Safe decisions. Decisions that technically worked but felt like they came from a different codebase.

One followed the letter. One understood the spirit.

What “Understanding Intent” Actually Means

Let me make this concrete.

One of the specs was for a new section on my homepage. The spec described what the section should do, referenced the design system, pointed to similar components as examples, and explained why this section mattered for the overall page flow.

What Claude produced:

Matched the spacing patterns from adjacent sections
Used the same animation approach as other interactive elements
Chose typography that fit the hierarchy I’d established
Made the component responsive in the same way as my other components

What GPT-5.2 produced:

Technically correct implementation
Generic spacing that didn’t match the rhythm
Different animation timing than everything else
Typography choices that were fine but didn’t fit
Responsive breakpoints that worked but felt inconsistent

It was like hiring a contractor who does exactly what the blueprint says, even when the blueprint has an obvious typo. Technically correct. Practically useless.

Both outputs would “work.” Only one looked like it belonged in my codebase.

The GPT-5.2 output would have taken me 5-10 follow-up prompts to fix. Adjusting the spacing, matching the animations, tweaking the typography, aligning the responsive behavior. By the time I’d done all that, I might as well have written clearer specs in the first place — or just used Claude.

Why Benchmarks Miss This

SWE-bench tests whether a model can solve isolated coding problems with clear specifications. It’s a clean environment: here’s the bug, here’s the test, fix it.

Real work isn’t like that.

Real work has ambiguity. Real specs have gaps. Real codebases have unwritten conventions that you’re expected to intuit. Real projects have a “vibe” that new code needs to match.

Benchmarks measure whether a model can follow instructions. They don’t measure whether a model can understand intent.

This is why GPT-5.2 benchmarks well but feels worse to use. It’s optimized for the test, not for the work.

(Benchmarks are like podcast interviews where the guest gets every question beforehand. It feels rehearsed. Robotic. The guest isn’t being themselves — they’re performing the version of themselves they want the audience to see. Real conversations don’t work that way. Neither does real coding.)

I’m not saying benchmarks are useless. They measure something real. But they don’t measure the thing that matters most in my workflow: can I ship what this model produces?

The Rushed Release Problem

Let’s talk about the elephant in the room.

GPT-5.2 was supposed to ship late December. It shipped December 9th. OpenAI employees reportedly asked to delay it. Leadership said no.

Why the rush? Gemini 3 was coming. OpenAI’s internal “Code Red” alarm was going off. They needed to ship something to stay in the news cycle.

This is the same pattern we saw with GPT-5.1. Rushed. Shipped to compete, not because it was ready.

I don’t have inside information. I don’t know what’s happening at OpenAI. But the pattern is clear: speed over readiness, two releases in a row.

Is OpenAI leadership really that out of touch? I doubt it. Smart people work there. They know these models ship rough. But they’re making a calculated bet that shipping fast matters more than shipping polished.

Maybe it does, for market positioning. It doesn’t for the people trying to actually use these tools.

What This Means for Choosing Tools

I’m not here to tell you what tool to use. Different workflows have different needs.

But I can tell you how I think about it:

If you work with clean, detailed specs — where every requirement is explicit and there’s minimal ambiguity — GPT-5.2 will probably work fine. The benchmarks are real. The model is capable.

If you work the way I do — with intent-driven specs, existing codebases with conventions, and an expectation that the AI will fill in gaps intelligently — Claude has a clear edge.

The gap isn’t in raw capability. It’s in how the models handle ambiguity. Claude seems to ask “what is this person actually trying to build?” GPT-5.2 seems to ask “what did this person literally write?”

One approach produces code you might ship. The other produces code you’ll definitely revise.

Competition Is Good. This Isn’t It.

I want to be clear: I’m rooting for OpenAI to get this right.

When they ship something great, everyone benefits. Anthropic has to improve. Prices drop. Features accelerate. The whole ecosystem levels up.

GPT-5.2 isn’t that.

It’s a rushed release that benchmarks well but feels worse to use. It’s the result of “Code Red” culture prioritizing speed over readiness. It’s shipping to ship.

I’ll keep testing. I’ll give GPT-5.3 or whatever comes next a fair shot. I want to be wrong about this.

But for now, for my workflow, the answer is clear:

Claude understands intent. GPT-5.2 follows instructions.

For the work I do, that difference is everything.

The Practical Takeaway

If you’re choosing between these tools, here’s what I’d actually do:

1. Test with YOUR workflow

Don’t trust benchmarks. Don’t trust my results. Run your actual work through both models and see what happens. The best model is the one that produces code you can ship.

2. Pay attention to the gaps

How does each model handle ambiguity? What happens when your spec isn’t perfect? That’s where the real differences show up.

3. Consider the trajectory

OpenAI is shipping rushed. Anthropic is shipping slower but more polished. That’s not a permanent state — things change. But right now, it tells you something about what each company prioritizes.

4. Don’t get religious about it

The tool you use today probably isn’t the tool you’ll use in 2026. Skills transfer. What matters is understanding how to work with AI coding tools, not which specific one you picked.

(I’m still bullish on Claude Code, obviously. But I’ll switch the moment something better comes along. Loyalty is for humans, not software.)