Claude Opus 4.6 vs GPT-5.3-Codex: The AI Coding War That Crashed the Stock Market
On February 5, 2026, at approximately 6:40 PM ET, Anthropic released Claude Opus 4.6. Twenty minutes later, OpenAI dropped GPT-5.3-Codex. Within 48 hours, $285 billion had been wiped from global software stocks. Three days later, both companies aired competing Super Bowl ads.
This wasn't just a product launch. It was the opening salvo of the AI coding wars and the fallout is still reshaping the tech industry.
We've spent the past three days testing both models, reading the system cards, verifying benchmarks, and separating the marketing from reality. Here's what you actually need to know.
The 20-Minute War
The timing wasn't coincidental. Anthropic went first with Opus 4.6, and OpenAI responded almost instantly with GPT-5.3-Codex, a move VentureBeat described as a deliberate "counter-punch" amid an already heated week between the two companies.
Both models represent massive leaps in AI coding capability. Both claim to be the best. And both come with unprecedented implications for cybersecurity, software development, and the future of white-collar work.
Let's break down what each model actually brings to the table.
Claude Opus 4.6: The Highlights
Opus 4.6 is Anthropic's most capable model to date, and the improvements over Opus 4.5 are substantial:
- 1 million token context window (beta) — 5x larger than its predecessor's 200K limit
- 128K token max output — doubled from Opus 4.5's 64K
- Agent Teams — multiple Claude Code instances coordinate work in parallel, with a lead session assigning tasks and synthesizing results
- Adaptive Thinking — four effort levels (low, medium, high, max) replace the old binary extended thinking toggle, letting the model dynamically decide how deeply to reason
- Context Compaction API — automatic server-side summarization for effectively infinite conversations
- Fast Mode — 2.5x faster output at 6x the price ($30/$150 per million tokens), with a 50% introductory discount through February 16
But the headline that grabbed everyone's attention was this: Opus 4.6 autonomously discovered 500+ previously unknown zero-day vulnerabilities in open-source software using only out-of-the-box capabilities. Each was validated by Anthropic researchers or independent security experts. For one GhostScript flaw, Claude turned to the project's Git commit history after both fuzzing and manual analysis failed, demonstrating problem-solving approaches that surprised even Anthropic's own team.
GPT-5.3-Codex: The Highlights
OpenAI's response was equally ambitious:
- 400K token context window with a "Perfect Recall" attention mechanism that prevents information loss in the middle of long contexts
- 128K token max output — matching Opus 4.6
- Auto-Router architecture — automatically switches between a fast "Reflex Mode" for simple queries and "Deep Reasoning Mode" for complex problems
- Interactive Steering — users can adjust, redirect, and provide feedback mid-task without losing context
- Deep Diffs — shows why a code change was made, not just what changed
- 25% faster inference than GPT-5.2-Codex, using fewer output tokens to achieve equivalent results
The controversial headline: GPT-5.3-Codex "helped build itself." OpenAI says early versions of the model debugged its own training runs, managed deployment infrastructure, and diagnosed test results. However — and this is important — OpenAI's own system card explicitly states the model "does not reach High capability on AI self-improvement." The reality sits between the marketing and the safety assessment: it was a useful tool during development, but humans remained in charge.
What the system card does confirm is more concerning: GPT-5.3-Codex is the first OpenAI model to receive a "High" cybersecurity designation under their Preparedness Framework, meaning it could potentially automate end-to-end cyber operations against hardened targets. OpenAI is taking a precautionary approach, gating full API access and committing $10 million in API credits for cyber defense.
Head-to-Head Benchmarks
Here's where it gets interesting. Important caveat: Anthropic and OpenAI report on different SWE-Bench variants (Verified vs. Pro), making direct comparison on that specific benchmark unreliable. We've included all available scores from official sources.
| Benchmark | Opus 4.6 | GPT-5.3-Codex | Winner |
|---|---|---|---|
| Terminal-Bench 2.0 | 65.4% | 77.3% | Codex (+11.9) |
| GPQA Diamond | 91.3% | 73.8% | Opus (+17.5) |
| ARC AGI 2 | 68.8% | — | Opus |
| OSWorld | 72.7% | 64.7% | Opus (+8.0) |
| BrowseComp | 84.0% | — | Opus |
| Cybersecurity CTF | — | 77.6% | Codex |
| SWE-Lancer IC Diamond | — | 81.4% | Codex |
| MMLU Pro | 85.1% | 82.9% | Opus (+2.2) |
| SWE-Bench Verified | 80.8% | Not reported* | — |
| SWE-Bench Pro | Not reported* | 56.8% | — |
*SWE-Bench Verified and SWE-Bench Pro are different benchmark variants with different methodologies and difficulty levels. Direct numerical comparison between them is not valid.
The pattern is clear: Codex dominates terminal-based automation and sustained coding tasks. Opus leads in reasoning, knowledge, browsing, and abstract problem-solving. Every.to's independent "LFG" benchmark — a practical test involving React, 3D visualization, and a full e-commerce build, scored Opus 9.25/10 versus Codex 7.5/10, with the gap widening on complex, under-specified requirements.
Opus 4.6's Standout Result
ARC AGI 2 score jumped from 37.6% (Opus 4.5) to 68.8% — a 31.2 percentage point increase that nearly doubles the previous version. This is the largest single-benchmark improvement seen in any frontier model update and suggests genuine advances in abstract reasoning, not just benchmark optimization.
Pricing: Where They Stand
| Claude Opus 4.6 | GPT-5.3-Codex | |
|---|---|---|
| API Input | $5 / MTok | Not yet announced |
| API Output | $25 / MTok | Not yet announced |
| Context Window | 1M tokens (beta) | 400K tokens |
| Max Output | 128K tokens | 128K tokens |
| Batch Discount | 50% ($2.50 / $12.50) | TBD |
| Cache Savings | Up to 90% | TBD |
| Subscription | Claude Pro $20/mo | ChatGPT Plus $20/mo |
As of February 8, OpenAI has not released official API pricing for GPT-5.3-Codex. The predecessor GPT-5.2-Codex was priced at $1.75 input / $14 output per million tokens — significantly cheaper than Opus 4.6. If OpenAI maintains similar pricing, Codex will likely be the more affordable option for pure coding tasks.
Worth noting: Opus 4.6 is priced identically to Opus 4.5 ($5/$25), meaning all the improvements, including the 5x context window expansion; come at zero additional cost over the previous version.
The $285 Billion "SaaSpocalypse"
The simultaneous release of both models, along with Anthropic's industry-specific Claude Cowork plugins and OpenAI's Frontier agent platform, triggered a massive selloff across global software stocks. Jeffrey Favuzza at Jefferies' equity trading desk coined the term "SaaSpocalypse" to describe what he called "very much 'get me out' style selling."
The fear? AI agents could render the SaaS subscription model obsolete. If an AI can do the work that software used to do, why pay per seat?
| Company | Ticker | YTD Drop |
|---|---|---|
| Figma | FIGM | -40% |
| HubSpot | HUBS | -39% |
| Shopify | SHOP | -38% |
| Atlassian | TEAM | -35% |
| Intuit | INTU | -34% |
| Salesforce | CRM | -26% |
| LegalZoom | LZ | -20% |
| Thomson Reuters | TRI | -18% (single day) |
The iShares Software ETF (IGV) dropped over 20%. The JPMorgan US Software Index fell 7% in a single trading day. Even advertising giants like WPP (-12%), Omnicom (-11%), and Publicis (-9%) were caught in the downdraft.
The Super Bowl Ad War
As if the stock market carnage wasn't dramatic enough, both companies aired competing ads during Super Bowl LX on February 9.
Anthropic's campaign was a direct shot at OpenAI, with the tagline: "Ads are coming to AI. But not to Claude." One 60-second pregame spot showed a man asking a chatbot for advice on communicating with his mom, only for the response to morph into an ad for a fictional cougar-dating site called "Golden Encounters." The message was clear: AI that serves ads can't serve you.
OpenAI's 60-second Codex commercial took a different approach, positioning AI coding as part of a long lineage of human creation and building.
Sam Altman didn't take the jab quietly. He called Anthropic's ads "funny" but "clearly dishonest," arguing that OpenAI "would obviously never run ads in the way Anthropic depicts them." He went further, calling Anthropic "authoritarian" and claiming they block Claude Code usage from "companies they don't like" — including OpenAI. His closing shot: "One authoritarian company won't get us there on their own... It is a dark path."
The Honest Assessment: Which Should You Use?
After testing both models and reviewing the data, here's our take:
Choose Claude Opus 4.6 if you need:
- Deep reasoning on complex, ambiguous problems
- Multi-agent workflows (Agent Teams)
- Security auditing and vulnerability research
- Long-context tasks (1M tokens vs 400K)
- Higher ceiling on open-ended creative coding
Choose GPT-5.3-Codex if you need:
- Fast, reliable terminal automation
- Consistent output with fewer errors and lower variance
- Mid-task steering and redirection
- Sustained autonomous coding sessions
- Budget-friendly API pricing (once announced)
As Every.to put it in their "Great Convergence" review: "Opus has a higher ceiling but more variance; Codex is more reliable with fewer errors." Neither model dominates universally — and increasingly, teams are mixing and matching based on the task.
The Writing Quality Caveat
One criticism worth noting: multiple users on Reddit have reported that Opus 4.6's writing quality regressed compared to Opus 4.5. Prose is described as "flatter, more generic" with more formulaic constructions. The theory is that reinforcement learning optimizations for reasoning came at the cost of natural prose quality. The community consensus: use 4.6 for coding, stick with 4.5 for writing. Anthropic has not formally addressed this.
What's Coming Next
The AI coding race isn't slowing down. Here's what we're watching:
- DeepSeek V4 (expected ~February 17) — focuses on repo-level reasoning and a novel "Engram" memory system for near-infinite context retrieval. Expected to be open-weight.
- Gemini 3 Pro — Google's model already leads LMArena's text leaderboard for user preference and matches Opus 4.6 on GPQA Diamond at 91.9%
- Claude Sonnet 5 — Anthropic's next mid-tier model, expected with gains in coding and reasoning
February 2026 might be remembered as the month the AI coding wars went fully mainstream — complete with stock market crashes, Super Bowl ads, and models that help build themselves. The question isn't which model wins anymore. It's whether your workflow is ready for what's next.
Sources
- Anthropic — Claude Opus 4.6 Official Announcement
- OpenAI — Introducing GPT-5.3-Codex
- OpenAI — GPT-5.3-Codex System Card
- Vellum — Claude Opus 4.6 Benchmarks Explained
- VentureBeat — AI Coding Wars Heat Up
- Every.to — GPT-5.3 Codex vs Opus 4.6: The Great Convergence
- Fortune — Claude Opus 4.6 Triggers Stock Selloff
- CNBC — SaaS Software Stocks Selloff
- TechCrunch — Sam Altman vs Anthropic Super Bowl Ads
- Anthropic — API Pricing Documentation
- Fortune — GPT-5.3-Codex Cybersecurity Risks
- NxCode — SaaSpocalypse 2026
What do you think?
Join the conversation and share your thoughts on this article.
Join the Discussion
Comments
0 commentsBe the First to Share Your Thoughts
Be the first to comment!
Share your thoughts and start the conversation.