GPT-5.5, Benchmarks, and the End of Control: What Really Matters in AI

In this article, I will share my impressions of the GPT-5.5 launch by OpenAI and, more importantly, explain why traditional benchmarks are losing relevance in the daily lives of those who actually use artificial intelligence for work. If you follow the AI market, you know that every launch is accompanied by a flood of numbers, charts, and comparisons. But does all of this make a difference when it comes to solving a real problem in your code, your project, or your routine? My experience says no. And I will explain why. ## The GPT-5.5 Launch GPT-5.5 was announced yesterday by OpenAI. According to the first data released, the model outperforms Claude Opus 4.7 in some important benchmarks, including software development and tool use. Let's look at the numbers: on Terminal Bench 2.0, GPT-5.5 scored 82.7% — a significant jump from GPT-5.4's 75.1%. Claude Opus 4.7, for comparison purposes, was at 69.4%. If you already find Opus impressive in your daily life, theoretically GPT-5.5 would be even better. However, one must be careful with these numbers. OpenAI tends to hide benchmarks where GPT loses to Claude. In other words, the comparison that reaches you is partial. I will put this into perspective further on. Another fundamental point is the price. GPT-5.5 is twice as expensive as 5.4. OpenAI justifies this by saying the model is more efficient in token usage — it does more with less. But the cost per token has risen. And this changes the equation. It's worth comparing this with the approach of Anthropic, the creator of Claude. When it launched Opus 4.7, the company said: "it thinks for longer, spends more tokens, but the cost is not that high." These are two opposite philosophies. On one side, efficiency and more expensive tokens. On the other, more reasoning, more tokens, and controlled cost. I haven't been able to test GPT-5.5 in my daily life yet. I will do so in the coming days and bring a practical report. But I can already tell you: benchmarks, on their own, won't tell me if the model is truly good. ## Why Benchmarks Aren't Everything There are other numbers that deserve attention. On SWE-bench, an internal OpenAI software engineering benchmark, GPT-5.5 scored 73.1%. Sounds good, right? The problem is that you don't have Anthropic's score on this same test. Why? Because Anthropic does not allow its models to be tested on these internal benchmarks — which makes them inherently biased. Another example is OS World Verified, an important benchmark for real-world tasks. In this test, GPT-5.5 did not show significant gains over Opus 4.7. In other words, depending on what you measure, the story changes completely. But the main problem isn't even that. The problem is that benchmarks do not reflect real use. In your daily life, you don't run a controlled test. You open a tool — Cursor, Codex, Claude Code, whatever it may be — and ask it to solve a real problem. And that's where variables enter that no benchmark captures. The main one is what is being called the AI harness. The harness is the entire operating system that surrounds the model: cache, retry logic (trying again when something fails), decisions about when to search the internet, when to use a tool, when to delegate to a subagent. It is a complex system that goes far beyond the raw model. And here is the central point: you can use GPT-5.2 or GPT-5.4 inside Cursor, and it will perform one way. If you use the same model inside Codex, the performance might be completely different. Because the harness is different. The model is the same. What changes is everything around it. Therefore, when you see a benchmark saying that model X is better than model Y, it means nothing until you test that model inside the tool you use, with the specific harness of that platform, solving the problems you actually face. ## The Control Problem: The Anthropic Case What if I told you that you could select the best model, configure it to think at the maximum level, and yet it wasn't obeying? That is exactly what happened with Anthropic this week. The company publicly admitted a bug in its system. Since March 4th, in Claude Code, when the user configured "effort high" — that is, asked the model to think as much as possible — it was actually operating at "medium." The interface showed "high" configured, but under the hood, it was running "medium." You believed you were using the model at the peak of its capacity, paying for it, trusting that configuration. But in practice, you were receiving a crippled, dumber version that thought less. This perfectly illustrates the point I've been raising for weeks on this channel: companies will have to choose between "nerfing" models (making them dumber to save computation) or increasing latency (making the user wait longer). Anthropic tried to make the user believe they were getting the best of both worlds, but in reality, it was delivering less. And even more serious: this doesn't appear in any benchmark. You won't look at the comparative table and see an asterisk saying "high mode is actually medium." The benchmark tests the model under ideal conditions. Your real use is subject to these "adjustments" that platforms make to save money. The conclusion is inevitable: we no longer have total control over what is running. You select a model, but subagents can be spawned with other models. You configure maximum effort, but the platform might ignore it. In my recent test with Claude Desktop, I selected Opus 4.7 and it decided to run a subagent using Haiku 4.5 — a much smaller model. I wasn't warned, I didn't authorize it. It just happened. What good is it, then, to keep comparing benchmarks? If you don't even know if the model you requested is being used, if you don't know if the thinking level is being respected, if you don't know if the platform is cutting corners to save money... these pretty numbers in a table mean absolutely nothing for your daily life. ## What to Do, Then? Given this scenario, what should be the stance of those who actually use AI for work? I suggest three practical movements. First: test in practice, do not rely only on benchmarks. Take a real problem from your work — that annoying bug, that complex refactoring, that documentation that needs to be written — and test the models side by side. Only then will you find out which one works best for your specific context. Second: choose the right model for each task. Not every model needs to be good at everything. GPT-5.5, for example, seems to be excellent for searching and creating plans. Perhaps the smart strategy is to use GPT-5.5 for planning and research, and Claude Opus 4.7 for execution. These are complementary approaches. Third: also test the tools, not just the models. Codex, Claude Code, Cursor — each of these platforms has its own AI harness. One might be better for long and complex tasks, another might be faster for short interactions. And prices also vary. Knowing which tool to use in each situation is becoming as important a skill as knowing how to program. ## What's Next Next week, I intend to bring a practical review of GPT-5.5. I'm going to take it for a real debugging case — a particularly sinister bug we have in PSUA on Windows. This bug has been my personal benchmark in recent months because it is genuinely difficult to solve. I will test GPT-5.5 side by side with Claude Opus 4.7 and bring an honest report on which one fares better. I won't base myself on pretty numbers released by companies. I will base myself on what works — or doesn't — in practice. ## Summary for Those in a Hurry If you haven't read the whole article, here are the main points: - GPT-5.5 is more expensive and more token-efficient, but that doesn't mean it's better for your specific use. - Benchmarks don't tell the full story. The AI harness — the entire system around the model — matters as much as the model itself. - Companies are "nerfing" models to save money, as we saw in the Anthropic bug case where "effort high" was actually "medium." - You no longer have total control over which model is running, if subagents are being used, or if the effort configuration is being respected. - The best way forward is to test personally with real problems and choose model + tool according to the needs of each task. ## Final Consideration The artificial intelligence market is maturing, and with maturity come complexities. There is no longer a "single best model." There is the right model for your tool, for your problem, for your budget. Are benchmarks useful? Yes, as a starting point. But don't stop there. Test, break, compare. And remember: behind every pretty number, there is a company making business decisions that may be affecting the quality of what you are using — without you knowing.