The AI Treadmill

In this week’s episode of “The AI Treadmill” we bring you GPT-5.2 “The most advanced frontier model for professional work and long-running agents”.

This comes off the back of OpenAI’s Code Red that was supposedly declared by Sam Altman in response Google’s Gemini 3 taking a clear lead in the state-of-the-art. Like every release, it makes an incremental improvement on most benchmarks.

The problem that has been found time-and-time again is that these benchmarks/evaluations no longer have any meaning in practical terms. The only purpose they serve is to act as a something shiny to attract the less informed journalist, customer and investor.

Take the “SWE-Bench” performance for example. These are coding questions for Software Engineering designed to test models. There are many different flavours of test (e.g. SWE-Bench Pro, SWE-Bench Verified). Something dishonest the model providers do is to only report on the variaton that their model performs the best (or better than their competitors). At the same time, these models are deleting entire databases and getting caught in endless loops in real-world applications.

The performance of GPT-5.2 in GDPval, that OpenAI claim is the best analogue to knowledge work, looks very impressive at 70.9%. However, OpenAI created this benchmark and did not release results of the competitor's models on this benchmark. This is the definition of marking your own homework.

To get a better understanding of the real changes with new releases, one must look at closed benchmarks like SimpleBench, where GPT-5.2 regressed in score from 61.6% to 57.4%. The current leader of SimpleBench is still Gemini 3 at 76.4%.

Regardless of the percentage value scored on X Y and Z benchmark, unless this translates into meaningful performance for actual work, I fear we will be stuck on this treadmill for a while.

I would prefer these models to perform bad or average on all benchmarks, but have the capacity to learn from their mistakes in real-time.

15 December 2025

Measure How Much Productivity You Could Gain With Our Calculator

Our productivity calculator reveals the potential costs Traffyk can save your business and improve productivity by when inefficient workforce communication is reduced.

Productivity Calculator Contact Us