Benchmark busters

Published: January 4, 2026

Rule one. "Never use a metaphor, simile, or other figure of speech which you are used to seeing in print." — George Orwell

It has become a common occurence of everyday life that about every 3-4 months one's LinkedIn, X, and Bluesky feeds explode with the prophetic messages from the frontiers of 21st century AI - a new wave of benchmark results dropped from the next iteration of your favourite hyperscaler model. There are really two main types of signals I observe: somebody (typically the company itself) boasting about a benchmark result of theirs (either on internal data, famously sometimes even without a y-axis, or on an external benchmark, although external here does by no means imply independent) or because someone is raving about how "they tried it themselves" and were "blown away", "stunned", "stopped in their tracks" (I venture to estimate around 80% of LinkedIn posts make use of one of three of these limited linguistic metaphors).

I reserve a good amount of doubt for the latter category of endorsements. While I believe that may of these are made earnestly and sincerely (and have been made to me by friends whose judgement I respect and value), I have never myself had that same experience despite using LLMs for a good amount of coding (other tasks) and generally being willing to try 0-shotting an entire project before building it piecemeal and through more extensive prompting. Undeniably, ChatGPT 5.2 is better than ChatGPT 3. There also remain a good number of tasks, it still won't satisfactorily do for me, some of which more scale (read: more parameters, more RL, more inference budget) will fix. That to me is the state of the "first-hand" experience for me with LLMs in the area I find them most reliably and appliedly useful, coding.

Progress on the ARC Prize for artificial general intelligence (AGI).

As someone self-identifying as a Chemist, I'm perhaps unsurprisingly quite partial however to big, statistically sound benchmarks. Before we embark on what I estimate to be roughly the best case against them, let me say this: It's quite easy for me to imagine a world in which independent benchmarking (including the very word (metaphor) of "benchmarking") is not well known to anyone outside a small, specialist community and in which AI performance gains are mostly reported through companies releasing their own results on their own data and tests. Clearly, it's a pretty great achievement that somehow all of the major AI labs somehow feel obliged to evaluate themselves on a whole host of external benchmarks that often are constructed by generally well-meaning researchers. That already beats the standard of evaluation in a lot of other industries (home appliance manufacturers measuring their own energy efficiencies comes to mind), perhaps only exceeded by pharma where it is state regulation that enforces even more rigorous, external evaluations known as clinical trials.

Obviously, there are bad benchmarks out there. I would, until proven otherwise, cluster all internal company benchmarks in this category. I won't be considering those here. There are well-meaning, large, external benchmarking datasets with a statistically rigorous evaluation out there. Again: a relatively high standard that not all industries can claim. Yet, given how much depends on these benchmarks - around 35% of the Dow Jones Industrial Average by one estimate - that we must examine them with corresponding exactitude. After all, the stock price of Miele does not tend to make heavy jumps based on their new efficiency reports.

Famously, OpenAI's benchmarking team considers 52.8 to be bigger than 69.1, which is indistinguishable in size to 30.8.

As I see it, there are 3 main problems besetting even the best of the external benchmarks out there today. There are two sort of related problems or at least open research questions that I'll discuss further below.

A lack of independence: As I hinted at earlier, "external" should not be equated to "independent here". One of the organizes that I consider the best among external benchmark providers, Epoch AI, for example is funded nearly exclusively by OpenAI. That doesn't need to prevent proper benchmarking but it should make one more sceptical. I think every other benchmarking organization is beset by the general circularity of the AI SF of the mid 2020s and I don't blame them, it's simply a reality of their lives that they get invited to the same conferences, talks, and fireside chats as everybody else, that they consciously chose to live in that bubble of people (which in other aspects tangibly improves their work I'd argue), and that somehow everybody seems to be funded from the same pot of money in the end, i.e. the big belief in a relatively nigh advent of AGI. Again, this doesn't prevent good work from happening it simply means good work constantly has to fight against a backdrop of unfavourable forces.
You don't know the data leakage: Now that is probably the issue that most people are familiar with - although by no means everyone. There was a around 2020, which I think we'll look back on as the "great publishing arbitrage opportunity", where you could publish in quite reputable journals by just calling up ChatGPT and asking it to "predict" some molecular property, some material science quantity, some bio pathway. That all of these "emergent behaviours" and "0-shot capabilities" came down to memorization and data leakage was apparent to everyone except the reviewers it seemed. When you train an LLM (perhaps illegally) on everything you can scrape from the internet and then you don't release the data set, it becomes impossible to evaluate LLMs on any task that could reasonably be solved via encyclopaedic knowledge. That for example is the case for the bar exam or the MCAT. While impressive that LLMs have the ability to do efficient knowledge retrieval from their data base to get at the answer - and perhaps even more impressive - that they manage to output the correct answer in a structre acceptable to and sometimes indistinguishable from humans, it is arguably not a test of general intelligence. To their credit, some modern benchmarks have moved beyond this. Probably most famously the ARC Prize 2, which only poses LLMs with problems where the rules are not given and you need to figure out what is expected of you through play. ARC is also transparent enough to actually give you some example games to play with for yourself and you'd probably be surprised by how easy they are given that the frontier performance at the time of publishing this article is below 60%.
What really are these problems, who curated them, and who says they're reliable in the first place? This nicely transitions to a related problem: what really are these problems and do they actually test for intelligence? First, a point on transparency: given LLMs regularly get trained on new scrapes from the web (or just have web access), you obviously can't release the entire data set online or else you guarantee data leakage. What you should do is release representative example problems and give as much meta information on the dataset, the implementation and the statistical methodology. Unfortunately, that is quite rare and a lot of reasoning benchmarks either rely on open tests (such as nation-wide math tests even international olympiads) which can often suffer from data leakage (naturally, a company like Google whose valuation north of $3 trillion depends quite a lot on it performing well on these benchmarks is strongly incentivized to find any way to cheat) or on constructing their own problems, such as ARC. Probably, the latter is more desirable it comes however with a realism trade-off. It certainly seems to me that a lot of the truly out-of-domain benchmarks where problems rely on reasoning alone and thus should be pretty much immune to memorization and data leakage also are problems which are not that realistic. Nor do they well approximate the use-case most of us care for. That does not take away from their value as real, robust benchmarks but "what gets measured gets managed" and setting this as the metric to optimize for might lead LLM development in a somewhat mistaken direction. Although I think this concern so far has remained mostly speculative.
Implicit overfitting: This last and arguably most subtle point was the reason for me writing this article. When many different AI labs build many different models and all throw them at the same benchmark, this is overfitting. The same way that you should not optimize your hyperparameters on your test set, which is a form of data leakage, because your model got to learn from the data it is later evaluated on, you should not "optimize" your choice of model based on a benchmark. Given the differences in architecture across frontier LLMs, you will always have some stochastic variation on the test data sets (some of which can be quite small in fact). If you just look at the highest number (which is how a leaderboard works) you're inevitably overcounting the performance of the actual model. The model will probably be good but it also "got lucky" that its training data + hyperparameters just randomly fit this test set well. If you repeat this many, many times over (i.e. have many participants that all have a reasonable shot at being the top-contender) this problem will get worse. This is what I consider a "busted" benchmark: a well-defined score based on top performance on a long-running benchmark with many competitive submissions. In general, I would suggest you do not trust this number. This is a formulation of Goodhart's paradox

This last point actually becomes even more complicated in the age of RL. What prevents the model builder from doing RL on tasks that are at least related to the problem the benchmark is posing. Maybe this doesn't work the first iteration around when nobody has ever seen your benchmark but it would be quite easy to do from then on and the incentives would be there. In a world in which model builders were "overfitting to the benchmark" with RL you would expect first submission to be relatively low and the increment up later on. This, however, is indistinguishable from the underlying models just getting better. I'm not pretending to know the answer to this either but it's reasonable something like this "benchmark busting" is contributing to the slope in these leaderboard graphs.

Is RL bitter-lesson compliant?

Something that often comes to my mind concerning benchmarks and modern RL scaling laws, is the Bitter Lesson by Richard Sutton. In brief, the idea is that a lot of ML work was misguided by trying to impart a lot of clever ideas and specialist understanding that made a model a little bit better for one small use-case (e.g. object recognition or text translation). In fact, we would have been much better of using general-purpose architectures that are not all adapted to a particular problem but training them on a much larger scale. Transformers really solve both object recognition and text translation without the need for any human-designed expertise, specialist understanding, nor clever algorithms needed. Essentially, you're swimming against Moore's law - and swimming against an exponential is pretty painful.

Now I'm actually not so sure how bitter-lesson compliant modern RL is. There are other problems with RL too of course, for one thing the lack of good "games" we can actually run any type of RL on, which motivates the work that many are now doing in constructing better world simulators, physics world models, cell emulators, etc. so that self-play can happen in these environments, translating scalable RL into a new problem domain (e.g. bio). I quite like Dwarkesh's take on this. But another concern from a more philosophical view of things one could have with modern RL is that you specialize models for a very small class of maybe useful and even economically valuable tasks - but unlike the age of parameter scaling (and perhaps even inference time scaling) - we're no longer scaling towards AGI. At best we're scaling towards commercial viability in some domains. (Which expressly is not commercial viability for the hyperscalers who need something pretty close to AGI that replaces a good part of our current work-force to justify their current valuation and capital expenditure).

Does any of this generalize?

In the end this relates to something that Epoch AI (I know I just called them potentially compromised but ...) recently wrote a nice summary about. Essentially, it comes down to the question whether general intelligence comes down to one "deep" capability that unifies all (at least most) tasks of intelligence; or whether most tasks are "contingent", i.e. they are not really related and there is not enough transfer learning and generalizability among them so that you need to train specialist models for each tasks separately. In the worst case, these contingent tasks exist on some kind of pareto frontier where improving on one ability (e.g. coding) negatively impacts your ability on another (e.g. poetry).

In (young) humans interestingly, there is evidence for this "deep" model of intelligence. Where psychologists observed that children good at one task generally have a higher likelihood of being good at other, unrelated tasks as well. This is referred to as the g-factor, which summarizes the positive correlations between cognitive tasks. The g-factor can explain around 40-50% of the IQ scores between individuals.

I personally mostly subscribe to a contingent world model. One reason is laid out by the folks at Epoch AI: firstly, they realized that most of their benchmarks are highly correlated. Secondly, by doing a PCA they conclude that the 2nd principle component (where the first is just "being good at everything", which you could see as a g-factor) of predicting good benchmark performance is how much "Claudiness" your model has. In other worlds, a model which is good at coding, agentic tasks but not so good at vision and also bad at math. Essentially, the priorities Claude shows strongly (at the time of writing). As Greg Burnham at Epoch AI writes:

But the existence of the Claudiness dimension feels to me like a bit of evidence for the “contingent” world. Anthropic has focused on making models that are state-of-the-art at agentic coding. Without additional focused investment, the models turn out not to be exceptional at advanced math. There is surely some generalization across tasks, but perhaps this is a sign of its limits.

I think that is mostly right and I hold a contingent world model to be more congruent with other priors I hold, such as a general no-free-lunch-theorem and "the minust first law of thermodynamics", which posits that information is conserved. Both of which I find at least notionally at odds with the idea that learning on one task yields large, unaccounted for rewards on other tasks. Lastly, it also rhymes with my own experience with these models where I haven't found them to be nearly as impressive in chemistry or biology as for coding.

The question of whether we live in a "deep" or "contingent" world is profoundly relevant to benchmarking too: In a "deep" world where everything depends on essentially an AI's g-factor any benchmark could plausibly swap in for another. In a contingent world you'd better have a benchmark for each task and would spend a lot of time thinking about what capabilities you're not measuring - not least so you steer the work of LLM builders towars these tasks as well.

Benchmarks are hugely important (so is your personal experience with them as limited and biased as it might be) and they deserve attention. They arguably deserve some more scrutiny too. On a more optimistic note, maybe rougher benchmarks is all the market needed for a catalyst. And even a recession might have some upsides.

← Back to main.