Published: December 27, 2025
I recently had a very strange night-time thought. The type of idea just before dozing off into sleep that makes you internally wake up and (at least in my case) forces you to get out of bed and write it down lest you forget it by morning. For some reason, I imagined myself a print-on-demand shop. To me it seemed that there is a clear moral hazard problem, in other words, how should I show myself to be a trustworthy and high-quality printer rather than a rip-off. Presumably, the easiest way to convince my customer that I'm legit would be to show them a first free sample. But, and this is how this thought came to me, wouldn't a lot of people exploit this who just wanted a single copy of some book? What would a strategy be to prevent this.
It occurred to me that it would never occur to me anymore to put this question into Google. It's not really googlable. On the simplest level, most words in the question "How does a print-on-demand store offering one free sample prevent exloitation?" actually have nothing to do with the thing I'm looking for. Semantically, the question I'm asking and the answer I'm looking for are very far apart. For this reason, traditional Google search would likely not succeed.
Yet it is a question that most LLMs would excel at given the "correct" way to approach this problem would be, first, by induction and assigning this question to the larger conceptual group of "moral hazard problems". And then, secondly, look laterally among this group for methods that could work.
It is in fact the perfect combination of an answer that is not really creative (perhaps you prefer the term "original") beyond using something well-established in one context in another context (which is creative by some standards some times, it should be admitted!) and yet requires too much lateral thinking (i.e. covering too much semantic distance) for Google to be likely to yield anything useful. I think most cases in which LLMs give me really, really good answers in my everyday life are of this type, lateral but uncreative.
Most coding questions are of this type as well. It's an algorithm or a snippet well-established somewhere that now just needs to be laterally applied here, e.g. replace all mentions of "car" with "bike" because you're coding for a different but analogous website now.
On a related note, do try and figure out what problems LLMs are actually being tested on when it comes to the fancy new reasoning benchmarks everybody is focussed on these days. While it makes perfect sense you wouldn't publish them online (given that the LLM would train on it and then know the answer, essentially cheating the benchmark), it seems surprisingly hard to even just get example problems. The best example I could find online FrontierMath by EpochAI (who still get sponsored by OpenAI) actually does give some insightful examples and they seem legit. I want to say concomitantly the LLMs actually perform somewhere below 6% as of the writing of this blog. Much below most of the other math benchmarks where everybody seems to get >90% these days. The other good example I'm aware of is ARC-3 where you can play the same games as the AI. Now, no models have so far been benchmarked on it but I'm certainly excited about the results.
In any case it seems wise not to overvalue any benchmark that has been around for a while (or, for that matter, any benchmark influenced by the hyperscaler labs themselves). Presumably, over time, data pollution of some form or another is inevitable for any benchmark that is out there. You could presumably address this by making problems more random (as ARC-3 attempts) but even then probably some general solution strategies would become more popular on the internet and your model would learn these over time. So best to rely on new, independent, tough benchmarks where you can actually have some certainty that the problems are sufficienlty hard to actually be impressed by. Otherwise, you're just hyping something up that did some useful if uncreative lateral thinking.
← Back to main.