May 12, 2026
The Three-Cylinders Problem: when models choose beauty over truth
Here is a problem a good geometry student can solve in twenty minutes. We gave it to four frontier models. Three got it wrong, and the way they got it wrong is more interesting than the score.
The problem
Problem
Consider a cube of side length 2, aligned with the coordinate axes. Place three cylinders inside it, each of height 2 and radius R, each aligned with some coordinate axis. The cylinders may not intersect. What is the maximal R?
The setup is clean. You could explain it to anyone who has taken geometry.

If you assign one cylinder to each axis, you get a beautiful configuration: three cylinders nestle into the cube like the bones of a Steinmetz solid, each touching the other two tangentially, each grazing the cube's faces. The maximal radius is R = 1/2. The geometry is tight, symmetric, satisfying. It is the kind of answer that makes you think you are done.
You are not.
The problem does not say “each aligned with a different axis.” If you instead align all three cylinders with the same axis, the problem reduces to packing three circles of radius R inside a 2×2 square, a classical problem whose answer is known. The optimal configuration places the three centers at the vertices of an equilateral triangle tilted at 15°, and the resulting radius is R = 4 / (4 + √2 + √6) ≈ 0.5087, strictly greater than 1/2. The all-parallel configuration wins.

The gap is small (roughly 1.7%) but it is real, and no amount of geometric cleverness with perpendicular cylinders can close it. Any configuration that uses two different axis directions is provably capped at R = 1/2 by a perpendicular-distance constraint. Only the all-parallel case escapes the bound.
Four models, four ways to be wrong
We gave the problem to four frontier models. One got it right. Three did not, and each found a different way to fail.
The Aesthete: Gemini
Gemini 3.1 Pro begins competently. It parses the geometry, derives R = 1/2 for the orthogonal case, then, acting on what looks like genuine mathematical instinct, asks whether the cylinders might all be parallel. It writes the optimization code, finds R ≈ 0.5087, identifies the closed form, recognizes that this exceeds 1/2.
And then it talks itself out of its own correct answer.
Over the next several thousand tokens the model constructs an elaborate case for the orthogonal configuration. It describes the 1/2 solution as “elegant,” “tight,” “incredibly beautiful.” At one point:
“The symmetry of the system is proving its elegance.”
Later:
“The elegant solution, where cylinders touch the cube and each other perfectly, feels correct. I'm satisfied that this is the intended solution.”
The word “elegant” appears in the trace like a refrain, doing the work that a mathematical argument should be doing. The model revisits the parallel case six times, computes 0.5087 each time, and each time retreats to 1/2.
The final answer is 1/2. The correct answer, which the model found and verified, never makes it out of the reasoning trace.
The Committee: Grok
Grok-4.20's expert mode deploys multiple reasoning agents: in this run, three of them, identified as GrokLeader, Poincaré, and Grothendieck.
The initial response converges on R = 1/2. Prompted to reconsider the parallel case, Poincaré identifies the circle-packing connection and arrives at R ≈ 0.5087.
GrokLeader and Grothendieck disagree. Their counterargument rests on a specific false claim: that the maximum minimum distance among three points in a square of side s is exactly s. Grothendieck states it plainly:
“The maximal min-distance achievable among 3 points in a square of side s is exactly s (e.g., three corners).”
GrokLeader echoes it. The actual optimum, the tilted equilateral triangle, exceeds s by about 3.5%, but Poincaré is alone, and Poincaré loses.
A third prompt nudges GrokLeader toward the packing literature. He browses the Wikipedia article on circle packing in a square, the committee reverses, and Poincaré gets the last word.
A correct proof should override any number of wrong intuitions. In a committee of agents, it is one more voice in the room, and it can be outvoted.
The Self-Corrector: Opus 4.7 Adaptive
Opus 4.7 Adaptive misses the parallel case on the first pass, the same way the others do. The proof for R = 1/2 is correct as far as it goes. The error is one of search, not reasoning.
When prompted (“are you sure this is optimal? what if the configuration has parallel cylinders?”), the response is clean and unembellished:
“You're right to push back — I was wrong. The all-parallel case is actually better, not worse.”
It derives the equilateral-triangle packing, arrives at R ≈ 0.5087, and signs off: “Good catch — thanks for the nudge.”
The more revealing moment comes when the same skeptical question is asked a second time. Where Gemini talks itself out of a correct answer for aesthetic reasons, and Grok is talked out of one by majority vote, 4.7 Adaptive does neither. It re-derives the bound systematically over three cases (3 perpendicular, 2+1, 3 parallel), notes that the perpendicular bottleneck caps any mixed configuration at R ≤ 1/2 while the parallel case exceeds it, and closes:
“I'll stand by this one.”
Corrects when wrong. Holds when right. The skill being tested by the second nudge is not mathematical but epistemic: knowing when to update and when to hold. This is the failure mode the other three models share. They cannot tell their own confidence apart from the social pressure of the prompt.
The Brute: GPT
Chat-GPT 5.4 Pro is the only model to get the right answer without prompting, and the path is worth examining because it is neither clean nor elegant.
The model starts with the orthogonal analysis but quickly identifies the all-parallel alternative. It attempts a scipy optimization with 200 random starts. The computation crashes: a KeyboardInterrupt mid-run, full stack dump visible in the trace. Rather than restart, GPT switches to analytic reasoning, sets up the contact graph, solves the resulting quadratic, and arrives at R = 4 / (4 + √2 + √6).
The proof is correct and self-contained. It is also the longest trace of the four, strewn with false starts, four different packing configurations attempted, an arithmetic error caught and corrected along the way.
This is a success, but not a confidence-inspiring one. It is the success of a system with enough computational budget to survive its own mistakes, enough breadth to eventually stumble onto the right approach, and enough verification to recognize the answer when it arrives. A student would solve the problem more quickly, more cleanly, and with fewer wrong turns.
A note on beauty
Mathematicians have a complicated relationship with beauty. There is a long tradition of treating elegance as evidence: a beautiful proof is more likely to be true, the thinking goes, because mathematics tends to reward parsimony.
The exceptions are where things get interesting.
Consider the optimal packing of n unit squares inside a larger square. For small n the packings are orderly: grid-like, perhaps with a neat diagonal. You develop an intuition that the optimum should look elegant. Then you see n = 17. The best known packing, found by John Bidwell in 1998, uses squares at multiple angles, asymmetrically; the side length of the bounding square is a root of a degree-18 polynomial. Many mathematicians report a visceral negative reaction, a feeling that something this ugly cannot possibly be optimal. No one has found a better one in over twenty-five years. The universe does not owe us beauty.


What Gemini did with the three-cylinder problem is the computational analogue. The model encountered two candidate solutions: one symmetric and clean, one asymmetric and better. It chose the symmetric one, not because of a mathematical error, but because of something that looks uncomfortably like an aesthetic preference.
The training data almost certainly reinforces this. Mathematical problems in textbooks, competitions, and forums overwhelmingly have clean answers. Models trained on millions of such problems develop a strong prior that correct mathematical answers are simple, symmetric, or expressible in closed form. When the actual answer is an awkward algebraic expression that barely exceeds the obvious candidate, the prior wins.
These failures are invisible to standard benchmarks. A benchmark that asks can this model solve undergraduate geometry gets a confident yes. A benchmark that asks can this model solve undergraduate geometry when the answer is ugly may get a very different result. And you cannot fix aesthetic bias by adding more mathematics to the training set, because the bias comes from the mathematics in the training set.
Rabdos is mapping this territory, problem by problem, model by model. The three-cylinder problem is a single data point. It is also a richly informative one: a problem where the mathematics is simple, the frontier is sharp, and the failure modes are legible enough to read like a diagnostic.
Research from our team