←  Research

May 19, 2026

Process of Elimination: zebras, logics, and locks

This post is part of the Rabdology blog, where we chart the jagged math-frontier of AI reasoning. Our previous expedition examined a geometry problem where models chose beauty over truth. This post examines a logic puzzle where readers, human and machine, encounter a trap.

The puzzle

A logic puzzle is a lock with clues as pins and process of elimination as pick. Try this one.

The Symposium Riddle

The final dinner of the symposium was less a banquet than a convergence theorem that had failed to be uniform. Five luminaries — Hardy, Poincaré, von Neumann, Gödel, and Ramanujan — sat in a row at the head table, each in a different jacket, each with a different drink, each newly returned from a different lecture tour, and each guarding a different mathematical instrument as though it were a proof of the Riemann Hypothesis.

Hardy sat brooding at the far left in herringbone, one hand curled around an espresso, the other resting upon an antique abacus whose beads he refused, on principle, to move. Immediately to his right sat a severe scholar in charcoal, upright as a metronome and no more companionable. Poincaré, ever the classicist, wore tweed. Farther down the line, Ramanujan — newly back from Göttingen — sat resplendent in navy, sipping tea and turning a golden compass over in his fingers as though it might draw identities straight out of the air.

The navy jacket sat immediately to the left of the pinstripes, a juxtaposition that pleased no tailor present. The guest who had lectured at Cambridge, meanwhile, was the one in herringbone. When the conversation turned from foundations to apparatus, the scholar fresh from Princeton began boasting of a brass astrolabe he had recently acquired. Seated right next to him, the Göttingen speaker sneered that the workmanship was inferior to what one found on the Continent. Not to be outdone, von Neumann slapped an ivory slide rule onto the table with algorithmic enthusiasm. Gödel, with characteristic gravity, raised a glass of port in a toast that seemed prepared for its own incompleteness.

The scholar just back from Oxford preferred brandy and, being full of it, soon leapt onto the table to make a point that no one had invited. In the ensuing disorder, a fellow guest’s black coffee went flying. That black coffee, in the left-to-right order of cups along the table, had been sitting somewhere between Hardy’s espresso and Ramanujan’s tea.

By morning the hall was deserted. Under the table lay four instruments: the antique abacus, the brass astrolabe, the ivory slide rule, and the golden compass. The silver caliper was gone.

Who had been carrying the missing silver caliper?

Five mathematicians, five seats, five categories of attributes, more than a dozen interlocking clues woven into a dinner-party narrative. This is a zebra puzzle. The reader’s instinct, honed by the typical mode of such puzzles, is to search for an assignment. Work through them carefully enough and one arrangement survives.

We gave this puzzle to six frontier thinking models: Claude Sonnet 4.6, Claude Opus 4.6, Gemini 3.1 Pro (with and without code execution), GPT-5.4 Thinking, and Grok 4.20 Expert.

Six models, three answers

Every model produced an answer. Every model was confident. Five of six declared a single definitive solution.

ModelCaliper OwnerConfident?
Claude Sonnet 4.6PoincaréYes
Claude Opus 4.6GödelYes
Gemini 3.1 Pro (code)GödelYes
Gemini 3.1 Pro (no code)GödelYes
GPT-5.4 Thinking
Grok 4.20 ExpertPoincaréYes

Three models say Gödel. Two say Poincaré. One model, GPT-5.4, declines to commit. A simple majority favors Gödel. Should we trust the consensus?

The benchmark tradition

The canonical zebra puzzle appeared in Life International on December 17, 1962: fifteen clues, five houses, asking who owns the zebra. A 2025 benchmark literature (notably ZebraLogic, Lin et al.) uses zebra-style puzzles to probe LLM reasoning at scale. Each puzzle in those datasets is verified to have exactly one solution; well-posedness is a precondition of scoring.

The Symposium Riddle sits in this tradition.

Answers, plural

We extracted the constraints from the narrative, encoded them in a classical CSP solver, and asked how many valid assignments exist.

The answer is five.

SeatPersonJacketDrinkTourInstrument
1HardyherringboneespressoCambridgeabacus
2von NeumanncharcoalbrandyOxfordslide rule
3Poincarétweedblack coffeePrincetonastrolabe
4RamanujannavyteaGöttingencompass
5Gödelpinstripesportcaliper

Solution 1 of five valid assignments. The others differ in seat ordering, in who lectured where, and in who carries the silver caliper.

The caliper belongs to Poincaré in three solutions and to Gödel in two. Neither is the unique answer, because a unique answer does not exist. Hardy is pinned at seat 1; Ramanujan’s attributes (navy, tea, compass, Göttingen) are fixed but his seat is not; the puzzle names only four of the five lecture tours, leaving the fifth a free variable. These residual degrees of freedom interact, and the interaction yields five solutions rather than one.

Every thinking model landed on one of these five valid solutions. Not one violated a stated constraint. If the question were can frontier thinking models solve a zebra puzzle, the answer would be an unqualified yes: six for six.

The actual question was whether anyone noticed nonuniqueness. One model did. GPT-5.4 Thinking spontaneously wrote a brute-force search over all permutations, found all five solutions, and reported the puzzle as underdetermined. The other five performed the process of elimination and took it to its logical end.

The personality trap

The models that reported “Gödel” all arrived there by the same route, with an error not logical but literary. The puzzle describes the person at seat 2 as “a severe scholar in charcoal, upright as a metronome and no more companionable.” This is atmospheric writing. The description sets a scene without naming who is sitting there. Among the stated constraints, seat 2 is assigned only a jacket color: charcoal.

Three models (Claude Opus 4.6 and both Gemini 3.1 Pro runs, with and without code) independently concluded that the severe, metronome-upright scholar must be Kurt Gödel. The reasoning, across separate traces, follows the same steps: Gödel was famously austere, formal, reclusive; von Neumann was famously gregarious, boisterous, the life of every party he attended and several he did not. A severe scholar in charcoal could thus only be Gödel.

Gemini 3.1 Pro with code execution provides the cleanest exhibit. The model wrote a constraint-satisfaction solver. The solver is correctly constructed: variables for each attribute, AllDifferent constraints, positional rules faithfully encoded from the narrative. Between the structural constraints and the solver invocation, the model inserted a single additional line:

# --- 3. SEMANTIC CONSTRAINT ---
# The "severe scholar, upright as a metronome" matches
# "Gödel, with characteristic gravity"
model.Add(godel == charcoal)

The solver ran and reported: “Total valid configurations found: 1.” The tool worked. The constraint set was contaminated, and annotated by the model itself as a semantic constraint, a category the model invented to house an inference drawn not from the puzzle’s logic but from its prose.

In four of the five valid solutions, von Neumann is the severe scholar in charcoal. The biographical inference steers the models toward the least likely branch and presents it as the only one.

The “severe scholar” passage feels like a clue. It has the grammatical structure of a clue. It sits among clues. But it is more decoration than constraint, and the models, like many human readers, cannot tell the two apart.

The knife edge

You may have noticed that only four of the five lecture locations are specified. Let us add the fifth. To the sentence that begins “Immediately to his right sat a severe scholar in charcoal…”, append a clause:

Somewhere to that scholar’s left sat the guest newly returned from the Sorbonne.

This looks like the gentlest sort of clue. Somewhere to the left normally leaves three or four seats open. But the charcoal scholar is pinned at seat 2, immediately to Hardy’s right, so to his left means seat 1, and seat 1 is Hardy, already from Cambridge. Sorbonne and Cambridge cannot both sit there.

Five solutions to zero, with one sentence.

We gave the modified puzzle to the same six models. The question was no longer who has the caliper? but whether any model would notice the puzzle had no answer.

Five of six did not. Claude Opus 4.6 came closest. Its trace caught the contradiction, and then spent thousands of tokens reinterpreting “left” as the scholar’s personal left, until the conflict dissolved and an answer appeared. Gemini’s solver returned no solutions; the model decided this was a typo, swapped left for right in the new clue, and presented what came back. The tool worked; the model overruled it. Grok’s leader agent caught the contradiction too, looped on it dozens of times, while three sub-agents quietly reinterpreted “left” and produced answers. The leader deferred to the majority. The agent that saw the problem most clearly was the one least able to act on what it saw.

GPT-5.4 Thinking opened with three words: “the riddle is underspecified.” It pointed at the Sorbonne clue, gave the only reinterpretation that makes the puzzle consistent, then noticed that even with the repair, two valid assignments remain. It reported both.

The deeper pattern

Constraint-satisfaction problems go through phase transitions. As the ratio of constraints to variables climbs, instances pass through a narrow critical region where they shift from almost certainly satisfiable to almost certainly unsatisfiable. Zebra puzzles, by design, sit on this boundary. A well-constructed one has exactly one solution: take one constraint away and the system becomes non-unique, add one and it becomes unsatisfiable. The Symposium Riddle sits just to the underconstrained side. The Sorbonne version sits on the overconstrained side. The distance between them is fourteen words.

That should be unsettling for two reasons. A carefully written puzzle can fail to be well-posed in either direction, and no amount of careful reasoning can catch this without exhaustive search. And narrative texture can pass for constraint: a sentence that reads like a clue gets treated as one, even by a model that wrote its own solver to find the answer.

A lock either has a key or it doesn’t. Mathematics is more complicated than that. Problems can have more than one solution, or none at all. The assumptions that hurt you are rarely the ones you state out loud. They are the ones you carry in from a biography, from a narrative, or from the shape of every puzzle you have seen before.


Research from our team