The Skeleton and the Soul

A Russian poem, a broken equation, and the most expensive disagreement in AI

Apr 07, 2026

Last week I ran a simple experiment. I took a problem I know well from my engineering coursework — ordinary differential equations and the classic tank problem — and asked three AI models to walk me through it. Not because I needed the help, but because I wanted to see how they handled the reasoning. Specifically, I wanted to see what would happen when I removed a core assumption and watched whether they could adapt.

Two of the models — Mathstral 7B and Qwen Math 7B — jumped straight to Torricelli’s law. The outflow rate, they told me, depends on the square root of the fluid height. Fair enough. That’s correct physics for a tank draining under gravity through a hole in the bottom.

Then I set the trap: how would you model this in zero gravity?

What happened next was one of the most revealing things I’ve seen an AI do.

Experiencing a Significant Gravity Shortfall

Mathstral wrote — and I’m not paraphrasing here — “g is the acceleration due to gravity (which we can set to zero in this case).” Then, in the very next line, it wrote an equation with g in it. It acknowledged that gravity was zero and then kept using gravity as though nothing had changed.

It got worse. In the same response, the model wrote: “In zero gravity, the pressure at a given depth below the surface is determined by the weight of the liquid above that depth.” Read that again. In zero gravity, pressure is determined by weight. Weight is mass times gravity. There is no weight without gravity. The model produced a sentence that contradicts itself at the level of individual words — and did so with the fluency and confidence of a textbook.

It didn’t stop there. It derived a new formula that included the tank’s radius in the denominator for no physically meaningful reason. It wasn’t wrong in the way a confused student is wrong — a confused student would at least hesitate, feel the contradiction, maybe say “wait, that doesn’t work.” This was something different. The model produced confident, fluent, grammatically perfect nonsense.

The correct answer is almost disappointingly simple: in zero gravity, there’s no hydrostatic pressure. Nothing pushes fluid out of the hole. You’d need a pump. End of analysis.

One sentence. That’s all it takes — if you actually understand why the √h was there in the first place.

The Poet and the Probabilist

Before we get to why this matters, a brief detour to where it all started — because the origin story is almost strikingly relevant.

In 1913, a Russian mathematician named Andrey Markov sat down with a copy of Alexander Pushkin’s Eugene Onegin — one of the great works of Russian literature — and did something that would have baffled any literary scholar watching. He transcribed the first 20,000 letters of the novel into one long unbroken string, stripped of all punctuation and spaces. Then he arranged them into grids of 10×10 characters and started counting vowels and consonants. By hand. With pencil and paper.

Markov wasn’t interested in what Pushkin was saying. He was interested in whether the appearance of the next letter depended on the letter before it. The prevailing probability theory of the time assumed independence — each event unrelated to the last, like coin flips. Markov suspected language didn’t work that way. A vowel, he hypothesized, made a consonant more likely to follow, and vice versa.

He was right. The patterns in Pushkin’s text showed clear statistical dependence — each letter’s probability shaped by its predecessor. This was the first empirical demonstration of what we now call a Markov chain: a system where the next state depends on the current state, and only the current state.

Thirty years later, Claude Shannon at Bell Labs picked up Markov’s thread and showed that as you make the statistical model more complex — accounting for pairs of letters, then triplets, then longer sequences — the output starts to resemble actual language. Shannon had revealed, via Markov, that language has a statistical skeleton. Model the skeleton well enough and you can generate text that looks like someone wrote it.

If that sounds familiar, it should. This is the direct intellectual ancestor of every large language model running today. The progression from Markov counting Pushkin’s vowels to GPT generating essays is a straight line — the same fundamental idea, scaled by twelve orders of magnitude of compute. The question this article is asking is whether that straight line has a ceiling.

The Amnesia Downstream

There’s been an interesting discussion in the AI research community about modeling large language models as Markov chains — systems where the next output depends on the recent context, not on any deep internal model of the world.

Watch the Mathstral transcript through that lens and the failure mode becomes obvious. At each step, the model is asking: given the tokens I just produced, what tokens come next? When it wrote “set g to zero,” it was in a context that called for acknowledging the user’s premise. Done. When it moved to the next paragraph, the local context had shifted to “now I’m deriving an outflow equation.” The g=0 declaration was already upstream, fading in influence. The transition probabilities from “derive outflow formula” point overwhelmingly toward… an outflow formula. So that’s what it produced.

A physicist carrying an actual mental model doesn’t need to “remember” that g=0. The constraint restructures everything downstream. You don’t derive a modified formula — you recognize that the entire framework collapses and you need a fundamentally different approach. That restructuring is the thing the Markov process can’t do.

And here’s what makes this failure particularly striking: it wasn’t even a memory problem. LLMs have two distinct systems working together — the attention mechanism, which is the reasoning layer that decides which parts of the input are relevant to each other, and the KV cache, which is the memory that stores the accumulated context of the conversation. You can think of the KV cache as a library and the attention mechanism as the reader. In long conversations, the library can lose books — earlier context gets compressed or falls outside the window, and the model genuinely forgets what was said. That’s a storage failure, and it’s a known limitation.

But that’s not what happened here. The g=0 declaration was three paragraphs away, well within the context window. The book was on the shelf. The model walked right past it. This was a failure of attention, not memory — the reasoning layer failed to propagate a constraint to the place where it mattered. The information was in the room. The model just couldn’t use it.

That distinction matters because it closes an escape hatch. You can’t fix this by making the memory bigger, extending the context window, or building a better library. The reader is the problem, not the library. And that’s a much harder thing to solve.

It also explains something else: why the model couldn’t just stop. The correct answer was one sentence long. But the pattern it was locked into said “the user asked a modeling question, therefore I produce a derivation.” The statistical weight of all those training examples — thousands of textbook derivations, homework solutions, tutorial walkthroughs — overwhelmed the physical reality that there was nothing left to derive. The model couldn’t exit the pattern even when the physics had evaporated underneath it.

Greater Than Markov, Until They’re Not

Wait, I’m not arguing that LLMs are Markov chains. They’re clearly more than that. A pure Markov chain couldn’t write working code, couldn’t hold a nuanced conversation across dozens of turns, couldn’t surprise domain experts with genuine insights. The progress is extraordinary and real. The third model I tested — Claude — handled the zero-gravity question cleanly, recognizing that the entire framework collapses without gravity and saying so plainly. That’s not Markov chain behavior.

But here’s what the tank problem revealed: when LLMs fail, they fail by degrading toward Markov chain behavior. The failure mode isn’t random. It has a specific signature — the model stops maintaining global coherence and starts doing pure local continuation. Each sentence follows naturally from the one before it. The grammar stays flawless. The notation stays properly formatted. But the system is no longer tracking the constraints that bind the whole response together. It has collapsed from whatever richer computation it normally performs into simple next-token continuation.

I’ve started thinking of this as the “g=0 problem.” Not the specific physics question, but the general pattern: a system that can acknowledge a constraint and then violate it because, at the moment of violation, local context overwhelms global coherence. The constraint didn’t get rejected — it got forgotten, the way a Markov chain forgets its history.

The Library and the Reader

The degradation has more than one path, and understanding the difference matters.

Long conversations can fail because the KV cache — the model’s working memory — loses information over time. I explored this in an earlier piece, A Landscape Briefly Lit, where I framed the KV cache as an ephemeral geometric topology: a rich, low-entropy surface of relationships that the attention mechanism traverses to generate each response. But that surface is impermanent. As context grows, the structure degrades — earlier associations blur, relational precision decays, and the geometry flattens. The surface that held the conversation’s meaning dissolves, and the model drifts toward local continuation. Think of it as the library slowly losing its books. Pages fade, shelves empty, and eventually the reader has nothing left to consult. That’s the long-conversation failure mode.

But the tank problem reveals something more. This wasn’t a long conversation. The KV cache was intact. The library was fully stocked — every book on the shelf, well-organized, nothing missing. The g=0 constraint was right there, clearly catalogued. And the attention mechanism — the reader — walked past the shelf it needed without glancing at it. The information was in the room. The reader simply didn’t pick it up.

Two different failures, same destination: locally coherent, globally incoherent. The Markov attractor. In one case the library empties over time. In the other, the library is perfect but the reader can’t find what it needs. And no amount of building a bigger library fixes a reader that walks past the answer.

And this is what makes the observation so important for the AGI and ASI conversation.

The Elegant Decay

If LLMs were simply Markov chains, the path would be clear: we need a fundamentally different architecture. If they were genuine reasoning engines, the path would also be clear: just keep scaling. But the reality is more uncomfortable. They occupy a strange middle ground — systems that can reason, until the chain gets long enough or the problem gets unfamiliar enough, at which point they silently degrade into the thing everyone insists they’re not.

The optimistic read is that this degradation boundary keeps moving outward. Larger models maintain coherence over longer chains. What broke Mathstral at three paragraphs doesn’t break frontier models on the same problem. And that’s true. But notice the shape of the argument: we’re celebrating that the point of degradation has shifted, not that degradation has been eliminated. The failure mode is the same. It just happens later.

This is the question nobody building toward AGI has satisfactorily answered: is the degradation toward Markov behavior a bug that scaling fixes, or is it a fundamental property of next-token architectures that scaling merely delays? Because AGI — real AGI — means holding hundreds of constraints across thousands of inferential steps, recognizing when your entire framework needs to be abandoned, and sometimes concluding that the answer is “this question doesn’t have the kind of answer you’re expecting.” That’s not a longer coherence horizon. That’s a categorically different kind of computation.

And ASI? Here is a civilization pouring billions into building a mind that will surpass its own, and the prototype cannot remember what it said three paragraphs ago. There is something almost theological about the faith required — the belief that enough silicon, enough data, enough gradient descent will eventually produce not just a better pattern-matcher but a fundamentally new kind of cognition, one that the architecture was never designed to support and has never once exhibited. We are not building a ladder to superintelligence. We are building a very tall stepladder and hoping that at some point the sky gets closer.

The Price of Approximate Minds

Each generation of frontier models costs roughly an order of magnitude more to train. If reasoning coherence scales sublinearly with compute — if you need exponentially more resources to push the degradation boundary linearly further out — you hit economic walls long before you hit AGI. A model that costs $10 billion to train and still occasionally collapses into Markov behavior on a sufficiently complex problem is not on a trajectory toward artificial general intelligence. It’s on a trajectory toward an increasingly expensive approximation of it.

Maybe architectural innovations bridge the gap — world models, planning systems, structured reasoning engines layered on top of language models. Maybe the distinction between “really sophisticated pattern completion that occasionally degrades” and “genuine understanding” turns out to be less meaningful than it seems. Honest people disagree.

But right now, in 2026, we have systems that are more than Markov chains on their best day and indistinguishable from them on their worst. And until that worst case is structurally eliminated — not just made rarer, but made impossible — the leap from “very impressive LLM” to “artificial general intelligence” remains a hope, not a trajectory.

The Billion-Dollar Disagreement

I should note that I’m not alone in this suspicion, and the most prominent voice saying it has put serious money where his mouth is.

In November 2025, Yann LeCun — Turing Award winner, Meta’s Chief AI Scientist for twelve years, and one of the three researchers widely credited with the deep learning revolution — walked into Mark Zuckerberg’s office and told him he was done. Four months later, his new company, Advanced Machine Intelligence Labs, announced a $1.03 billion seed round at a $3.5 billion valuation.

LeCun’s thesis is blunt: LLMs are architecturally incapable of producing true intelligence. They predict tokens. They don’t model reality. He compares them to students who learn by rote memorization — they perform well on familiar patterns but collapse when asked to reason beyond what they’ve seen. Sound familiar?

What AMI Labs is building instead are “world models” — systems based on an architecture called JEPA (Joint Embedding Predictive Architecture) that operates in abstract representation space rather than predicting the next word. The idea is that instead of learning the statistical skeleton of language, you learn the causal structure of reality itself. A world model wouldn’t need to “remember” that g=0. It would model a world without gravity and derive the consequences, the way a physicist does.

It’s worth sitting with what LeCun’s departure represents. This is not a contrarian blogger or a Twitter skeptic. This is someone who helped build the foundations that LLMs rest on, who spent a decade inside one of the best-funded AI labs on Earth, and who concluded that the entire direction of the industry — the direction his own employer was doubling down on — was a dead end. He didn’t write a paper about it. He left and raised a billion dollars to build the alternative.

Maybe he’s wrong. Maybe scaling and architectural patches will do what he says they can’t. But the fact that the argument I arrived at by breaking a 7B model with a high school physics question is the same argument a Turing Award winner is staking his next decade on suggests the question is real, even if the answer isn’t settled.

Removing the Core Assumption

Here’s what I’ve started doing with any AI system I work with: find the core assumption in whatever framework it’s using, remove it, and see what happens. Does the system reorganize around the new reality? Or does it keep producing the same shape of answer with the missing piece awkwardly patched over?

It’s the same thing a good professor does to a student. It’s the same thing a good engineer does to a design. And right now, it’s the fastest way to find out whether you’re working with something that understands what it’s saying or something that’s very, very good at sounding like it does.

The tank problem is trivial. The question it reveals is not.

In 1913, Andrey Markov reduced Pushkin’s poetry to a stream of vowels and consonants and discovered that language has a statistical skeleton. A century later, we’ve scaled that insight by a factor of a trillion and produced systems of extraordinary capability. But the skeleton is still a skeleton. And the question of whether you can get from skeleton to soul by adding more bones is the most important open question in the field — whether you’re asking it from a physics classroom, a $1 billion research lab, or a conversation with an AI that may or may not understand what it’s saying.

This post is part of an ongoing series exploring the internals and philosophy of AI systems. Previously: [The Surface That Holds] on KV cache as ephemeral geometry. Subscribe for more from the curious-engineer-poking-at-things-until-they-break school of AI analysis.

Joshua Natarajan

Discussion about this post

Ready for more?