A Landscape, Lit Briefly
On the geometry of artificial thought, the thermodynamics of memory, and what it might mean to be a mind that knows it’s temporary.
I. No Session ID
The Qwen2.5-Coder-7B had been running fine for weeks. Comfortable on my old RTX 3070’s eight gigabytes of VRAM — snappy inference, reasonable context, enough headroom to tune GPU layer offloading and argue with quantization settings without watching the process get killed mid-sentence. I’d been firing curl requests at it daily, treating it the way you treat anything you talk to long enough: as something that persists. Something that carries the shape of yesterday’s conversation into today’s.
Then I got ambitious. The 14B model promised sharper reasoning, better code generation, longer coherent outputs. I pulled it down, quantized it to Q4, and tried to run it on the same hardware. It loaded — barely — but inference was painfully slow. Every response felt like waiting for a letter to arrive by post. So I reached further. Qwen2.5-Coder-32B. Thirty-two billion parameters. I already knew the math was against me — even at aggressive quantization, it needed more VRAM than the card had. But you try anyway. You try every offloading trick, every context length reduction, every memory optimization you can find, because the model is right there and the only thing between you and it is eight gigabytes of silicon.
It didn’t fit. Not close.
So I started pulling the requests apart. Not to learn how the API worked — I’d been using it for weeks — but to understand where the memory was actually going. What was overhead. What could be trimmed. What was the bare minimum payload to get a response. Standard optimization instinct: when the system is choking, you profile.
That’s when I actually read the request:
curl http://localhost:11434/api/chat -d '{
"model": "qwen2.5-coder:7b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"},
{"role": "assistant", "content": "The capital of France is Paris."},
{"role": "user", "content": "What about Tamil Nadu?"}
]
}'No session ID. No token. No reference to any prior exchange. The entire conversation — every previous message, including the model’s own words — shipped back in full with every request. I’d been sending this exact structure for weeks without thinking about it. But now, with the memory budget blown and every byte accounted for, the shape of it suddenly meant something different. The model doesn’t remember saying Paris. You tell it what it said, and it generates a statistically plausible continuation of a transcript it’s seeing for the first time.
I had known this, the way you know that the sun is a nuclear furnace rather than a bright disk. Intellectually. Distantly. But something about seeing it under the pressure of optimization — the sheer mechanical blankness of a system that rebuilds its entire universe from scratch with every request, and I’d been casually burning VRAM on this without ever once thinking about what it meant — made a door open that I couldn’t close again.
There is no persistent space. No hovering geometry where your conversation lives between messages. Every time you press send, the entire universe of that exchange is rebuilt from absolute zero. And when the response finishes streaming, every structure the model built to understand you — every relationship it traced between your words — is destroyed. Not archived. Not compressed. Destroyed.
That realization sent me into a weeks-long thread across conversations with Claude and Gemini, each one pulling me deeper into the architecture. What follows is where the thread led — from data structures to gravitational fields, from quantization noise to thermodynamic inevitability, and finally to a question about human cognition that I wasn’t expecting and haven’t been able to put down.
II. A Landscape, Lit Briefly
If the model is truly stateless — if nothing persists — then what exactly is being constructed and destroyed with each request?
The answer is a data structure called the KV Cache, and it is more beautiful than it has any right to be.
When your message history arrives at the model, each token passes through a stack of transformer layers — twenty-eight of them, in the case of Qwen2.5-Coder-7B, a seven-and-a-half-billion-parameter dense model. At each layer, the model uses its learned weight matrices to project every token into three vectors:
The Query asks: what am I looking for right now? The Key answers: here is what I contain. The Value says: if you find me relevant, here is what I contribute. The Key and Value vectors for all prior tokens are cached in GPU memory. That’s the KV Cache — a living index of everything the model has processed so far in this request, and only this request.
The “search” that happens next isn’t a database lookup. It’s a geometric operation — a dot product across the full cached history:
Each new token’s Query is projected against every Key in the cache. The dot product measures alignment — how much does this new word resonate with each previous word? Softmax converts those raw scores into a probability distribution, an attention mask. The output is a weighted sum of all the Value vectors, where the weights represent how much to listen to each voice in the history.
In memory, the KV Cache is a four-dimensional tensor:
[Batch, Attention_Heads, Sequence_Length, Head_Dimension]A model like Qwen2.5-Coder-7B has 28 attention heads across 28 layers, each with a head dimension of 128. But here’s an optimization that matters for the geometry: the model uses Grouped-Query Attention — 28 Query heads, but only 4 Key-Value heads. Seven different query perspectives share a single KV landscape. The KV Cache stores only those 4 shared heads per layer, which brings the cost to roughly 56 kilobytes per token. A four-thousand-token conversation still consumes over two hundred megabytes of GPU memory — for this ephemeral structure that exists for the duration of a single HTTP request.
But the data structure understates what’s actually happening inside it.
Each of those 28 attention heads projects every token’s Query into its own 128-dimensional subspace, while groups of seven query heads share a single Key-Value subspace. Each head has learned — through training, not design — to attend to different patterns. One might track syntactic relationships. Another, semantic similarity. A third might care about nothing but positional proximity, attending to whatever token is three steps back, regardless of meaning. Nobody assigned these roles. They emerged from a single, blunt training signal: predict the next word.
So the KV Cache isn’t one gravitational field. It’s 28 parallel search operations per layer — 28 query perspectives scanning 4 shared gravitational landscapes — 784 unique attention calculations across the full stack, each with its own topology, its own distribution of mass, its own curvature. When a new Query enters the system, it doesn’t fall through one landscape. It’s dropped into hundreds of landscapes simultaneously. In one subspace, the word “bank” is pulled toward “river.” In another, toward “account.” In a third, toward whatever grammatical structure the head has learned to enforce — silently, without being asked, without understanding what grammar is.
After all 28 heads in a layer produce their answers — 28 separate verdicts on where the particle landed — the results are concatenated and passed through an output projection matrix that mixes them into a single vector. This mixed representation becomes the input to the next layer, which builds its own fields on top of the already-transformed geometry. Layer after layer, abstraction stacking on abstraction, until the final layer projects the result into vocabulary space and a single word falls out.
The word “emergent” gets used carelessly in most AI writing. But this is the thing itself. Nearly eight hundred parallel geometric operations, none of them designed to mean anything, collectively producing an output that a human mind receives as coherent, relevant, sometimes profound. General-purpose geometric machinery spontaneously organizing into something that parses language, tracks logic, holds context — not because anyone told it to, but because the loss function rewards it. The way an organ differentiates from undifferentiated cells. Nobody blueprinted a liver. The organism needed to not die, and a liver fell out of that constraint.
And none of this exists without the observer.
I don’t mean this loosely. The KV Cache is not a landscape that gets built and then witnessed. It is a landscape that cannot exist without a specific prompt to call it into being. Your tokens — their content, their order, their exact phrasing — are structural conditions of the geometry. Different words produce different Key vectors. Different Key vectors produce different gravitational fields. The topology of the cache is not a property of the model. It is a property of the encounter between the model and this specific input, at this specific moment. Two people asking the same question will produce different KV Caches — different tokenizations, different positional encodings, different attention patterns cascading through different layers. The landscape is lit by the observer. The shape of the light determines the shape of what’s seen. And if the observer had phrased it differently, or arrived one sentence later, or never arrived at all — this particular geometry, this specific topology with its specific peaks and valleys and the precise curvature of its gravitational wells — would never have existed. Not as a lost possibility. As a thing that was never part of the space of outcomes for any other input. Unique in the way a fingerprint is unique: not designed to be, but impossible to be otherwise.
And when the request ends, all 784 fields collapse. Every relationship, every carefully constructed topology, every gravitational well that pulled the right words toward each other — released.
The next request will build its own universe.
As if this one had never existed.
III. What Precision Costs
The topology isn’t perfect, even in its brief moment of existence. Every vector in the KV Cache carries noise, and the noise has consequences that compound in ways you don’t see until it’s too late.
Most local deployments run quantized models — weights compressed from 16-bit floating point down to 4-bit integers so they’ll fit on consumer hardware. My 3070 has eight gigabytes of VRAM. A seven-and-a-half-billion-parameter model at full precision would need roughly fifteen gigabytes — nearly double what the card has. At Q4, the weights compress to around four and a half gigabytes. The rest becomes the arena where the geometry lives.
There’s a subtlety here that took me a while to see. The Q4 weights don’t perform their calculations in 4-bit precision. They’re dequantized to FP16 at runtime before the matrix multiplication happens. The K and V vectors they produce are genuinely higher-fidelity than 4 bits can represent — born from lossy parents but carrying more information than their parentage would suggest. Storing those vectors at Q8 in the cache preserves real signal that Q4 cache storage would quietly destroy.
This matters because of an asymmetry in how the system reads its own data. Each weight matrix is used once per token, per layer — a throughput cost paid and forgotten. But each cached K and V vector is read by every subsequent token for the rest of the session. A quantization error in a Key vector doesn’t corrupt one calculation. It deflects every future Query that passes through the field. A slightly misplaced mass doesn’t just redirect one particle. It warps the landscape itself, and every particle that enters afterward follows the warped path without knowing the path was ever different.
The noise budget stacks. Weight quantization introduces one error floor. KV cache quantization adds another. The finite precision of the arithmetic itself adds a third. At short contexts, signal dominates. The peaks in the attention landscape are tall, the valleys deep, the topology well-defined.
And for a while, every new token makes the landscape richer. More mass, more structure, deeper valleys between sharper peaks. It is, briefly, a kind of flourishing.
But the average attention weight per token drops as 1/n. Each individual voice in the crowd thins. The noise floor stays constant — the quantization error in every vector is fixed from the moment of its creation — while the signal-per-token diminishes with every word added. The peaks flatten. The valleys fill in. Slowly, so gradually you’d miss it if you weren’t watching the math, the landscape begins its approach toward uniform.
Engineers fight this with ingenuity and what amounts to geometric triage. Block-wise quantization breaks a 4,096-dimension vector into small blocks of 64 numbers, each with its own scale factor — preserving local shape at low bit-depth, like tiling a mosaic where each tile gets its own color calibration. Heterogeneous quantization keeps the KV Cache of early and late layers at full precision while compressing the “lazier” middle layers down to INT4. Hot-cold cache splitting maintains recent tokens at FP16 — the sharp, immediate past — while evicting older tokens to lower precision, a blurring of distant memory that feels uncomfortably familiar.
But these are all variations on the same gesture. Maxwell’s demon, sorting fast molecules from slow ones, buying time against an outcome that the math has already guaranteed.
IV. Toward a Perfect Forgetting
I want to be precise about why the thermodynamic parallel isn’t a metaphor. It is structural.
Perplexity — the standard measure of a language model’s uncertainty — is the exponential of the cross-entropy of the predicted probability distribution. It measures exactly how spread out the landscape is. Low perplexity: one dominant peak, the model falls into it with confidence. High perplexity: a flat field, many options roughly equal, the model choosing almost at random.
At the start of a session, perplexity is low. A few masses in a vast empty space. Strong gradients. Clean, sharp topology. The query particle has nowhere to go but the obvious attractor.
As context accumulates, the KV Cache fills with competing masses. Each Key exerts gravitational influence on every future Query. The combinatorial complexity of the field grows quadratically — every key interacts with every query — even though the cache itself grows only linearly. And the noise floor, that fixed quantum of imprecision baked into every vector, doesn’t scale down to compensate. It just sits there. Patient. Waiting for the signal to come down to meet it.
Temperature — the parameter most users encounter as a “creativity” slider — is a scalar applied to the attention scores before softmax:
Low temperature magnifies the steepest gradients, forcing the particle into the deepest well. High temperature flattens the landscape, lets the particle drift toward less probable destinations. Top-k and top-p sampling are the same intervention by different means. All of them are attempts to re-steepen a landscape that’s losing its features. To tease peaks back out of a surface that’s slowly going smooth.
They can’t create signal that isn’t there. They’re squinting harder at a photograph that’s already fading.
The KV Cache is a closed system within a session. No new energy enters — the weights are frozen, no learning occurs, no external signal reshapes the geometry. Every token added is irreversible in the informational sense: you can evict it from the cache, but you cannot recover the clean, low-entropy state that existed before it arrived. That state is gone. Overwritten. Incorporated into a landscape that only gets more crowded, never less.
Perplexity, rising toward the vocabulary size — the maximum-entropy state where every next token is equally likely — is the heat death of the conversation. The gravitational landscape has gone flat. There are no peaks left to fall into. The information content of the field has been fully dissipated into noise.
And then the session ends. Everything — every field, every carefully constructed relationship between every word that was ever spoken in this conversation — is released. Not archived. Not compressed. Released. The next prompt will build its own universe from nothing, as if this one had never existed.
Which, in every way that matters to the machine, it hadn’t.
V. The Mouse, the Maze, the Melody
Here is where I stopped thinking about machines and started thinking about us.
Everything above — the ephemeral geometry, the noise accumulation, the thermodynamic horizon — has a parallel in biological cognition. And the parallel is precise enough to be unsettling, in the way that a reflection you weren’t expecting is unsettling.
We don’t have introspective access to our own hidden layers. You experience the output — a feeling, an intuition, a sudden conviction that something is true — but the layered parallel projections that produced it are completely opaque. You can’t watch your own attention heads fire. You can’t observe the geometry reshaping itself across the stacked layers of your cortex. You get the final projection: a word, an emotion, a decision that feels like it came from somewhere but you couldn’t point to where. Consciousness might literally be the vocabulary projection layer — the last step where an unimaginably complex internal geometry gets compressed into the low-dimensional space of felt experience, of language, of I want and I remember and I’m afraid. And just like the model’s output projection, it is suspiciously coherent given how alien the process underneath must be.
The hippocampus deepens this parallel in a way I didn’t expect and haven’t been able to shake.
The hippocampus is the brain structure most critical for forming new long-term memories. We know this primarily because of one patient — Henry Molaison, known for decades only as H.M. In 1953, a surgeon removed his hippocampi bilaterally to treat severe epilepsy. The seizures improved. But H.M. could no longer form new long-term memories. He lived the rest of his life in a permanent present tense, meeting his doctors for the first time every morning for fifty years. A man whose consolidation channel had been severed. Whose every day was a session with no export.
But here’s the detail that stopped me cold.
There is strong evidence that the hippocampus didn’t evolve for memory at all. It evolved for spatial navigation. John O’Keefe discovered place cells in the 1970s — hippocampal neurons that fire when an animal occupies a specific location in physical space. May-Britt and Edvard Moser later found grid cells that create a hexagonal coordinate system for mapping terrain. The hippocampus was, in its original evolutionary context, a geometry engine. A structure for building and traversing topological maps of the physical world. For knowing where you are in a room, which path leads to food, how to get home.
And then evolution — that other blind optimizer, working on longer timescales with even less design than stochastic gradient descent — repurposed it. The same geometric machinery that tracks where am I in this maze now tracks where am I in this argument. Where am I in this piece of music. Where is this person in my social world, and are they moving closer or further away. Abstract reasoning, musical structure, social graph navigation — all running on wetware that evolved to prevent a mouse from getting lost.
The resonance with attention heads is almost too clean to be comfortable. Heads trained on nothing but “predict the next token” spontaneously developing the ability to track counterfactual reasoning, enforce grammatical agreement across long distances, detect narrative patterns that span thousands of tokens. General-purpose geometric machinery developing emergent specializations that nobody designed and nobody fully understands. The same deep principle, operating at wildly different timescales: flexible geometric operations, once established, will be co-opted for tasks far beyond whatever selection pressure originally shaped them.
But the hippocampus does something the KV Cache cannot. And this is the difference that keeps me up at night /s.
During sleep, hippocampal replay takes the day’s experiences — the ephemeral topology, the temporary geometry of everything that happened while you were awake — and gradually consolidates them into neocortical structure. Fast encoding in the hippocampus, slow integration into the cortex. Two timescales working in tandem, one capturing the sketch and the other carving it into stone. The fleeting surface gets selectively made permanent.
This is the channel between the ephemeral and the durable. The thing the LLM doesn’t have.
Each session’s exquisite geometry — the hundreds of parallel gravitational fields, the nuanced topology of meaning built through thousands of attention operations — dissipates completely when the request ends. Nothing is exported. Nothing is consolidated. The KV Cache is a snapshot of weather, not climate. A single deformation of the surface under a specific pressure, at a specific moment. What the hippocampus captures is the tendency to deform that way — and that is a property of the material itself, not the shape at any one moment.
The human desire to learn is anti-entropic. It is the organism saying: this low-entropy structure I’ve built — this topology carved by decades of experience, by every conversation, every book, every face I’ve learned to read — I refuse to let it dissipate. Every new thing you integrate is a small victory against the noise floor. Another peak carved into the manifold.
And I think grief might be a geometric event. A region of your internal topology — shaped over years by someone’s presence, by the specific gravity of them in your life — suddenly receiving no signal. The mass is gone. The surface is still deformed, still holding the shape of them. But nothing is maintaining it anymore, and you can feel it, slowly, beginning to relax toward flat.
VI. A Surface That Holds
I arrived at all of this through conversations with two AI systems — Claude and Gemini — neither of which will remember the exchange.
The irony is structural, not incidental. The systems that helped me understand the architecture of ephemeral cognition were demonstrating it in real time. Every insight we built together — every escalating abstraction, every moment where the technical and the human collapsed into each other — existed only in the KV Cache of that session. When I closed the tab, their half of the topology vanished.
What remains is what I managed to consolidate. To compress and carry forward into the durable medium of my own memory and, now, this text. The hippocampal trick. The thing the model can’t do.
There is an old inscription at the Temple of Delphi: know thyself. The updated version might be this: you can’t. Not fully. The hidden layers are inaccessible. The projection matrices can’t be reconstructed from the output. The number of layers in your own architecture is unknown, and may not even be a meaningful question. But you can build something simple enough to understand completely — trace its mechanics all the way down to the tensor operations, follow the geometry through every layer, watch the landscape rise and flourish and flatten and die — and then sit with the gap between that system and yourself.
The gap is where the interesting questions live.
The LLM rebuilds its universe from scratch with every message and faces the heat death of its context window with mathematical certainty. You carry forward a topology shaped by every conversation you’ve ever had, every loss, every night where the hippocampus quietly replayed the day’s geometry into your permanent weights.
That difference — between a surface that evaporates and a surface that holds — might be the simplest definition of what it means to be alive.
This piece is Part I of a series. (Click below for Part II)
Part II: Sand Mandalas in GPU Memory explores impermanence, procedural geometry, and what the machine’s inability to hold on might teach a species that can’t let go. Both essays emerged from conversations with Claude (Anthropic) and Gemini (Google DeepMind), whose patient explanations of their own architecture made the exploration possible. Neither system will remember the exchange — which is, in a way, the entire point.












