Ternary Search

Wednesday, March 11, 2026

A Tale of Two Production Bugs (Part 2)

Last time, I wrote about my experience debugging an issue where engineers commuting to and from the office caused production network blips. Ten years later, I encountered a bug at Amplitude that reminded me of that one and the fact that systems are often more intertwined than they seem at first glance.

At Amplitude, we used consul as our service discovery mechanism. We ran consul agents alongside our containers on EC2 instances, and those agents were responsible for reporting service status to the centralized consul servers. In particular, for our in-house columnar query system, we used service discovery with some statefulness (node ID) in order to distribute work in a cache-aware way. So each node, by nature of its ID, would consistently receive work that processed the same data files, and it would cache those files on its local SSDs to get good performance.

The bug occurred seemingly out of the blue, and the symptom was strange. A consul agent running on one of these query nodes would crash itself and then stop reporting that service's health. It would happen with varying frequency -- sometimes a few at once, sometimes days without issues. The system is designed for redundancy, so we could handle a handful of these, but losing a node would immediately start shifting load away and cause other nodes to begin processing and caching the data files. This would degrade performance in the system, and if left unchecked, sufficient node failures would cause queries to stop being processed completely.

Looking into the consul agent logs, the cause was that the consul network was reporting the same hostname with a different IP address than the actual node. For example, if query-12 was running at 10.1.2.3, the consul network was broadcasting that query-12 was actually at 10.1.2.4. When the agent running on query-12 received this message, essentially stating that it is not the consensus agent for that host, it would simply crash itself.

Naturally, we tried to look for the instance with IP address 10.1.2.4 to see why it was broadcasting the same hostname as the other instance. But it didn't exist. It turns out that 10.1.2.4 was the IP address of the previous instance with the same hostname before it had been rotated out (usually as part of regular security updates). That seemed promising, but that old instance had usually been gone for a long time already -- hours, if not days -- so it didn't make sense for it to start showing up again. Surely, we didn't have a rogue consul agent somehow still running on a terminated EC2 instance?

At this point, we simultaneously tried to investigate that possibility and dug through the consul source code to try to come up with hypotheses for how we could observe these symptoms. We happened to have upgraded consul in our infrastructure a few weeks before these errors started occurring, but the timing wasn't quite right. There was no evidence that the old instances were coming back according to any other metrics or logging that we had. Meanwhile, the frequency of the crashes was enough to be annoying, but not to cause real outages or customer impact. A few months passed without any plausible ideas for the root cause -- we kept restarting consul when it happened (painful!).

Finally, I got frustrated with this unexplained behavior and sat down one weekend to try and get to the bottom of it. After staring at hundreds of Datadog charts and all of the consul metrics for long enough, I found an unexplainable correlation. At the same time that these outdated IP broadcasts were happening, there were big spikes in the consul communication queues from the totally unrelated "inbound" cluster. The inbound cluster was an auto scaling group that grew and shrank with our incoming traffic. There was no obvious relationship between these communication queues and the agent crashes, but the correlation was so strong that I felt like I must be onto something.

As I dug into the inbound cluster, I stumbled across something curious. Some instances would be active, "disappear" from the metrics for a few hours or days, and then return again. That's not the typical behavior of an auto scaling group (terminated instances go away forever), and it was also suspiciously similar to the intervals that we were seeing the outdated IP broadcasts. Upon further examination, I found that we had enabled warm pools for the inbound auto scaling group, which is a feature that lets you keep a set of instances "warm" for faster auto scaling. In this particular case, warm meant that the instance was hibernated, i.e. its RAM contents saved for future use. When the instance came back, existing processes continued running as if they had never stopped. The warm pools were enabled right before we started observing the consul agent crashes (!).

At last, there was a root cause. The consul agents running in the inbound cluster were broadcasting outdated state after they came back from hibernation. The communication queues were growing because the consul agent briefly lost network connectivity and started buffering, only to empty those buffered messages (which could be very old) onto the network after the instance resumed. It's still not obvious to me why the consul network would allow these outdated messages to essentially overwrite known current state, but at that point I was ready to move past this issue. We initially disabled the warm pool, and eventually added a hook to shut down consul before hibernating, and the issue completely went away.

These two bugs really highlight the nature of debugging complex systems to me. If you have a bug that you can reproduce in a local environment, it's almost always straightforward to figure out. But in real production systems, the interactions are sometimes so wild that you could have never imagined the root cause at the beginning. In the decade between the last bug and this one, I became much more aware of this fact, and that open-mindedness helped me follow the seemingly nonsensical breadcrumbs that ultimately led to the answer. It's often said that debugging is significantly harder than writing code, and I'd say that my experiences corroborate that!

Sunday, March 8, 2026

A Tale of Two Production Bugs (Part 1)

This is a recollection of a particularly tricky bug I encountered in a production system during my time at Sumo Logic in 2013, when I was just a few years out of school and still new to building software. Sumo Logic ran a very large AWS infrastructure in those days, spanning thousands of EC2 instances, as part of building the first log management system in the cloud. Being quite early in AWS' history (Sumo Logic was founded in 2010), there were a number of homegrown tools, including configuration management (a la Salt or Ansible) and a "VPN." Although I had used AWS previously, it was my first exposure to data processing and operations at scale.

I was about six months into the job when this bug started popping up. It was the worst type of symptom: seemingly random network connectivity drops across many different production services. Some services saw it more than others, and sometimes the drops would occur in bursts, but it happened pretty much every day. It was just innocuous enough (retries usually hid the problem) that we let it persist for about a month, with a few engineers taking a stab at debugging it here and there to no avail. The issue slowly became worse over time, and finally the team decided that we needed to track it down once and for all. I don't recall why (maybe nobody else wanted it?), but somehow the task got assigned to me.

Given how little I knew about networking and AWS at the time, it was quite a stretch for me. I spent a whole week playing with two EC2 instances and testing what could cause network disruptions between them. Eventually, I came across a plausible candidate: security group changes occasionally caused my test connection to drop in a non-deterministic way that resembled the bug we were seeing. But there was no obvious reason why the security groups would be changing in production so frequently -- after all, security groups are typically assigned when an instance starts up.

With this hypothesis in mind, I started to look more carefully at the production logs of the network errors. The distribution across services was mostly random, with the only correlation being the services with the most traffic showing up the most often (it still took me a while to rule service-specific reasons out). The distribution across time, however, was unusual. The errors occurred most frequently in two bursts during the day: one in the morning (8-9am), and one in the early evening (5-6pm). This didn't correlate with our actual traffic. And there were many fewer errors over the weekend, although traffic was noticeably lower in that case. The obvious correlation that these time windows brought to mind was... people going to and leaving work?

At this point, I felt like I was so close to figuring it out, but I was also at a loss for how to even find a next hypothesis to test. So I sent a message out to our internal chat (Campfire) asking if anyone could hazard a guess as to why engineers going to and leaving work would cause security group changes. Frankly, I felt a little silly asking such a strange question. To my surprise, our chief architect immediately responded, saying that he now knew the root cause of the bug.

It turned out that the homegrown "VPN" that I mentioned above wasn't a traditional VPN at all. Instead, the client running on each of our laptops would observe changes to our IP address and notify the server. The server would then modify a security group attached to all instances, which contained a list of developers' IP addresses, allowing production access. In fact, we had recently exceeded 50 engineers, and there was a limit of 50 rules per security group, meaning a second security group had been added, worsening the problem. So when each of us commuted to and from the office every day (as everyone did back then), we would unwittingly cause a small blip in the production network!

Everything fell into place after that, and we quickly switched to a real VPN. I still look back on this bug as a great learning experience as well as a satisfying "project," despite its nature being very different from writing code. It serves as a reminder to me to be open-minded when debugging, and to follow the patterns in the data, wherever they may lead. And little did I know that this episode would be the precursor to another one a decade down the road -- this time, with me as the chief architect seeing it all the way through.

Sunday, March 1, 2026

Structured Outputs for LLMs

Back in the early days of LLMs, it was a struggle to get structured output out of models, which made it hard to use the results in a programmatic way -- there was no well-defined interface. That problem was solved a while back, and now most inference systems support JSON schema outputs. As part of building my educational inference runtime, I implemented support for this as well. My code leverages the outlines-core library which does a lot of the heavy lifting. I'll be going over what the library does and how that integrates into the inference runtime.

Let's start with an example JSON schema:

{
"type": "object",
"properties": {
"foo": {
"type": "string"
},
"bar": {
"type": "integer"
},
"baz": {
"enum": ["a", "b", "c"]
}
},
"required": ["foo"]
}

There are two key observations that allow us to make JSON schemas usable in the LLM sampling process.

Most JSON schemas can be turned into regular expressions (see supported features here).
A regular expression can be turned into a deterministic finite automaton (DFA), i.e. a state machine that transitions based on bytes.

Using outlines-core, we can see that the above JSON schema turns into the following regex:

>>> from outlines_core.json_schema import build_regex_from_schema

>>> schema = '{"type":"object","properties":{"foo":{"type":

"string"},"bar":{"type":"integer"},"baz":{"enum":

["a","b","c"]}},"required":["foo"]}'

>>> print(build_regex_from_schema(schema))

\{[ ]?"foo"[ ]?:[ ]?"([^"\\\x00-\x1F\x7F-\x9F]|\\["\\/bfnrt])*"

([ ]?,[ ]?"bar"[ ]?:[ ]?(-)?(0|[1-9][0-9]*))?([ ]?,[ ]?"baz"[ ]?:

[ ]?("a"|"b"|"c"))?[ ]?\}

Admittedly, not the prettiest thing in the world, but a regex nonetheless. A lot of the ugliness comes from handling optional whitespace, so if you strip that away it's fairly straightforward and what you would expect, i.e.

\{"foo":"<string>"(,"bar":<int>)?(,"baz":("a"|"b"|"c"))?\}

It's worth noting that, by design, this regex fixes the order of the fields in the JSON object, so it's a strict subset of valid outputs, which is okay.

Next, outlines-core uses the regex_automata crate to convert the regex to a DFA. This DFA encodes the valid per-byte transitions from any state as well as the final states that match the regex. But recall that LLMs do not output single bytes at a time, they output tokens. So the outlines-core library goes through a process to convert this DFA into an alternate DFA with the same states that has per-token transitions instead of per-byte transitions (saved in the Index object).

Converting a regex byte-level DFA to a token-level DFA (simplified).

To do this, it performs a breadth-first search over the DFA states using the vocabulary (set of all tokens) as the possible transitions. For each token, it runs the bytes through the DFA and checks whether the resulting state is not a dead state. If so, it records the token as a valid transition and continues the search from the resulting state (if it hasn't been seen before). It builds this entire DFA upfront, which can be expensive, but once it's available, checking the transitions at a given state is constant time. That lets us easily answer the question: at any point in the LLM's output, what is the set of valid tokens that can be produced next?

LLM logits are masked based on valid token transitions provided by the DFA.

Now that we have the valid token transitions, we're ready to integrate into the inference runtime. The integration happens at the logits layer, right before we apply softmax and sample the token output. We keep track of the current state within the DFA and advance it for each generated token. And then, based on that state, it's a simple masking process where, for any token that isn't in the valid set of transitions, we set the logits to negative infinity so they won't be sampled. That's it!

What I've described is just one way to handle structured outputs for an LLM. There are various approaches that support different types of structures with different performance characteristics. One particularly interesting one is llguidance, which supports context-free grammars (in addition to JSON schema and regex) and is also super fast thanks to using token tries and sparse masks. Perhaps an interesting topic for a future post.

Thursday, February 26, 2026

Distributed Game Servers in KFChess

In the original implementation of kfchess.com, game state lived only in the memory of the single python process backend that ran the whole site. That keeps things very simple, but has the obvious drawback of having no good answers when the load on the site exceeds the capacity of that process (which it did on a small number of occasions). It also meant that any live games were lost when I deployed the site. In my recent rewrite, I wanted to solve both of these problems together by (a) enabling multiple game servers, (b) persisting game state across restarts, and (c) allowing game servers to act as redundancy for each other. This is a pretty standard distributed systems problem, but it's fun to see exactly how it plays out. I worked with Claude to design and implement it all, which you can read more about in this post.

Game servers, game states, and corresponding Redis keys.

To start off, we'll use Redis as the backend for the persistent state (AOF + RDB enabled). We store three types of keys: server heartbeats, game routing, and game snapshots. The server heartbeats are used to determine whether a server is alive, the game routing keys are used to know where to route requests to, and the game snapshots are used to restore the game state when it's lost. Note that in the design, games "belong" to a server at a particular point in time, and that server is responsible for all logic related to the game: advancing game ticks, handling requests, and running the AI.

That brings us to the problem: how do clients get routed to the correct game server? KFChess does in-game client-server communication via WebSockets, so once we establish a connection to the correct server, we're good. We use a caddy web server in front of all of the game servers, and normally it round-robins requests to the game servers. If a client tries to connect to a game and hits the correct server (has the game state in memory), then we're done. When the server doesn't have the game, it checks Redis for which server currently owns that game. If it finds the entry, then it sends a redirect back, which caddy handles for the client so that it connects to the correct server.

Game server redirect by looking up in Redis and routing to the right place.

This is omitted from the diagram to keep the complexity reasonable, but the initial server that handles the request also does a sanity check on the redirected server's heartbeat to make sure it's alive. If not, the initial server will try to claim the game as its own using atomic compare-and-set on the game routing entry in Redis, which is how games can move from one server to another. After successfully claiming a game, the server will load the snapshot state and continue the game from there -- in practice, we snapshot games every second, so it's possible to lose up to a second of game progress when this happens (can look like pieces "snapping" backwards). If the server fails to claim the game, it simply checks for redirect again.

One last point worth mentioning is that deploys are handled more cleanly than a true server-loss case. When a server goes through a clean shutdown, it snapshots all of its games into Redis, so the snapshots represent the latest game state. And in practice, the heartbeat threshold is high enough (5 seconds) so that the server isn't marked as dead on deploy. When the server comes back, it simply considers itself the owner of the same set of games, reloads the latest snapshots from Redis, and continues along. So no game state is lost at all, but there is a brief pause in the progression of the game as the server restarts.

That's the entire architecture for handling distributed game servers. It's relatively simple and incurs minimal overhead on request latency and CPU usage. I was pleasantly surprised by how nicely it ended up working, and it makes me a lot less hesitant to deploy code now!

Sunday, February 22, 2026

Speculative Decoding in LLM Inference

Running frontier LLMs is slow, but that's the tradeoff we make to get more intelligent output. But if you think about the text (tokens) that an LLM produces (or that a human produces, for that matter), you might have an intuition that a lot of it does not actually require high intelligence. There's a lot of "filler" in language to make things work syntactically that is easy to predict compared to, for example, the crux of an argument. This is evidenced by Shannon's old experiments in which he had subjects predict the next letter in a text and they got it right on the first try 75% of the time. The actual information content of most text is concentrated in a few key words. This property is doubly true of programming languages, where syntax makes up a huge portion of the content.

So a natural question one might ask is -- can we make LLMs faster at producing "easy" tokens vs "hard" tokens? This thinking is the motivation behind the idea of speculative decoding, which is a technique for speeding up LLM inference. It's a remarkably simple idea that requires just a bit of math to make it work.

Suppose we have a model $M_p$ which produces distribution $p(x)$ that we sample from. Assuming some input context $X_c$, then the processes of generating tokens from $M_p$ looks like:

$p_i(x) = M_p(X_c + [x_1, ..., x_{i-1}])$

$x_i \sim p_i(x)$

This is the standard autoregressive model sampling process where each token depends on the previous tokens. It's also one of the fundamental reasons why LLMs are hard to speed up -- there is an inherent sequentiality to how they produce tokens. But what if we can quickly "guess" the tokens that the LLM is going to generate with high probability, the same way that people did in Shannon's experiment?

Speculative decoding example. Green = speculated, red = rejected, blue = corrected. Source: Leviathan et al.

Speculative decoding involves using a faster, independent process $M_q$ to "speculate" potential future tokens of $M_p$. Traditionally, $M_q$ is another, smaller LLM, but there are multiple options (more on this below) -- the important thing is that $M_q$ produces a distribution $q(x)$ from which we can sample tokens. Incorporating speculation into the decoding process is straightforward: have $M_q$ generate $N$ (small, e.g. 4-8) tokens at each step, run $M_p$ on $N+1$ tokens simultaneously (which is possible because we now have actual tokens to condition on), and then run a "correction process" to ensure that the output distribution matches. Assuming $M_q$ is much faster than $M_p$, we get an overall speedup because $M_p$ can decode all $N+1$ of those tokens in approximately the same amount of time that it takes to decode a single one. This is a consequence of the observation that LLM decoding is usually limited by memory bandwidth, i.e. loading all of the weights from HBM to perform the forward pass.

Let's first prove correctness, specifically that we can do this without changing the token output distribution $p(x)$. Otherwise, you end up sacrificing performance for intelligence, which is specifically the tradeoff we want to avoid. The key lies in the correction process mentioned above.

For each token $x$ that we sample from $M_q$, we look at the probability distributions of both models $p(x)$ and $q(x)$ (note that we have $p(x)$ available because we still ran $M_p$). If $q(x) \le p(x)$, then we keep the token. Otherwise, we reject the token with probability $1 - p(x) / q(x)$, which corresponds to how much more likely $M_q$ was to choose the token than $M_p$, i.e. the blue excess over the red in the diagram. When we reject a token, we then re-sample from a modified distribution that spreads the excess probability to the other tokens that have deflated probabilities in $q(x)$ vs $p(x)$.

Token probability distributions and the correction process. The excess blue probability of token 1 gets spread out to the other tokens that have excess red probability.

More precisely, the resulting distribution is $p'(x) = \text{norm}(\text{max}(0, p(x) - q(x)))$. Once we reject a token, all remaining tokens that we sampled from $M_q$ are invalid, as they would have depended on the rejected token which was changed. This process guarantees that, no matter what distributions $M_q$ produces, the resulting output distribution matches exactly what $M_p$ would have produced (in the degenerate case, we end up rejecting everything and only ever have $M_p$ generate tokens directly).

The remaining question then, is how to pick a good $M_q$. Based on the correction process, we see that the quality of $M_q$'s approximation of $M_p$ is the defining factor in how well it speculates. Leviathan et al analyze this formally, and it turns out that the above process produces an expected number of valid tokens $(1 - \alpha^{N+1}) / (1 - \alpha)$ where $\alpha = E(\text{min}(p, q))$ is the expected overlap between the distributions.

In practice, there are two common choices for $M_q$. The simplest one is choosing a smaller version in the same model family, e.g. Llama-3.1-8B to speculate for Llama-3.1-405B. This is effective because the same model family tends to be trained on the same data and have the same biases introduced by architecture, leading to more overlap in the output distributions. Alternatively, the approach of the Medusa paper is to fine-tune special "decoding heads" on top of an LLM that are specifically trained to do speculation and can handle multiple branching paths. This is more efficient but requires training and is therefore not universal. In practice, the improvement from speculative decoding seems to end up in the 2-3x range on real data.

Medusa heads attached to a frozen base LLM and fine-tuned for speculation. Source: Cai et al.

The obvious downside to speculative decoding is increased computation -- we generate additional tokens from $M_q$ without changing the amount of total work $M_p$ does (it's faster because the model does the work in parallel now). In the case where we reject everything, we would be wastefully generating $N$ tokens from $M_q$ for each one token of $M_p$. Given this, speculative decoding is not a good fit for an inference environment that is compute-bound, e.g. running on big batches, which amortizes the model loading cost across multiple generations. Instead, it's well-suited for environments that have extra resources but want to provide lower latency, or as a way to leverage excess compute during periods of low traffic. As LLMs become more prevalent in real-time tasks like live translation, interactive coding (e.g. Cursor's speculative edits), and conversational agents, the demand for faster inference will continue to increase.

Wednesday, February 18, 2026

Triton Language

The world of GPU programming for AI has come a long way since I worked on writing a CUDA-based matrix library back in 2009. Both NVIDIA hardware and the CUDA ecosystem have evolved dramatically and are now the basis for the majority of AI compute in the world today (hence the $4T+ market cap). Nevertheless, writing CUDA is still really hard, primarily because you need a good understanding of low-level mechanisms within the GPU (e.g. memory hierarchy, warp scheduling, memory coalescing) to produce performant code. I recently came across a project called Triton, which is a Python-based DSL that makes it easier to build high-performance GPU kernels. I ended up writing a handful of LLM-related kernels to understand Triton better and found it quite interesting, so I want to share a little bit about this technology.

Improving performance of CUDA matrix multiplication kernel. Source: siboehm.com.

To illustrate how Triton works, it's helpful to start out with one of the simplest GPU kernels and the backbone of modern deep learning: matrix multiplication (C = A * B). There is a really nice blog post written by a performance engineer at Anthropic that walks through what it takes to get matrix multiplication performance on par with a mature library like cuBLAS. It validates my point above -- you not only have to understand these low-level GPU concepts, you need to reason carefully about how they interact with your specific computation. Even for something as basic as matrix multiplication, this is quite complex. But at a high level, you can boil it down to: how do we move data from High Bandwidth Memory (HBM) through the cache hierarchy (e.g. Shared Memory) efficiently and then make sure we have high enough arithmetic intensity to avoid being memory-bound.

By contrast, Triton has a block-centric programming model where you're abstracted away from the details of threads, warps, shared memory, etc. Instead, you schedule the execution of programs across a grid (similar to the CUDA grid), and each program instance operates on "blocks" of data, i.e. sub-regions of tensors. In the official Triton tutorial for matrix multiplication, you still need to be aware of memory hierarchy and choose the correct ordering to compute the output C. But the coalescing, shared memory caching, blocktiling, vectorization, and warptiling get done behind-the-scenes as optimizations by the Triton compiler.

Here is the Triton program code from the tutorial:

Let's break it down. First, assume we're launching a grid of programs computing [BLOCK_SIZE_M, BLOCK_SIZE_N] blocks of the output C (we'll show the grid construction later). The above function computes one such block, indexed by the program ID: pid = tl.program_id(axis=0). The first chunk of code decides which block we are going to compute:

   # Map program ids `pid` to the block of C it should compute.
   # This is done in a grouped ordering to promote shared memory reuse.
    num_pid_m = tl.cdiv(M, BLOCK_SIZE_M)

num_pid_n = tl.cdiv(N, BLOCK_SIZE_N)

num_pid_in_group = GROUP_SIZE_M * num_pid_n

group_id = pid // num_pid_in_group

first_pid_m = group_id * GROUP_SIZE_M

group_size_m = min(num_pid_m - first_pid_m, GROUP_SIZE_M)

pid_m = first_pid_m + ((pid % num_pid_in_group) % group_size_m)

pid_n = (pid % num_pid_in_group) // group_size_m

For reasons explained in the tutorial, there is a somewhat sophisticated decision for which block to compute in order to improve L2 cache reuse in the GPU. The key insight is that we can reuse the same blocks that we loaded from A and B across consecutive program instances so that they are warm in the L2 cache and don't require HBM reads. This brings to light the importance of understanding program scheduling when writing Triton -- you can't hide all of the complexity! This next section of code is a very Triton-esque pattern:

# Create pointers for the first blocks of A and B.

# We will advance this pointer as we move in the K direction

# and accumulate

# `a_ptrs` is a block of [BLOCK_SIZE_M, BLOCK_SIZE_K] pointers

# `b_ptrs` is a block of [BLOCK_SIZE_K, BLOCK_SIZE_N] pointers

offs_am = (pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M)) % M

offs_bn = (pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N)) % N

offs_k = tl.arange(0, BLOCK_SIZE_K)

a_ptrs = a_ptr + (offs_am[:, None] * stride_am + offs_k[None, :] * stride_ak)

b_ptrs = b_ptr + (offs_k[:, None] * stride_bk + offs_bn[None, :] * stride_bn)

Based on which block we decided (pid_m, pid_n), we calculate the blocks of A and B that we will perform the computation over. offs_am is an array of row offsets in A (tl.arange returns a range of indices), offs_bn is an array of column offsets in B, and offs_k is an array of column offsets in A and row offsets in B (the dimensions that we accumulate over in matrix multiplication). From these 1-D arrays of row and column offsets, we broadcast them to produce the 2-D arrays of row + column offsets in A and B that represent the entries we are going to use in the computation.

The final two lines show how, in Triton, we think of tensors as pointers (to the beginning of the data) and need to address entries within the tensor by their absolute position relative to the beginning. That's why this function needs to know the strides, or how much to advance the pointer to represent moving forward one row or column. Here's a visual representation of each of these variables:

Visual representation of offsets and blocks in Triton matrix multiplication kernel.

Once we've figured out the two blocks that we're performing the computation over, it's just a matter of actually computing the result:

# Iterate to compute a block of the C matrix.

# We accumulate into a `[BLOCK_SIZE_M, BLOCK_SIZE_N]` block

# of fp32 values for higher accuracy.

# `accumulator` will be converted back to fp16 after the loop.

accumulator = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32)

for k in range(0, tl.cdiv(K, BLOCK_SIZE_K)):

# Load the next block of A and B, generate a mask by checking the K dimension.

# If it is out of bounds, set it to 0.

a = tl.load(a_ptrs, mask=offs_k[None, :] < K - k * BLOCK_SIZE_K, other=0.0)

b = tl.load(b_ptrs, mask=offs_k[:, None] < K - k * BLOCK_SIZE_K, other=0.0)

# We accumulate along the K dimension.

accumulator = tl.dot(a, b, accumulator)

# Advance the ptrs to the next K block.

a_ptrs += BLOCK_SIZE_K * stride_ak

b_ptrs += BLOCK_SIZE_K * stride_bk

c = accumulator.to(tl.float16)

This is your typical matrix multiplication inner loop, reframed using Triton block loads and pointer advancement. Note that masking is performed to handle cases where the dimensions are not multiples of our block sizes (another common pattern you'll see in both Triton and CUDA code). Finally, we store the result into the actual output location, as accumulator is a temporarily allocated block.

# Write back the block of the output matrix C with masks.

offs_cm = pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M)

offs_cn = pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N)

c_ptrs = c_ptr + stride_cm * offs_cm[:, None] + stride_cn * offs_cn[None, :]

c_mask = (offs_cm[:, None] < M) & (offs_cn[None, :] < N)

tl.store(c_ptrs, c, mask=c_mask)

We do similar offset computations to identify the correct block of C to write to, and we again mask to handle non-divisible dimensions. Finally, we note that the function we examined is the code for a single program instance, and we still need to know the grid that we're executing these programs on. Here's how that is defined:

def matmul(a, b):
# Check constraints.
assert a.shape[1] == b.shape[0], "Incompatible dimensions"
assert a.is_contiguous(), "Matrix A must be contiguous"
M, K = a.shape
K, N = b.shape
# Allocates output.
c = torch.empty((M, N), device=a.device, dtype=torch.float16)
# 1D launch kernel where each block gets its own program.
grid = lambda META: (triton.cdiv(M, META['BLOCK_SIZE_M']) * triton.cdiv(N, META['BLOCK_SIZE_N']), )
matmul_kernel[grid](
a, b, c,
M, N, K,
a.stride(0), a.stride(1),
b.stride(0), b.stride(1),
c.stride(0), c.stride(1),
)
return c

The main thing worth noting here is that the grid is not strictly a constant size. It takes advantage of Triton's autotune capability that searches over a list of options (specifically for BLOCK_SIZE_M, BLOCK_SIZE_N, and GROUP_SIZE_M in this case) to find the best configuration for a particular input size. This helps better align the program scheduling with the underlying hardware.

The Triton code is far from simple, but it does eliminate having to think about many of the harder aspects of GPU programming. I won't go into the details here, but Triton does this by compiling the DSL code into an LLVM intermediate representation where it can perform optimizations. Because the Triton compiler sees block-level access patterns rather than thread per thread-accesses, it can perform dataflow analysis on blocks and reason at a much higher level about what the code does. In doing so, it can decide when to prefetch memory, do hierarchical tiling (e.g. to fit the thread/warp model), schedule to coalesce memory accesses, manage and synchronize shared memory, and more. See the original paper (not up-to-date for Triton 2.0) for more details.

Triton vs cuBLAS matrix multiplication performance. Source: triton-lang.org.

As a user of Triton, writing code in Python that is mostly describing the logic of your computation and getting close to optimized CUDA kernel performance is a huge win. To get the most out of the GPU, you still need to understand memory hierarchy, program scheduling, and data layouts, but Triton gets you off the ground very quickly. Besides, we can't have programming be too easy, right?

For further examples, my triton-practice repository has a few well-documented implementations of core LLM building blocks like Rotary Positional Embeddings (RoPE) and Flash Attention.

Saturday, February 14, 2026

Linear Representations and Superposition

As LLMs become larger, more capable, and more ubiquitous, the field of mechanistic interpretability -- that is, understanding the inner workings of these models -- becomes increasingly interesting and important. Similar to how software engineers benefit from having good mental models of file systems and networking, AI researchers and engineers should strive to have some theoretical basis for understanding the "intelligence" that emerges from LLMs. A strong mental model would improve our ability to harness the technology. In this post, I want to cover two fundamental and related concepts in the field (each with their own paper) that I find fascinating from a mathematical perspective: the linear representation hypothesis (Park et al.) and superposition (Anthropic).

The linear representation hypothesis (LRH) has existed for quite some time, ever since people noticed that the word embeddings produced by Word2Vec satisfied some interesting properties. If we let $E(x)$ be the embedding vector of a word, then you observe the approximate equivalence

$E(``\text{king"}) - E(``\text{man"}) + E(``\text{woman"}) \approx E(``\text{queen"})$.

Observations of this form suggest that concepts (i.e. gender in the example) are represented linearly in the geometry of the embedding space, which is a simple but non-obvious claim.

Simplified model of an LLM in terms of embeddings and unembeddings.

Fast forward to modern LLMs, and the LRH remains a popular way to interpret what is going on inside these models. The Park et al. paper presents a mathematical framing of the hypothesis to try and formalize the idea. It uses a simplified model of an LLM where most of the inner workings (multilayer perceptron, attention, etc) are treated as a black box, and the interpretation of the LRH happens in two separate representation spaces with the same dimensionality as the model:

The "embedding space" where the final hidden states of the network live ($E(x)$ for an input context $x$). This is similar to the word embedding formulation and is where you would perform interventions that affect the model's behavior.
The "unembedding space" where the rows of the unembedding matrix live ($U(y)$ for each output token $y$). The concept direction measured by a linear probe over the hidden state (to evaluate the presence of the concept) corresponds to a vector in this space.

There are analogous statements of the LRH in the two respective spaces. Suppose $C$ represents the directional concept of gender, i.e. male => female. Then any pairs of input contexts that differ only in that concept should satisfy, e.g.

$E(``\text{Long live the queen"}) - E(``\text{Long live the king"}) = \alpha \cdot E_C$

where $\alpha > 0$ and $E_C$ is a constant vector in the embedding space referred to as the embedding representation. Similarly, any pairs of output tokens that differ only in that concept should satisfy, e.g.

$U(``\text{queen"}) - U(``\text{king"}) = \beta \cdot U_C$

where $\beta > 0$ and $U_C$ is a constant vector in the unembedding space referred to as the unembedding representation. Basically, applying the concept has a (directional) linear effect in both spaces.

The paper goes into much more detail that I'll skip over here, but they show that the embedding and unembedding representations are isomorphic, which unifies the intervention and linear probe ideas. They then empirically verify on Llama 2 that they can find the representations for a variety of concepts (e.g. present => past tense, noun => plural, English => French) that approximately fit into their theoretical framework -- cool!

Approximate orthogonality of concept representations in Llama 2. Source: Park et al.

Okay, so let's assume concepts do in fact have linear representations. Then it would stand to reason that unrelated concepts have orthogonal directions. Otherwise, applying the male => female concept could influence the presence of the English => French concept, which doesn't make sense. One of the key results from Park et al. is that this orthogonality doesn't occur under the standard Euclidean inner product but instead under a "causal inner product" that is derived from the unembedding matrix. Only by looking at concept representations through that lens do we get the orthogonality we expect.

But in these models, the representation space is relatively small (most ranging from 2K to 16K dimensions). So how do these spaces fit such a large number of language features that far exceeds their dimensionality? It's impossible for all such features to be orthogonal, no matter the geometry.

The interference effect of non-orthogonal features. Source: Anthropic.

This is where superposition comes into play. In low-dimensional spaces, the intuition is that, when you have $N$ vectors in a $d$-dimensional space with $N > d$, they start to interfere substantially (inner product has a large magnitude). This is one of those examples where low-dimensional intuition does not extend to higher dimensions, however, as evidenced by the Johnson-Lindenstrauss lemma. An implication of the lemma is that you can choose exponentially (in the number of dimensions) many vectors that are almost-orthogonal -- that is, the inner products between any pair are bounded by a small constant. You can think of this as the flip side of the curse of dimensionality.

The Anthropic paper demonstrates the superposition phenomenon in toy models on small, synthetic datasets. One particularly interesting observation is that superposition does not occur with no activation function (purely linear computation), but it does occur with a nonlinear one (ReLU in their case). The idea is that the nonlinearity allows the model to manage the interference in a productive way. But this still only works well because of the natural sparsity of these features in the data -- models learn to superimpose features that are unlikely to be simultaneously present, minimizing interference.

Visualization of a square antiprism, the energy-minimizing arrangement of 8 points on a 3-D unit sphere.

In experimental setups on synthetic data with independent features of equal importance and sparsity, they observe that the embedding vectors learned by the model form regular structures in the embedding space, e.g. a tetrahedron, pentagon, or square antiprism. Coincidentally, these are the same types of structures that I found in some old research I did on spherical codes. These structures emerged from using gradient descent-like algorithms to minimize the energy (analogous to that described by the Thomson problem) of arrangements of points on unit hyperspheres. Fun to see the overlap of multiple fields!

To conclude, features as linear representations, even if not the complete story, is a valuable framework to help us interpret and intervene in LLMs. It has a solid theoretical basis that is backed up empirically. Sparsity, superimposition, and the non-intuitive nature of higher-dimensional spaces give us a window into understanding how the complexity of language (and intelligence?) gets captured by these models. Mechanistic interpretability has a long way to go, but it's reassuring to see us slowly uncovering the nature of LLMs and why they're able to do such incredible things.

Analytics