Analytics

Saturday, February 7, 2026

Quantization-Aware Distillation

Following up on the last post, I want to write about this new paper from NVIDIA on quantization-aware distillation. We learned about quantization last time, so let's talk about distillation now. Model distillation is the process of transferring knowledge from one model (the teacher) to another, usually smaller, one (the student). Similar to quantization, the goal is to produce a smaller model that uses less memory and compute while still retaining intelligence.

Benchmark performance of DeepSeek-R1 distillations. Source: DeepSeek.

A fairly well-known example is the DeepSeek-R1 release from a year ago, which was the first major open-source reasoning model. As part of their release, they distilled DeepSeek-R1 into various smaller Llama and Qwen models to demonstrate the fact that the reasoning capability could, in part, be transferred to other models. The idea is that you can take these small models which lack strong reasoning capabilities, show them DeepSeek-R1's output (more specifically, the output probability distributions), and ask them to mimic it. This substantially improved the small models' performance on tasks like math and coding which benefit from careful reasoning.

With respect to NVFP4, the problem to solve is how you get the set of NVFP4 weights that correspond to the strongest model. There are three approaches discussed in the paper.

  1. Post-training quantization (PTQ): Start from the full-precision weights and scale through calibration to map them to NVFP4 weights. This works well for large models, but has poor observed performance in smaller models.
  2. Quantization-aware training (QAT): Simulate quantization during the training process to allow the model to adjust for the bias introduced by quantization.
  3. Quantization-aware distillation (QAD): Distill knowledge directly from a high-precision, post-trained model (of the same size) to the quantized one.

In the context of the paper, the teacher model uses bfloat16 while the student model of course uses NVFP4. To perform QAD, they train the quantized model using the Kullback-Leibler divergence between the teacher and student probability distributions as the loss function. This is in contrast to traditional pre-training and QAT where the model tries to replicate the training data itself. While distillation typically uses larger teacher models, the authors observe that keeping the teacher model the same size as the student works better for QAD, likely because it's easier for the student to recover its own distribution rather than learning a new one.

KL divergence vs cross-entropy loss. Source: NVIDIA.

One particularly fascinating result is that, although both QAT and QAD models achieve similar cross-entropy loss on the dataset (fairly close to that of the bfloat16 model), the KL divergence of the QAD model is substantially better on held-out samples. The takeaway being that, although QAT adjusts for quantization well during training, the resulting model behaves very differently from its high-precision reference.

On top of that, QAD is much simpler to perform on extensively post-trained models. Models these days undergo significant post-training via supervised fine-tuning (SFT) and/or reinforcement learning (RL), and it can be quite challenging to keep these processes stable under quantization. In fact, the paper finds that QAT actually degrades performance over PTQ, sometimes losing the capabilities gained during RL training. Using QAD bypasses the need for this, with the tradeoff of needing the high-precision model where the heavy lifting of post-training has already been done. The paper shows even larger QAD vs QAT wins on the performance of these types of models.

Recovering performance via QAD with limited training data. Source: NVIDIA.

A final, interesting observation made in the paper is that QAD as a process is robust to incomplete training data. That is, even when presented with only math training data or only code training data, the model recovers performance on both domains. The paper suggests that the output probability distributions of the teacher model contain information for all domains even on limited input tokens. So as long as you present the model with some amount of high-quality training data, it can perform well generically.

Distillation as an LLM training mechanism is a powerful tool, which intuitively suggests that mimicking intelligence is computationally simpler than deriving it, whether from scratch (as with DeepSeek-R1) or as part of recovering performance in a quantized scenario. The fact that smaller (or quantized) models can successfully mimic is also an indication that it's less about size or precision gating model strength than it is the process of synthesizing behaviors into the parameters. This is already happening to some extent, but my guess is that, long-term, we will have lots of small, quantized models running at the edge (e.g. on phones, computers, browsers) that are distilled from centralized, intelligent teachers.

Tuesday, February 3, 2026

LLM Quantization and NVFP4

With the rise of large language models and the desire to run them more cheaply and efficiently, the concept of quantization has gained a lot of traction. By representing the weights of the LLM with data types that use fewer bits, you reduce the necessary GPU memory to load the model and the memory bandwidth of operations like matrix multiplication.

As a quick refresher, floating-point numbers consist of a single sign bit, a number of exponent bits (E), and a number of mantissa bits (M). If $e$ is the value of the exponent bits (potentially biased), and $m$ is the value of the mantissa bits, then the floating point number represented is

$f = sign \cdot 2^e \cdot (1 + \frac{m}{2^M})$

Intuitively, the exponent determines the rough scale of the number, and the mantissa determines the precise value within that scale.

float32 representation. Source: Wikipedia.

The full-precision reference format commonly used for LLMs is float32, which is your standard float data type in C, and has E = 8, M = 23 for 32 bits total. For quantizing down to 16 bits, it's become popular to use bfloat16, which is a newer format developed for machine learning specifically. The bfloat16 format uses E = 8, M = 7 versus traditional float16 that uses E = 5, M = 10. This is because capturing a wider range of scale (e.g. for large gradients or activations) is more important than high precision in ML.

bfloat16 representation. Source: Wikipedia.

Quantization can go further: 8-bit, 4-bit, and even 2-bit formats are common now. By the time you get down to 4 bits, it's no longer obvious that anything will work -- after all, 4 bits can only represent 16 values! To understand how this could work, let's look in particular at NVFP4, which is NVIDIA's own data type that is targeted specifically at maintaining model accuracy.

It's a bit of a misclassification to consider NVFP4 a data type at all, as it's not a standalone representation like traditional floating-point numbers. Rather, it is a format for an entire tensor, which is the building block for neural networks. It consists of a standard 4-bit floating-point value (E = 2, M = 1) used in conjunction with per-block scaling factors and a per-tensor scaling factor. You can think of it like a generalization of the floating-point concept across multiple values instead of within a single one.

NVFP4 representation. Source: NVIDIA.
 
There is a single 32-bit tensor scaling factor that determines the "global scale" of the tensor, and then an 8-bit scaling factor for every 16 values in the tensor. These scaling factors mitigate the 4-bit format's limited range, and they've chosen E and M values for the scaling factors to most accurately reconstruct true values in practice (as measured by mean squared error). The additional scaling factors add an overhead of about 8 bits per 16 values, or 0.5 bits per 4-bit value (12.5% overhead), which is the tradeoff against a standard float4. It might seem complex to manipulate tensors in this format, but NVIDIA has implemented hardware support for NVFP4 into their Blackwell architecture, so their GPUs natively understand how to do it, abstracting the complexity away from developers.

One thing I've glossed over is how the quantization actually happens, which isn't trivial. Another reason that bfloat16 is preferred over float16 is the fact that quantizing from float32 to bfloat16 is easy as they have the same range (number of exponent bits). But if you quantize to 8 or 4 bits, that's never going to be true, so you have to apply scaling as described in this blog post. The post covers techniques for post-training quantization in order to choose quantized weights that maintain model accuracy, but I won't be covering them here. Instead, this post is a setup for my next post on quantization-aware distillation as an alternative approach, which is a paper that NVIDIA recently published -- stay tuned!

Saturday, January 31, 2026

Sparse File LRU Cache

An interesting file system feature that I came across a few years ago is sparse files. In short, many file systems allow you to create a logical file with "empty" (fully zeroed) blocks that are not physically backed until they get written to. Partially copying from ArchWiki, the behavior looks like this:

The file starts at 0 physical bytes on disk despite being logically 512MB. Then, after writing some non-zero bytes at a 16MB offset, it physically allocates a single block (4KB). The file system is maintaining metadata on which blocks of the file are physically represented on disk and which ones are not. To normal readers of the file, it's transparent -- the sparsity is managed completely by the file system.

At Amplitude, we found a cool use case for sparse files. All of the data is stored durably in cold storage (Amazon S3) in a columnar data format used for analytics queries. But it's inefficient and costly to fetch it from S3 every time, so the data gets cached on local NVMe SSDs (e.g. from the r7gd instance class). These local SSDs are more than ten times as expensive as cold storage, though, so you need a good strategy for deciding how and what to cache. To understand why sparse files are a good option for this, let's revisit columnar data formats briefly.

One of the observations about analytics queries that makes columnar data formats so effective is the fact that usually only a very small subset of columns (5-10 out of potentially thousands) are used on any particular query (and, to some extent, even across many queries). The data being stored in a columnar fashion means that each of these columns is a contiguous range inside the file, making it much faster to read. The contiguous ranges are also important in the context of our local caching -- we're not trying to pick out small pieces scattered across the file.

Setting sparse files aside for a moment, we originally had two different approaches for doing this caching:

  • The most naive strategy is caching entire files from S3. This is simple and requires minimal metadata to manage, but has the obvious downside of wasting a lot of disk space on the SSDs by storing the columns that are rarely or never used.
  • Another option is caching different columns as individual files on disk. This somewhat solves the wasted disk space issue, but now explodes the number of files, which requires a substantial amount of file system metadata. It also struggles with small columns, which are rounded up to the file system block size. With hundreds of thousands of customers of varying sizes, it's inevitable that the vast majority of files/columns are small.

At this point, it's pretty clear how sparse files give you an option in between these two. We imagine that we're caching entire files, except that the files are sparse, where only the columns that are used (more specifically, logical blocks that contain those columns) are physically present. This is simpler for a consumer to read and utilizes the disk better -- less file system metadata, and small columns are consolidated into file system blocks (as a bonus, this reduces S3 GETs as well). Similar to the latter approach above, consumers must declare the columns they need to the cache before reading them.

Managing sparse file block metadata in RocksDB.

This system requires us to manage metadata on which columns are cached, which we used a local RocksDB instance for. More specifically, we track metadata on logical blocks of these sparse files: which ones are present locally, and when they were last read. Using this, we approximate an LRU policy for invalidating data when the disk fills up (via periodic sweeps across all blocks). The invalidation process uses the fallocate system call with the FALLOC_FL_PUNCH_HOLE flag to signal to the file system that we want to reclaim that space.

As an important implementation detail, the logical blocks are variable-sized with a few smaller blocks at the head (still larger than file system blocks) and then larger blocks throughout the rest, which takes advantage of the fact that the file format has a metadata header (similar to Parquet) that always needs to be read to know how the columns are laid out. The variable-sized blocks are particularly suitable for the mix of very small and very large files that are present in the data.

This sparse file LRU cache improved many aspects of the query system simultaneously: fewer S3 GETs, less file system metadata, less file system block overhead, and fewer IOPS to manage the cache. In turn, that leads to significantly improved query performance at a lower cost. It's rare that a feature that lives as low-level as the file system has such a prominent impact on system design, so when it happens, it's pretty neat.

Wednesday, January 28, 2026

Observations from using Claude Code

Firstly, I'm celebrating the 12-year revival of this blog! I spent the last 12 years building Amplitude, taking it from nothing to a public company that helps some of the biggest digital products in the world understand their users. There are many interesting stories (both technical and otherwise) to share from my time working on that, but we'll save those for another day. Instead, I want to write down some observations from my recent usage of Claude Code with Opus 4.5, which has been a hot topic lately.

During the last year at Amplitude, I had started using both Cursor and Claude Code (CC) regularly, but always in a limited scope. The codebases at Amplitude were large and often not particularly well-documented or well-tested (as you'd find at any fast-growing startup), so these AI coding tools would struggle to handle tasks where the complexity escaped clear boundaries. With guidance, they could write and modify a small React component or an isolated microservice, which was already quite useful, but they didn't radically change my approach to programming.

Last week, I decided I wanted to explicitly try something different. Instead of me driving the code and using CC for help, I inverted the process to have CC drive -- my goal was to write zero code directly. I decided to take one of my personal projects, kfchess, and have CC rebuild the entire site with the new features and enhancements I had on the backlog. The scope of kfchess is much smaller than anything at Amplitude, so I was optimistic that Claude could dramatically change how I interact with this codebase. The new repository with the in-progress code is here.

I started off by having CC read the old repository and write down everything that it felt was relevant in order to rebuild the site. It handily documented all of this for its future self. Then I dumped the whole list of things I wanted to change, ranging from functionality to maintainability, and asked it to plan out the project. You can see the result in the original architecture doc (which it has since updated as the project progressed), noting that I essentially let it make all of the decisions: web framework, database migrations, frontend state management, linting, etc.

My first observation is simple: it is really helpful to have CC set up the project and get it into a working "Hello, World" state. It can easily handle the local environment setup, package management, wiring up frontend and backend, and so on. There are a surprising number of decisions and gotchas in any such setup process -- being able to not think about those and get into a reasonable state in five minutes is huge for greenfield projects. On top of that, as someone who doesn't spin up new projects all the time,  leveraging more modern technologies like uv, ruff, and zustand feels great.

If you browse the repository, you'll notice that the docs/ folder has quite a few files. It's well-known that you should plan extensively while using CC in order to have continuity of approach across large changes. CC will create plans itself and go through them, but I find it valuable to tell it to write the plans into the repository itself. I can then review the plan, give feedback, answer open questions, and iterate until it feels roughly correct. Crucially, the decisions made in the plan can be referenced again in the future.

When planning large changes, CC helpfully breaks them down into multiple phases. Then I typically have it implement a single phase at a time (e.g. here), commit the changes, clear context, and move on. This planning pattern is akin to having a junior engineer write a design doc that you review, but the difference is that the process takes minutes versus days (or weeks!). That makes it the highest ROI use of time, so my second observation is that good planning is the core skill for getting value out of CC.

During the implementation process, CC is very confident in the code that it writes the first time around. In contrast, when I write code, I typically spend a fair amount of time thinking about the logic as I go, constantly trying to verify correctness and identify issues. The first pass that CC takes during implementation feels like it doesn't quite have that part built-in (even though "thinking" is enabled), and so when it says something like "I finished implementing the 4-player game mode!" my immediate reaction is: are you sure it works?

Anytime I find myself skeptical, I use a subagent to review the change. 90% of the time, the subagent identifies a serious bug (often multiple) that CC then fixes. You can redo this 2-3 times until the review comes up clean. My third observation is that CC does not know the right amount of thinking it needs to correctly write certain pieces of code. In the same vein as padding LLM responses with extra tokens to allow for more computation, forcing CC to review its work substantially improves quality.

Finally, my last observation is related to the best practice of providing good verification criteria. My original implementation of kfchess is entirely untested (shame on my past self), which contributed to my hesitation to continue building on top of it. In starting from scratch, I had CC focus on good test hygiene as part of project maintainability, and it's written hundreds of tests during our sessions. While tests have always been valuable in software engineering, CC changes the calculus in a positive way.

Tests are inherently repetitive and can be painful to both write and maintain. You're often forced into this tradeoff between coverage and cost. But when it's trivial to write and update tests, the tradeoff goes away -- CC will make a logic change, run the tests, and then fix whatever is broken without me needing to think about the fact that it happened at all. That said, the tests that CC writes aren't always great the first time around, so part of the subagent review process is identifying edge cases and missing tests.

All things considered, a mere 10 days into rebuilding kfchess using Claude Code, I have been impressed at how much it has sped up my development. It's the perfect project scope for the model quality we have today. It's also the first time that I can confidently say that, thanks to AI, I'm building 3-5x faster and the output is high quality. There are surely plenty more observations to come, but for now, I'm excited for what the future of programming looks like.

TLDR

  1. Greenfield project setup is better and faster.
  2. Good planning is a core skill to use Claude Code effectively.
  3. Use subagent reviews to get the right amount of thinking.
  4. Be more aggressive with tests since maintenance is cheap.
  5. 3-5x faster development with the right project scope is real!

Sunday, March 16, 2014

Distributed Systems and the CAP Theorem

In the field of distributed systems, the CAP theorem is an important result that often guides the design of such systems. The theorem states that a distributed system cannot satisfy consistency, availability, and partition tolerance simultaneously (see the Wikipedia article for definitions of these three properties). In practice, this typically means that, in the case of a network partition, there is some trade-off between consistency and availability. For example, HBase chooses consistency in that case, while Cassandra chooses availability (with eventual consistency). These days, many services have extremely high availability requirements, leading to the popularity of systems that sacrifice strong consistency for availability; having worked with Cassandra myself, however, it is clear that eventual consistency introduces a layer of complexity on both the client and server side that can be difficult to overcome. I was recently introduced to a blog post on how to beat the CAP theorem (it's lengthy, but well worth the read) written by Nathan Marz, the creator of Storm. He outlines an approach to building distributed systems that is intended to simplify the nature of eventual consistency.

First, let me preface the discussion. The article was met with some criticism, but from what I can tell it is mostly people not understanding his ideas combined with the provocative and ambiguous title. There's no such thing as "beating" the CAP theorem in the sense of violating it through some ingenious design; it's a theorem because someone proved it is always true and cannot be violated. The point being made is that we can address the CAP theorem in a way that doesn't lead us down a road of unmanageable complexity and, consequently, systems that are not robust and reliable.

The basic principle that the design leverages is that "data is inherently immutable." This is because data is always associated with a point in time; you can think of data as a fact that was true at that time. So while the balance of your bank account might change from $10$ to $20$ from time $T$ to time $T+1$, the two pieces of data, namely that your balance was $10$ at time $T$ and $20$ at time $T+1$, are forever true. In my experience, starting with this kind of definition sets you up for success because immutability is good, and it simplifies everything. From here, a distributed system is merely an accumulation of data that exposes methods of querying the data, where a query can be any arbitrary computation over all the data. The flexibility you have in querying the data is determined by what the system has chosen to expose, ranging from limited queries (e.g. a plain key-value store) to very expressive queries (e.g. SQL).

Now that we've gotten the mental model of distributed systems out of the way, let's take a look at the key piece of the design that let us "beat" the CAP theorem. Instead of treating all data homogeneously as most systems do, separate the data into two layers: the batch layer, say everything up until the last hour, and the real-time layer, i.e. everything from the last hour. Queries are then sent to both layers of the data and subsequently merged to produce the final result. Since queries are typically too slow to run across the entire batch layer all the time, we precompute "views" of the batch layer that allow queries to be quickly answered. For example, in a key-value store, the data is the history of all inserts, updates, and deletes, but we can precompute a view which is the current map of keys to values, which lets us answer any query by key quickly. Do this every hour, flushing the real-time layer as the view becomes available to query, and we have a system that only transiently depends on the real-time layer.

So what does this buy us? First of all, the batch layer is very simple. You have all of the data, you compute something from it, and you never have to deal with concurrency or anything of that sort. But we still need the real-time layer, which is going to be complex, so have we really saved anything? Let's think about failure (and if you're working with distributed systems, you should always be thinking about failure), both in terms of systems and humans. Failures will occur, and given the complexity of these real-time systems and the application code interacting with them, it's not unusual to corrupt or lose data in unexpected ways. The batch layer is essentially a reliable fallback mechanism in case of a catastrophe (which, as Marz recalls in an incident, can be as simple as running out of disk space somewhere). By isolating the complex real-time layer from the batch layer that is the ultimate source of truth, you protect yourself against these failures.

We can summarize this entire design exercise by the simple principle that we started out with: data is immutable (and immutability is good). Whether you're programming in a multithreaded environment or building a distributed data processing system, leverage immutability as much as you can. It simplifies things and protects you against your own mistakes. And again, I highly recommend checking out the full post to understand more of his thinking.

Sunday, March 2, 2014

Matrix Sketching

Last time, I wrote about a clever algorithm for approximating the histogram for a stream using bounded memory. The post was motivated by this paper, which is an extension of that algorithm to a problem that seems unrelated at first glance, which is matrix sketching. The matrix sketching problem is as follows: given the rows of a large matrix $A \in \mathbb{R}^{n \times m}$ as a stream, produce a matrix $B \in \mathbb{R}^{l \times m}$ where $l << n$ which is a "good" approximation for $A$ when multiplied with vectors. Specifically, the algorithm in the paper achieves the following result: if $l = 1/\epsilon$, then it produces a matrix $B$ such that, for any unit vector $x$,

$||Ax||^2 \ge ||Bx||^2 \ge ||Ax||^2 - \epsilon ||A||_f^2$

where $||A||_f$ is the Frobenius norm of $A$. Here you can see the parallels with the frequency approximation algorithm; the error is a function of how "big" the stream is, which in this case is the Frobenius norm of the input matrix.

The algorithm works as follows: start with $B$ as a $l \times m$ matrix of all zeroes, and for each input row $A_i$ do the following update:
  1. Set $B_l = A_i$ (the last row of $B$).
  2. Compute the singular value decomposition (SVD) of $B$, so we obtain $B = U \Sigma V$ with the standard assumption that the diagonal values of $\Sigma$ are $\sigma_1 \ge \sigma_2 \ge \cdots \ge \sigma_l$.
  3. "Remove" the smallest singular value from $\Sigma$ by letting

    $\bar{\Sigma} = \sqrt{\max(\Sigma - I_l \sigma_l^2, 0)}$

    where $I_l$ is the $l \times l$ identity matrix.
  4. Then set $B = \bar{\Sigma}V$ (note that the last row of $B$ is all zeroes after this step because the last row of $\bar{\Sigma}$ is all zeroes by construction).
At the end, just output the current value of $B$. I won't go into any of the proof details (they can be found in the paper), but it's interesting to try to understand what the algorithm is doing intuitively. The SVD can be thought of (very loosely) as breaking down a matrix into three transformations applied to a multidimensional space: a rotation ($V$), followed by a scaling along the axes ($\Sigma$), and lastly another rotation ($U$). So the singular values are the scaling factors of the matrix in orthogonal directions, and we are removing the smallest one from each of these directions equally. As a result, we only lose a fraction of the accuracy in any particular direction (i.e. $Bx$ versus $Ax$ for a specific $x$) compared to the Frobenius norm of $A$ as a whole.

This would be pretty cool even if it was a purely theoretical result since it ties two problems together in a very elegant way, but it gets better. The paper goes on to explore the algorithm's performance with some experiments and observes that the accuracy is quite an improvement over existing techniques for matrix sketching. Moreover, computing the SVD is somewhat expensive, so the author describes how the algorithm can be parallelized as well as a way to reduce the computational cost of the SVD step and only slightly relaxing the guarantees. It's a very nice paper that spans both the theoretical and practical domains for the matrix sketching problem.

Saturday, February 15, 2014

Streaming Frequency Approximation

It's been a while, but I finally motivated myself to write another blog post! Today's topic is approximating the frequency of items in a stream. With all the data that is generated today, streaming algorithms have become very popular as one of the efficient ways to process datasets that are far too large to fit in memory. One common data mining problem is determining the frequency of items in data, i.e. building a histogram. Consider a stream $X = x_1, x_2, \ldots, x_n$ of $n$ items, each of which belongs to one of $m$ classes $c_1, c_2, \ldots, c_m$. If $m$ is small, the problem is simple, as we simple maintain the entire histogram in $O(m)$ memory and output the exact result. The interesting case is when there are an extremely large number of distinct classes, the number of which is usually unknown. In this case, we will present an algorithm originally from this paper that approximates each value in the histogram within $\epsilon n$ using $O(1/\epsilon)$ memory.

At any point while processing the stream, the in-memory state is a partial histogram with at most $k = 1/\epsilon$ elements (class/count pairs). When processing an element $x_i$, we use the following procedure with three cases:
  1. $x_i$ is already in the histogram: increment its count;
  2. we have fewer than $k$ elements in the histogram: add $x_i$ with count $1$;
  3. the histogram already has $k$ other elements: while the histogram has $k$ elements, decrease the count of each element by $1$ and evict any elements with zero count, and then add $x_i$ with count $1$.
That's the entire algorithm, and it's easy to prove that it achieves the desired approximation as well. The only case in which we "lose" information is case 3, where the count of each of $k$ elements is decreased by $1$. But we can clearly decrease counts at most $n/k = \epsilon n$ times, since that would necessarily reduce all counts to zero. Thus, using the partial histogram and outputting the frequency of any class not in the histogram as zero, we are guaranteed that each count is within $\epsilon n$ of the actual count. In particular, this means that the algorithm will be guaranteed to output a positive count for any class that occurs with frequency $\epsilon n$ or more in the stream.

I've always found this problem to be particularly beautiful because of the simplicity of the result, the algorithm, as well as the proof. Moreover, it's a very practical problem with many applications. To make things even more interesting, a recent paper shows how this concept can be generalized to do matrix sketching (i.e. approximating a matrix with a smaller one). I hope to get around to writing about that at some point as well.