Analytics

Saturday, January 31, 2026

Sparse File LRU Cache

An interesting file system feature that I came across a few years ago is sparse files. In short, many file systems allow you to create a logical file with "empty" (fully zeroed) blocks that are not physically backed until they get written to. Partially copying from ArchWiki, the behavior looks like this:

The file starts at 0 physical bytes on disk despite being logically 512MB. Then, after writing some non-zero bytes at a 16MB offset, it physically allocates a single block (4KB). The file system is maintaining metadata on which blocks of the file are physically represented on disk and which ones are not. To normal readers of the file, it's transparent -- the sparsity is managed completely by the file system.

At Amplitude, we found a cool use case for sparse files. All of the data is stored durably in cold storage (Amazon S3) in a columnar data format used for analytics queries. But it's inefficient and costly to fetch it from S3 every time, so the data gets cached on local NVMe SSDs (e.g. from the r7gd instance class). These local SSDs are more than ten times as expensive as cold storage, though, so you need a good strategy for deciding how and what to cache. To understand why sparse files are a good option for this, let's revisit columnar data formats briefly.

One of the observations about analytics queries that makes columnar data formats so effective is the fact that usually only a very small subset of columns (5-10 out of potentially thousands) are used on any particular query (and, to some extent, even across many queries). The data being stored in a columnar fashion means that each of these columns is a contiguous range inside the file, making it much faster to read. The contiguous ranges are also important in the context of our local caching -- we're not trying to pick out small pieces scattered across the file.

Setting sparse files aside for a moment, we originally had two different approaches for doing this caching:

  • The most naive strategy is caching entire files from S3. This is simple and requires minimal metadata to manage, but has the obvious downside of wasting a lot of disk space on the SSDs by storing the columns that are rarely or never used.
  • Another option is caching different columns as individual files on disk. This somewhat solves the wasted disk space issue, but now explodes the number of files, which requires a substantial amount of file system metadata. It also struggles with small columns, which are rounded up to the file system block size. With hundreds of thousands of customers of varying sizes, it's inevitable that the vast majority of files/columns are small.

At this point, it's pretty clear how sparse files give you an option in between these two. We imagine that we're caching entire files, except that the files are sparse, where only the columns that are used (more specifically, logical blocks that contain those columns) are physically present. This is simpler for a consumer to read and utilizes the disk better -- less file system metadata, and small columns are consolidated into file system blocks (as a bonus, this reduces S3 GETs as well). Similar to the latter approach above, consumers must declare the columns they need to the cache before reading them.

Managing sparse file block metadata in RocksDB.

This system requires us to manage metadata on which columns are cached, which we used a local RocksDB instance for. More specifically, we track metadata on logical blocks of these sparse files: which ones are present locally, and when they were last read. Using this, we approximate an LRU policy for invalidating data when the disk fills up (via periodic sweeps across all blocks). The invalidation process uses the fallocate system call with the FALLOC_FL_PUNCH_HOLE flag to signal to the file system that we want to reclaim that space.

As an important implementation detail, the logical blocks are variable-sized with a few smaller blocks at the head (still larger than file system blocks) and then larger blocks throughout the rest, which takes advantage of the fact that the file format has a metadata header (similar to Parquet) that always needs to be read to know how the columns are laid out. The variable-sized blocks are particularly suitable for the mix of very small and very large files that are present in the data.

This sparse file LRU cache improved many aspects of the query system simultaneously: fewer S3 GETs, less file system metadata, less file system block overhead, and fewer IOPS to manage the cache. In turn, that leads to significantly improved query performance at a lower cost. It's rare that a feature that lives as low-level as the file system has such a prominent impact on system design, so when it happens, it's pretty neat.

Wednesday, January 28, 2026

Observations from using Claude Code

Firstly, I'm celebrating the 12-year revival of this blog! I spent the last 12 years building Amplitude, taking it from nothing to a public company that helps some of the biggest digital products in the world understand their users. There are many interesting stories (both technical and otherwise) to share from my time working on that, but we'll save those for another day. Instead, I want to write down some observations from my recent usage of Claude Code with Opus 4.5, which has been a hot topic lately.

During the last year at Amplitude, I had started using both Cursor and Claude Code (CC) regularly, but always in a limited scope. The codebases at Amplitude were large and often not particularly well-documented or well-tested (as you'd find at any fast-growing startup), so these AI coding tools would struggle to handle tasks where the complexity escaped clear boundaries. With guidance, they could write and modify a small React component or an isolated microservice, which was already quite useful, but they didn't radically change my approach to programming.

Last week, I decided I wanted to explicitly try something different. Instead of me driving the code and using CC for help, I inverted the process to have CC drive -- my goal was to write zero code directly. I decided to take one of my personal projects, kfchess, and have CC rebuild the entire site with the new features and enhancements I had on the backlog. The scope of kfchess is much smaller than anything at Amplitude, so I was optimistic that Claude could dramatically change how I interact with this codebase. The new repository with the in-progress code is here.

I started off by having CC read the old repository and write down everything that it felt was relevant in order to rebuild the site. It handily documented all of this for its future self. Then I dumped the whole list of things I wanted to change, ranging from functionality to maintainability, and asked it to plan out the project. You can see the result in the original architecture doc (which it has since updated as the project progressed), noting that I essentially let it make all of the decisions: web framework, database migrations, frontend state management, linting, etc.

My first observation is simple: it is really helpful to have CC set up the project and get it into a working "Hello, World" state. It can easily handle the local environment setup, package management, wiring up frontend and backend, and so on. There are a surprising number of decisions and gotchas in any such setup process -- being able to not think about those and get into a reasonable state in five minutes is huge for greenfield projects. On top of that, as someone who doesn't spin up new projects all the time,  leveraging more modern technologies like uv, ruff, and zustand feels great.

If you browse the repository, you'll notice that the docs/ folder has quite a few files. It's well-known that you should plan extensively while using CC in order to have continuity of approach across large changes. CC will create plans itself and go through them, but I find it valuable to tell it to write the plans into the repository itself. I can then review the plan, give feedback, answer open questions, and iterate until it feels roughly correct. Crucially, the decisions made in the plan can be referenced again in the future.

When planning large changes, CC helpfully breaks them down into multiple phases. Then I typically have it implement a single phase at a time (e.g. here), commit the changes, clear context, and move on. This planning pattern is akin to having a junior engineer write a design doc that you review, but the difference is that the process takes minutes versus days (or weeks!). That makes it the highest ROI use of time, so my second observation is that good planning is the core skill for getting value out of CC.

During the implementation process, CC is very confident in the code that it writes the first time around. In contrast, when I write code, I typically spend a fair amount of time thinking about the logic as I go, constantly trying to verify correctness and identify issues. The first pass that CC takes during implementation feels like it doesn't quite have that part built-in (even though "thinking" is enabled), and so when it says something like "I finished implementing the 4-player game mode!" my immediate reaction is: are you sure it works?

Anytime I find myself skeptical, I use a subagent to review the change. 90% of the time, the subagent identifies a serious bug (often multiple) that CC then fixes. You can redo this 2-3 times until the review comes up clean. My third observation is that CC does not know the right amount of thinking it needs to correctly write certain pieces of code. In the same vein as padding LLM responses with extra tokens to allow for more computation, forcing CC to review its work substantially improves quality.

Finally, my last observation is related to the best practice of providing good verification criteria. My original implementation of kfchess is entirely untested (shame on my past self), which contributed to my hesitation to continue building on top of it. In starting from scratch, I had CC focus on good test hygiene as part of project maintainability, and it's written hundreds of tests during our sessions. While tests have always been valuable in software engineering, CC changes the calculus in a positive way.

Tests are inherently repetitive and can be painful to both write and maintain. You're often forced into this tradeoff between coverage and cost. But when it's trivial to write and update tests, the tradeoff goes away -- CC will make a logic change, run the tests, and then fix whatever is broken without me needing to think about the fact that it happened at all. That said, the tests that CC writes aren't always great the first time around, so part of the subagent review process is identifying edge cases and missing tests.

All things considered, a mere 10 days into rebuilding kfchess using Claude Code, I have been impressed at how much it has sped up my development. It's the perfect project scope for the model quality we have today. It's also the first time that I can confidently say that, thanks to AI, I'm building 3-5x faster and the output is high quality. There are surely plenty more observations to come, but for now, I'm excited for what the future of programming looks like.

TLDR

  1. Greenfield project setup is better and faster.
  2. Good planning is a core skill to use Claude Code effectively.
  3. Use subagent reviews to get the right amount of thinking.
  4. Be more aggressive with tests since maintenance is cheap.
  5. 3-5x faster development with the right project scope is real!