Analytics

Thursday, February 26, 2026

Distributed Game Servers in KFChess

In the original implementation of kfchess.com, game state lived only in the memory of the single python process backend that ran the whole site. That keeps things very simple, but has the obvious drawback of having no good answers when the load on the site exceeds the capacity of that process (which it did on a small number of occasions). It also meant that any live games were lost when I deployed the site. In my recent rewrite, I wanted to solve both of these problems together by (a) enabling multiple game servers, (b) persisting game state across restarts, and (c) allowing game servers to act as redundancy for each other. This is a pretty standard distributed systems problem, but it's fun to see exactly how it plays out. I worked with Claude to design and implement it all, which you can read more about in this post.

Game servers, game states, and corresponding Redis keys.

To start off, we'll use Redis as the backend for the persistent state (AOF + RDB enabled). We store three types of keys: server heartbeats, game routing, and game snapshots. The server heartbeats are used to determine whether a server is alive, the game routing keys are used to know where to route requests to, and the game snapshots are used to restore the game state when it's lost. Note that in the design, games "belong" to a server at a particular point in time, and that server is responsible for all logic related to the game: advancing game ticks, handling requests, and running the AI.

That brings us to the problem: how do clients get routed to the correct game server? KFChess does in-game client-server communication via WebSockets, so once we establish a connection to the correct server, we're good. We use a caddy web server in front of all of the game servers, and normally it round-robins requests to the game servers. If a client tries to connect to a game and hits the correct server (has the game state in memory), then we're done. When the server doesn't have the game, it checks Redis for which server currently owns that game. If it finds the entry, then it sends a redirect back, which caddy handles for the client so that it connects to the correct server.

Game server redirect by looking up in Redis and routing to the right place.

This is omitted from the diagram to keep the complexity reasonable, but the initial server that handles the request also does a sanity check on the redirected server's heartbeat to make sure it's alive. If not, the initial server will try to claim the game as its own using atomic compare-and-set on the game routing entry in Redis, which is how games can move from one server to another. After successfully claiming a game, the server will load the snapshot state and continue the game from there -- in practice, we snapshot games every second, so it's possible to lose up to a second of game progress when this happens (can look like pieces "snapping" backwards). If the server fails to claim the game, it simply checks for redirect again.

One last point worth mentioning is that deploys are handled more cleanly than a true server-loss case. When a server goes through a clean shutdown, it snapshots all of its games into Redis, so the snapshots represent the latest game state. And in practice, the heartbeat threshold is high enough (5 seconds) so that the server isn't marked as dead on deploy. When the server comes back, it simply considers itself the owner of the same set of games, reloads the latest snapshots from Redis, and continues along. So no game state is lost at all, but there is a brief pause in the progression of the game as the server restarts.

That's the entire architecture for handling distributed game servers. It's relatively simple and incurs minimal overhead on request latency and CPU usage. I was pleasantly surprised by how nicely it ended up working, and it makes me a lot less hesitant to deploy code now!

No comments:

Post a Comment