Tautik Agrahari

geohash vs quadtree vs r-tree - three ways to find what's near you

2026-05-14T00:00:00Z

every time your phone shows "5 ubers within a mile" or yelp pulls up restaurants near your dot on the map, somewhere in the backend a system is solving the same problem: given a point on earth, find all the other points close to it, fast.

the naive answer is to scan every row in your database, compute the haversine distance to the query point, and return everything within some radius. this works when you have 100 points. it absolutely melts when you have 100 million.

so we use spatial indexes. three come up over and over: geohash, quadtree, and r-tree. they all partition space, but they do it in fundamentally different ways — geohash encodes points into sortable strings, quadtree carves space into rigid quadrants, r-tree wraps data in flexible overlapping rectangles. let's walk through each, end with a concrete yelp example, and figure out which one to reach for when.

what you'll take away

quick pointers so you know what to look for as you read:

geohash, quadtree, and r-tree are three bets on the same narrowing problem. encode points into sortable strings, carve space into a rigid grid, or wrap data in flexible overlapping rectangles.
the index never gives you the answer — it gives you a candidate set. narrow with the structure, refine with exact haversine. skip the second step and you ship wrong results.
always query the cell AND its 8 neighbors. two points 10 meters apart can land in cells that don't even share a prefix.
pick by deployment, not elegance. geohash for redis-fast point-only lookups, quadtree for wildly uneven density, r-tree inside real databases mixing points and polygons.
r-tree is what production databases actually run. postgis, mysql spatial, oracle — overlap is the tax, splitting heuristics are the tuning knob.

the problem they solve

okay. concrete. you're building yelp. user opens the app at lat 40.7128, lon -74.0060 (manhattan). they want "coffee shops within 1 mile". you have 10 million coffee shops in your db.

a brute-force SELECT * FROM shops WHERE distance(lat, lon, 40.7128, -74.0060) < 1 does 10 million distance computations per query. dead. you'd burn through a server cluster before lunch.

what you actually want is: narrow the search to ~50 candidate shops in O(log n), then run the precise distance check on just those. that "narrowing" step is what spatial indexes do.

geohash, quadtree, and r-tree are three different bets on how to do the narrowing.

geohash — encode space into a string

geohash converts a (lat, lon) pair into a short base32 string like dr5ru7p. the magic is the prefix property: two points that are close on earth will, most of the time, share a long common prefix.

so dr5ru7p and dr5ru3z are nearby (both share dr5ru). dr5ru7p and 9q8yzn are very far (no shared prefix — different sides of the country).

how it actually encodes

geohash interleaves the binary representations of latitude and longitude. you take lat (range -90..90), repeatedly split it in half, write down whether your target is in the lower or upper half. same for lon (range -180..180). interleave the bits, encode in base32. each character adds about 5 bits of precision.

precision by length:

length	cell size (approx)
4 chars	~39 km × 19 km
5 chars	~5 km × 5 km
6 chars	~1.2 km × 0.6 km
7 chars	~150 m × 150 m
8 chars	~38 m × 19 m

so for "coffee shops within ~1 mile", a 5-character geohash bucket is about right. you store every shop pre-computed with its 5-char geohash, then the query becomes "give me all shops where geohash = 'dr5ru'."

why redis loves this

geohashes are sorted strings. they fit perfectly into a redis sorted set, or any b-tree index. redis actually has built-in commands for this — under the hood it's just a sorted set of geohash-encoded scores.

GEOADD shops -74.0060 40.7128 "blue-bottle"
GEOADD shops -74.0050 40.7130 "joe-coffee"

GEOSEARCH shops FROMLONLAT -74.0060 40.7128 BYRADIUS 1 mi

dead simple. no postgis, no kd-tree, no custom service. just a sorted set with magic strings.

the edge case nobody mentions on day 1

geohash has one big gotcha: points right next to each other can fall in different cells.

imagine the cell boundary runs right through times square. a hotdog stand 5 meters east of the boundary has geohash dr5ru7. a coffee shop 5 meters west has geohash dr5rgz. they're 10 meters apart, but their geohashes don't even share the second character.

the fix is mandatory: always query the user's cell AND its 8 neighboring cells. there are well-known formulas to compute the 8 neighbors of any geohash, so this is cheap. you end up querying 9 cells instead of 1, then filtering the result with exact haversine distance.

so the real query loop is:

compute the user's geohash at the right precision
compute its 8 neighbors → 9 cells total
fetch all points whose geohash is in that set
compute exact haversine distance for each, filter to radius
return

step 4 is the unavoidable "refinement" step. the index narrows you from 10 million to ~50–200 candidates. the haversine check filters those to the final answer.

✽ RECALL what property of geohash lets you build a "nearby" feature on a plain sorted set or b-tree index, with no spatial extension at all?

the prefix property — interleaving lat/lon bits and base32-encoding them means nearby points (usually) share a long common prefix, and the resulting strings sort. proximity search becomes a prefix/range lookup on any ordered structure, which is exactly why redis ships GEOADD/GEOSEARCH as a thin layer over a sorted set.

✽ RECALL two coffee shops sit 10 meters apart but their geohashes don't even share a second character. what happened, and what's the mandatory fix?

a cell boundary runs between them — the prefix property breaks at boundaries, so closeness on earth doesn't guarantee closeness in the encoding. the fix is to always query the user's cell plus its 8 neighbors (cheap, well-known formulas exist), then filter the candidates with exact haversine distance. query one cell and you'll silently drop results.

quadtree — recursive 2D split

quadtree comes at it differently. instead of encoding each point, it organizes space itself into a tree.

start with the whole 2D map as a single box. that's the root. when too many points are in one box (say, more than 4), split the box into 4 equal quadrants — NW, NE, SW, SE. each quadrant becomes a child node. if any of those children gets too crowded, split that one into 4 more. recurse.

the beauty is density-adaptive: downtown manhattan ends up with tiny leaf nodes (lots of recursive splits) while the middle of nebraska might be a single huge leaf. you're not wasting tree depth on empty space, and you're not letting busy areas get overcrowded.

the search

to find points near (lat, lon):

start at the root
descend into whichever quadrant contains your query point
recurse until you're in a leaf
that leaf contains the candidate points (and you check neighboring leaves too — same boundary problem as geohash, different mechanics)
exact haversine check, return

each step picks 1 of 4 children, so search is O(log₄ n) — same shape as O(log n), just a tighter constant.

tradeoffs vs geohash

density adapts automatically. if nyc has 100k coffee shops and nebraska has 50, the quadtree just handles it. with geohash, you have to manually pick a precision that works everywhere.
memory overhead. every internal node, every pointer. for uniform-density data, geohash is way cheaper.
harder to distribute. geohash strings shard naturally across machines (just hash the prefix). quadtrees need careful partitioning.
rebalancing is real. as drivers move (uber!) or points get added/removed, the tree shape changes. geohash, since it's just a string per point, sidesteps this entirely.

✽ RECALL why does a quadtree handle manhattan-vs-nebraska density gracefully where geohash forces an awkward global compromise — and what does that grace cost you?

a quadtree splits a box only when it gets crowded, so dense areas grow deep tiny leaves while empty space stays one big block; geohash makes you pick a single precision that's wrong somewhere. the cost: node and pointer overhead, harder distribution across machines, and real rebalancing as points move — geohash sidesteps all of that because it's just a string per point.

r-tree — flexible, overlapping rectangles

both geohash and quadtree split space on fixed grid lines — every cell or quadrant has a predetermined position and size. r-trees throw that constraint out. instead, they wrap the data in flexible, overlapping bounding rectangles that adapt to wherever the data actually clusters.

think of organizing a pile of photos on a table. quadtree would force you to draw a fixed 2×2 grid and keep subdividing. r-tree just lets you draw natural rectangles around clumps of nearby photos, even if those rectangles overlap a bit.

each rectangle is a minimum bounding rectangle (MBR) — the smallest axis-aligned box that contains all its children. the root has a few wide MBRs, each child has tighter ones, all the way down to leaves that hold actual data items.

why this matters

r-tree is the default in production databases. postgresql's gist indexes (and postgis) use r-trees. so does mysql's spatial index, oracle spatial, and basically every commercial spatial database. there's a reason for that.

mixed data types. r-tree indexes points, polygons, road networks, delivery zones — anything with a bounding box — in the same index. quadtree struggles here because non-point shapes mess up the rigid grid; r-tree just expands the rectangle to fit.
adapts to data clusters. rectangles hug the data rather than slicing space arbitrarily. fewer empty cells, fewer wasted traversals.
disk-friendly. modern r-tree variants (r*-tree, hilbert r-tree) tune their splits for how databases actually read pages off disk. when you call ST_DWithin in postgis, it's hitting an r-tree.

the trade-off

rectangles can overlap. when a query point sits inside two overlapping rectangles, you have to descend into both branches and union the results. lots of overlap means the index does extra work.

production r-tree implementations spend a lot of cleverness on the splitting heuristic — when a node fills up and has to split, how do you divide its children into two new MBRs that minimize overlap? this is where r-tree, r±tree, r*-tree, and hilbert r-tree differ. all the same shape, different splitting strategies.

in interviews and at the whiteboard: just say "r-tree" and you're fine. the variants matter for production tuning but not for the architectural conversation.

✽ RECALL why do postgis, mysql spatial, and oracle all default to r-trees instead of geohash or quadtrees — and what's the tax you pay for it?

minimum bounding rectangles hug the data instead of slicing space on fixed grid lines, and they can wrap anything with a bounding box — points, polygons, road networks — in one index. the tax is overlap: a query point inside two rectangles forces you down both branches, which is why production variants pour their cleverness into splitting heuristics that minimize overlap.

a concrete example — yelp "coffee near me"

let's walk through the same query with both approaches.

user at (40.7128, -74.0060) asks for coffee shops within 1 mile.

geohash version

# precompute (done once per shop, stored in db / redis)
shop "Blue Bottle"    → geohash dr5ru7p
shop "Joe Coffee"     → geohash dr5ru3z
shop "Stumptown SoHo" → geohash dr5rsbk
shop "...10M others"

# query time
user_geohash = encode(40.7128, -74.0060, length=5) = "dr5ru"
neighbors    = ["dr5ru",  # the user's cell
                "dr5rg", "dr5rs", "dr5rv",
                "dr5rt",          "dr5rw",
                "dr5rh", "dr5rk", "dr5re"]  # 8 neighbors

# fetch all shops whose geohash starts with one of these 9 prefixes
candidates = SELECT * FROM shops WHERE LEFT(geohash, 5) IN (...9 values...);
# ~50–200 results

# exact distance filter
results = [shop for shop in candidates
           if haversine(user, shop) <= 1.0]

simple. redis-friendly. one round trip. ship it.

quadtree version

# precompute (built in memory or stored in postgis-style index)
quadtree = build_quadtree(all_10M_shops)

# query time
user_point = (40.7128, -74.0060)
leaf       = quadtree.find_leaf(user_point)
candidates = leaf.points + leaf.neighbors_within(1_mile)

# exact distance filter
results = [c for c in candidates if haversine(user, c) <= 1.0]

same overall shape, different data structure. the quadtree gracefully handles the fact that manhattan has 100x the shop density of staten island, while geohash needed you to pick a single global precision.

when to pick which

use case	better fit	why
uber nearby drivers	geohash	drivers move every few seconds; just update the geohash string in redis. classic pattern.
tinder location matching	geohash	users update location periodically, sorted set, done
anything you want in redis quickly	geohash	it's literally a built-in
mapping with wildly varying density	quadtree	density-adaptive splits give better worst-case performance
physics sim, collision detection	quadtree	spatial reasoning over points matters more than encoding
yelp restaurant search at production scale	r-tree (via postgis)	mixed data — points + polygons (delivery zones, neighborhoods) — and you want a real db query, not just a redis lookup
mapping app indexing roads + POIs together	r-tree	only structure that gracefully handles points + polygons + linestrings in one index
postgis, mysql spatial, oracle spatial	r-tree	they literally use r-tree — that's the default

rule of thumb:

geohash when you need redis-fast, point-only, simple-as-possible.

quadtree when density is wildly uneven and points are the only thing you index.

r-tree when you're in a real database (postgis, mysql) and need to mix data types.

✽ RECALL you're building "drivers near me" for an uber clone and a yelp-style search with delivery zones. which index for each, and why?

drivers → geohash in redis: moving points are just string updates in a sorted set, no tree to rebalance every few seconds. yelp → r-tree via postgis: you're mixing points with polygons (delivery zones, neighborhoods) and want a real database query, not just a key lookup. quadtree only enters when density is wildly uneven and points are the only thing you index.

the universal pattern

both approaches share the same two-step structure:

index → narrow. use the spatial structure to go from millions of candidates to a few hundred. cheap and fast.
exact distance → refine. for each candidate, compute the real haversine distance. more expensive per-row, but only running on the narrowed set.

the index doesn't give you the right answer. it gives you a candidate set small enough that the exact answer is cheap to compute. forgetting step 2 is how you ship a "nearby" feature that includes shops 10 miles away just because they happened to share a geohash cell.

✽ RECALL what does a spatial index actually give you — and how do you ship a bug if you treat its output as the answer?

a candidate set small enough that computing the exact answer is cheap: narrow from millions to a few hundred in O(log n), then haversine-filter to the radius. skip the refine step and you return shops 10 miles away just because they shared a cell — the index narrows, the distance check decides.

what i'd remember

three different bets on the same problem. geohash = encoding, quadtree = rigid grid, r-tree = flexible overlapping rectangles.
geohash is the right default for most "nearby" features built on redis or a sorted index.
quadtree wins when point density varies a lot across your map.
r-tree wins inside real databases (postgis, mysql) — and it's the only structure that gracefully indexes mixed shapes.
always query neighbors, never just one cell. boundary edge cases will burn you otherwise.
always do the exact distance check after the index. the index narrows; the distance check gives you the right answer.
the "narrow then refine" two-step is universal. whichever index you pick, the second step is the same haversine filter.

nothing here is novel. redis has GEOADD, postgis has ST_DWithin, mongodb has 2dsphere indexes. the design is in picking the right tool, remembering the boundary fix, and not over-engineering. you're paid to solve a problem, not to ship the fanciest architecture.

computer use agent story

2026-05-12T00:00:00Z

a cua (computer use agent) is just an llm that controls a browser. you give it a task like "log into some saas dashboard and screenshot the encryption settings", it takes a screenshot, sends the screenshot to claude with a list of tools (click, type, scroll), claude returns a tool call like click(x=420, y=180), you apply that to the browser, take another screenshot, and loop. that's the whole thing. it's a while not done: screenshot → llm → tool call loop with a real browser on the other end.

the part that's not obvious is where the browser actually runs. it can't run on the user's laptop (you'd need them online and giving you access). it can't run on your api server (you'd be sharing one chrome between every request). so it has to run on a remote linux box somewhere, with chrome inside it, and your backend talks to that chrome over the network. that "remote linux box with chrome" is the whole infra problem. everything else — the prompts, the planner, the auth, the recording — is built around that one primitive.

what you'll take away

quick pointers so you know what to look for as you read:

a cua is just a screenshot loop. screenshot → llm → tool → screenshot. everything else is plumbing.
the hard part is "where chrome runs." rented browser apis are easy; raw vms are powerful but you build the stack yourself.
you don't need to invent a remote desktop. xvfb + x11vnc + novnc has been doing this since the 90s.
auth = storage_state, not oauth. capture cookies once, replay forever.
bot detection isn't really solvable. proxies + atomic typing + bail on captchas, that's it.
a Computer protocol with ~10 methods is the seam that lets you swap providers without rewriting the agent.
traditional unit tests are theatre for agents. trace everything, benchmark periodically, manually verify on real flows.
scaling = retuning gunicorn, not magic. long timeouts, worker recycling, locked + shielded cleanup, pre-warmed templates.

the screenshot loop, slightly less hand-wavy

claude (and openai's models, and gemini) ship "computer use" tools — basically a function spec that says "i can call click(x, y), type(text), scroll(direction) etc." you wire that up like any other tool call. the whole loop is maybe 15 lines of pseudocode:

python

async def run_task(task: str, browser: Computer):
    history = []
    for turn in range(MAX_TURNS):
        screenshot = await browser.screenshot()
        response = await claude.complete(
            task=task,
            screenshot=screenshot,
            tools=COMPUTER_TOOLS,   # click, type, scroll, ...
            history=history,
        )
        if "<<TASK_COMPLETE>>" in response.text:
            return response
        tool_call = response.tool_use         # e.g. click(x=420, y=180)
        await browser.execute(tool_call)      # applied to chrome via cdp
        history.append((screenshot, tool_call))

quick aside on <<TASK_COMPLETE>> — that's not a magic constant from anthropic. it's just a sentinel string you put in the system prompt: "when you're done with the task, output the literal text <<TASK_COMPLETE>> followed by a one-line summary." the model emits it, your loop greps for it, you break. could be [DONE], doesn't matter — pick something the model is unlikely to output by accident. same trick for the captcha bail-out later: prompt says "if you see a captcha, output <<TASK_COMPLETE>> with reason CAPTCHA."

that's it. the model is doing all the perception (where's the button? what does the screen say?) and all the planning (what do i click next?). you're just the hands.

you might ask, tautik — what drives the browser? playwright. python library that speaks cdp (chrome devtools protocol — same thing your devtools panel uses when you press f12). when chrome launches with --remote-debugging-port=9222, anyone with the websocket url can connect and drive it — click, type, screenshot, run js, intercept network. so your backend does await playwright.chromium.connect_over_cdp(url) and from then on it's as if the chrome is on your laptop, except the actual browser is on a vm somewhere.

✽ RECALL in the cua loop, what is your code responsible for vs the model — and how does the loop know when to stop?

the model does all the perception (what's on screen) and all the planning (what to click next); your code is just the hands — apply the tool call to chrome over cdp, screenshot, send it back, loop. termination is a sentinel string you define in the system prompt ("output <<TASK_COMPLETE>> when done") that the loop greps for — nothing magic, and the same trick handles the captcha bail-out.

where chrome runs — the actual hard problem

aight so two paths.

option 1: rented browser-as-a-service. vendors like browserbase give you back a cdp_url and a stream_url and call it a day. their problem to figure out where chrome runs and how to keep it alive. easy day one. cost: they decide which features get exposed (recordings? proxies? auth storage?) and how their pricing scales. you get whatever they ship.

option 2: raw linux vm + install chrome yourself. services like e2b give you firecracker-as-a-service — AsyncSandbox.create() returns a root vm with commands.run(), files.read/write, port exposure. no browser, no display, no vnc, no recording. you trade up on control, you trade down on convenience.

most teams should start with option 1. you'll know when it's time to move down a layer — usually when you keep hitting the vendor's ceiling on something (recording, proxies, custom flags, snapshot timing). don't migrate prematurely; you're signing up for a stack you now own end to end.

✽ RECALL rented browser-as-a-service vs raw vm with chrome — when is it time to move down a layer, and what do you sign up for?

start rented; migrate only when you keep hitting the vendor's ceiling on something you need — recording, proxies, custom chrome flags, snapshot timing. the raw vm hands you a root box and nothing else: no browser, no display, no vnc, no recording. you trade up on control and sign up to own the entire stack end to end, so don't do it prematurely.

the remote display stack, demystified

if you go option 2, here's what you actually need on the vm. linux servers don't have monitors, but chrome (when not running headless) wants to draw to a screen. so:

xvfb is a fake screen — gives chrome an x11 display in ram
xfce4 is a window manager that puts borders and a taskbar around chrome's window
x11vnc watches the fake screen and rebroadcasts it as vnc (a 90s screen-sharing protocol)
novnc translates the vnc stream into websockets so a normal browser tab can render it without any plugin (i wrote a deeper novnc walkthrough here if you want to actually understand what's happening over the wire)

so the chain looks like:

none of this is novel — vnc has been around forever and people have been remote-debugging headless linux for two decades. the only thing you have to do is orchestrate the startup order (xvfb → xfce → x11vnc → novnc → chrome) and wait for cdp to respond on port 9222 before the agent connects.

apt install xvfb xfce4 x11vnc novnc chromium-browser and you're ~80% there. there are even pre-built ubuntu templates floating around with all of this preinstalled, which saves a lot of fiddling.

✽ RECALL chrome on a headless linux vm needs a screen to draw on, and you need to watch it from a browser tab. what's the chain, and why is none of it novel?

xvfb fakes a display in ram, xfce4 puts a window manager around chrome, x11vnc rebroadcasts the fake screen as vnc, and novnc translates vnc into websockets a browser tab can render. all of it is decades-old remote-desktop tech — your only real work is the startup ordering (xvfb → xfce → x11vnc → novnc → chrome) and waiting for cdp to answer before the agent connects.

the `Computer` protocol — the small trick that buys you everything

this is the smartest 30 lines of code you'll write. before introducing it, your agent loop will have provider-specific calls baked in — vendor.start_browser(), vendor.click() — and swapping providers later means a rewrite.

so define a typing.Protocol called Computer with the methods the loop actually uses — screenshot, click, type, scroll, keypress, drag, wait, move, double_click, get_environment, get_dimensions. the loop calls those. anything that implements those methods is a valid Computer.

python

class Computer(Protocol):
    async def screenshot(self) -> bytes: ...
    async def click(self, x: int, y: int) -> None: ...
    async def type(self, text: str) -> None: ...
    # ... ~8 more

VendorAClient is a Computer. RawVMChromeWrapper is a Computer. tomorrow if you want a third provider, that's another Computer. the loop doesn't know or care. and you can mock a Computer with 30 lines for a unit test.

this is just the strategy pattern with python's structural typing. nothing fancy. but it's the difference between "ship a new provider, swap a factory" and "rewrite the agent."

✽ RECALL what does the Computer protocol buy you, and why structural typing instead of an inheritance hierarchy?

the loop only ever calls ~10 methods — screenshot, click, type, scroll, and friends — so anything implementing them is a valid Computer: a vendor client, a raw-vm wrapper, a 30-line mock for tests. swapping providers becomes a factory change instead of an agent rewrite. structural typing means providers never need to know about your base class; the protocol is just the seam where they plug in.

the stuff you actually have to write

ok so the vm is up, chrome is running, the loop is talking to it. what's still left?

auth

big one. when a user says "log into my snowflake instance and pull the audit logs" — how does the agent log in?

you don't do oauth. trying to script through "sign in with google → enter password → 2fa" with an llm is a nightmare and most enterprise saas blocks it as suspicious activity anyway.

what you do instead is capture once, replay forever. once, during auth setup, a real user logs in interactively while watching a vnc stream of your vm. when they confirm "yes i'm logged in", you call playwright's context.storage_state() which returns a json blob of cookies + localstorage + sessionstorage. you store that json (encrypted, scoped to the org). every future task that needs that platform pulls the json and injects cookies + storage into a fresh playwright context. the user appears already logged in. no oauth, no passwords stored, no refresh token handling.

if the customer's site uses oauth (sign in with google), the user does the dance once during capture and you keep the resulting session cookies. you never see refresh tokens.

why not use a vendor's "auth as a service"? because vendor auth blobs are opaque, in their proprietary format, and lock you in. storage_state is playwright's native primitive — works on any vm, anywhere playwright runs. when you swap providers, the auth blobs come with you untouched. when a vendor offers a fancy feature wrapping a primitive you already know, take the primitive.

✽ RECALL how does the agent log into a customer's saas without doing oauth, and why prefer playwright's primitive over a vendor's auth-as-a-service?

capture once, replay forever — a real user logs in interactively over vnc, you call context.storage_state() to snapshot cookies + localstorage + sessionstorage as json (encrypted, org-scoped), and every future task injects it into a fresh context so the agent appears already logged in. no scripted oauth (fragile, flagged as suspicious), no stored passwords, no refresh tokens. and storage_state is playwright-native, so the blobs move with you across providers — vendor auth blobs are opaque lock-in. when a vendor wraps a primitive you know, take the primitive.

proxies

datacenter ips get flagged by anti-bot waf instantly. cloudflare, akamai, datadome — they all maintain ip reputation lists and your aws/gcp/e2b egress ip is on every one of them. options:

residential proxies (bright data etc) — expensive and noisy but actually work
isp proxies — middle ground, cheaper, decent reputation
curated static rotating list — self-managed pool of clean ips, cheapest, but you maintain it

most teams end up with a tri-state mode: off (default datacenter ip), on (rotate through your pool), forced (always use a specific clean ip for sensitive flows). pre-install tinyproxy or similar in your template and pass --proxy-server=... when chrome launches.

recording

your customers will ask "show me what the agent did" and you'll need video + network logs. neither comes free.

screen capture: ffmpeg reading the x11 framebuffer → mp4 written to a path on the vm like /tmp/recording.mp4. one ffmpeg process per task, kicked off when the loop starts.

network capture: tcpdump with a bpf filter that excludes port 9222 (so you don't record your own cdp traffic) → pcap on the vm → har conversion → zstd compression.

now the obvious question — those files live on the vm, your backend lives somewhere else, how do they physically end up in s3? you ask the sandbox sdk to read them. e2b (and most vm-as-a-service sdks) expose a files.read(path) that streams bytes from the vm filesystem back to your backend over their control channel. so cleanup looks like: stop ffmpeg → bytes = await sandbox.files.read("/tmp/recording.mp4") → await s3.put_object(Body=bytes) → repeat for the pcap → then kill the sandbox. order matters — once the sandbox is killed, the filesystem is gone.

both uploads must be non-fatal — if ffmpeg crashes mid-task or the file read times out, log a warning, move on, kill the sandbox anyway. observability shouldn't be on the critical path, and one stuck recorder cannot be allowed to leak vms.

custom vm template

stock templates are fine but cold start sucks (~30s to boot xvfb + xfce + x11vnc + novnc + chrome). the trick is to pre-start the entire stack at build time and snapshot the running state. when a sandbox boots from the snapshot, everything is already running. cold start drops to ~7s.

bonus optimization: pre-warm chrome's profile during build. run a headless chrome at build time loading google.com, sleep 8s so all the sqlite databases get created (cookies, history, web data, preferences), then kill it. saves 15–20s per cold start because chrome doesn't need to bootstrap profile dbs at runtime.

bot detection — the day-1 problem you won't solve

honestly. five layers of "avoid getting flagged" is the best most teams do:

atomic typing into the url bar (ctrl+l, type, enter — three separate tool calls, no combining). real humans don't paste-and-press at machine speed.
residential or isp proxies, never raw datacenter ips
real headful chrome, not puppeteer-style headless (the user-agent string and window props are obvious tells)
browser-realistic http headers on any side fetches your backend makes
captcha bail-out — when the model sees an image-grid challenge, return <<TASK_COMPLETE>> with reason CAPTCHA. don't pretend to solve them.

you're not going to beat cloudflare's bot detection. nobody does, including the bot-detection vendors who claim to. plan around it — surface the captcha bail to the customer, let them decide whether to retry from a residential ip or hand off to a human.

✽ RECALL you're not going to beat cloudflare's bot detection. so what do you actually do?

stack mitigations and plan for failure: residential or isp proxies (datacenter ips are pre-flagged everywhere), real headful chrome (headless tells are obvious), atomic human-paced typing instead of machine-speed pastes, browser-realistic headers on side fetches — and when a captcha appears, bail via the sentinel with reason CAPTCHA and surface it to the customer, who decides whether to retry from a cleaner ip or hand off to a human. don't pretend to solve captchas; nobody does.

how do you even test this

honest answer: not as well as a textbook would tell you to. unit tests for an agent are theatre. mock the screenshot, mock claude's response, mock the click — congrats, you've tested that python can call mock.assert_called. you haven't tested that the agent can actually log in anywhere.

the failure modes that matter are at the screenshot + llm + browser interaction layer, and you can't unit-test those — you need a real browser, a real model, a real website. so skip the theatre.

what actually catches regressions:

smoke-test the streaming endpoint, not the loop. one pytest test that posts a real task to /execute_task, opens the sse stream, and asserts you get a stream_url event followed by a result or error event. that's it. no mocks. it catches "did anything wire up correctly" — which is the only question a unit test could meaningfully answer here anyway.
@traced on every method that's not a fat blob. decorate your tool implementations (click, type, scroll, llm calls, prompt builders) with braintrust's @traced so spans show up automatically. don't trace methods that return base64 screenshots — span payload inflates and your dashboard becomes useless. exception, not the rule.
make tracing non-fatal. if the tracing init fails or the export crashes, prod shouldn't go down. wrap it in graceful degradation — observability is never on the critical path.
trace everything in prod that's left. every screenshot reference (not the bytes), every tool call, every llm response, every cost in dollars, every latency number. pull up a failed run and watch it frame by frame: "ok at turn 14 the model clicked the wrong button because the dropdown hadn't loaded yet." that's the core feedback loop.
fail loud, fail typed. structured error subclasses with context so failures group cleanly in your dashboard.
benchmark periodically. webvoyager is a public dataset of 642 web tasks across 15 sites — not perfect, but it's a real number you can quote, and every cua vendor benchmarks against it.
manually verify on real flows. ngl, this is most of how regressions get caught. actually run the agent against the real customer portal.
swap models based on signal, not vibes. when traces say sonnet is winning over opus on your shape of task, ship sonnet. don't run formal traffic-split a/b tests for this — offline batch eval is enough.

is this rigorous? not by traditional standards. but agents fail in ways traditional tests can't catch — flaky vendor pages, css that changed yesterday, a captcha the model hadn't seen before, a button label that became an icon. the only honest test is running the thing. so focus your infra on running the thing well and watching it carefully.

once you have customers, build a small internal dataset of tasks shaped like real customer flows (anonymized) and run them as a webvoyager-style nightly job with traced scoring. that's the obvious next move past the prototype phase.

✽ RECALL why are unit tests theatre for a cua, and what actually catches regressions?

the failure modes live at the screenshot + llm + real-website layer — mock all three and you've only proven python can call a mock. instead: one real smoke test against the streaming endpoint (post a task, assert stream_url then result/error events), @traced on every tool and llm call (never the screenshot bytes — span bloat), typed structured errors, periodic benchmarks like webvoyager, and — honestly — manually running real customer flows. the only honest test is running the thing and watching failed runs frame by frame.

scaling — what actually breaks at volume

a cua task isn't a 50ms api call. it's a 60-second-to-30-minute browser session holding an sse stream open and a vm warm on the other end. the python web-server defaults are tuned for the opposite shape — short requests, lots of them — so you have to retune.

the things that bite you:

gunicorn timeouts kill the worker mid-task. default timeout is 30s. an enterprise task can run 30 minutes. set timeout = 3600 (one hour) and keepalive = 3600 so the sse connection doesn't get aggressively closed. graceful_timeout = 45 so deploys can drain in flight without taking forever. yes you're abusing the worker model — long-running tasks really belong in a background queue, but honestly for v1 the abused-worker pattern works.
uvicorn workers leak memory. chrome client objects, screenshot bytes, llm response cache, playwright contexts that didn't fully gc — it adds up. set max_requests = 500 with max_requests_jitter = 100 so workers cycle automatically and the leak resets every ~500 tasks. cheap fix, very effective.
set preload_app = True. with multiple workers you don't want each one re-initing your singletons (db pools, llm clients, braintrust logger, otel exporter). preload the app in the master process and let workers fork from it. shared code, separate state — exactly what you want.
worker count = min(2 × cpu_cores, ram_ceiling). cua tasks are io-bound (waiting on llms, waiting on chrome), so 2 workers per core is fine. but each worker holds a chrome connection + a tracing buffer + sse state, so cap at whatever your ram budget allows.
the vm-side concurrency is not your bottleneck. vendors like e2b's sdk pool ~2000 concurrent commands per sandbox. one sandbox running a task can fire screen-recording, network-recording, and chrome-launch commands in parallel via asyncio.gather without breaking a sweat. the bottleneck is your egress proxy pool and your llm rate limits.

cleanup is where things get spicy at scale. when a task ends — successfully or via cancellation — you have an ffmpeg process, a tcpdump process, a chrome instance, a playwright context, and a vm sandbox to tear down. in that order. if you kill() the sandbox first, the filesystem is gone and your recordings never upload.

a few rules that will save you:

wrap cleanup in an asyncio.Lock. if a task gets cancelled while cleanup is already running, you don't want two coroutines trying to kill the same sandbox at the same time. one lock, atomic teardown.
asyncio.shield the sandbox kill. the most critical step is "tell e2b to stop billing me." if a parent cancellation interrupts that call, the sandbox keeps running and you're paying for it. shield it.
flush recordings to s3 before killing the sandbox. after kill, the filesystem is destroyed. there is no second chance.
make every cleanup step non-fatal. if ffmpeg refuses to die, log it, move on, kill the sandbox anyway. you cannot afford one stuck recorder to leak vms.
track an _cleanup_completed flag. so a re-entered cleanup short-circuits instead of running twice and crashing on a closed handle.

if cold start matters — and at any real volume, it does — pre-warm the entire desktop stack at template build time. xvfb, xfce, x11vnc, novnc, chrome all running before the snapshot is taken. boot from snapshot drops cold start from ~30s to ~7s. you're paying once at build time so every task pays nothing at runtime. one of the highest-leverage optimizations in the whole stack.

✽ RECALL what breaks when 30-minute browser tasks hit a web server tuned for 50ms requests, and what's the teardown rule that protects both your bill and your recordings?

the defaults kill you: a 30s gunicorn timeout murders workers mid-task (set it to ~an hour with matching keepalive for the sse stream), leaked chrome clients and screenshot bytes accumulate (recycle workers via max_requests), and per-worker re-init wastes ram (preload the app, fork from master). teardown order: flush recordings to s3 before killing the sandbox — the filesystem dies with it — under an asyncio.Lock so cleanup never runs twice, with the sandbox kill shielded from cancellation so you stop paying, and every step non-fatal so one stuck recorder can't leak vms.

the three things to take away

first, the Computer protocol. ~30 lines of structural typing. the seam that lets you swap providers and run multiple orchestrators on top of the same vm. structural typing > inheritance for plug-in surfaces.

second, storage_state over oauth. you don't do oauth; you capture sessions. portable across providers, opaque-vendor-blob-free, plays nicely with playwright. when a vendor offers a fancy feature wrapping a primitive you already know, take the primitive.

third, fork-vs-depend. for a layer this load-bearing — the layer between your agent and the world — owning it pays off. cost is the bugs you inherit (templates carry their own quirks). benefit is total control over what gets exposed and when. on a primitive like this, control wins.

aight that's the cua story. nothing here is magic. it's a screenshot loop, a remote desktop stack from 1998, a protocol seam, and a lot of patience for bot detection. you're paid to solve a problem, not to ship the fanciest architecture.

noVNC and websockify

2026-05-11T00:00:00Z

aight, let's talk about remote desktops and why the web makes everything complicated (but also kinda cool).

what you'll take away

quick pointers so you know what to look for as you read:

vnc is just a protocol over tcp. server captures the screen, client sends input — and only the changed pixels ship.
browsers can't open raw tcp connections. sandboxed to http and websockets — that one constraint shapes the whole architecture.
websockify is a dumb translator, and that's the point. unwrap websocket frames, forward raw tcp, wrap responses back.
neither end knows the other's world exists. the vnc server sees plain tcp, novnc speaks pure websocket.
build bridges, don't rebuild systems. respecting each layer's constraints beats changing everything.

what even is vnc?

your first question should hit you: what is vnc? well, it's a remote desktop protocol - think of it like http or websocket but for controlling computers remotely. vnc (virtual network computing) lets you control one computer from another over a network. dead simple concept.

so basically you get this magic remote control for any computer. wanna fix your mom's laptop from across the country? vnc. need to check on that long-running script on your server while you're at starbucks? vnc. it's been around since like the 90s and just works.

how does vnc work?

the setup is straightforward. you've got a vnc server running on the remote computer - the one you want to control. this server captures everything: screen updates, what's happening on the desktop, all that visual stuff. then on your local machine, you run a vnc client that displays those screen updates and sends your keyboard/mouse actions back to the server.

here's the important bit: all this communication happens over a tcp connection. remember that - yep fr remember that. the protocol itself is pretty efficient too. instead of sending the entire screen every time, vnc just sends what changed. move a window? only that rectangle gets updated. type some text? just those pixels change.

✽ RECALL how does vnc keep remote-desktop traffic small enough to be usable?

it never resends the whole screen — only the regions that changed. move a window and just that rectangle ships; type and only those pixels update. all of it flows over one persistent tcp connection: the server captures and diffs the screen, the client renders updates and sends keyboard/mouse input back.

then what is novnc?

so here's where things get interesting. what if you want to access that remote desktop from a web browser? seems reasonable right? well, browsers have this annoying (but important) limitation: they can't make direct tcp connections. security reasons. browsers can only do http requests and websocket connections.

this is where novnc enters the chat. it's a javascript library that implements a full vnc client entirely within your web browser. uses html5 canvas for display, handles all the vnc protocol stuff in javascript. pretty impressive tbh. but wait - if browsers can't do tcp and vnc servers only speak tcp, how does novnc connect to anything?

spoiler: it can't. not directly anyway.

✽ RECALL novnc implements a complete vnc client in javascript — so why can't it talk to a vnc server on its own?

browsers can't open raw tcp connections — the sandbox limits them to http requests and websockets, for good security reasons. vnc servers only speak tcp. so you've got a fully capable client and a fully capable server with no shared language between them; something in the middle has to translate.

enter websockify: the protocol translator

the solution is websockify - a proxy server that acts as a translator between the web world and traditional networking. think of it as a universal adapter but for protocols.

websockify sits in the middle doing this translation dance:

receives websocket connections from browsers running novnc
extracts the vnc protocol data from those websocket frames
forwards it as raw tcp packets to the actual vnc server
when the vnc server responds with screen updates, websockify wraps that data back into websocket frames
sends them back to the browser

the beautiful thing is neither end needs to know about the other's world. the vnc server has no idea websockets even exist - it just sees regular tcp connections. novnc doesn't need to worry about tcp sockets - it just speaks websocket to websockify.

✽ RECALL what does websockify actually do with each message, and why does neither end need modification?

it unwraps vnc data from incoming websocket frames and forwards it as raw tcp to the vnc server; responses get wrapped back into websocket frames for the browser. the vnc server just sees ordinary tcp connections, novnc just speaks websocket — each side stays in its native world while the proxy quietly translates between them.

putting it all together

so when you set up web-based remote desktop, here's what you're actually running:

vnc server on the target machine (the one you want to control)
websockify as a proxy - often runs on the same server, listening on port 6080
novnc html/javascript files served to users' browsers

when a user connects, their browser loads novnc, which opens a websocket to websockify. websockify then connects to the vnc server via tcp. boom - you're controlling a remote desktop through nothing but a web browser. no plugins, no java applets (remember those? lol), no special software to install.

the whole setup enables some pretty powerful use cases. cloud providers use this for console access to vms. support teams can help users without installing anything. you can access your home lab from any computer with a browser. hell, you could probably run it on your phone if you hate yourself enough.

what i find neat about this architecture is how it respects the constraints of each layer. browsers stay in their sandbox, vnc servers don't need modifications, and websockify just quietly translates between them. it's a reminder that sometimes the best solution isn't changing everything - it's building a good bridge.

nothing revolutionary here, just good engineering solving a real problem. and now you can remote desktop from literally anywhere with a browser. pretty cool for technology that's essentially duct-taping protocols together.

✽ RECALL you need browser access to a system that only speaks a raw tcp protocol — what's the general pattern?

put a protocol-translating proxy in the middle, the websockify way: the browser speaks websocket to the proxy, the proxy speaks tcp to the legacy system. it respects each layer's constraints — the browser stays sandboxed, the server stays unmodified — which usually beats rewriting either end. build a good bridge instead of changing everything.

how dropbox handles uploads, downloads and sync

2026-05-07T00:00:00Z

dropbox is one of those apps that feels totally simple until you start designing it. you upload a file, it shows up on your other devices. share it, someone else sees it. but every step here has a real engineering question buried in it. 50GB uploads can't just be a single POST. sharing 100k files across users can't just be a list scan. and getting bytes from a virginia data center to a tokyo client without making them wait 30 minutes? that's a whole separate problem.

this post walks through how i'd actually build it.

what you'll take away

quick pointers so you know what to look for as you read:

never upload large files through your own servers. presigned URLs let the client upload directly to S3.
chunk + fingerprint = resumable uploads. a hash of each chunk gives you a unique id that works across sessions.
trust but verify. clients report progress, but the server confirms via S3's ListParts before marking done.
sharing is a graph problem, not a list problem. put (userId, fileId) in its own indexed table.
sync is a hybrid. WebSocket push for near-real-time, polling as the safety net.
CDN for downloads, not uploads. downloads benefit from edge caching; uploads don't.
content-defined chunking is the secret of delta sync. fixed boundaries break the moment you insert a byte at the start.

what we're building

a cloud file storage service. user uploads a file, downloads it from any device, shares with others, and sees automatic sync across all their connected devices.

scope:

upload, download, share, automatic sync across devices
support files up to 50GB
low latency, secure, available

out of scope:

file editing, in-app preview, virus scanning, versioning, per-user storage limits
rolling our own blob storage (we'll use S3 / equivalent and call it done)

the cap call

availability > consistency. user uploads a file in germany; user in america seeing the old version for a few seconds is fine. dropbox is not a stock exchange. eventual consistency, no problem.

core entities

just two real entities and a user wrapper:

File — the raw bytes
FileMetadata — id, name, size, mime type, owner, status, s3 link, chunks
User — auxiliary, identified via session token / JWT in headers

the api

four endpoints, one per feature:

POST /files/presigned-url      → returns a presigned URL to upload to
GET  /files/{fileId}            → returns metadata + presigned download URL
POST /files/{fileId}/share      → body: { users: [...] }
GET  /files/changes?since=...   → returns ChangeEvent[] for delta sync

users come from headers, never from request bodies — keep auth out of payloads.

upload — the presigned URL trick

the obvious way to handle uploads is wrong, and it's worth saying why.

if a user POSTs a 50GB file to your server, two bad things happen. first, you've burned bandwidth uploading the file twice — once from client to your server, once from your server to S3. second, your API gateway probably has a 10MB request body limit (looking at you, AWS API Gateway). you can't even get the bytes through the gate.

the fix is presigned URLs. instead of the file flowing through your server, the client requests a permission slip from your server, then uploads directly to S3 using that slip.

three steps:

client POSTs to /files/presigned-url with just the metadata (name, size, mimeType). server creates a row in FileMetadata with status: uploading, generates a presigned URL via the S3 SDK, returns the URL. no file bytes touch your server.
client PUTs the file directly to the presigned URL. S3 stores it.
S3 fires an event notification to your backend on completion. backend flips status: uploaded.

now your server never holds the bytes. the file goes client → S3 directly. cheap, fast, no payload limits.

download — the same trick, in reverse

same idea. client requests metadata from your server, server returns a CDN-signed URL pointing to the file. client fetches from the CDN; CDN serves from cache or pulls from S3 on first miss.

GET /files/{fileId} → { metadata, downloadUrl }

the downloadUrl is a signed CDN URL with a short expiration (5 minutes is typical). without that signature, anyone with the link could download. with it, the CDN verifies the signature and serves only authorized requests.

✽ RECALL why should neither uploads nor downloads of file bytes flow through your api servers — and why does the download path add a CDN while the upload path doesn't?

proxying a 50GB upload burns the bandwidth twice (client → your server, your server → S3) and your api gateway's request body limit blocks the bytes anyway. so the server issues a presigned URL, the client PUTs directly to S3, and an S3 event notification flips the metadata to uploaded. downloads reverse the trick: the server returns a short-TTL signed CDN url, because hot files benefit from edge caching. uploads skip the CDN because fresh unique bytes have nothing to cache.

large files — chunk, fingerprint, resume

50GB on a 100Mbps connection takes ~1 hour. asking the user to start over if their wifi drops 45 minutes in is cruel. and honestly, most browsers and servers won't even accept a single request that large.

the answer is chunking on the client. break the file into 5–10MB pieces. upload each chunk separately. track progress by counting completed chunks. resume by skipping chunks already uploaded.

but how do you know which chunks have been uploaded? you can't go by file name — two users can upload files with identical names. you need a content-derived id. that's a fingerprint — a SHA-256 (or similar) hash of the bytes.

so the metadata grows:

json

{
  "id": "uuid-123",
  "fingerprint": "sha256:ab12...",
  "name": "movie.mp4",
  "status": "uploading",
  "chunks": [
    { "id": "chunk-1-fp", "status": "uploaded", "etag": "..." },
    { "id": "chunk-2-fp", "status": "uploading" },
    { "id": "chunk-3-fp", "status": "not-started" }
  ]
}

when the client resumes, it computes the file fingerprint, asks the server "have we seen this?", gets back the chunks array, and uploads only the missing ones.

✽ RECALL your 50GB upload dies 45 minutes in. what makes resume possible, and why can't file names do the job?

the client chunks the file into 5–10MB pieces and ids each one by a fingerprint — a hash of the chunk's bytes. on resume it recomputes the file fingerprint, asks the server "have we seen this?", gets back the chunk statuses, and uploads only the missing ones. names can't work — two users can upload identically-named files. content-derived ids are unique and survive across sessions, which is also what makes chunking solve request-size limits, resumability, and parallelism in one shot.

trust but verify

how does the server know a chunk was actually uploaded? naive answer: client sends a PATCH after each chunk completes. problem: a malicious client can lie. they could mark all chunks uploaded without uploading any, leaving you with metadata that says "complete" pointing at empty S3 objects.

the fix is trust but verify. when the client claims chunk N is uploaded with ETag X, your backend calls S3's ListParts API to confirm. only after S3 vouches do you flip the chunk to uploaded. once all chunks check out, call CompleteMultipartUpload to assemble them into a single S3 object.

S3's multipart upload API basically packages this whole flow. you can use it directly, but knowing what's underneath makes the trade-offs visible.

✽ RECALL the client PATCHes "chunk 3 uploaded, here's the ETag". why don't you just believe it, and what do you do instead?

a malicious client can mark every chunk uploaded without sending a byte, leaving metadata that says "complete" pointing at empty S3 objects. trust but verify: on each claim the backend calls S3's ListParts to confirm the chunk actually landed, and only then flips its status. once every chunk checks out, CompleteMultipartUpload assembles them into one object. it's the clean general pattern for client-reported state on a hostile network.

the obvious approach is to add a sharelist: [user1, user2] field to file metadata. it works for "is this user allowed?" but it falls apart on the inverse query: "show me everything shared with me." you'd have to scan every file's sharelist looking for the user. terrible.

put it in its own table:

SharedFiles
| userId (PK) | fileId (SK) |
| user1       | fileId1     |
| user1       | fileId2     |
| user2       | fileId3     |

now both directions are O(1). "show me everything shared with me" is a single index lookup on userId. "is user X allowed?" is a single point read on (user, fileId). cheap.

✽ RECALL why does a sharelist: [user1, user2] field on file metadata fall apart, and what replaces it?

it answers "is this user allowed?" just fine, but the inverse query — "show me everything shared with me" — forces a scan of every file's sharelist. put the relationship in its own SharedFiles table keyed (userId, fileId): now both directions are a single index lookup. sharing is a graph problem, not a list problem — model the edge, not an attribute.

sync — push + poll hybrid

every connected device needs to know when files change. two options:

polling — client asks "anything new?" every N seconds. simple, can lag, wastes calls when nothing changed.
WebSocket / SSE — server pushes change events to the client in real-time. fast, but connections drop and you can miss messages.

dropbox uses both. WebSocket as the primary path, polling as the safety net.

the client opens a single WebSocket per device (not per file). the server pushes change notifications for any file the user has access to. if the WebSocket drops or messages get lost, the client is also calling GET /files/changes?since={timestamp} every few minutes. anything missed by the push gets caught by the poll.

on the local side, each OS gives you a file watcher (FSEvents on macOS, FileSystemWatcher on windows). when something changes locally, the client agent uploads it to remote. last-write-wins for conflicts.

✽ RECALL why does dropbox run polling alongside websockets instead of trusting the push path alone?

sockets drop and messages get lost. the client keeps one websocket per device (not per file) for near-real-time pushes of any file it can access, and also calls GET /files/changes?since=... every few minutes as a safety net — anything the push missed, the poll catches. push for latency, poll for correctness; conflicts resolve last-write-wins.

delta sync — only ship the chunks that changed

once chunking is in place, sync gets a free win: when a file changes, we only need to upload (or download) the chunks that actually changed, not the whole file.

but there's a subtlety. if you chunk by fixed sizes (every 5MB), inserting a single byte at the start of the file shifts all chunk boundaries. now every chunk's fingerprint is different. delta sync becomes useless.

the fix is content-defined chunking (CDC) — chunk boundaries are determined by the file's content using a rolling hash (Rabin fingerprinting). a byte inserted near the start only affects the chunks immediately around it; the rest stay identical. this is how real systems achieve actual delta sync efficiency.

✽ RECALL you insert one byte at the start of a synced file. why does fixed-size chunking force a full re-upload, and how does content-defined chunking avoid it?

with fixed-size chunks every boundary shifts by one byte, so every chunk's fingerprint changes and delta sync degenerates into shipping the whole file. CDC picks boundaries from the content itself via a rolling hash (rabin fingerprinting), so an edit only changes the chunks immediately around it — everything else keeps its fingerprint and never moves. that's what makes dropbox feel fast on edits.

the final wiring

what each component does:

uploader — client agent. watches local folder via OS file events, chunks files, computes fingerprints, calls the upload API.
downloader — client. polls or receives push notifications, fetches presigned URLs, downloads from CDN.
API gateway — auth, rate limit, route.
file service — generates presigned URLs (purely local, signs with AWS credentials), reads/writes metadata. never touches file bytes.
file metadata DB — DynamoDB or Postgres, doesn't matter much. holds FileMetadata and the SharedFiles table.
S3 — holds the actual file bytes.
CDN — caches files at edge locations for fast downloads, serves via signed URLs.

security in two minutes

TLS everywhere — HTTPS for all traffic, end of story.
encryption at rest — S3 encrypts files transparently with managed keys.
signed URLs with short TTLs — even if a download URL leaks, it expires in 5 minutes. for higher security, bind URLs to specific IPs or require auth cookies.
compress before encrypting — encryption introduces randomness, killing compression ratios. always compress first.

what i'd remember

presigned URLs are the entire point of file upload design at scale. never proxy bytes through your server.
chunking solves three problems at once — request size limits, resumability, parallelism.
fingerprints are content-derived ids. use them everywhere — for files, for chunks, for dedup.
trust but verify is the clean way to handle client-reported state on a hostile network.
sync is a hybrid problem. push primary, poll fallback.
CDC > fixed chunking for delta sync. the rolling hash trick is what makes dropbox feel fast on edits.

nothing here is novel. S3 multipart upload, presigned URLs, signed CDN URLs, file watchers — it's all off-the-shelf. the design is in the wiring. you're paid to solve a problem, not to ship the fanciest architecture.

id generators - from `Date.now()` to snowflake

2026-05-03T00:00:00Z

aight so today we're talking about id generators. seems boring on paper, but stay with me — this is one of those topics where every line of complexity comes from a real production problem someone hit at scale. and the cool part is the entire problem statement collapses to: write a function that spits out something unique every time it is invoked. that's it. no service, no microservice, no fancy box.

actually that's the first thing you should internalize. you don't need a microservice for this. everyone wants to draw a box and call it id-svc but please don't do that until something forces you to. think of yourself as the only engineer who's gonna build, deploy, and on-call this thing. suddenly your appetite for new boxes drops to zero. good.

what you'll take away

quick pointers so you know what to look for as you read:

function first, service later. don't reach for a microservice until something forces you to.
whatever causes a collision becomes a differentiator. two machines collide → machine id. two threads collide → counter. mechanical pattern.
timestamp on the left = rough sortability for free. snowflake, mongo objectid, instagram — all do this.
distribution + strict monotonicity + high throughput → pick two. spanner buys all three, but only with atomic-clock hardware.
uuids are an index-bloat tax — except when they aren't. bad for mysql reads, perfect for cockroach writes. depends on the db.
logical shards on physical hosts = cheap hot-shard rebalancing. instagram's trick. same idea elasticsearch uses.
relax the right constraint for your usecase. twitter dropped strict monotonicity, cockroach dropped sortability, flickr dropped "no central service". each got something back.

but why do we even need our own ids?

your first question should hit you - if databases give us auto-increment ids, why are we even having this conversation?

so imagine a sharded database. you've got 4 mysql shards each holding a slice of your posts table. if every shard runs its own auto-increment, you're gonna get id 1, 2, 3 in shard A and id 1, 2, 3 in shard B. when you read a post by id 2, which one do you mean? lol. collision.

so the moment you shard, you cannot rely on the database to give you unique ids. you have to provide the id yourself at insert time. that's the whole motivation. there are other places too (event sourcing, distributed logs, whatever) but sharded dbs is the canonical one.

ok so what do we want? a function get_id() that returns something unique every single time. let's build it.

time as id - dead simple

what's unique through time? well, time itself. it always moves forward. so the world's simplest id generator:

func get_id() {
    return get_epoch_ms()
}

return current epoch milliseconds. done. ship it.

this works for so many use cases that it's actually criminal how often people skip past it.

your solution should be functional with respect to your constraints. nobody gets to call your design stupid because they don't know your constraints.

aight back to get_epoch_ms(). what's the catch?

multi-machine collision

what if two machines invoke this function in the same millisecond? same id. collision. damn.

fix: prepend the machine id.

func get_id() {
    return concat(machine_id, get_epoch_ms())
}

so machine m1 at time 1729 returns m1-1729. machine m2 at the same instant returns m2-1729. no collision. each machine knows its own id at boot.

threads happen

ok now what if my program has threads, and two threads on the same machine invoke get_id() in the same millisecond? collision again. you've got two ways out:

add thread_id as another differentiator
add a static counter that gets atomically incremented every call

the counter approach is cleaner imo. you've got one int, every call does an atomic counter++, and you append it to the id:

int counter = 0
func get_id() {
    return concat(machine_id, get_epoch_ms(), counter++)
}

now even within the same ms on the same machine, two calls get different ids because the counter moved.

but wait. think tautik think. if i already have a counter that always moves forward, what is the timestamp even doing here?

nothing. it's redundant. so just rip it out:

int counter = 0
func get_id() {
    return concat(machine_id, counter++)
}

this generates m1-0, m1-1, m1-2, ... each id is leaner (no timestamp bytes), still unique, still atomically safe. and on a 64-bit counter you'd have to invoke this at insane rates to ever wrap around.

✽ RECALL two machines collide in the same millisecond, then two threads on the same machine collide. what's the fix in each case, and why does the timestamp become dead weight once you add a counter?

whatever causes the collision becomes a differentiator — machines collide → prepend machine id, threads collide → append an atomically-incremented counter. and once you have a counter that only ever moves forward, the timestamp contributes nothing to uniqueness anymore, so concat(machine_id, counter) is leaner and just as safe. a 64-bit counter won't wrap at any sane rate.

but counters are volatile

so the problemmmm. the counter is in memory. if the process crashes or restarts, it goes back to 0. and now you start regenerating ids you've already issued. catastrophic.

fix: persist it. easy.

now most people's brain immediately goes "let me put it in a database." why?? bro you just need durability. that's the whole property you need. a local file gives you durability. why are you inducing a network call, a sql parser, a query planner, an execution plan, a commit protocol... just fwrite() to a file. open file, write counter, flush, close. or keep it open and keep flushing. that's all you need.

int counter = load_from_disk()
func get_id() {
    counter++
    save_to_disk(counter)        // disk i/o on every call
    return concat(machine_id, counter)
}

this is the biggest disease in tech tbh. people see "durability" and reach for postgres. you are paid to solve a problem and not necessarily write code to solve it. if a 4-byte file does the job, write a 4-byte file.

but disk i/o on every call?

yeah that's slow. millions of get_id() calls = millions of fsyncs. not gonna fly at any reasonable throughput. so we batch.

instead of flushing every increment, flush every 1000. and here's where it gets a little subtle:

int counter = load_from_disk() + 1000     // flush_frequency
func get_id() {
    counter++
    if (counter % 1000 == 0) {
        save_to_disk(counter)
    }
    return concat(machine_id, counter)
}

look at line 1. on startup, you don't load and start from whatever_was_on_disk. you load and add flush_frequency. why?

because between the last successful flush and the crash, the counter moved forward by some amount you don't know. could be 1, could be 999. so to be safe you assume it moved by the full flush_frequency. add that, and your starting value is guaranteed to be a value never issued before.

people sometimes try to do this time-based ("flush every 5 seconds") and then on recovery do "historical analysis" of how many ids were typically generated in 5 seconds. don't. that estimate will be wrong eventually and you'll regenerate an id. go frequency-based. opposite of time-based. with frequency, the math is exact.

so now we've got: lean ids, durable counter, batched i/o, no service. for a single machine this is genuinely solid. but the moment we want monotonically increasing ids, the rules of the game change.

✽ RECALL your counter flushes to disk every 1000 increments and the process crashes between flushes. why do you restart at last_flushed + flush_frequency instead of last_flushed, and why is time-based flushing a trap?

between the last flush and the crash the counter moved forward by some unknown amount up to 999, so you assume the worst — add the full flush frequency and your restart value is guaranteed never issued before. time-based flushing ("every 5 seconds") forces you to estimate how many ids fit in the window on recovery, and that estimate will eventually be wrong and you'll reissue an id. frequency-based math is exact. and note: a local file gave you all of this — durability never required a database.

monotonic ids - why bother?

monotonically increasing means id_{i+1} > id_i for every i. always. no dips. no out-of-order.

why do we want this? conflict resolution. who came first.

imagine elasticsearch with multi-version document updates. transaction T1 sets a=20, transaction T2 sets a=30. they arrive at your search index out of order because you're consuming in parallel. how do you know T1 came before T2? well... if their ids are monotonically increasing, you just compare. id_T1 < id_T2, so T1 was earlier, and when applying out-of-order updates you keep the latest version and discard the older one. last-write-wins.

without monotonicity? you have no way to order them. chicken and egg.

strict vs rough monotonicity

strict monotonicity in a distributed system is brutally hard (we'll see why in a sec). but rough monotonicity is easy. just put the timestamp on the left-hand side of your id.

id = concat(get_epoch_ms(), machine_id, counter)

LHS = most significant bits. since the timestamp moves forward every millisecond, the most significant bits dominate the ordering. ids generated later are almost always numerically larger than ids generated earlier. within a single ms there might be reshuffling between machines (machine_id breaks ties on the right) but at the macro level, monotonic.

this is the pattern. once you internalize it, you'll see it everywhere. snowflake? timestamp on LHS. mongodb objectid? timestamp on LHS. instagram? timestamp on LHS. universal.

✽ RECALL why does putting the timestamp in the most significant bits of an id buy you rough — but only rough — monotonicity, and what does that sortability unlock?

the high bits dominate numeric ordering and the timestamp only moves forward, so ids generated later are almost always numerically larger — snowflake, mongo objectid, and instagram all lean on exactly this. it's only rough because within a single ms the machine-id bits reshuffle the tail, and clock skew across machines causes dips. the payoff is cursor pagination: WHERE id > :last_seen LIMIT n costs the same on page 2 and page 200,000, while OFFSET walks and discards every skipped row.

the clock skew bossfight

so why doesn't rough monotonicity become strict at scale? clock skew.

picture 4 machines. machine ids 2, 4, 7, 9. they're all running NTP but their clocks have drifted slightly:

m2 thinks it's time=23
m4 thinks it's time=24 (running fast)
m7 thinks it's time=23
m9 thinks it's time=23

at "the same instant" we invoke get_id() on all four. results:

m2 → 232
m4 → 244
m7 → 237
m9 → 239

not monotonically increasing! 244 came out before 237 in wall-clock order. dip.

and there's nothing you can really do about clock skew with software alone. NTP doesn't make clocks identical, just close-ish. the only company i know of that genuinely fixed this is google with truetime in spanner — and they did it by literally putting atomic clocks and gps receivers in their datacenters. specialized hardware. who has that luxury? nobody else really.

the central service trap

so you go: "ok let me have ONE machine generate ids. no clock skew, no problem."

cool. now you have a single point of failure. machine dies → no ids → entire service down. so you add a second id server behind a load balancer for redundancy.

now your two id servers need to gossip to agree on ranges. server A picks 102, gotta tell server B "don't pick 102". server B picks 105, gotta tell A. on every single id. and reservations have to be consistent. this is essentially a distributed consensus problem and your throughput just collapsed.

so we end up with this fundamental result, which is honestly kinda beautiful in how clean it is:

there is no way to distribute id generation, guarantee strict monotonicity, AND have high throughput — without specialized hardware.

pick any two. you cannot have all three. real-world systems almost universally drop monotonicity (settle for rough/sortable) so they can keep distribution and throughput. that's what twitter, instagram, discord, sony etc all do.

✽ RECALL you want distributed generation, strict monotonicity, and high throughput all at once. why can't software alone get you all three, and which one do real systems drop?

clock skew kills distributed strict ordering — NTP keeps clocks close-ish, never identical, so a machine running fast stamps "later" ids that come out earlier in wall-clock order. centralizing on one machine fixes skew but creates a SPOF, and adding a second id server means gossiping a consistent reservation for every single id — a consensus round that collapses throughput. only specialized hardware escapes the triangle (spanner's truetime, with atomic clocks and gps in the datacenter). everyone else — twitter, instagram, discord — drops strict monotonicity and settles for rough.

case study 1: amazon's batched ticket service

amazon does the central-service thing but they make it cheap. one mysql table:

service     counter
-------     -------
orders        500
payments        0

when an order-service pod boots up, it calls get_id_batch(500). the id service does an atomic update on the row, bumps counter from 500 to 1000, and returns the range [501, 1000] to the caller.

now that pod owns 500 ids locally. it doesn't talk to the id service again until it's used 80% of them. then it asks for the next batch.

what if the pod crashes after using 10 of its 500 ids? you lose the remaining 490. who cares. counter is a 64-bit unsigned int. let's do napkin math: 100 servers each burning 500 ids/sec = 50,000 ids/sec ≈ 4.32 billion ids/day. 2^64 / 4.32 billion = ~4.3 billion days. and even if 50% of every batch is wasted from restarts, you've still got billions of days. the calculator overflows before your id space does.

don't be miserly about wasting ids. trying to deterministically reuse "leaked" ranges adds enormous complexity for zero practical benefit. let them die. ship faster.

✽ RECALL in amazon's batched ticket scheme, a pod grabs a batch of ids and crashes after using a handful. why is "let them die" the right call instead of reclaiming the leaked range?

the counter is a 64-bit int — even with half of every batch wasted to restarts, napkin math says the id space outlives you by billions of days. batching already bought the real win: a pod touches the central service once per batch instead of once per id, so generation stays local and fast. deterministic reuse of leaked ranges adds genuine complexity for zero practical benefit.

case study 2: why not just UUIDs?

uuids are 128-bit integers. they're random. they're easy. why aren't we using them?

index bloat.

think about how an index is stored in mysql. for an index on column name, the index leaf is essentially (name, primary_key) → row. so if your primary key is a 4-byte int, an index entry is 4 + 4 = 8 bytes. if your primary key is a 16-byte uuid, an index entry is 4 + 16 = 20 bytes. 2.5x larger.

multiply that across every index on every table. now your indexes are 2.5x bigger. and here's the killer: a database is fast only when its indexes fit in RAM. if your indexes spill to disk, every lookup becomes a disk seek before you even get to fetching the row. you've turned a memory operation into a disk operation. your query latency just exploded.

so at scale, uuids are a tax you pay on every read. not worth it. (also they're not sortable, which kills cursor-based pagination — more on that in a sec.)

the cockroachdb counter-example

hold on though — cockroachdb's docs literally recommend using uuids as primary keys. why?

because cockroachdb is range-partitioned. data is sorted by primary key and split into ranges across nodes. if your pk is monotonically increasing, every insert hits the node that owns the latest range. that one node becomes a hotshard, the other N-1 nodes do nothing. you've defeated the entire point of a distributed database.

with random uuids, inserts are spread evenly across all nodes. throughput scales horizontally.

so the same property (randomness) that made uuids bad for sharded mysql makes them good for cockroachdb. nothing is best. everything depends on your usecase.

✽ RECALL why are random uuids a tax on every read in sharded mysql, yet the recommended primary key in cockroachdb?

in mysql every secondary index entry carries the primary key, so a fat uuid pk bloats every index on every table — and the moment indexes spill out of RAM, each lookup eats a disk seek before you even fetch the row. cockroach is range-partitioned by pk: a monotonically increasing key funnels every insert to the node owning the latest range — one hotshard, N-1 idle nodes — while random uuids spread writes across the cluster. same property, randomness, opposite verdict. nothing is best; everything depends on the usecase.

case study 3: flickr's database ticket server

flickr did something a lot of people roll their eyes at, but it's actually elegant. they spun up a dedicated mysql server whose only job was to run auto-increment.

sql

CREATE TABLE tickets (
    id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT
)

every time someone wants an id, they go to that mysql, do an insert, get back the auto-incremented id, and use it as the pk for their actual sharded data store. it's effectively the central-id-service we discussed, just implemented with mysql's existing primitives instead of writing one from scratch.

was this the most elegant thing ever? no. did it work for flickr at their scale? yes. that's the bar.

case study 4: twitter snowflake

ok now my favorite. twitter went the other direction — no service at all.

snowflake is a 64-bit integer split into 3 chunks:

41 bits: epoch milliseconds (since some custom epoch like 2010)
10 bits: machine id (so up to 1024 machines)
12 bits: per-machine counter

generation is literally 3 lines of bit manipulation. there is no service. every api server that wants to write a tweet just runs get_id() locally as a function call. machine_id is assigned at boot via a tiny coordination service (think: a table with 1024 rows, claim one, release on shutdown). that's the only network hop, and it happens once per pod lifetime.

and because the timestamp is on the LHS, snowflake ids are roughly sortable. which gives you something kinda magical for pagination.

the pagination payoff

most apps paginate with LIMIT n OFFSET k. OFFSET k is brutal — the database walks the first k rows just to skip them. as you go deeper into the result set, the query gets linearly slower. classic graph: deeper → slower.

but if your ids are roughly sortable, you can paginate by id instead:

sql

SELECT * FROM tweets WHERE id > :last_seen_id LIMIT 5

every page is O(log n + k). constant with depth. doesn't matter if you're on page 2 or page 200,000, the query takes the same time.

this is why twitter's api uses since_id instead of offset. it's why dynamo uses LastEvaluatedKey. it's why mongodb pagination patterns push you toward _id > :cursor. cursor-based pagination is the natural friend of sortable ids and they unlock each other.

case study 5: instagram's snowflake variant

instagram took twitter's idea and made one really clever change. their layout:

41 bits: epoch milliseconds (since jan 1 2011)
13 bits: db shard id
10 bits: per-shard sequence number

notice — they replaced "machine id" with "db shard id". the id encodes where the row lives.

and the genius is what they do with logical shards on physical servers. instagram has thousands of logical shards (basically CREATE DATABASE insta_0, CREATE DATABASE insta_1, ... in postgres) but only ~10-15 physical postgres servers. each physical server holds many logical shards.

why? because handling a hot shard becomes trivial. when one physical server starts getting hot, you spin up a new physical box and just pg_dump one of the logical shards from the hot box and pg_restore it on the new box. dump-and-load operates on a whole directory at the file level — way faster than iterating row-by-row to figure out who-goes-where like you'd have to do if your physical and logical shards were 1:1.

this is the same trick elasticsearch uses with shards. the underlying pattern: have many small logical units inside fewer large physical containers, so you can move logical units around when load shifts. extremely clean.

and the id generation lives inside the database — it's a stored function set as the DEFAULT for the id column. when you insert a post and don't specify an id, postgres calls the function, which packs (timestamp | shard_id | counter) and that becomes the row's pk. zero service, zero network calls, all happening inside the insert transaction.

✽ RECALL instagram runs thousands of logical postgres shards on only ~10-15 physical servers. what does that indirection buy them when one box runs hot?

rebalancing becomes trivial — pg_dump one logical shard off the hot box and pg_restore it on a fresh one. dump-and-load moves a whole directory at the file level, no row-by-row resharding to figure out who-goes-where. if logical and physical shards were 1:1 you'd have none of that freedom. it's the same trick elasticsearch uses: many small logical units inside fewer large physical containers, so you can move the units around when load shifts.

closing thoughts

the thing i keep coming back to is — id generation seems trivial until you actually try to ship it at scale. then every constraint you add (durability, monotonicity, distribution, throughput) tightens the screws and forces tradeoffs. the entire history of this space is people relaxing the right constraint for their specific use case:

twitter relaxed strict monotonicity → got rough sortability + zero-service generation
instagram relaxed strict monotonicity AND embedded shard info → got hot-shard rebalancing for free
cockroachdb relaxed sortability entirely (uuids) → got distributed write throughput
flickr relaxed "no central service" → got auto-increment for free without writing code
amazon relaxed "no wasted ids" → got cheap, simple, robust batching

the courses portal i mentioned earlier? we relaxed everything except uniqueness. 6 random characters. ships every release. nobody complains.

so the framework is: figure out your non-optional constraints, your optional constraints, the order you'd relax them, and then design the id generator. don't pattern-match to "snowflake!" because somebody used it at twitter. you might not need any of what twitter needed.

and please. for the love of distributed systems. don't make a microservice unless you have to. a function in your app code, a stored procedure in your db, a 4-byte file on local disk — these are all valid id generators. the box-on-an-architecture-diagram is not the goal. the function returning something unique every time it's invoked is the goal.

nothing is best. everything depends on the usecase.

✽ RECALL before you pattern-match to "snowflake!", what's the actual framework for designing an id generator?

list your non-optional constraints, your optional ones, and the order you'd relax them — then design. twitter relaxed strict monotonicity and got zero-service sortable ids; instagram also embedded shard info and got hot-shard rebalancing; cockroach relaxed sortability and got distributed write throughput; flickr relaxed "no central service" and got auto-increment for free; amazon relaxed "no wasted ids" and got cheap robust batching. and don't make a microservice unless something forces you to — a function in app code, a stored procedure, or a 4-byte file on disk are all valid id generators.

designing instagram's hashtag page

2026-04-20T00:00:00Z

a system that looks deceptively simple on the surface but hides a bunch of interesting engineering decisions underneath. you know the page — when you tap #sunset on instagram and see the name, total posts, and a grid of top photos. let's walk through how to actually build it.

what you'll take away

quick pointers so you know what to look for as you read:

a guiding principle saves you from the cap theorem. "best user experience" wins ties so you stop fighting consistency vs availability vs latency.
store just enough, not everything. the gray area between "ID only" and "full metadata" is where the system gets cheap and simple.
pagination isn't always an optimization. sometimes it's just disk I/O dressed in a useful-looking flag.
kafka isn't infra you bolt on, it's the natural answer when many services care about one event. post service shouldn't know who's listening.
two-buffer swap beats stop-the-world. the critical section becomes a pointer flip — three CPU instructions.

the requirement

imagine you're an early engineer at instagram and someone walks up to you and says — "hey, we need to build this page." for every hashtag, all you have to show is:

the name of the hashtag
total number of posts (approximate is fine)
top 100 photos tagged with it

the top 100 photos? that's computed by a data science team. you don't care about their logic — exponential decay, reaction counts in the last hour, whatever. they hand you a list of 100 post IDs. your job is to render the page. sounds simple right? well, the moment you dig in, you realize there's so much more.

design principle: best user experience

before we start, here's a discipline that most engineers skip — define your guiding principle. for this system, it's best user experience. every single design decision should optimize for UX. whenever you're stuck between option A and option B, you pick the one that makes the user's experience better.

this is honestly one of the best habits you can build. without a guiding principle, you end up in the classic trap where someone says "i want strong consistency, high availability, AND fault tolerance" and you're basically violating the CAP theorem. you can't have it all. but what you're okay giving up on should depend on what you're optimizing for.

✽ RECALL you're stuck between two designs and someone wants strong consistency, high availability, AND low latency all at once. how does a guiding principle get you unstuck?

you can't have it all — that's the cap theorem trap. declaring a principle up front ("best user experience" here) gives you a tie-breaker: every contested decision gets resolved by asking which option serves the principle, so you know in advance what you're willing to give up instead of re-fighting the same tradeoff on every choice.

the read path

so the user taps #sunset. a GET /tags/{tag_name} request hits your hashtag API servers. and because we're optimizing for UX, this one API call should return everything — the name, the count, and the top 100 photos. no fan-out from the frontend making three separate calls. one request, one response, done.

the hashtag db is a partitioned key-value store — could be mongodb, dynamodb, whatever. but why partitioned? because our access pattern is dead simple — given a hashtag name, fetch its document. that's a key-value lookup. the hashtag name (like sunset) is the key, the document with count and top photos is the value. no range queries, no joins, no fancy relational stuff. just get(key) → value.

and because there are millions of hashtags, one machine can't hold all of them. you need to spread the data across multiple nodes. the hashtag name is a natural partition key — it distributes well (hashtags are diverse enough), and all the data you need for one hashtag lives in a single document on a single node. no cross-partition queries needed. so the three requirements from storage are: partitioned, key-value access, and durable. mongodb, dynamodb, couchbase — any of these work. pick based on team expertise and operational comfort.

the document looks something like this:

json

{
  "tag": "sunset",
  "total_posts": 1200000,
  "top_100": [ ... ]
}

now the interesting question — what goes inside top_100?

storing top 100: the storage tradeoff

you have three options here and this is where things get spicy. let's do the math for each so we make an informed decision instead of going with gut feel.

option 1: store only post IDs. so top_100 is just [p1, p2, p3, ..., p100]. let's break down the document size:

tag name:       ~12 bytes (avg hashtag length)
total_posts:    ~32 bytes (number serialized as string + key overhead)
top_100 array:  100 × 8 bytes (snowflake IDs are 64-bit = 8 bytes each)
                = 800 bytes
─────────────────────────────────────
total:          ~850 bytes ≈ 1 KB

cheap on storage. but now when a request comes in, you read this document, get the 100 IDs, and then you have to do a batch read from the posts database to fetch the actual post details — image URLs, captions, whatever you need to render the grid. that's a second lookup. more latency. bad for UX.

option 2: store entire post metadata. so you denormalize everything — caption, image URL, user details, tags, the works. per post:

post ID:        8 bytes   (snowflake)
caption:        ~160 bytes (avg caption with hashtags baked in,
                           think twitter-length text)
image URL:      ~160 bytes (CDN URLs can be long)
user metadata:  ~160 bytes (username, profile pic URL, etc.)
─────────────────────────────────────
per post:       ~488 bytes, but captions contain hashtags
                which eat space, so realistically ~560-660 bytes
                let's call it ~1 KB conservatively

top_100 array:  100 × ~1 KB = ~100 KB per document

that's ~100 KB read from disk on every lookup. and here's the nastier problem — if you store likes count here, every time a like happens, you have to update it in the posts db AND here. you just introduced a consistency nightmare and extra plumbing to keep things in sync.

option 3: the middle ground. this is where you put on the product manager hat. you ask yourself — what's the bare minimum we need to render the grid? just the photo. that's it. the user sees a grid of images. they tap one, they go to the full post page. no caption, no likes, no username on the grid view. so all you store is the post ID and the image URL:

post ID:        8 bytes   (snowflake)
image URL:      ~160 bytes
─────────────────────────────────────
per post:       ~168 bytes

top_100 array:  100 × 168 = ~16,800 bytes ≈ 16 KB per document

from 100 KB down to ~16 KB. peanuts. no extra consistency headaches. no syncing likes. no additional lookups. dead simple.

and here's the key insight — you as an engineering leader can propose product changes that simplify the system. most people think this is not possible. it absolutely is. if your idea has merit and optimizes the system, your PMs will get it. it's not product versus engineering — it's product AND engineering towards the same goal. the answer usually lies in the gray area.

✽ RECALL for the top-100 grid, why is storing only post IDs bad, why is storing full post metadata also bad, and what's the middle ground?

IDs alone force a second batch read against the posts db on every page view — extra latency, worse ux. full metadata balloons the document to ~100 KB per read and creates a consistency nightmare (likes now live in two places that must stay in sync). the middle: store exactly what the grid renders — post id + image url, ~16 KB, one lookup, zero sync plumbing. and noticing that the grid only needs the photo is a product change an engineer can and should propose.

why pagination is overkill here

now someone will inevitably say "shouldn't we paginate the top 100?" and at first glance it sounds reasonable. but think about how the data is stored. you have one JSON document with an array of 100 items. if you paginate with page=1&size=10, what happens? the database still reads the entire document off disk, then discards 90 entries and returns 10. that's a waste of disk I/O. you're not saving anything.

pagination makes sense when you have separate rows — like SELECT * FROM posts LIMIT 10 OFFSET 20. here, you have one document containing an array. pagination is just unnecessary overhead.

so we send all 100 post metadata (ID + image URL) to the frontend in one shot. but wait — what if the user never scrolls past the first 18 images (roughly 3 folds on a phone)? then we just wasted bandwidth loading 82 images for nothing.

lazy loading to the rescue

the trick is to offload this responsibility to the frontend. send all 100 metadata entries, but let the frontend do lazy loading of images. only fetch the actual image file when it enters the viewport. if the user never scrolls, those 82 images are never downloaded.

this is a good example of how it's not backend vs frontend — it's backend AND frontend together solving the problem. rather than over-engineering pagination on the backend, you push the intelligence to where it belongs. efficient bandwidth usage, no database overhead, great UX.

✽ RECALL why is paginating the top-100 array a fake optimization, and how do you avoid wasting bandwidth without it?

the top 100 live in one json document — the database reads the whole thing off disk regardless, then throws away 90 entries; pagination only saves work when items are separate rows. so send all 100 (id + image url) in one response and let the frontend lazy-load: an image downloads only when it enters the viewport, so a user who never scrolls past the first fold never fetches the rest.

the write path: counting hashtag posts

now we've sorted out reads. let's talk writes. every time a post is published, we need to increment the total_posts count for each hashtag in that post's caption. but before we jump to the architecture, let's build up to it.

here's what already exists in your infra. you have a post service (svc is just shorthand for service — you'll see it everywhere in system design diagrams) that handles post creation. users upload a photo, write a caption, hit publish, and the post gets stored in the posts db. that's your existing system. now you're bolting the hashtag service on top of it.

so the question is — when a post gets published, how does the hashtag service know about it? the naive thought is "just call the hashtag service directly from the post service." but think about who else cares when a post is published:

the feed service needs to add it to followers' feeds
the notification service needs to notify followers
the search service needs to index it in elasticsearch
the hashtag service needs to update counts

if the post service directly calls each of these, you've tightly coupled everything. post service now needs to know about every downstream consumer. add a new consumer? modify the post service. one downstream is slow? post service slows down. one downstream is down? post service either fails or needs retry logic for each. that's a mess.

this is the classic case for event-driven architecture with kafka as the glue. the post service doesn't care who's listening. it just publishes a POST_PUBLISH event to a kafka topic and moves on. whoever is interested — feed, notifications, search, hashtag — independently consumes from that same topic. decoupled. extensible. if tomorrow you want to add a "trending" service, just add another consumer. zero changes to the post service.

so kafka is not something we added for fun — it naturally falls out of the requirement that multiple services need to react to a post being published. the hashtag worker (wrkr = worker, another shorthand) is just one of many consumers on that topic.

now the hashtag worker receives the event, gets the post ID, fetches the caption from the post service (or the caption is included in the kafka event itself), and extracts hashtags from it using a simple regex. the naive implementation looks like this:

python

def process(m):
    tags = extract_hashtags(m.caption)
    for tag in tags:
        db.incr(tag, 1)

simple. one db call per tag. but let's estimate the scale. if instagram sees 100K posts per minute, and each post has ~8 hashtags on average, that's 800K database updates per minute just for this one use case. one db call per tag is not gonna cut it.

✽ RECALL when a post is published, why shouldn't the post service just call the hashtag service directly?

because hashtag isn't the only listener — feed, notifications, and search all care about the same event. direct calls couple the post service to every downstream: adding a consumer means modifying it, a slow consumer slows publishing, a dead one forces per-consumer retry logic. instead it publishes one POST_PUBLISH event to a kafka topic and moves on; consumers subscribe independently, and adding a "trending" service tomorrow touches nothing upstream. kafka falls out of the requirement — it isn't bolted on for fun.

approach 2: in-memory batching

so we buffer. instead of writing every increment to the db immediately, we accumulate counts in an in-memory map and flush periodically.

python

def process(m):
    tags = extract_hashtags(m.caption)
    for tag in tags:
        m[tag] += 1

    if has_been_long():   # > 5 minutes or count > 1000
        for tag, count in m:
            db.incr(tag, count)
        m.clear()

now instead of db.incr(tag, 1) eight hundred thousand times, you do db.incr(tag, count) where count could be hundreds or thousands. you've slashed the db calls by a massive factor.

but should you trigger the flush on time or frequency? time-based (every 5 minutes) has a risk — what if a surge of posts comes in and your map fills up so much that you run out of memory and the process crashes? frequency-based (every 1000 messages) is safer in that regard. pick based on your appetite for risk.

one small but crucial thing — write the buffer to disk instead of purely in-memory. if the worker crashes, in-memory data is gone. an embedded db like rocksdb or leveldb that supports incr operations gives you durability without much complexity.

✽ RECALL your hashtag counter is doing 800k db increments a minute. how do you slash that, and what are the two risks in the fix?

buffer counts in a map and flush periodically — one db.incr(tag, count) instead of count separate calls. risk one: a purely time-based flush can let the map grow unbounded during a surge and oom the process, so a count-based trigger is safer. risk two: in-memory data dies with the worker, so back the buffer with an embedded store like rocksdb that supports incr — durability without much complexity.

the stop-the-world problem

here's where it gets really interesting. when you flush the buffer, you're iterating through the map and making sequential db calls. during this time, you've stopped consuming from kafka. if the flush takes 5 seconds (say 1000 tags × 5ms per call), that's 5 seconds of zero consumption. bad.

so you think — let me do this in a separate thread. but now the map is shared between two threads: the consumer thread (writing to the map) and the flusher thread (reading from it). if you try to iterate and modify the map simultaneously, you get concurrent modification exceptions. if you take a lock, you're back to stopping the world.

deep copy approach: take a lock, deep copy the map, clear the original, release the lock, then flush the copy in a thread. the critical section is now just the deep copy + clear, not the entire flush. much better. but deep copy takes time and doubles your memory.

the two-buffer swap — minimal stop the world:

this is the elegant solution. think of it like a mcdonald's coke fountain competition — you have two glasses. while you're drinking from one, you fill the other. apply this to buffers.

you always write to the active buffer. when it's time to flush, you swap references — ma, mp = mp, ma. that's it. three CPU instructions. your critical section is literally a variable swap. no deep copy, no doubled memory, no long lock times.

if count == 1000:
    mu.lock()
        ma, mp = mp, ma
    mu.unlock()
    go writeToDB(mp)

the flusher thread works on the passive buffer at its own pace while the consumer keeps writing to the now-active buffer. when the flush is done, swap again. this is the same pattern used in production systems handling petabytes of data — google's dataproc remote shuffle service used exactly this to avoid stopping consumption while writing to remote storage.

✽ RECALL flushing the buffer inline stops kafka consumption, a flusher thread causes concurrent modification, and locking the map stops the world again. how does the two-buffer swap escape all three?

keep an active and a passive buffer. the consumer always writes to active; at flush time you take the lock only to swap the two references — three cpu instructions — then a background thread drains the passive buffer at its own pace while consumption continues uninterrupted. no deep copy, no doubled memory, a near-zero critical section. same pattern google's dataproc shuffle service used at petabyte scale.

approach 3: repartitioning by hashtag

one more thing. the kafka topic POST_PUBLISH is partitioned by post ID. so two posts containing #sunset could end up on two different worker machines. each does incr(sunset, 1) independently instead of one doing incr(sunset, 2). not the absolute minimum db calls.

if you want to minimize further, you write an adapter that reads from POST_PUBLISH, extracts hashtags, and writes to a new kafka topic POST_HASHTAG partitioned by hashtag. now all #sunset events land on the same consumer. your batching is maximally efficient.

but — this adds another kafka topic, more infra to manage, more cost. the benefit should outweigh the operational complexity. for most cases, the two-buffer batching approach is good enough. don't over-engineer.

read path optimization: CDN

one last thing on the read side. for a given hashtag, the count doesn't change every second. the top 100 photos don't rotate every minute. this data is relatively stable. perfect candidate for a CDN. stick a CDN in front of your hashtag API, set a reasonable TTL, and most requests never even hit your backend.

since the hashtag page has no personalization — #sunset looks the same for everyone — there's no reason not to cache it on CDN. anything on CDN, assume it's public.

the full picture

zooming out — you can see how every piece slots together. post service publishes events, an adapter repartitions them by hashtag, counting workers batch and update the partitioned db, the hashtag API reads behind a CDN. read path and write path optimized independently.

key takeaways

kafka as a glue binding services together. post service publishes, hashtag workers consume, feed/notification/search services also consume.
adapter pattern for repartitioning — if you're unhappy with the partition key, write a relay agent that reads and repartitions.
read path vs write path — separate them, optimize them independently. reads get replicas, caches, CDNs. writes get batching, buffering, partitioning.
wear three hats — architect, product manager, engineer. a senior engineer proposes product changes that simplify the system.

and the most important takeaway? don't skip steps. don't jump to the complex solution because it sounds impressive. start simple. measure. find the actual bottleneck. then optimize. most people love being at the peak of the complexity curve. but just one extra push of thinking can simplify your solution dramatically.

designing the unread message count

2026-04-20T00:00:00Z

the tiny number on your messaging icon: designing the unread sender count

a system that looks deceptively simple on the surface but hides a bunch of interesting engineering decisions underneath. you know that little number on your messaging icon — the one that tells you how many new people messaged you. let's walk through how to actually build it.

what you'll take away

quick pointers so you know what to look for as you read:

composite indexes win or lose by column order. (to, ts, from) is one query plan, (ts, to, from) is a full scan in disguise.
it's not an anti-pattern to optimize for your most critical query. hyper-specific indexes are fine if they serve the product.
the punching bag pattern protects critical components from redundant load. drop ops you know won't change state before they hit the core.
start simple. measure. then optimize. don't jump to redis on day zero if a single SQL query handles it.

the requirement

you know that little number on the messaging icon in linkedin, twitter, instagram? that's not the total number of unread messages. it's the number of unique people from whom you received new messages since you last tapped on the icon.

so if your friend sends you 100 messages, the indicator shows 1, not 100. if three different people each send you messages, it shows 3. you tap the icon, the counter resets. next time someone messages you, it starts counting again.

sounds simple. let's see.

approach 1: count on the fly

your schema in mysql (or any relational db) has three tables — users, messages, and user_activity. user_activity.last_read_at resets to current time whenever the user taps the message icon. the query to get the unread count is literally:

sql

SELECT COUNT(UNIQUE(from))
FROM messages
WHERE to = ?
AND timestamp > last_read_at

that's it. what looks like a super complicated system is literally this one query. at small to medium scale, this works just fine.

but will this scale? that depends entirely on your indexing strategy. and this is where most people mess up.

the indexing deep dive

let's say you blindly create individual indexes on to and timestamp. seems reasonable right? here's how the query evaluates.

take user C who was last online at timestamp 11:11. the query fires WHERE to = C AND timestamp > 11:11.

the to index (B+ tree, ordered by to then id) gives you all messages ever sent to C. could be millions of rows across 10 years. the timestamp index gives you all messages after 11:11 — potentially the entire table if the timestamp is old enough. then the database does a set intersection of these two result sets to find the matching rows.

imagine millions of entries from the to index intersected with millions from the timestamp index. catastrophically slow.

✽ RECALL both to and timestamp had their own indexes. why was the query still catastrophically slow?

each index alone produces a huge candidate set — every message ever sent to C, and every message after the watermark. neither narrows both predicates together, so the database must set-intersect millions of entries from each side before it can count anything.

composite index to the rescue

the fix is a composite index on (to, timestamp, from). now when you fire WHERE to = C AND timestamp > 11:11:

O(log n) — binary search to find where to = C starts
skip — within C's entries, jump to where timestamp > 11:11
k sequential reads — scan only the matching entries

total: O(log n + k) where k is the number of matching rows. no set intersection. no pointed lookups on the main table for the from column (because it's baked into the index). done.

now compare:

without from in index: O(log n + k) + k × O(log n) — for each of the k matched rows, you do a pointed lookup on the main table to get the from column for the COUNT(UNIQUE(from)).
with from in index: O(log n + k) — everything you need is right there in the index.

the overhead? just 8 extra bytes per index entry for storing from. even with 10 million messages, that's 80 MB. peanuts compared to the compute savings.

and yes, this index is hyper-optimized for this one query. that's perfectly fine. it's not an anti-pattern to optimize for your most critical query. as long as it works for your product and business, do it.

also — the order of columns in the composite index matters. if you did (timestamp, to, from) instead, the first column would match a huge range of timestamps, then you'd have to linearly scan through all of them to filter by to. completely defeats the purpose. the index should be ordered to match your query's selectivity — to first (narrows to one user), then timestamp (narrows to recent messages), then from (avoids main table lookup).

✽ RECALL why does (to, timestamp, from) make the query O(log n + k) while (timestamp, to, from) degenerates into a scan — and what does carrying from in the index buy you?

the equality column must lead: to = C binary-searches to one user's slice, then the range on timestamp skips within it. timestamp-first matches a huge range that still has to be linearly filtered by to. carrying from means the count never touches the main table — without it you'd pay k extra pointed lookups, k × O(log n), for 8 bytes per entry saved.

✽ RECALL the (to, timestamp, from) index exists to serve exactly one query. is that an anti-pattern?

no — hyper-specific indexes are fine when they serve your most critical query. judge the index by what it does for the product, not by how general it is. the column order just follows the query's selectivity (equality first, range second, payload last), and as long as that works for your product and business, optimizing for one query is a feature, not a smell.

approach 2: pre-computed with redis

the on-the-fly approach works at decent scale. but if you want to pre-compute, here's how it shapes up.

you need a way to know whether a message was actually delivered or not. websockets help — if the user is connected, the message is delivered live and doesn't count. if the user is offline, the message is undelivered and that's when the counter should bump. so you need an online/offline service that knows the websocket state of every user.

whenever a message can't be delivered in real-time (user is offline or inactive), an ON_MSG_UNSENT event hits kafka partitioned by receiver ID. workers consume it and fire SADD on redis — adding the sender to the receiver's set.

so if D sends A 10 messages, the worker fires SADD A D ten times. but since it's a set, D only appears once. the count? just SLEN A. returns 3 if A has messages from B, C, and D.

when A taps the message icon: DEL A. counter reset. done.

the read path is a single SLEN call. the clear path is a single DEL. both O(1). the write path is SADD calls from workers.

✽ RECALL the single SQL query already works. what extra machinery does the redis pre-compute drag in, and when is moving to it actually justified?

pre-computing forces you to know whether a message was delivered — so you now need websocket online/offline state, kafka events for undelivered messages, and workers firing SADD. that's a lot of moving parts versus one well-indexed query. it's justified only when the on-the-fly query measurably hits its ceiling: start simple, measure, then optimize.

the punching bag pattern

now here's the micro-optimization. if D sends A 100 messages, the worker fires SADD A D a hundred times. but after the first one, D is already in the set. the remaining 99 are redundant operations — they don't change the data but they still consume redis resources.

at peak load, imagine 90% of your commands are redundant. your redis cluster is both read-heavy (polling for status) and write-heavy (ingesting events). every unnecessary command adds up.

the punching bag pattern is about protecting your critical component. you add an auxiliary redis replica in front. before firing SADD on the primary, you check the replica — is this member already in the set? if yes, skip. if no, write to primary.

this isn't your day-zero solution. it's a day-n optimization when you observe that a huge percentage of writes are redundant. the pattern shows up everywhere — rate limiters are essentially punching bags that absorb load before it hits your core system.

two flavors of the punching bag:

streaming buffer — batch and buffer writes before they hit the db
check-and-set — discard redundant operations before they reach the critical component

✽ RECALL D sends A 100 messages while A is offline. how many of the 100 SADDs actually change state — and how does the punching bag absorb the rest?

exactly one — set membership is idempotent, so the other 99 are redundant load. the punching bag puts an auxiliary replica in front of the primary: check membership there first, and only write through when the member is new (the check-and-set flavor). a day-n optimization you add when you measure that most writes are redundant.

key takeaways

start with a single SQL query. at small to medium scale, SELECT COUNT(UNIQUE(from)) FROM messages WHERE to = ? AND timestamp > last_read_at literally is the system. don't overcomplicate.
composite indexes are surgical tools. column order is everything. (to, ts, from) makes the query O(log n + k). reorder the columns and you've built a beautifully indexed full table scan.
including columns in the index avoids main table lookups. 8 extra bytes per row to skip k pointed lookups is one of the cheapest trades in databases.
redis pre-computation is a day-n move. only when the on-the-fly query hits its ceiling.
the punching bag pattern saves your critical component from doing work it doesn't need to do. drop redundant ops before they hit the core.

and the bigger picture — identify the read path and write path, optimize them independently. for reads: indexes, replicas, caches, pre-computed counts. for writes: batching, buffering, punching bags to absorb redundancy.

nothing is best. everything depends on the usecase. and the answer almost always lies in the gray area — not purely in engineering, not purely in product, but in that sweet spot where both work together.

bit.ly system design — building a url shortener

2026-04-05T00:00:00Z

every system design conversation eventually circles back to bit.ly. it's the canonical "looks dead simple but isn't" service — take a long url, hand back a short one, redirect on click. the surface is straightforward. then you start asking how it scales to a billion urls and 100M daily active users, and the design gets way more interesting.

this is a full walkthrough — requirements, api, the dumb-first-design, then the deep dives where things actually get fun. i'm gonna walk through it as if i'm building it myself, talking through each decision as it comes up.

what you'll take away

quick pointers so you know what to look for as you read:

the read/write asymmetry is wild. 1000:1 reads-to-writes. design around reads first.
counter + base62 is the cleanest short-code generator. no collision checks, no extra reads.
6 base62 chars get you 56 billion urls. plenty of room for the next decade.
redis as a counter is the secret weapon. single-threaded, atomic incr — perfect for this.
counter batching kills the cross-network latency. grab 1000 ids at a time, use them locally.
302 redirect, not 301. keeps control on our side, allows analytics, doesn't get cached forever.
boring is fine. postgres, redis, a load balancer. nothing fancy is needed for this whole system.

what we're building

a url shortener. user sends in https://www.example.com/some/very/long/url, we hand back something like short.ly/abc123. anyone who hits that short url gets bounced to the original. plus a couple of optional flavors:

custom alias — user picks the short code (short.ly/evan)
expiration time — short url stops working after a date

scope-wise we're skipping user accounts, click analytics, and spam detection. they add complexity without changing the core architecture.

what the system has to do

requirements come in two flavors. functional — the features. non-functional — the qualities.

functional:

create a short url from a long url, optionally with custom alias and expiration
redirect from short url to the original

non-functional:

short codes are unique. one short code maps to one long url, no collisions ever.
redirects feel instant — under ~200ms end to end.
99.99% available. we lean availability over consistency.
handles 1B short urls total and 100M daily active users.

now the most important fact about this whole system, the one that drives every architectural decision later: read/write traffic is wildly asymmetric. one user creates a short url, then potentially millions click it. typical ratio is 1000 reads per 1 write. caching strategy, service topology, database choice — all of it falls out of that single number.

✽ RECALL what single fact about a url shortener's traffic drives every other design decision, and what falls out of it?

the read/write asymmetry — roughly 1000 reads for every write, because one person creates a link and millions click it. once you internalize that, the architecture falls out: design the read path first (index, cache, maybe a cdn at the edge), split read and write services so the hot side autoscales alone, and let the roughly-one-write-per-second side stay tiny and boring.

core entities

before drawing boxes, name the things our system moves around.

original url — the long thing the user gave us
short url (or short code) — what we generated
user — who created it

really two entities, since short and long urls live in the same row of the same table. user is auxiliary.

the api

two endpoints. one to shorten, one to redirect.

POST /urls
{
  "long_url": "https://www.example.com/some/very/long/url",
  "custom_alias": "optional",
  "expiration_date": "optional"
}
→ { "short_url": "http://short.ly/abc123" }

GET /{short_code}
→ HTTP 302 redirect to the original long url

dead simple. no fancy verbs, no nested resources. one post, one get.

the dumb-first design

start as small as possible — client, server, database. that's it.

on a write:

client POSTs to the primary server with the long url
server validates the url (something like an is-url check)
server generates a short code (we'll get to how in a sec — this is the fun part)
if the user gave us a custom alias, we use that — but we first check the db to make sure it's not already taken. nightmare scenario is a custom alias colliding with a generated code in the future. easy fix: prefix all generated codes with a character that aliases can't use, or keep them in different namespaces.
server writes (short_code, long_url, created_at, expires_at, created_by) to the db
server returns the short url to the client

on a read:

user's browser hits short.ly/abc123
server looks up abc123 in the db
if it's there and not expired, server returns a 302 Found with the long url in the Location header
browser follows the redirect, user lands on the original site
if expired, return 410 Gone. if missing entirely, 404.

quick aside on 301 vs 302. 301 is "permanent" — browsers and intermediate caches will cache the redirect, future requests might never hit our server. 302 is temporary — every request comes through us.

we want 302. why?

it lets us update or expire short urls without fighting browser caches
if we ever want analytics (clicks, geo, referrer), every request needs to come through us
the cost of a server hit per redirect is way smaller than losing observability

right, working system. the requirements aren't fully met yet though — we hand-waved short code generation, redirects aren't fast, we can't scale. let's actually get to the interesting parts.

✽ RECALL why return a 302 redirect instead of the "more correct" permanent 301?

a 301 gets cached by browsers and intermediaries, so future clicks may never reach your server — you lose the ability to update or expire short urls and you lose every click for analytics. a 302 keeps each request flowing through you: control, expiration, observability. the cost of one server hit per redirect is tiny next to going blind.

deep dive 1 — generating unique short codes

three properties we want:

unique — never collide
short — 5–7 characters
fast to generate

let's walk a few options.

option 1: random + check

generate a random number, base62-encode it, slice the first 6 characters.

base62 encoding is a numbering system that uses 0-9, a-z, A-Z for 62 total symbols. 6 base62 characters gives 62^6 ≈ 56 billion combinations. lots of room.

problem: random isn't unique. how often will two random short codes collide? more often than you'd think. this is the birthday paradox in action — in a room of just 23 people there's already a 50% chance two share a birthday despite 365 possible birthdays. apply the same math to 1B short codes randomly chosen from 56B options and you get roughly 880k collisions. not catastrophic but not zero.

so we'd need a db check before saving — generate a candidate, look it up, if it exists try again, otherwise save. that adds a read on every write. not great.

option 2: hash the long url

hash(canonicalize(long_url)), take the first 6 base62 chars. md5, murmur, sha-256, whatever.

a good hash function has the avalanche property — change one bit on the input and the output looks completely different. so the collision behavior is the same as random: 56B possible outputs, same ~880k collisions at 1B urls. same db check needed.

what's nice about hashing — same long url always maps to the same short code, so we get deduplication for free. what's not nice — most url shorteners actually want multiple short codes per long url (different users want different aliases, different expirations, separate analytics). dedup is often a feature you don't want.

option 3: counter + base62 — the winner

just keep a counter. first url gets short code 1, second gets 2, third gets 3. base62-encode the counter to keep things compact.

the counter guarantees uniqueness by construction. no collision checks, no extra reads. the encoding keeps it short — at 1B urls, our short code is a 6-character string. quick math: 1,000,000,000 in base62 is 15ftgG. and 62^6 ≈ 56 billion, so we don't need to bump up to 7 characters until we cross that threshold (which would take a lifetime at any reasonable url shortener's growth rate).

there's one fair concern with the counter: it's predictable. a competitor or scraper can iterate 1, 2, 3, ... and discover every short url we've generated. two ways to deal with this:

accept it. short urls are usually meant to be shared publicly anyway. rate-limit and move on.
scramble the counter before encoding. xor with a secret key, or use a "bijective" function (the squids library is one) that maps 1 → "Xa3kL9" reversibly but unpredictably. you keep the uniqueness, you lose the predictability.

we're going with the counter. cleanest, fastest, no read amplification on writes.

✽ RECALL why does counter + base62 beat random codes and hashing the long url for short-code generation?

random and hash both collide — birthday paradox, so at a billion urls you're looking at hundreds of thousands of collisions — which forces a db existence check before every save: read amplification on the write path. a counter is unique by construction, no checks at all, and base62-encoding keeps it short, with six characters covering tens of billions of codes. the one cost is predictability; either accept it (short urls are shared publicly anyway, rate-limit and move on) or scramble the counter with a bijective function before encoding.

deep dive 2 — making redirects fast

the read path is short_code → long_url. without optimization the server walks every row of the urls table on each lookup. at 1B rows that's a non-starter — full table scan for every redirect. dead.

step 1: index the short code

stick a primary key (or unique index) on short_code. now the database keeps a b-tree of short codes pointing to row locations on disk. lookups become O(log n) — typically a handful of memory reads followed by one disk seek to fetch the row.

postgres does this automatically for primary keys. for a WHERE short_code = ? lookup this is plenty fast — well under 10ms even at 1B rows.

step 2: cache the hot path

even with an index, every lookup eventually touches disk. and a huge fraction of our traffic is going to be a small number of viral short codes. that's a caching layup.

stick a redis (or memcached) instance in front of the db. on every read:

check redis for short_code → long_url
hit? return immediately — sub-millisecond
miss? read from db, populate redis, return

eviction policy is least recently used — if the cache fills, kick whatever's been quiet longest. natural fit for url shorteners because old links go cold fast.

key:   "abc123"
value: "https://www.example.com/long/url"

dead simple key-value lookup. redis hits sub-ms, db hits maybe 10ms. for hot urls we basically never touch the database. tie the cache TTL to the url's expiration time so expired entries fall out automatically.

step 3 (optional): cache at the edge

for global users, even a redis hit means a round-trip to whatever region our service runs in. a user in tokyo hitting a virginia data center is eating 200ms just on the network.

a CDN (cloudflare, fastly, akamai) caches responses at edge servers worldwide. for popular short codes the redirect can be served from the tokyo edge in 10–20ms without ever reaching our origin.

trade-offs:

cache invalidation across many edge nodes is harder
you lose visibility — the request never hits your server, so analytics get weird
it costs money

worth it for the most-clicked links and globally distributed audiences. for everything else, a single redis layer in your primary region is plenty.

deep dive 3 — scaling to 1B urls and 100M DAU

let's do the back-of-envelope math.

writes:

1B urls total over the lifetime of the service
assume linear-ish growth → roughly 100k new urls per day
that's ~1 url per second average. peaks maybe 10x. easy.

reads:

100M DAU × 1 redirect/user/day ≈ 100M redirects/day
100M ÷ 86,400s ≈ 1,200 reads/sec average
peaks 10x or 100x → 12k – 120k reads/sec at the upper bound

storage:

per row: short_code (~~8 bytes) + long_url (~~100 bytes) + timestamps (~~16 bytes) + custom_alias (~~100 bytes) + metadata (~80 bytes) ≈ 300 bytes. round up to 500.
500 bytes × 1B rows = 500 GB. fits on a single modern instance with room to spare.

so the dataset isn't the problem. the read throughput is. and we already solved most of that with caching. let's wire up the rest.

split read and write services

reads and writes have completely different traffic profiles:

writes: ~1/sec
reads: 10k+/sec, possibly bursting to 100k/sec

scaling them together is wasteful. we'll split them into two services behind an api gateway. the gateway routes POST /urls to the write service and GET /{short_code} to the read service.

each service horizontally scales independently. read service runs hot — many instances, all behind a load balancer, hitting redis cache and falling back to db. write service stays small — one or two instances handle the entire write load.

is splitting really worth it? honestly, for a service this small, sometimes no. running two services means two deployments, two dashboards, two on-call rotations. but the read/write asymmetry here is so extreme that the split is genuinely useful — you can autoscale the read service on cpu/memory thresholds without ever touching the write service.

the global counter problem

now that we've split into multiple write service instances, the counter has to live somewhere shared. you can't have each instance keeping its own count — they'd all hand out short_code 1 simultaneously.

the answer is a central redis instance holding the counter. redis is single-threaded, so its INCR command is atomic — two simultaneous calls always get different values. one gets 1000, the next gets 1001, never the same one twice.

every time the write service needs a new short code:

INCR counter on redis → returns next value
base62-encode it
write (short_code, long_url, ...) to the db
return short url to client

✽ RECALL multiple write instances each need globally unique ids. why is a single redis counter the answer, and what makes it safe?

redis is single-threaded, so INCR is atomic — two simultaneous calls can never receive the same value, which is exactly the uniqueness guarantee short codes need with zero coordination protocol. and a single redis box handles 100k+ ops/sec, while this system writes about one url per second — the counter is never the bottleneck. if redis loses a few values during failover, fine: the database's unique constraint is the safety net.

counter batching — kill the per-write network hop

doing a redis call on every single write is fine but wasteful. we can do better.

batch counter ranges to each write service instance. when a write service starts, it asks redis for a chunk of 1000 counter values:

INCRBY counter 1000  → returns N (start of the batch)

the instance now owns counters N through N+999 locally. it serves urls from this range with zero cross-network calls. when it runs out, it asks for the next 1000.

if a write service crashes mid-batch, those unused counters are lost forever. who cares — 56 billion total slots, losing a few hundred is invisible. we just need uniqueness, not continuity.

✽ RECALL what does counter batching fix, and why is losing a batch when an instance crashes a non-issue?

it kills the per-write network hop — instead of hitting redis for every url, an instance grabs a range of ids with one INCRBY and burns through them locally, refilling when empty. a crash loses the unused remainder of the batch, and nobody cares: you need uniqueness, not continuity, and the base62 keyspace has tens of billions of slots to spare.

multi-region — split the counter space

if the service runs in multiple regions (US, EU, APAC), having every region hit a single global counter is a latency disaster. instead, partition the counter space so each region owns a slice:

US gets [0, 1B)
EU gets [1B, 2B)
APAC gets [2B, 3B)

each region runs its own redis with its own slice of the namespace. no cross-region coordination on the hot path. a UNIQUE constraint on short_code in the database is the ultimate safety net if anything ever drifts.

the database

we said 500 GB on a single instance. that's fine for postgres or mysql today — vertically scale to instances with multi-TB SSDs and hundreds of GB of ram easily. so we don't actually need to shard.

if we ever do need to shard (say, data grows past 5 TB), we'd shard by short_code — hash(short_code) % N to pick a shard. but for the next decade of growth, one well-tuned postgres handles this whole system.

high availability matters though. one box dying takes down the entire product. so:

read replicas for redundancy and to absorb db reads if redis ever goes cold
regular snapshots to s3 (or equivalent) for point-in-time recovery
automatic failover — if the primary dies, promote a replica

for redis (counter and cache), redis sentinel or cluster mode handles failover. if redis loses the latest counter values during a failover, we lose a few short codes — fine, the database's unique constraint catches anything that would actually collide.

the final design

zooming out, here's everything wired together.

client hits the api gateway
gateway routes writes to the write service (small fleet)
- write service grabs a counter value (from its local batch, refilling from the global counter redis when empty)
- base62-encodes it, writes the row to the postgres database
gateway routes reads to the read service (large fleet)
- read service checks the redis cache first
- on miss, reads the database, populates cache, returns
read service responds with 302 and the long url
browser follows the redirect

every component scales independently. read service autoscales on traffic spikes without touching writes. cache absorbs the hot path. database serves only cache misses. counter is durable, atomic, and never a bottleneck thanks to batching.

✽ RECALL what's actually "fancy" in the final bit.ly design — and what should that teach you?

nothing — postgres, redis, an api gateway, a load balancer. the one structural flourish, the read/write service split, is earned by the extreme traffic asymmetry; everything else stays deliberately boring. no sharding for a dataset that fits one box, no microservices for a one-write-per-second workload. boring components, well arranged, carry a billion urls — you're paid to solve a problem, not to ship the fanciest architecture.

what i'd take away from all this

a few thoughts after walking through it.

start with the asymmetry. reads vs writes is the lens through which every other decision is made. once you internalize 1000:1, the cache, the service split, the counter — they all just fall out.
counter + base62 is the answer for short-code generation in 99% of these systems. it skips the collision-check tax that random and hash approaches need.
single redis is enough. for the counter, for the cache, for everything in this system. redis at 100k+ ops/sec on a single box is more than this whole service needs.
microservices for a tiny system are theater. the read/write split here is justified by the traffic asymmetry, but a lot of designs split for the sake of splitting. don't.
boring is fine. postgres, redis, an api gateway, a load balancer. nothing here is novel — and that's the whole point.

nothing is best. everything depends on the actual traffic, the actual constraints, the actual money. but for a url shortener at this scale, this design holds up. you're paid to solve a problem, not to ship the fanciest architecture.

remote locks and distributed locks

2026-02-11T00:00:00Z

where do remote locks even fit in?

before we jump into remote locks and distributed locks, lets take a step back and understand where they fit in. there's a beautiful logical evolution here that most people miss.

when you have multiple threads that need to synchronize, you use a mutex or a semaphore. why? because threads share the same memory space, so an in-memory primitive is the closest possible synchronization mechanism. dead simple.

now bump it up a level. when you have multiple processes on the same machine that need to synchronize, you use the disk. why? because disk is the shared common storage between processes. a beautiful real-world example of this is apt-get upgrade. try opening two terminals and running apt-get upgrade on both simultaneously. the second one will throw an error saying dpkg.lock file exists. the process creates a lock file on disk when it starts, and if another instance sees that file, it kills itself. two processes synchronizing through disk.

now bump it up one more level. when you have multiple machines that need to synchronize, what's the closest shared resource? the network. and that's exactly where remote locks come in.

you try to synchronize with the closest possible shared storage available. that's the pattern.

✽ RECALL threads sync through memory, processes through disk, machines through the network. what's the underlying rule, and why does apt-get use a lock file instead of a mutex?

you synchronize through the closest shared resource available. threads share memory → mutex/semaphore. processes don't share memory but share the disk → lock file (dpkg.lock: create it on start, kill yourself if it already exists). machines share neither memory nor disk, only the network → remote locks through a central lock manager.

what you'll take away

quick pointers so you know what to look for as you read:

closest shared resource wins. threads → memory, processes → disk, machines → network. pick your sync primitive accordingly.
two non-negotiables for any lock manager. atomic ops + automatic expiration. without TTL, one dead consumer halts the whole system.
SET NX is the whole game on the acquire side. atomic check-and-set means two consumers can't sneak past simultaneously.
release is sneakier than acquire. blind delete corrupts the system after a TTL expires. always verify ownership before releasing — atomically.
redlock buys availability with throughput. 5 nodes, majority quorum, no replication. survives 2 failures but every acquire is now a consensus round.
quorum count is static, not dynamic. even when nodes go down, the magic number stays at majority of original — shrinking it breaks correctness.
fanciest architecture rarely wins. single-node redis is what most teams should reach for. redlock exists for when correctness during failure is non-negotiable.

what are remote locks?

remote locks are essentially locks managed by a central machine — we call it the lock manager. it's a central component that multiple machines coordinate through.

the 3 machines coordinate through a central lock manager. simple enough right?

lets build some intuition with a stupid queue

to understand remote locks better, let's set up a problem. imagine you have a message broker — a remote queue. but this queue is stupid. it gives you no guarantees whatsoever. it's not SQS, it's not kafka, it's nothing fancy. it's a mythical, stupid, unprotected queue.

what we want is that when one consumer reads from the queue, the other two should wait. once the first one is done, the next one gets a turn. that's it. we want multiple machines to coordinate so that only one accesses the queue at a time.

the consumer's pseudocode

at a high level, every consumer runs this loop:

ACQ_LOCK()        ←  acquire the lock
  READ_MSG()      ←  read, process, and delete the message
REL_LOCK()        ←  release the lock

all consumers wait on ACQ_LOCK() while one of them does READ_MSG(). once the active consumer releases the lock, the next one gets in. then the third one. and so on.

what do we need from the lock manager?

two core properties:

atomic operations — so that two machines don't acquire the lock simultaneously. when one consumer is setting the lock, no other consumer should be able to sneak in. no race conditions.

automatic expiration (TTL) — imagine consumer 1 acquires the lock, starts processing, and then dies mid-way. if there's no expiration, that lock is held forever. nobody can make progress. so we need a timeout that auto-releases the lock after some time in case there's no graceful deletion.

so which database gives us both atomicity and TTL? redis. it's the popular choice because it's in-memory, which means it's fast. dynamodb works too, but redis is what most people reach for.

✽ RECALL your lock manager has atomic operations but no TTL. what single event halts the whole system, and why?

a consumer acquires the lock and dies mid-processing. with no expiration that lock is held forever, every other consumer waits on acquire indefinitely, and nobody makes progress — one dead machine stops the world. TTL auto-releases the lock when there's no graceful deletion, so the system heals itself. atomicity alone only protects acquisition; it does nothing for liveness.

implementation with redis

the idea is straightforward. you set a key in redis that says which consumer holds the lock. the key is the queue id, and the value is the consumer id.

eg:    q7 : consumer2    [ex: 300]
        ↑                    ↑
   lock held by         expiration: 5 min
   consumer 2

the magic here is SET NX — set if not exists. if the key already exists, it means some other consumer holds the lock, and the command returns 0. if it doesn't exist, the key is set and it returns 1. each command in redis is atomic, so no two consumers can race past this.

acquire lock

python

def acquire_lock(q):
    consumer_id = get_my_id()

    while True:
        v = redis.setnx(q, consumer_id)
        if v == 1:
            redis.expire(q, 300)   # TTL of 5 minutes
            return
        else:
            continue               # busy wait

yep, it's busy waiting. not the most elegant, but it works. each consumer keeps trying setnx in a loop. the moment it succeeds (returns 1), it sets the TTL and returns from the function. otherwise, it keeps spinning.

release lock

your first instinct might be to just delete the key:

python

def release_lock(q):
    redis.delete(q)

but think about this. consumer 1 acquires the lock, starts processing, and takes longer than expected. the TTL expires. the lock auto-releases. consumer 2 now acquires the lock and starts processing. consumer 1 finishes its work and calls release_lock — and blindly deletes the key. but that key now belongs to consumer 2. consumer 1 just deleted someone else's lock. now consumer 3 can waltz in and you've got two consumers processing simultaneously. chaos.

that's why we verify ownership before deleting:

python

def release_lock(q):
    consumer_id = get_my_id()
    v = redis.get(q)
    if v == consumer_id:
        redis.delete(q)

if the value in redis doesn't match my consumer id, i don't touch it. it's not my lock to release.

but there's one more subtlety here. the get and delete are two separate commands. what if between the get (which returns my consumer id) and the delete, the TTL expires, another consumer acquires the lock, and then my delete fires? same problem again. these two operations need to be atomic. in redis, you achieve this using EVAL — executing a lua script atomically on the server.

redis.eval("if get(q) == c : del(q)")

the lua script runs atomically on the redis server. no interleaving between the check and the delete. no race condition. clean.

✽ RECALL why does acquire need SET NX specifically — what breaks if a consumer does a GET to check the key and then a SET to claim it?

between the check and the set, a second consumer can run the same check, also see no key, and both end up "holding" the lock. SET NX collapses check-and-claim into one command, and since every redis command is atomic, exactly one consumer gets back 1 and everyone else gets 0. the entire acquire side rests on that single atomic check-and-set.

✽ RECALL consumer 1's TTL expired mid-processing and consumer 2 now holds the lock. what goes wrong if consumer 1's release just deletes the key — and why isn't plain get-then-delete enough either?

a blind delete removes consumer 2's lock, letting a third consumer in — two consumers now process simultaneously. get-then-delete still races: the TTL can expire between the get and the delete. the check and the delete must run as one atomic operation — a lua script via EVAL on the redis server.

where else do we see remote locks?

this pattern is everywhere in distributed systems. mongodb transactions, for example, use remote locks on the involved rows. when you run a multi-document transaction in mongodb, it acquires locks on the documents involved so that no other transaction can modify them concurrently. the mongos router coordinates these locks across shards — essentially the same remote locking pattern we just built. one central coordinator making sure multiple machines don't step on each other's toes. if you want to dig deeper into how mongodb handles this internally, their official docs on transactions and the wiredtiger storage engine's locking model are solid reads.

but there's a problem. what happens if your single redis node goes down? nobody can acquire the lock. you've got a single point of failure. and that's exactly why distributed locks exist.

distributed locks — redlock

the idea behind distributed locks is simple: what we did with a remote lock on one node, just distribute it across multiple nodes.

you have 5 independent redis master nodes. no replication between them. all standalone. the concept is that instead of acquiring a lock on one node, you acquire a lock on the majority of these nodes.

acquire lock (distributed)

python

REDIS_SERVERS = [., ., ., ., .]
QUORUM_COUNT = ceil(len(REDIS_SERVERS) / 2)    # 3 out of 5

def acq_lock():
    count_acq = 0

    for i in random.shuffle(range(5)):          # random order
        count_acq += redis[i].setnx(q, c, ex=300)

    if count_acq >= QUORUM_COUNT:
        return                                   # got the lock!
    else:
        # release locks on nodes where we did acquire
        for i in range(5):
            redis[i].eval("if get(q) == c : del(q)")

the client goes through all 5 nodes, trying to SET NX on each. if it acquires the lock on the majority (3 out of 5), it's done — lock acquired. if not, it releases whatever locks it did acquire and tries again.

that last part is crucial. if you don't release the partial locks, you get deadlocks. imagine consumer 1 gets a lock on nodes A and B, consumer 2 gets C and D, consumer 3 gets E. nobody has the majority. so everybody releases and retries. on the next round, maybe consumer 2 gets A, B, and C — majority acquired.

release lock (distributed)

same lua script as before, but fired at all 5 nodes:

for i in range(5):
    redis[i].eval("if get(q) == c : del(q)")

fault tolerance — the whole point

so why did we go through all this trouble? because with distributed locking, we no longer have a single point of failure.

lets say consumer 2 holds the lock on nodes A, B, and C. if node D goes down, no problem — 2 still holds the majority. if node E also goes down, still no problem — 2 has 3 out of 3 remaining, but more importantly, 2 already holds 3 out of the original 5.

with 5 nodes, you can survive 2 node failures.

the quorum count trap

here's where people get confused. lets say nodes C and D go down. you think "well, 3 nodes remain, so the quorum should be 2 out of 3 now, right?"

hell naw.

look at the code again:

REDIS_SERVERS = [., ., ., ., .]
QUORUM_COUNT = ceil(len(REDIS_SERVERS) / 2)    # 3

REDIS_SERVERS is a static configuration. it's a hardcoded list that every consumer has when it starts up. it doesn't dynamically shrink when a node becomes unreachable. a node going down doesn't mean it gracefully removes itself from everyone's config — it just stops responding. the list still has 5 entries. len(REDIS_SERVERS) is still 5. QUORUM_COUNT is still 3.

and honestly, you wouldn't want it to change. think about why. the whole point of quorum is that any two majorities must overlap on at least one node. if you have 5 nodes and need 3, any two groups of 3 will share at least 1 node — which prevents two consumers from both thinking they own the lock. the moment you start shrinking the quorum dynamically ("oh only 3 nodes are alive, so 2 is enough"), you break that overlap guarantee. consumer 1 could get nodes A and B, consumer 2 could get node E, and both think they have majority of the "alive" set. that defeats the entire purpose.

so the quorum is fixed at 3 out of 5. if 2 nodes are down, you need all 3 remaining nodes to agree. that's harder, sure. but that's the price of correctness. and if 3 nodes go down? well, majority is mathematically impossible with only 2 nodes, so the system halts. no lock can be acquired. that's a feature, not a bug — better to stop than to hand out locks incorrectly.

✽ RECALL nodes C and D are down. how many of the 3 remaining nodes do you now need to acquire the lock — and why doesn't the quorum shrink to 2-of-3?

all 3. the quorum is ceil(5/2) = 3, computed from the static server list — a dead node doesn't remove itself from anyone's config. and shrinking it would break the overlap guarantee: any two majorities of the original 5 must share at least one node, which is exactly what prevents two consumers from both believing they hold the lock.

the tradeoff spectrum

here's where it gets interesting. there are three architectures, three different trade-off profiles. think of it like convincing friends to go to a restaurant.

remote lock (single node) — very high throughput because you only need to convince one friend to go to the restaurant. very high correctness because there's one source of truth. but if that node dies, game over. availability is gone.

remote lock with replica — you get throughput (just write to master) and you get availability (if master dies, replica takes over). but correctness takes a hit. imagine you acquire a lock on the master, and before it propagates to the replica via async replication, the master dies. the replica never got the lock entry. now two consumers think they own the lock. bad.

distributed lock (redlock) — you get high correctness because majority consensus means even if a node goes down, the lock state is safe. you get high availability for the same reason. but throughput suffers because now you're convincing 3 out of 5 friends to agree on a restaurant. consensus always slows things down.

as the number of redis nodes increases, your lock acquisition time goes up. and as the number of clients (consumers) contending for the lock increases, it gets even slower. exponentially slower.

when do you use what?

there's no one right answer. it depends on context.

remote lock (single node) — most people use this. high throughput, simple setup, and they're okay with the availability risk. classic use case: consumer synchronization where lock contention is frequent and you need fast acquisition.

distributed lock (redlock) — this exists for situations where availability is absolutely critical. think database leader election. it doesn't happen frequently, but when it does, you cannot tolerate failures. if you give up on correctness during leader election, you get data inconsistency, data corruption — huge problems. so you're willing to pay the latency cost for the guarantee.

remote lock with replica — right in the middle. a little VC money, a little correctness, a little availability. sometimes good enough.

nothing is best. everything depends upon the usecase and the constraints you're operating under. you pick one over another based on what trade-offs you're willing to make. you are paid to solve a problem and not necessarily use the fanciest architecture to solve it.

✽ RECALL one team needs locks for database leader election, another for high-contention queue consumers. which architecture do you hand each team, and what does each give up?

leader election → redlock: it's rare but correctness + availability during failure are non-negotiable, so you pay the consensus latency. queue consumers → single-node redis: contention is frequent, acquisition must be fast, and the SPOF risk is acceptable. the replica setup sits in the middle — and gives up correctness during async-replication failover.

Message queue

2026-01-03T00:00:00Z

queues are one of those things that sound dead simple from the outside. you push stuff in one end, you read it from the other end, fifo, done. but the moment you try to build anything real on top of a queue, you realize there are like five layers of decisions hiding under the hood. and one of the biggest myths people walk around with is that queues are fifo. they are not. not in any system you'd actually run in prod.

so let's actually master queues. async patterns, push vs pull mechanisms, why fifo breaks the moment you have more than one worker, and where kafka stops being a "nice to have" and becomes the only sane answer.

what you'll take away

quick pointers so you know what to look for as you read:

synchronous = whatever's slowest is your floor. one rate-limited dep can take down the whole flow. async is the escape hatch.
enqueue is cheap, processing is async. the api responds in 200ms, the email lands 6 hours later, and for non-critical work that's totally fine.
push vs pull is the real fork. push (rabbitmq) hides complexity in the broker. pull (sqs) gives you control and makes you write the rest.
dedup, retries, backoff — only "free" with push. in pull-land you write all three yourself. that's the price of control.
fifo is a lie when you scale out. multiple workers + variable processing time = out-of-order delivery, no matter what queue you're using.
kafka exists because ordering needs partitioning. key-based partitions guarantee order within a key, parallelism across keys.
mix queues by use case. notifications can be push-based, video pipelines pull-based, event log on kafka — nobody's stopping you.

the synchronous trap

let's start with something concrete. you're building a signup flow. user hits POST /signup, you've got a server, you've got postgres, and you want to send a welcome email. easy:

python

def signup(req):
    insert_user_in_db()
    send_welcome_email()
    token = create_token()
    return token

four steps, all in sequence. all synchronous. and at small scale this works perfectly fine — one user signs up, four steps fire, response goes back. ship it.

now bump the load to 10,000 signups in a minute. suddenly that send_welcome_email line is the problem. two things happen.

email service is an external dep. it's somebody else's box — aws ses, sendgrid, whatever. if it's slow, your signup is slow. if it's down, your signup is down. your signup now depends on an email server. that's a wild coupling for something as fundamental as user creation.

rate limits hit. email providers cap you. let's say ses gives you 25 emails per minute, which is honestly a healthy quota. but if 10,000 users hit signup in the same minute, you're firing 10,000 email calls into a 25-call budget. the rate limiter says no. your send_welcome_email starts erroring. and because the whole flow is synchronous, that error kills the entire signup. one external rate limit takes down user creation. unacceptable.

so what's actually critical here? insert_user_in_db — yes, without that there's no user. create_token — yes, the user needs a session. return response — obviously. but send_welcome_email? the user does not need that email in the same 200ms as their signup. if the email shows up 30 seconds later, who cares. if it shows up 5 minutes later, still fine. it's a non-critical task that's been hard-coupled into a critical path.

this is exactly where async processing exists.

the async escape hatch

instead of sending the email inline, you enqueue the work and let something else process it later.

python

def signup(req):
    insert_user_in_db()
    messageQueue.enqueue(sendWelcomeEmail)
    token = create_token()
    return token

three sync steps plus one cheap enqueue. signup is fast again. the queue holds the email job. some other process — a worker — is going to pick it up later and actually call the email service. that worker can respect the 25/min rate limit, sleep when it hits the cap, retry when slots open up. the queue absorbs all the spikiness.

so if you got 10,000 signups in a minute, your queue now has 10,000 pending email jobs. the worker picks 25/min and the queue drains slowly. user #1's email arrives in seconds. user #9,999's email might arrive 6 hours later. and that's fine. welcome emails are not synchronous-critical. the same way youtube doesn't show your uploaded video instantly — it takes 10-15 minutes because some worker is processing it asynchronously. literally everywhere you look, the non-critical stuff is on a queue.

this is async processing. and queues are how you implement it.

✽ RECALL your signup endpoint dies whenever the email provider rate-limits. what's the actual design mistake, and what's the fix?

a non-critical task (the welcome email) got hard-coupled into the critical path — synchronous means the slowest, flakiest dependency sets your floor, so one external rate limit takes down user creation itself. fix: enqueue the email job (enqueue is cheap) and return; a worker drains the queue at the provider's pace, the queue absorbs the spike, and the email lands minutes or hours later — fine, because it was never synchronous-critical.

the two ends of every queue

every queue has a producer side and a consumer side. producer is the easy part. doesn't matter what queue you're using — rabbitmq, kafka, sqs, bullmq — putting things in is always a one-liner. you call some sdk method, you give it a payload, it goes in. no surprise.

the consumer side is where it gets spicy. how does that worker actually get the next message? this is where queues split into two completely different worlds.

push vs pull — the fundamental fork

every queue is either push-based or pull-based. this single decision changes how you write the consumer, how you handle failures, how dedup works, and where the bottlenecks land.

push-based — the broker drives

push-based means the queue itself decides who gets which message and when. rabbitmq is the canonical example. here's the dance.

a worker spins up and the first thing it does is register itself with rabbitmq. it's basically saying "hey, i'm alive, i can process messages — when you have one, send it my way." rabbitmq writes that down in some internal map: worker-7 is available. the worker then sits there and waits.

every 30 seconds (or whatever the heartbeat interval is), the worker pings rabbitmq with a heartbeat. "still alive. still alive. still alive." if rabbitmq stops hearing those for more than its tolerance, it marks the worker dead and stops sending it messages. without heartbeats, a crashed worker would still be receiving work that nobody's processing — that's how you'd lose messages.

now when a producer pushes a message, rabbitmq looks at its registered live workers, picks one, and pushes the message at it. the worker didn't ask. it didn't poll. the message just shows up on its socket.

what's good about this:

consumer code is dead simple. you literally write a callback onMessage(msg) and rabbitmq invokes it. no polling loop. no "is there anything yet?" logic.
deduplication is built-in. since one broker is choosing who gets what, it'd never hand the same message to two workers. it can't, by definition. one source of truth.
retries are built-in. if a worker takes the message but never acks it within some window, rabbitmq notices and re-queues it for someone else. you didn't have to write that.

what's not good:

the broker is doing a lot of work. one box is tracking every worker, every heartbeat, every assignment. that's a real bottleneck and it slows down at scale.
you give up control. you can't say "i want to process at exactly 50 msg/sec" — the broker decides cadence.

✽ RECALL in a push-based queue like rabbitmq, what is the broker actually doing for you — and what's the price?

it tracks registered workers via heartbeats, decides who gets each message (so dedup is free by construction — one source of truth can't hand the same message to two workers), and re-queues anything that isn't acked in time (retries free). the price: one box doing all that bookkeeping becomes a real bottleneck at scale, and you surrender control — the broker decides cadence, not you.

pull-based — you drive

pull-based flips it. the queue does nothing on its own. it just sits there. workers have to actively go fetch.

sqs (amazon's simple queue service) is the textbook pull-based system. it gives you two apis: /push to put a message in, /pull to ask for one. that's the whole interface. no registration. no heartbeats. no broker choosing assignments. workers just call /pull whenever they want.

so your worker code looks like:

while True:
    msg = sqs.pull()
    if msg:
        process(msg)
        sqs.delete(msg)
    else:
        sleep(backoff)

a setInterval in js land. a while True in python. polling, basically.

now everything that was free in rabbitmq becomes your problem.

dedup is your problem. if three workers all call /pull at almost the same instant, sqs might hand the same message to multiple of them. you have to handle that yourself. typical fix is a redis lock — before processing message id 7, acquire a lock on lock:msg:7. if you can't get it, skip, somebody else has it. it works, but you wrote it.

retries are your problem. if you pulled a message and crashed mid-process, you have to put it back. or build a dead-letter queue (DLQ) where failed messages go for inspection. sqs gives you primitives, not policies — you wire it.

backoff is your problem. if your queue is empty and you're polling every minute, you're burning api calls (and sqs literally charges you per call). after 20 empty polls, slow down — go to every 5 minutes. after another 20 empties, every 15. the moment you find a message again, snap back to fast polling. you have to write that backoff strategy. rabbitmq workers don't have this problem because the broker pushes when there's something to push.

so why would anyone use pull?

control. all the control is yours. you decide cadence, concurrency, retry policy, backoff, dedup strategy. for some workloads — high-throughput batch processing, video pipelines, anything where you want to tune carefully — that control is gold. and there's no single broker bottleneck because there's no broker doing dispatch.

tbh i lean pull-based for most production stuff. bullmq (which sits on redis) and sqs are pull-based and they're what i reach for. bullmq gives you nice abstractions on top so you barely notice the polling. sqs is full raw — you implement everything — but it's a fun engineering problem to solve and the operational overhead is basically zero. on a recent project of mine all the queues are sqs with a while True loop, backoff strategies, retry handling, the works. it's a vibe.

✽ RECALL you move from rabbitmq to sqs. which three things just became your problem, and why?

dedup — concurrent /pull calls can hand the same message to multiple workers, so you write a redis lock per message id. retries — crash mid-process and you re-queue it yourself or wire a dead-letter queue. backoff — empty polls burn api calls (sqs charges per call), so you write the slow-down-then-snap-back polling strategy. sqs gives you primitives, not policies; that's the price of control, and the reward is you tune cadence, concurrency, and retry exactly how you want, with no broker bottleneck.

now the spicy part — queues are not fifo

ask anyone what a queue is and they'll say "first in, first out." and yeah, that's the textbook data structure definition. but in a real distributed system, queues do not give you fifo. let me show why.

say you're building an sms onboarding flow that sends a sequence: welcome → product tour → 3-day-later check-in. order matters. the user can't get the check-in before the welcome. so you push them in order: msg 1, 2, 3, 4, 5, 6 into the queue.

if you only had one worker, fine. but you don't — you have multiple. that's literally the whole point of horizontal scaling. so worker A grabs msg 1, worker B grabs 2, worker C grabs 3, A grabs 4, B grabs 5, C grabs 6.

now what actually finishes first?

worker A's process for msg 1 might be slow (db hiccup, network jitter, gc pause, whatever). worker B finishes msg 2 first. msg 2 arrived first. then worker C finishes msg 3. then worker A finally finishes msg 1. and so on.

end result: the user got messages in the order 2, 3, 1, 5, 4, 6.

the queue is fifo on the way in. it's not fifo on the way out. the moment you have more than one worker — and you always do, because that's literally what queues are for — sequential delivery is gone.

for the welcome → tour → checkin flow, this is genuinely broken. user gets the product tour before the welcome. then the check-in before the tour. then the welcome. they're confused. you shipped a bug.

so how do you actually preserve order when you need it? hell naw to "just use one worker" — you need parallelism and ordering. that combo is the hard problem.

✽ RECALL you pushed messages 1–6 into a queue in order. why does the user still receive them out of order?

multiple workers plus variable processing time. the queue is fifo on the way in, but worker A can stall on msg 1 (db hiccup, gc pause, network jitter) while worker B finishes msg 2 — delivery order is gone. and you always have multiple workers, because parallelism is the entire point of the queue, so single-queue fifo is a myth at any real scale regardless of which queue tech you use.

kafka — ordering inside a partition

this is exactly where kafka shines, and where the other queue systems quietly fall short. kafka introduces partitions and keys.

instead of one queue with everything mixed together, kafka splits the topic into N partitions. when you push a message, you attach a key — usually something like the user id. kafka hashes that key and sends the message to one specific partition. all messages with the same key always land in the same partition.

then on the consumer side, kafka has consumer groups. each partition is consumed by exactly one consumer in the group. so partition 0 → consumer A, partition 1 → consumer B, partition 2 → consumer C.

what you get out of this:

parallelism — different users' messages go to different partitions, processed concurrently by different consumers
order within a key — all of user 1's messages live in the same partition, processed by the same consumer, in the order they were pushed

so user 1 gets welcome → tour → checkin in the right order. user 2's messages can interleave with user 1's because they're on a different partition, but user 2's own sequence is preserved too. ordering where it matters, parallelism where it doesn't.

this is why people reach for kafka in event-sourcing setups, in pipelines that care about per-entity ordering, in anything where "process this user's stuff in order" is a real requirement. rabbitmq and sqs don't have this concept first-class. kafka does, and it's beautifully designed.

if you want to actually get good at this, kafka's consumer groups concept is worth a separate deep-dive — partition rebalancing, offsets, what happens when a consumer dies, the works. there's a lot under the hood.

✽ RECALL you need parallelism and per-user ordering. why can't a plain queue give you both, and what's the shape of the system that can?

with one shared queue any worker grabs any message, so ordering and parallelism fight each other — fixing order means one worker, fixing throughput means many. the fix is partitioning by key: all of one user's messages land in the same partition, consumed by one consumer in pushed order, while other users' partitions process in parallel. order within a key, parallelism across keys — the model kafka makes first-class and rabbitmq/sqs don't.

the actual lesson here

queues are not one thing. they're a whole spectrum of design decisions:

sync vs async architecture is the first call. async wins for any non-critical work.
push vs pull is the second call. push for simple consumers and built-in dedup. pull for control and tunability.
single-queue fifo is a myth at any real scale.
when ordering matters, you're either reaching for kafka or rolling your own partitioning logic on top.

and here's the philosophy. there is no best queue. rabbitmq is great when you want simple. sqs is great when you want zero ops and full control. kafka is great when ordering and replay matter. bullmq is great when you're already on redis. you can mix and match — your notification system can be push-based, your video pipeline can be pull-based, your event log can be on kafka. nobody's stopping you. you are paid to solve a problem and not necessarily use the fanciest queue to solve it.

✽ RECALL someone asks which queue is "best" for the new system. what's the right answer?

there isn't one — match the mechanism to the workload. push when you want dead-simple consumers with free dedup and retries, pull when you want control and near-zero ops, kafka when per-key ordering or replay actually matters. and mix them freely in one architecture: push-based notifications, pull-based video pipeline, event log on kafka. you're paid to solve a problem, not to use the fanciest queue.

scaling reads

2025-12-05T00:00:00Z

scaling reads sounds boring on the tin and then you realize it's basically every system design problem in disguise. the moment your app gets even a little popular, reads outpace writes by 10×, then 100×, then 1000× — and your single database starts crying. this post walks through the actual ladder you climb to handle it: from "just add an index" all the way up to global CDNs, with the trade-offs that bite you at each rung.

think about your instagram feed. you open the app, and bam — dozens of photos, each one pulling its own metadata, user info, like counts, comment previews. one feed load might fire 100+ db queries. meanwhile what'd you actually do? you posted one photo. that's a single write. one tweet → thousands of reads. one product upload → hundreds of browses. youtube serves billions of video views per day on top of millions of uploads. textbook ratio starts at 10:1 reads-to-writes but content-heavy apps are prolly looking at 100:1 or 1000:1.

and here's the problemmmm — when reads pile up, your db isn't slowing down because of bad code. it's physics. CPU cores execute a finite number of instructions per second, RAM holds a finite amount, disk I/O is bounded by what the SSD can physically push. when you hit those walls, no amount of clever code is gonna help. you've gotta change the architecture.

what you'll take away

quick pointers so you know what to look for as you read:

reads always outgrow writes. plan the system around reads first.
start cheap, climb only as far as you need. indexes → denorm → replicas → cache → CDN.
composite + covering indexes beat denormalization most of the time. cheaper, simpler, no consistency drama.
replicas buy throughput, caches buy latency. different problems, different tools.
cache invalidation is where the bodies are buried. TTL is a safety net, not a strategy.
physics wins eventually. at extreme scale you need versioned keys, request coalescing, probabilistic refresh.
the right answer is sometimes "we don't need to scale yet." modern hardware is wild.

the ladder

read scaling is a ladder. each rung adds operational pain in exchange for more throughput:

optimize within your db — indexes, composite/covering indexes, denormalization
scale horizontally — read replicas, then sharding when you must
add caches — Redis/Memcached, then CDNs at the edge

let's walk it.

optimize within your database

before you reach for new infrastructure, your existing db almost always has more headroom than you think. modern postgres or mysql is not a toy — with the right schema and indexes you can squeeze tens of thousands of QPS out of a single box.

indexes — always step one

an index is just a sorted lookup table that points back into your real data. think of the index in the back of a textbook — instead of flipping every page hunting for "B-tree", you check the index, see "page 184", jump straight there.

without one, your db does a full table scan — reads every row to find what you're after. with one, it jumps directly. that's O(n) becoming O(log n) — the difference between scanning a million rows and checking maybe 20 entries in a tree. most general-purpose indexes are B-trees. there are specialized ones — hash for exact matches, GIN/GiST for full-text or geo — but B-tree is your default.

so the first move for any read scaling problem? add indexes on the columns you query, sort by, or join on. social app filters posts by hashtag → index hashtag. sort products by price → index price. dead simple.

old textbooks freak out about "too many indexes slowing down writes." this fear is way overblown for modern hardware. yes there's some write overhead per index, but under-indexing kills more apps than over-indexing ever will. add the indexes you need. don't be cute about it.

composite + covering indexes — the cheaper alternative to denormalization

before you reach for denormalization (which is messy, more in a sec), see whether a composite index can solve it first. composite means an index across multiple columns. great for queries that filter or sort on more than one thing.

 
SELECT post_id, post_title FROM posts
WHERE user_id = ? AND created_at > ?
ORDER BY created_at DESC;

a composite index on (user_id, created_at) lets the db satisfy the entire WHERE and the sort using just the index. push it further with a covering index — (user_id, created_at, post_title) — and the index contains every column the query needs. the db doesn't even touch the table. people call this an "index-only scan" and it's stupidly fast.

a few rules. column order matters — put the most selective filter first. (user_id, created_at) helps queries on user_id alone, but not on created_at alone. sort order is free if you align it — if your ORDER BY matches the column order in the index, the db skips the sort step entirely. and don't over-cover — every column inflates write cost and storage.

why does this matter? because composite/covering indexes often kill the need for denormalization. denormalization brings storage bloat AND consistency headaches. a composite index brings just write cost. that's a way better trade. try this first. denormalize only when no index can satisfy the query.

hardware — boring but it works

sometimes the answer is just bigger hardware. swap spinning disks for SSDs, get 10–100× faster random I/O. add RAM, more of your dataset stays in memory. add cores, handle more concurrent queries.

won't solve every problem but it buys breathing room, fast. real limits though — you'll hit the ceiling on the biggest box your cloud provider offers, and a single failure takes the whole thing down. it's a stopgap that buys time to do the real architectural work.

denormalization — when no index can save you

normalization splits data across tables to avoid duplication. nice for storage, ugly for reads — joins everywhere. for read-heavy systems where joins start eating CPU, denormalization flips the script: you intentionally duplicate data to make reads single-table.

classic e-commerce. normalized version joins users, orders, order_items, products:

 
SELECT u.name, o.order_date, p.product_name, p.price
FROM users u
JOIN orders o ON u.id = o.user_id
JOIN order_items oi ON o.id = oi.order_id
JOIN products p ON oi.product_id = p.id
WHERE o.id = 12345;

four tables, three joins. fine at small scale, painful at thousands of order pages per second. denormalized: just have an order_summary table with everything inline.

 
SELECT user_name, order_date, product_name, price
FROM order_summary
WHERE order_id = 12345;

one table, one row, done.

yes you're storing the user name redundantly. yes that's storage cost. for a read-heavy system that's often worth it. the catch — when a user changes their name, you've gotta update it everywhere. that's the consistency tax, and every place a denormalized field lives is a place that can drift out of sync if your write path has a bug.

rule of thumb: only denormalize when reads vastly outnumber writes. if writes are frequent the consistency complexity prolly isn't worth it.

materialized views are denormalization for aggregations — instead of recomputing the average rating on every product page load, you precompute once and store the result. typically refreshed by a background job.

 
CREATE MATERIALIZED VIEW product_ratings AS
SELECT p.id, AVG(r.rating)
FROM products p
JOIN reviews r ON p.id = r.product_id
GROUP BY p.id;

✽ RECALL your joins are eating cpu on a read-heavy table. why try a covering index before reaching for denormalization?

a covering index puts every column the query needs into the index itself — the db does an index-only scan and never touches the table. that buys most of denormalization's read win at the cost of just some write overhead, while denormalization adds storage bloat plus a consistency tax: every duplicated field is a place that can drift out of sync. denormalize only when no index can satisfy the query.

scale your database horizontally

at some point one machine isn't enough. rough rule of thumb: above about 50–100k indexed reads per second , you've gotta either add a cache or distribute across more boxes. exact numbers vary based on hardware and query patterns, but that's the ballpark.

read replicas — leader/followers

simplest horizontal play. you keep your primary db (the leader) handling all writes, spin up extra copies (followers) that get every write replicated to them. reads can go to any follower. read throughput multiplies.

bonus: redundancy. if the leader dies, you promote a follower. minimal downtime.

the trade-off is replication lag. when you write to the leader, it takes some time before followers see that write. read-your-own-write becomes weird — a user updates their profile, the request hits a follower that's a beat behind, and they see their old name. classic gotcha.

so you've got a choice. synchronous replication — leader waits for followers to confirm before acking. fully consistent, but writes are bottlenecked by the slowest follower. asynchronous — leader acks the write, replicates in the background. fast writes, real lag window. most production systems are async by default and just accept the staleness. the ones that need fresh reads after writes route critical reads back to the leader.

sharding — when one box can't even hold the data

read replicas help when throughput is the issue. they don't help when the dataset is the issue. if you have 50TB of data and a single instance can't even store it, you need sharding — splitting the data itself across multiple databases. two common ways:

functional sharding — split by domain. user data in one db, product data in another, likes in a third.

geographic sharding — split by region. US users in US dbs, EU users in EU dbs. lower latency, less load on any single instance.

real talk though: sharding is operationally a beast — cross-shard queries, rebalancing pain, distributed transactions, hot shards. it's primarily a write scaling technique. for read scaling specifically, caching is almost always the better play.

✽ RECALL a user updates their profile and immediately sees their old name. what happened, and what are your options?

replication lag — the read hit an async follower that hadn't applied the write yet. you can go synchronous (fully consistent, but every write waits on the slowest follower), or stay async and route read-your-own-write critical reads back to the leader. most production systems pick async plus selective leader reads, accepting staleness everywhere else in exchange for fast writes.

add caches

you've optimized the db, you've added replicas. you still need more. now you reach for cache.

most apps have heavily skewed access patterns. millions read the same viral tweet. thousands hit the same popular product. the same trending video gets pulled millions of times. you're literally executing the same query over and over to return the same result. that's a caching layup.

caches store hot data in memory. databases read from disk and run query planners. caches just hand you the bytes back. the gap is sub-millisecond cache reads vs tens of milliseconds for even a well-tuned query. orders of magnitude.

application-level caching — Redis or Memcached

stick a Redis or Memcached instance between your app and your db. on every read, check the cache first. hit? return immediately. miss? query the db, populate the cache, return.

popular data naturally stays hot. celebrity profiles get hit constantly so they live in cache forever. inactive profiles get cached only when accessed and expire after their TTL. the system self-tunes.

now the hard part. cache invalidation is genuinely one of the trickiest things in software. when underlying data changes, you've gotta make sure the cache doesn't keep serving the old version. main strategies:

TTL — fixed lifetime per entry. simple, but you accept staleness up to the window. great for data with predictable update cadence.
write-through — delete or update the cache the moment you write to db. consistent, but adds latency to every write and you've gotta handle the case where db write succeeds but cache invalidation fails.
write-behind — queue invalidation events, process async. faster writes, brief stale window.
tagged — group entries by tag (user:123:posts), invalidate by tag when related data changes. powerful but you've gotta track relationships.
versioned keys — encode a version into the key. bump on writes, old entries become unreachable. clean, no race conditions — more below.

most production systems combine approaches. short TTLs (5–15 min) as a safety net plus active invalidation for critical data like profiles or inventory. low-stakes data (recommendation scores, view counts) can lean entirely on TTL.

drive your TTL from a product requirement. if the spec says "search results can be at most 30 seconds stale", your TTL is 30 seconds. let the requirement set the consistency budget.

CDN + edge caching

CDNs extend caching to globally-distributed edge servers. originally just for static assets — images, CSS, JS — but modern CDNs cache dynamic content too: API responses, query results.

the latency win is dramatic. a user in tokyo hitting your origin in virginia is doing a 200ms round-trip. hitting a tokyo CDN edge? 10ms. that's a different category of fast.

for read-heavy apps, CDN caching can wipe out 90%+ of origin load. product pages, public profiles, search results — anything multiple people request — is a candidate. trade-off is invalidation across many edge locations, gnarly when you need it but worth the engineering for the win.

CDNs only make sense for content shared across users. don't cache user-specific stuff like personal settings or private messages — there's no hit-rate benefit when only one user ever requests it.

✽ RECALL you already have read replicas. what does a cache buy you that another replica doesn't?

latency, not just throughput — a cache hands back bytes from memory in sub-millisecond time, while even a well-tuned replica query pays disk reads and a query planner for tens of milliseconds. and access patterns are heavily skewed: millions of reads hammer the same viral content, so serving the identical result from memory instead of re-executing the same query is the whole win. replicas multiply query capacity; caches change the category of fast.

✽ RECALL why is "just set a TTL" not an invalidation strategy, and what do production systems actually do?

TTL alone means accepting staleness up to the full window for everything, including data where stale is unacceptable. production systems combine: short TTLs as a safety net plus active invalidation (write-through, tagged, or versioned keys) for critical data like profiles and inventory, while low-stakes data leans on TTL alone. and the TTL itself should come from a product requirement — "at most 30 seconds stale" means a 30-second TTL.

applying this in real systems

most production systems eventually need read scaling somewhere. the discipline is figuring out where. walk through your API endpoint by endpoint and identify the high-volume ones — that's where the work goes. start with query optimization, then caching, then replicas.

what makes a system robust is identifying read bottlenecks proactively , before the pager goes off. when sketching a new feature's API, pause at endpoints that'll get hammered. how often will this be called? does it need to be fresh? is the data shared across users? the answers tell you exactly which tools to reach for.

a few patterns that show up over and over:

URL shorteners — extreme read/write skew. one URL shortened once, accessed millions of times. caching dream. cache the short→long mapping in Redis with no expiration, CDN globally.
event ticketing — event pages get crushed when tickets drop. cache event details, venue info, seating charts aggressively. but never cache actual seat availability — you'll oversell.
social feeds — pre-compute feeds for active users, cache recent posts from followed users. users mostly only read the first page so aggressive caching pays off.
video platforms — cache metadata aggressively (titles, descriptions, thumbnails change rarely). view counts can be eventually consistent. CDN every thumbnail.

and where this whole playbook doesn't apply: write-heavy systems like Uber's location tracking or IoT sensors (focus on writes first), tiny scale where a single indexed db handles everything (don't over-engineer), strongly consistent systems like financial transactions (you can still cache, but with aggressive invalidation), and real-time collab like Google Docs (caching actively hurts).

✽ RECALL before adding any infrastructure, how do you decide where read-scaling work actually goes — and when is the answer "nowhere yet"?

walk the api endpoint by endpoint and find the high-volume ones, asking three questions: how often is it called, does it need to be fresh, is the data shared across users. then climb the ladder in order — query optimization, then caching, then replicas — without skipping rungs (check whether a composite index solves it for free before reaching for redis). and skip the playbook entirely for write-heavy systems, tiny scale, or real-time collab where caching actively hurts.

the gnarly edge cases

a few specific failure modes show up over and over once a system gets real traffic. worth knowing how to spot them.

"queries got slower as the data grew"

your app launched with 10k users and queries were instant. now you've got 10 million users and a simple lookup takes 30 seconds. CPU pinned at 100%. simple queries, nothing fancy.

the answer is almost always a missing index. without one, every query does a full table scan. 10 million rows scanned to find one user by email. multiply by hundreds of concurrent logins and the db spends all its time reading disk.

 
-- before: full table scan
EXPLAIN SELECT * FROM users WHERE email = 'user@example.com';
-- Seq Scan on users (cost=0.00..412,000.00 rows=1)

CREATE INDEX idx_users_email ON users(email);

-- after: index scan
-- Index Scan using idx_users_email (cost=0.43..8.45 rows=1)

for compound queries, column order in the index matters. for WHERE status = ? AND created_at > ?, an index on (status, created_at) helps queries on status alone and queries on both — but won't help queries filtering only by created_at.

"millions of concurrent reads to the same key"

celebrity drops a post. millions try to read the same cached entry simultaneously. your cache server, which normally handles 50k qps, is suddenly looking at 500k qps for ONE key. it starts timing out. site goes down — purely from read traffic.

the problem is that traditional caching assumes load distributes across many keys. when everyone wants ONE key, the assumption breaks. even though the data is in memory, serializing it and sending it over the network 500k times per second melts the cache server.

fix 1: request coalescing. when multiple requests hit the same key on the same server, combine them into a single backend request. one fetch, broadcast the result.

 
class CoalescingCache:
    def __init__(self):
        self.inflight = {}

    async def get(self, key):
        if key in self.inflight:
            return await self.inflight[key]
        future = asyncio.Future()
        self.inflight[key] = future
        try:
            value = await fetch_from_backend(key)
            future.set_result(value)
            return value
        finally:
            del self.inflight[key]

caps backend load at N — the number of app servers. even if a billion users want the same key, the backend only sees one request per server.

fix 2: cache key fanout. spread one hot key across multiple entries. instead of feed:taylor-swift, store the same data under feed:taylor-swift:1 through :N. clients pick a random suffix. now those 500k qps distribute across N keys. trade-offs: more memory, more invalidation work. for hot-key scenarios that threaten availability, that redundancy is dirt cheap insurance.

✽ RECALL a celebrity post turns one cache key into 10× your cache server's capacity and the site dies from pure read traffic. why does the cache melt even though the data is in memory, and what are the two fixes?

caching assumes load spreads across many keys — when everyone wants ONE key, serializing and shipping the same value hundreds of thousands of times per second saturates the server regardless of where the data lives. fix one: request coalescing — duplicate in-flight requests on a server collapse into a single backend fetch, capping load at one request per app server. fix two: key fanout — store the value under N suffixed copies and let clients pick randomly, spreading the heat.

"cache stampede when hot entries expire"

your homepage data has a 1-hour TTL. serving 100k qps from cache like a champ. the hour ticks over. all 100k requests in that instant see a miss simultaneously. every single one tries to rebuild from the db. your db, sized for maybe 1k qps of misses on a normal day, is now staring at 100k identical queries. self-DDoS. cache stampede.

three approaches, in increasing sophistication:

distributed lock — only the first miss rebuilds; everyone else waits. protects the backend, but if the rebuild fails or stalls, thousands of waiters time out. fragile.
probabilistic early refresh — as the entry ages, each request has a small but growing chance of triggering a background refresh. 1% at minute 50, 5% at 55, 20% at 59. instead of stampeding at minute 60, refreshes spread across the last 10 minutes. clean, no locks.
scheduled background refresh — a worker continuously refreshes the hottest entries before expiry. they never go stale. costs ops complexity and you waste work on entries that might not be requested, but for hot content it's worth it.

✽ RECALL a hot entry's TTL expires and your db gets self-DDoSed by identical rebuild queries. why does probabilistic early refresh beat a distributed lock here?

the stampede happens because every request sees the miss at the same instant. a lock lets only the first miss rebuild, but if that rebuild stalls or fails, thousands of waiters time out — fragile. probabilistic refresh gives each request a small, growing chance of triggering a background rebuild as the entry ages, so refreshes spread across the final minutes instead of detonating at expiry. no locks, no waiters, clean.

"users need updates immediately — eventual consistency isn't enough"

eventual consistency is fine for most stuff. user changes their bio? if it shows up in 30 seconds across all caches, who cares. but for some data, stale is unacceptable. event venue updates 30 minutes before kickoff? attendees can't be looking at the old address.

naive approach is to delete the cache on write. real problems: which caches do you delete from (Redis, CDN, browser)? what if delete fails? race condition — another request reads from a follower replica that hasn't replicated yet, gets the old value, writes it back to the cache after you deleted it. cache has stale data again.

better approach: cache versioning. instead of deleting old entries, you make them unreachable by changing the cache key whenever data changes. each record has a version column. every update increments it in the same transaction. on read, fetch the current version, build the cache key as entity:id:vN, fetch using that key. on write, bump the version (v42 → v43), write new data under the new key.

 
event:123:v42     # before update
event:123:v43     # after update — readers automatically move here

old entries never get explicitly deleted — they just become unreachable. let TTLs clean them up.

why this kills the race condition: a "late writer" can't overwrite new data because the db forced a new version number. their write lands at v42, nobody's reading v42 anymore. no partial-invalidation worries — you're not deleting anything, you're routing past it. atomic by definition because version increments are atomic in the db.

trade-offs are real: two cache lookups per request, small extra latency. old versioned keys accumulate so you've gotta TTL them. and this works best for single-entity caches like profiles or product details — doesn't help much with computed data like search results or feeds where invalidation is inherently complex.

for computed data, a deleted items cache is your friend. small working set of recently deleted/hidden/changed IDs. when serving feeds, filter cached results against this set. lets you serve mostly-correct cached data immediately while doing proper invalidation in the background.

✽ RECALL why does cache versioning kill the stale-write race that delete-on-write can't?

with delete-on-write, a reader can grab the old value from a lagging replica and write it back into the cache after your delete — stale data resurrected. with versioned keys, every update atomically bumps a version in the same db transaction, readers build keys from the current version, and the late writer lands on vN while everyone reads vN+1. nothing is deleted — you route past old entries and let TTLs sweep them up.

wrapping up

read scaling is the most common scaling challenge in production because read traffic grows exponentially faster than write traffic, and at scale physics wins. no amount of clever code can outrun hardware limits when you're serving millions of concurrent users.

the path is the same every time: optimize within the db first (indexes, composite/covering, denormalize if you must), scale horizontally if you've gotta (read replicas first, sharding only when the dataset is the bottleneck), cache the rest (Redis/Memcached at the app layer, CDN at the edge).

the most common mistake is skipping rungs. teams jump straight to "let's add Redis" without ever checking if a composite index would've solved it for free. don't be that team. start cheap, climb only as far as you need.

nothing is best. everything depends on your usecase, your read patterns, your consistency budget. you're paid to solve a problem, not to ship the fanciest architecture.

why databases use b+ trees to hold data

2025-10-01T00:00:00Z

so we all know most databases store data in b+ trees, but why? not just sql databases either - even non-relational databases like mongodb leverage b+ trees to store their data. mongodb's storage engine wiredtiger serializes collection data as b+ trees on disk. but let me tell you why there was even a need to introduce a data structure like b+ trees in the first place, and how this actually works in practice.

what you'll take away

quick pointers so you know what to look for as you read:

disk io constraints drive everything. you can't insert a line in the middle of a file — a write at an offset overwrites what's there.
sequential storage makes every operation o(n). insert, update, delete — each one means rewriting the whole file.
b+ tree nodes are sized to the disk block. one disk read = one node ≈ a hundred rows. align the data structure with the hardware.
data lives only in the leaves, and leaves link sideways. that's what makes range queries a linear walk instead of repeated tree climbs.
every o(n) operation becomes o(log n) or better — which is why sql and nosql engines alike sit on b+ trees.

starting simple: the naive approach that doesn't work

let's start simple and see why the obvious approach fails spectacularly.

say our table records are stored in one file sequentially - literally one row after another in a single file. dead simple, right? when i say "row" here, i'm not just talking about sql tables. this applies to documents in mongodb, any entity you're storing - sql databases call them rows, nosql databases call them documents, but the idea holds true across the database spectrum.

the brutal reality of sequential storage

insert operations: `o(n)` complexity

here's the problem with inserting into a sequential file: you can easily append at the end, but what about inserting in the middle? databases typically store data ordered by primary key. so when you insert rows 1, 2, 5, then try to insert row 4...

you're screwed. why? you cannot just insert a line in the middle of a file. this isn't some in-memory buffer in your code editor where you hit enter and everything shifts down. on disk, when you write at a particular offset, you override what's there. period.

so what do you do? you have to:

find the insertion position
copy all rows before that position to a new file
add the new row
copy all remaining rows
replace the old file

every. single. insert. creates a new file. that's o(n) complexity right there.

update operations: the width problem

updates are equally problematic. say you want to update row 3:

linear scan to find it (worst case o(n))
start writing at that location

but wait - you can only write exactly the same number of bytes as the existing row. if the original row was 100 bytes and your update needs 120 bytes, those extra 20 bytes will override the beginning of row 4. you can't just push things forward - this is disk io, not ram.

so if you need more space? create a new file. copy everything. again.

find operations: pure linear scan

finding a single row? linear scan through the entire file. o(n). there's literally no other way with this structure.

range queries: slightly less terrible (but still bad)

range queries seem efficient once you find the first row - you can read sequentially from there. but finding that first row? still o(n).

delete operations: another new file

deleting row 3? you guessed it:

find the row (o(n))
create new file
copy everything except that row
replace old file

the fundamental insight:o(n) complexity for every operation is far too much. we need something better.

✽ RECALL why can't a sequential file just "insert a row in the middle" the way your text editor inserts a line?

on disk, writing at an offset overwrites what's there — nothing shifts down. so a mid-file insert means copying everything to a new file, and an update can't even grow a row without clobbering its neighbor. every operation — insert, update, delete, even find — degenerates to o(n) full-file work, which is fatal for a transactional database.

enter b+ trees: the game changer

given that o(n) operations won't work for transactional databases, we need a better solution. this is where b+ trees come into the picture.

how b+ trees actually store your data

think about this: rows or documents in a table are clubbed together into b+ tree nodes.

let's get concrete with numbers:

b+ tree node size : 4kb (matches disk block size)
average row size : 40 bytes
rows per node : 4kb ÷ 40 bytes ≈ 100 rows

why 4kb? this is crucial : disk io happens in blocks, typically 4kb. even if you want to read 1 byte, you read the entire 4kb block. the os does this for you - reads the block, extracts your byte, discards the rest. so we align our b+ tree node size with the disk block size. in one disk read, we read 1 node ≈ 100 rows.

the beautiful tree structure

your table becomes a collection of b+ tree nodes on disk. these nodes can be anywhere on the disk - they're not necessarily sequential. the database maintains the structure through pointers (disk offsets).

consider a table with 400 rows:

the table is always arranged by its primary key, and the b+ tree nodes (leaf nodes) are connected accordingly. but here's where it gets interesting...

the multi-level magic

a b+ tree isn't just leaf nodes. it has multiple levels:

ps: sorry for my bad drawing :(

note: (leaf nodes linked linearly)

every b+ tree node is serialized and stored on disk. they're not in memory (though they can be cached there for performance).

non-leaf nodes hold routing information - they tell you which child node holds which range of data. the root might store [1, 201, 401], meaning:

values < 201 are in the left subtree
values 201-400 are in the middle
values > 400 are in the right

leaf nodes hold the actual row data and are linked linearly to enable efficient range traversal.

✽ RECALL why is the b+ tree node size matched to the disk block size, and what does that buy per read?

disk io happens in whole blocks — ask for one byte and the os reads the entire block anyway. sizing the node to the block means a single disk read delivers exactly one full node packed with rows, zero waste. the data structure is shaped by the hardware, not the other way around.

operations in b+ trees: where the magic happens

find one by id: from `o(n)` to `o(log n)`

let's find row with id 3:

read root node from disk - check routing info
follow pointer to appropriate child - 3 is between 1 and 201
read that node - 3 is between 1 and 101
read leaf node - contains rows 1-100
extract row 3 from the 100 rows in memory

total disk reads: 3 (for a 3-level tree). no matter which row you want, it's exactly 3 disk reads. not 1 extra.

insert: no more file rewriting

want to insert row 4?

traverse to find the right leaf (3 disk reads)
load that node into memory
insert row 4 in the right position (array manipulation in ram)
flush the node back to disk

that's it. we only touched the blocks we needed. no rewriting the entire file. the tree might need rebalancing (standard b+ tree stuff), but we're not touching unrelated nodes.

update: surgical precision

update row 202:

navigate to the leaf (3 reads)
load node into memory
update the row
flush back to disk

if the update changes the row size, we handle it in memory before flushing. no overwriting neighboring rows.

delete: clean and simple

delete row 401:

find the leaf (3 reads)
load into memory
remove from array
flush back

the tree might rebalance, but again, we're only touching what we need to. you can also do some kinda soft delete and do a batch delete later on by running a cron.

✽ RECALL what makes a b+ tree insert "surgical" where the sequential file had to be rewritten end to end?

you traverse routing nodes to the one leaf that owns the key — a handful of disk reads — mutate it in memory, and flush just that node back. only the blocks you need are touched: row growth and shrinkage are handled in memory before the flush, and any rebalancing stays local to the affected nodes instead of touching the whole file.

range queries: the secret weapon

this is where b+ trees truly shine. find all rows with id between 100 and 600:

find the leaf containing 100 (3 reads)
use the leaf-to-leaf links to traverse linearly
read subsequent leaves until you reach 600

why b+ trees force you to store data at the leaf level: this linear traversal between leaves. b-trees allow data in non-leaf nodes, but b+ trees don't - precisely because it makes range queries super efficient. once you reach the first leaf, you just traverse sideways instead of going up and down the tree repeatedly.

✽ RECALL why do b+ trees (unlike plain b-trees) refuse to store row data in non-leaf nodes?

keeping data only in the leaves lets the leaves link linearly — a range query finds its first leaf in o(log n) and then just walks sideways until the upper bound, never climbing the tree again. non-leaf nodes stay pure routing information, which also keeps them small enough that the tree stays shallow.

the bottom line

this is why databases use b+ trees. the evolution from naive sequential storage to b+ trees solves fundamental problems:

predictable performance : o(log n) for single-row operations
efficient range queries : linear traversal at leaf level
optimal disk io : node size matches disk block size
localized updates : only touch the blocks you need

and this isn't just sql databases - mongodb's wiredtiger storage engine does exactly this. the beauty of b+ trees transcends the sql/nosql divide because the underlying problem - efficient disk-based storage with fast retrieval - is universal.

think about it: every operation that was o(n) in our naive approach is now o(log n) or better. that's the power of choosing the right data structure for your storage layer. and that's precisely why, when you dig into any production database, you'll find b+ trees at its heart.

✽ RECALL mongodb isn't sql — so why does wiredtiger end up on the same b+ trees as mysql?

because the underlying problem is identical: storing records on block-based disks with fast point lookups and range scans. rows or documents, the physics of disk io doesn't change — which is why the structure transcends the sql/nosql divide and shows up at the heart of basically every production database.

key takeaways

disk io constraints drive everything - you can't just insert in the middle of a file
align data structures with hardware - b+ tree nodes match disk block size
optimize for the common case - most database operations are finds and range queries
trade complexity for performance - b+ tree maintenance is worth the o(log n) guarantees

the next time someone asks why databases use b+ trees, you know it's not just tradition - it's physics, hardware constraints, and decades of engineering wisdom rolled into one elegant solution.

kv store on relational db: a storage-compute separation story

2025-09-23T00:00:00Z

why another database?

your first question should hit you: "why on earth should we do this?" like dynamodb exists, redis exists, valkey exists, this exists, that exists. why do i have to do this?

but instead of looking at it from a lens of "hey why the world needs it," let's look at it from a mental model perspective. the core essence of this is storage-compute separation. that's our biggest takeaway from this. second takeaway would be elegant queries. so storage and compute separation - this is how you can create a new db every time you want.

consider sparksql - it accepts sql queries but it fetches data from files, apis, databases etc and gives you the result internally. so its not a sql based database but instead we can prolly say it has a sql interface with flexible storage. one more example - dynamodb has a mongodb-like api which is exposed to the user but its built on amazon aurora (a relational db that uses sql). so now you see the compute and storage separation? just tweaking any one of em and you've created a new db.

what you'll take away

quick pointers so you know what to look for as you read:

storage-compute separation is how new databases get made. swap the interface or swap the storage and you've "created" a new db — dynamodb over aurora, sparksql over files.
store absolute expiry, never relative ttl. now() + ttl at write time — a bare 300 is meaningless when you read it later.
upsert beats select-then-decide, and INSERT ON DUPLICATE KEY UPDATE beats REPLACE INTO because an in-place update beats delete+insert.
batch delete in primary-key order and index your where-clause columns. closed key ranges mean minimal rebalancing; the index turns a full scan into o(log n + k).
scale bottom-up, and let the user pick consistency. ?consistent=true goes to master; everything else eats replica staleness.
partitioning is a metadata-vs-flexibility dial. static explodes metadata, hashing repartitions everything on change, range sits in between.

what we're building

to create a key value (we will call it kv in here for easy reference ) store which speaks in http/rest and stores in mysql. basically we're exposing db as a service with two parts - the computation layer where we expose the http apis to the user, and mysql db as our storage layer. everything will be stored in that mysql. damn this will be exciting and simple to start with.

so what are the requirements? we need http based apis - get, post, put, delete. all operations are in sync (to keep stuff simple). some basic schema to start with. and post that we can discuss on scaling, ttl, optimising queries etc.

we keep stuff simple, a simple key value db so we expose 4 rest apis and we want to hold this key value store in the mysql db. here's our schema:

 
CREATE TABLE store (
    key VARCHAR(255) PRIMARY KEY,
    value TEXT,
    expired_at BIGINT  -- absolute timestamp
);

✽ RECALL dynamodb runs on aurora and sparksql queries flat files. what's the mental model, and how does this kv store fit it?

storage-compute separation — the query interface and the storage engine are independent choices, and swapping either one "creates" a new database. here the compute layer is a set of http/rest apis and the storage layer is plain mysql: a kv store that's really just a db exposed as a service on top of a relational engine.

implementing the computation layer

the put operation

aight this is gonna be fun. how will our put operation look like first? put(k1, v1, 300)? - we shouldnt be storing 300sec directly as ttl. since thats wrong (i have seen lots of people doing that mistake). we store absolute time , that is created_at + ttl -> or you can say created_at + 300, so the put query should look like put(k1, v1, now()+300). and mind it we are doing insert command in here.

but can you now think what problem might be in here? well well well its time to triage that.

its simple - if a user logs in for the first time, the insert command works perfectly: INSERT INTO store VALUES ('user_123_session', '{"logged_in": true, "cart": []}', now() + 300);. but if the user refreshes or logs in again with the same session key, we try: INSERT INTO store VALUES ('user_123_session', '{"logged_in": true, "cart": ["item1"]}', now() + 300);. now we get an error since it already exists - the insert command will fail.

so now we know we need to create if the key doesnt exist and update if it already exists. we could wrap this in a transaction to ensure atomicity:

first select to check if it exists
then either insert or update
all wrapped in transaction boundaries

but my friends, we are doing 3 operations internally just to do it. one transaction, one select, one operation of insert or update. lets try to optimise it.

mysql supports upserts which is exactly what we need. we have two options here. option 1 is using REPLACE INTO:

 
REPLACE INTO store VALUES (k1, v2, now() + 300);

this has reduced our previous query to 1 operation. but the problemmmm - when conflict occurs (that means a key already exists) it internally deletes the row and inserts another row which is slower.

option 2 is using INSERT ON DUPLICATE KEY UPDATE:

 
INSERT INTO store VALUES (k1, v2, now() + 300)
ON DUPLICATE KEY UPDATE 
    value = v2, 
    expired_at = now() + 300;

this is 32x faster (yep i read that on the internet). since instead of delete+insert, it does a proper update when conflict occurs. since in my work i am using prisma currently, theres a simple query by my dear prisma orm which is prisma.<table_name>.upsert.

✽ RECALL why store now() + 300 instead of 300 in the expiry column, and why does a plain INSERT break the moment a user logs in twice?

a relative ttl is meaningless at read time — you can't tell when it started counting. absolute expiry makes the get a simple expired_at > now() comparison. and a repeat put on an existing key fails on the primary key, so you need an upsert: INSERT ON DUPLICATE KEY UPDATE does a real in-place update on conflict, collapsing the select-then-insert-or-update transaction into one operation and beating REPLACE INTO, which deletes and reinserts.

the get operation

lets try to get the users who havent been expired. for that the query would look like:

 
SELECT * FROM store 
WHERE key = k1 AND expired_at > NOW();

dead simple. we check the key and make sure its not expired. nothing fancy here.

the delete operation

again we have 2 options - hard delete or soft delete. i have already written about why soft delete is ideal (Link). in soft delete you just update a column to indicate it has been soft deleted and later on using batch delete you clear the data. here is the example:

 
UPDATE store SET expired_at = -1 
WHERE key = k1 AND expired_at > NOW();

so why expired_at = -1? this will help us later determine and provide stats to user about which key has been expired naturally vs deleted by the user. you might ask, tautik - where can i find this in practice? umm lots of cases:

when a user session has been timed out
application tries to delete a cache entry that already expired
cleanup processing that we run on already-cleaned data

its micro optimisation because it only helps in edge cases, but when you have millions of operations, these edge cases add up.

batch deletes and optimisation

so how should we ideally batch delete? well make sure to always use ORDER BY key, not ORDER BY expired_at. since when we are deleting the row it would require minimal rebalancing because the keys that are getting deleted are from a closed set of ranges. thus multiple keys on the same page might be deleted with this iteration:

 
DELETE FROM store 
WHERE expired_at < NOW() 
ORDER BY key 
LIMIT 1000;

now what further optimisation can we do? think about it. if we have 1mil rows, and we apply above query - we need to look for each row which satisfies this condition. since it will do a full table scan without the indexes. after that it will get those rows, sort it, and delete the first 1k rows from above query.

what happens with an index on expired_at? first create the index: CREATE INDEX idx_expired_at ON store(expired_at);. now whenever we run the delete query, our db will do lookup on index to quickly find the rows where expired_at < now(), and only reads 50k expired rows (not all 1 mil, and yep assuming 50k rows are expired, all are assumptions). and then we delete the first 1k rows. dead simple. i will write more about indexes later on lol. its so fun. everything is fun. aight back to the topic.

ps: if you dont know, we have reduced the time complexity from o(n) to o(log n + k) after doing quick index lookup + reading only matching rows. so what we can learn - you can always create indexes on columns you use in where clauses , especially for cleanup operations that run frequently.

✽ RECALL in the cleanup job, why ORDER BY key instead of ORDER BY expired_at — and what does the index on expired_at change?

deleting in primary-key order removes rows from a closed range, so they cluster on the same pages and the tree needs minimal rebalancing; ordering by expiry scatters the deletes everywhere. the index is what finds the victims: without it every run is a full table scan, with it the lookup reads only the matching rows — o(n) down to o(log n + k). index the columns in your where clauses, especially for cleanup that runs constantly.

scaling this thing

yeppie, this is the fun part. so one way i can think of doing this is by just scaling the kv computation layer (for our easy use, lets call it kv api server). but again wrong - what if we are unnecessarily scaling stuff but our db is not able to handle that load? thats why we always do bottom up.

lets start bottom up by trying to scale our storage layer. so what if we have 90:10 read requests? we simply add READ REPLICAS to handle those reads and lets add database proxy in between as well to route the requests from computation layer to storage layer.

but then here comes the problem - if we have read replicas, we know we are signing up for staleness. but then how will we ensure that every client around the world receives accurate data? like which client should we route to the replicas, which client to master? i know right this computation internally can become such an overhead ahh. so lets do this like how dynamodb and other services do. the right solution is: let the user decide!

 
GET /key?consistent=true  → goes to master
GET /key                 → goes to replica (default)

i know this is funny, but why complicate stuff. see the philosophy is: "you are paid to solve a problem and not necessarily write code to solve it". i should prolly tweet about this. so yep dont over optimize stuff.

✽ RECALL read replicas mean staleness. how does the kv api dodge the "which client needs fresh data" problem entirely?

punt it to the caller — GET /key?consistent=true routes to master, the default goes to a replica. the server can't know which reads are critical; the user does, so building routing heuristics is wasted work. and the scaling itself goes bottom-up: no point scaling the api layer into a storage layer that can't take the load.

handling write scaling

so now that we are done with scaling and handling reads, lets think about writes. for writes we know the request goes to the master. but when the master cant handle the write load, we basically shard the database.

for this we have partition strategies. there are 3 basically - hashing, static, and range based. so lets have a config db at the start - our computation-layer (or the proxy layer if you have) now refers to the config db which contains the rules for where to route the user.

static mapping : one metadata entry for each key in config db. but imagine if this is key value - we will be storing for each of the entry in the db. if you have million keys, will we store million entries in our config db? hell naw, its too much metadata explosion
hash based partitioning : another end of the extreme, zero metadata. but the problem is if hash function changes, we need to re-partition everything - huge data movement: shard = hash(key) % number_of_shards
range based partitioning : lies in between of static and hash based. now we have minimal metadata, and its just simple:

 A-K → Shard 1
    L-P → Shard 2  
    Q-Z → Shard 3

so depending upon metadata size, key control and flexibility we choose one. nothing is best. everything depends upon our usecase.

✽ RECALL static mapping, hash, and range partitioning — what's the actual dial you're turning when you pick one?

metadata versus flexibility. static keeps one config entry per key — total control, but a million keys means metadata explosion. hash needs zero metadata, but change the function (or shard count) and you're re-partitioning everything. range sits in between: minimal metadata, simple rules like A-K → shard 1. choose by metadata budget and how much key control you need — nothing is best.

Caching : Thundering Herd and Request Hedging

2025-09-07T00:00:00Z

Caching: It's Not Just About Memory

Myth-busting time : Caching doesn't mean in-memory. I see this confusion everywhere.

We accept data staleness in exchange for avoiding expensive operations. Every time you cache something, you're saying 'I'd rather serve data that might be 5 minutes old than wait 2 seconds for a database query

What Caching Really Means

Cache = saving expensive operations. That's it.

Expensive operations include:

Database queries with 14 table joins
Network calls to external services
Complex computations
File system reads

You can cache:

In memory (Redis)
On disk (API server's unused SSD)
In browser (localStorage)
At CDN edge

The Under-utilized Cache Location

Here's something nobody talks about - your API server's disk is sitting there doing nothing.

You spin up an EC2 instance:

4GB RAM (fully utilized)
20GB SSD (5% utilized)

Why not cache on that SSD? It's faster than network calls to Redis, costs nothing extra, and the space is already paid for.

But wait - multiple API servers means cache inconsistency! Which is why we usually centralize with Redis. But for read-heavy, rarely-changing data? Disk cache works beautifully.

✽ RECALL why is your API server's idle SSD a legitimate cache tier, and what's the catch once you run multiple servers?

caching means saving expensive operations, not "put it in RAM" — and a local disk read beats a network round-trip to redis while using space you've already paid for. the catch: each server's disk cache is its own little world, so multiple API servers drift inconsistent. that's why mutable data centralizes in redis, while read-heavy, rarely-changing data is the sweet spot for disk cache.

what you'll take away

quick pointers so you know what to look for as you read:

caching ≠ in-memory. a cache is anywhere you save the result of an expensive operation — RAM, disk, browser, CDN edge.
your API server's idle SSD is a free cache tier. great for read-heavy, rarely-changing data; the catch is per-server inconsistency.
cache expiry + concurrent traffic = stampede. N identical misses become N identical expensive queries, and the database melts.
request hedging collapses N misses into 1 query. the first request does the work, everyone else waits on a semaphore.
waiters read from a temporary result map. re-hitting the cache after the signal would just be a second stampede.

Cache Stampede & Request Hedging

The fundamental question we should always ask is: "What happens when your cache expires and 1000 requests hit simultaneously?"

Answer: Your database dies, your site goes down, and you get paged at 3 AM. Let me show you how to prevent this nightmare scenario.

The Cache Stampede Problem

Picture this: You have a popular blog post cached in Redis. The cache expires. Suddenly, 1000 concurrent requests hit your API at the exact same moment.

What happens?

All 1000 requests check Redis → cache miss
All 1000 requests query the database
Database connection pool gets overwhelmed
Database melts under load
Site goes down
You're now debugging at 3 AM while your users are angry

This is called a cache stampede or thundering herd problem , and it's one of the most common ways high-traffic applications fail.

Why is this so dangerous? Even if you have database connection pooling (which you should), making N identical expensive queries to your database for the same data doesn't make any sense. It's pure waste that can bring down your entire system.

✽ RECALL a hot key expires and 1000 concurrent requests arrive at once. walk the failure chain — and why doesn't connection pooling save you?

all 1000 miss the cache, all 1000 fire the same query at the database, the connection pool saturates, the db melts, the site goes down. pooling only caps concurrency — it doesn't change the fact that N identical expensive queries for the same data is pure waste. the fix isn't more database capacity, it's collapsing the N requests into 1.

The Real-World Impact

This isn't some theoretical problem I'm throwing at you. This is literally what CDNs solve every single day.

Think about it: CloudFlare, AWS CloudFront, and every other CDN faces this exact problem. When a cached resource expires and thousands of requests come in simultaneously, they can't all hit the origin server. The origin would die instantly.

CDNs use sophisticated request hedging to ensure that only ONE request goes to the origin while everyone else waits for that response. This is production-tested at massive scale.

The Solution: Request Hedging (Smart Debouncing)

Here's the elegant solution - and this is literally the pseudo-code you'd write:

 
# Pseudo-code that would work if you saved this as .py
sem_map = {}  # Use thread-safe implementation
res_map = {}  # Temporary result storage

def get_blog(k):
    # First, try cache
    v = cache.get(k)
    if v is not None:
        return v
    
    # Check if someone else is already fetching this
    s = sem_map.get(k)
    if s:
        s.wait()  # Wait for someone else to do the work
        v = res_map.get(k)  # Get the result they fetched
        return v
    else:
        # I'm the first one - I'll do the work
        sem_map[k] = new_semaphore()
        sem_map[k].block()  # Block others
        
        # Do the expensive work
        v = db.get(k)
        cache.put(k, v)
        res_map[k] = v  # Store temporarily for waiting requests
        
        # Signal that I'm done
        sem_map[k].signal()
        sem_map.remove(k)
        
        return v

✽ RECALL in request hedging, what does the first cache-missing request do differently from all the ones behind it?

the first request finds no semaphore for the key, so it creates one, blocks everyone else, does the expensive db fetch, writes the result to the cache and a temporary result map, then signals and removes the semaphore. every later request finds the semaphore, waits, and grabs the value from the result map. one db query, no matter how many concurrent misses pile up.

Implementation Details That Matter

Why the Temporary Result Map?

You might wonder: "Why not just make waiting requests hit the cache again after the signal?"

Because that creates unnecessary load! If everyone waits and then immediately hits the cache again, you've just created another stampede on your cache layer.

The res_map is a temporary local storage (5-minute TTL) that holds the result just long enough for waiting requests to grab it directly. This eliminates the extra cache round-trip.

✽ RECALL after the leader signals, why do waiters read from res_map instead of just hitting the cache again?

because hundreds of requests simultaneously re-hitting the cache is just a second stampede, aimed at the cache layer this time. the temporary local result map holds the value just long enough for the waiters to grab it directly — zero extra round-trips, no new herd.

When You Actually Need This

"I've been using Redis for years and never needed this!"

Fair point. This isn't some academic exercise. You need request hedging when you have:

High traffic with shared expensive resources
Cache expiration happening under concurrent load
Database queries that take >100ms
Flash sale scenarios or viral content

CDN Use Case (Real-World Example)

CDNs face this constantly:

Origin: Your S3 bucket or API server
Cache: CDN edge servers worldwide
Problem: Popular resource expires, 10,000 requests hit one edge server
Solution: Only ONE request goes to origin, others wait

This pattern has prevented countless outages for companies you use every day.

✽ RECALL you've run redis for years without request hedging and nothing broke. what combination of conditions changes that?

high traffic on a shared expensive resource, with cache expiry landing under concurrent load — slow db queries (>100ms), flash sales, viral content. CDNs live this every day: a popular resource expires at an edge server, thousands of requests pile up, and exactly one is let through to the origin while the rest wait. if your traffic never concentrates on one expiring key like that, you genuinely don't need it.

How DNS Really works and How it scales infinitely

2025-09-07T00:00:00Z

Why should we care about DNS? Because it's one of the most beautiful pieces of software ever written - it made the internet what it is possible today by giving a human-readable name to every single thing out there, not requiring us to remember weird IP addresses of machines. But here's the thing: most people think DNS is just "domain name to IP address lookup." That's like saying the internet is just "computers talking to each other." The real story is way more interesting.

what you'll take away

quick pointers so you know what to look for as you read:

dns is not a lookup table, it's a hierarchy. no single machine knows everything — each step takes you closer to the one that does.
a centralized dns database is impossible. volume of data, volume of queries, and every update funneling through one system that can never go down.
authoritative name servers are the source of truth. the entire resolution dance exists just to find the one that owns your zone.
records are typed mappings. A records map names to ips, CNAMEs map names to names — and the resolver chases the chain.
caching at every layer is the secret sauce. most lookups never leave your local network, and root servers barely get touched.

The Fundamental Problem DNS Solves

To connect to any machine on the internet, you need its IP address. When you type www.google.com in your browser, what your browser actually needs is the IP address to establish that TCP connection. But how do we discover that google.com → 17.53.21.253?

So there would be a place where this mapping is stored - somewhere this particular mapping needs to be configured: www.google.com means 17.53.21.253. This is typically the A record or the CNAME record in the DNS configuration.

Why do we need DNS records at all? Because we need to store different types of mappings, not just domain-to-IP.

A Record - Maps a domain name directly to an IPv4 address:

 
www.google.com → 17.53.21.253

CNAME Record - Maps a domain name to another domain name (alias):

 
blog.google.com → www.google.com

Think of CNAME as a redirect. When someone looks up blog.google.com, DNS says "actually, go look up www.google.com instead."

✽ RECALL when would you configure a CNAME instead of an A record, and what does the resolver do when it hits one?

an A record pins a name to an ipv4 address; a CNAME aliases a name to another name, so when the target's ip changes you update one record instead of many. the resolver treats a CNAME as a redirect — "go look up this other name instead" — and keeps resolving until it lands on an actual address.

DNS Zones: The Logical Foundation

A DNS Zone is like a database table that contains all the DNS records for a particular domain and its subdomains.

Example zone file for google.com:

 
google.com.        A      17.53.21.253
www.google.com.    A      17.53.21.253  
mail.google.com.   A      74.125.224.83
blog.google.com.   CNAME  www.google.com

Why zones? Organization and delegation. Google manages everything under google.com in their zone, while university.edu manages their own zone separately.

Cloud Providers:

AWS Route 53 - Amazon's DNS service
Google Cloud DNS - Google's DNS service
Azure DNS - Microsoft's DNS service
Cloudflare DNS - Also provides CDN services

Traditional DNS Providers:

GoDaddy DNS - Comes free with domain registration
Namecheap DNS - Free with domains
DNSimple - Paid DNS hosting specialist
NS1 - Enterprise DNS provider

Free Options:

Cloudflare - Free tier available
Hurricane Electric - Free DNS hosting

If you're using AWS, you'd configure this in Route 53 as a hosted zone. This zone contains everything about google.com - all the mappings, all the subdomains, all the different record types.

Authoritative Name Servers: The Source of Truth

Authoritative Name Server = The server that actually stores and serves your DNS zone data.

Zones are served via Authoritative Name Servers. An authoritative name server hosts multiple zones, and it answers DNS questions for the zones it owns.

These servers typically look like ns1.gns.com, ns2.gns.com, ns3.gns.com - usually offered by cloud providers like GoDaddy, Namecheap, or AWS. When you buy a domain, you configure which name servers should handle it.

Here's the key insight: if you somehow know the authoritative name server's address, you can get the record you're looking for. But how does your browser know which name server to ask?

✽ RECALL what makes a name server "authoritative", and why is the whole resolution process really just a hunt for one?

an authoritative name server actually hosts the zone — it stores the records and answers for the zones it owns, no guessing, no cache. if you already knew its address you could ask it directly and be done; the entire hierarchy of root and tld servers exists only because you don't, and each step just points you closer to it.

The Two Approaches: Centralized vs Decentralized

Let's think about this problem. How would we reach these authoritative name servers?

Option 1: The Centralized Database

Let there be a single massive "database" system that everybody reaches out to when they want DNS information.

This centralized way is not scalable, fault tolerant, or even manageable. Think about the concerns: volume of data, number of requests and updates, and any change in any info needs to be communicated to this single system.

If this goes down, the entire internet is down. The sheer volume of requests it would need to handle and the amount of data it would need to store makes this approach impossible.

Option 2: Decentralized Approach - No One Machine Knows It All

This is where the beauty of DNS architecture comes in. The world went with decentralization where no single machine knows everything.

✽ RECALL why can't dns be one giant centralized database that everyone queries?

it fails on every axis — the volume of data, the volume of reads and updates, every change in the world funneling through one system, and a single point of failure that would take the whole internet down with it. so dns went decentralized: no one machine knows it all, but each machine knows enough to send you one step closer.

DNS Resolvers: Your Gateway to the System

DNS Resolver is a server that carries out the resolution of Domain Name to IP address. Where does this DNS resolver run?

Typically runs at ISP, but you can run your own locally. Most home routers are real DNS resolvers. You can check yours by running:

 
ipconfig /all    # Windows
# Look for DNS Servers entry

On my machine, I get 192.168.0.1 - that's my router doing DNS resolution for me. You can change this to popular DNS resolvers like:

Google : 8.8.8.8
Cloudflare : 1.1.1.1

Root Name Servers: The Foundation of the Internet

Think about this chicken-and-egg problem: To find any website, you need to ask a DNS server. But how do you find the DNS server's address? You'd need DNS to find DNS!

Root Name Servers solve this bootstrap problem. They're the "starting point" that everyone agrees on.

Here's where it gets really interesting. Say we are looking for www.google.com - we need to reach its Authoritative Name Server, but we do not know where it is!

13 Root Name Servers

Calls to Root NS are infrequent, because even IP of TLD (Top-Level-Domain) servers does not change often, so it is heavily cached.

✽ RECALL why do resolvers almost never actually hit the root name servers?

the root only tells you where the tld servers are, and tld server addresses barely ever change — so that answer gets cached heavily and reused for days. the root is the bootstrap starting point everyone agrees on, not a server in the hot path.

The Complete DNS Resolution Process

Now, how does the DNS resolution process actually work? Let's walk through it step by step, assuming nothing is cached:

Putting It All Together

You buy mysite.com from GoDaddy (registrar)
GoDaddy registers your domain with .com TLD servers
By default , GoDaddy's name servers become authoritative for your zone
You can change to use AWS Route 53 instead
Your DNS zone contains A records, CNAME records, etc.

Step by Step Resolution

1. Query to Root Name Server When queried for Domain Name, it responds with IP address of server handling the TLD (e.g., .com, .in, .edu, etc.)

2. Query to TLD Server The .com TLD server responds with the authoritative name server that owns the zone google.com

3. Query to Authoritative Name Server Because it owns the corresponding zone, it can respond with what's configured against www.google.com, which is the IP address 17.53.21.253

4. Browser Connection Your browser gets the IP address, establishes TCP connection to the load balancer, sends HTTP request, gets response - and that's how you see Google's homepage.

✽ RECALL walk through resolving www.google.com from a totally cold cache — who do you ask, and what does each server actually tell you?

the resolver asks a root server, which points to the .com tld servers; the tld server points to the authoritative name server that owns the google.com zone; the authoritative server returns the record configured against www — the actual ip. nobody answers the question directly except the last hop; everyone else just delegates you one level down.

The Caching Strategy: How DNS Scales Infinitely

Here's the secret sauce: each machine takes us closer to machine that knows it.

Denotes caching - This information is cached for a few hours across multiple layers:

The recursion continues from Google Name Server to Authoritative Name Server of domain zone google.com. Then it checks the record against www, which may point to a load balancer, and it then resolves to the IP of the load balancer.

Key insight: The beauty of DNS is that it's a step-by-step resolution process. You go to .com TLD, they give you google.com authority, you go to that and say "Give me www.google.com" - it's hierarchical resolution that scales infinitely.

✽ RECALL what's the actual mechanism that lets dns scale "infinitely"?

hierarchy plus caching. the hierarchy splits the namespace so each server only owns its slice, and every answer along the chain gets cached — browser, os, resolver — for hours. so most lookups never leave the local network, and the higher you go up the hierarchy the less traffic ever arrives.

Why This Architecture Is Brilliant

The DNS architecture solves several critical problems:

Infinite Scalability - No single point of failure, distributed globally with heavy caching at every layer.

Fault Tolerance - Multiple servers at each level, anycast distribution, and redundant authoritative name servers.

Human Usability - We remember google.com instead of 17.53.21.253.

Decentralized Management - Organizations control their own zones without depending on a central authority.

Performance - Heavy caching means most DNS lookups never leave your local network.

This is one of those systems where the more you understand it, the more you appreciate how elegantly it solves an impossibly complex global coordination problem. The fact that typing any domain name anywhere in the world just works - that's the magic of DNS.

Note: In upcoming content, I'll be building my own DNS server to show you exactly how this protocol works under the hood.

Tautik Agrahari

geohash vs quadtree vs r-tree - three ways to find what's near you

what you'll take away

the problem they solve

geohash — encode space into a string

how it actually encodes

why redis loves this

the edge case nobody mentions on day 1

quadtree — recursive 2D split

the search

tradeoffs vs geohash

r-tree — flexible, overlapping rectangles

why this matters

the trade-off

a concrete example — yelp "coffee near me"

geohash version

quadtree version

when to pick which

the universal pattern

what i'd remember

computer use agent story

what you'll take away

the screenshot loop, slightly less hand-wavy

where chrome runs — the actual hard problem

the remote display stack, demystified

the Computer protocol — the small trick that buys you everything

the stuff you actually have to write

auth

proxies

recording

custom vm template

bot detection — the day-1 problem you won't solve

how do you even test this

scaling — what actually breaks at volume

the three things to take away

noVNC and websockify

what you'll take away

what even is vnc?

how does vnc work?

then what is novnc?

enter websockify: the protocol translator

putting it all together

how dropbox handles uploads, downloads and sync

what you'll take away

what we're building

the cap call

core entities

the api

upload — the presigned URL trick

download — the same trick, in reverse

large files — chunk, fingerprint, resume

trust but verify

sharing — separate table, not a list field

sync — push + poll hybrid

delta sync — only ship the chunks that changed

the final wiring

security in two minutes

what i'd remember

id generators - from `Date.now()` to snowflake

what you'll take away

but why do we even need our own ids?

time as id - dead simple

multi-machine collision

threads happen

but counters are volatile

but disk i/o on every call?

monotonic ids - why bother?

strict vs rough monotonicity

the clock skew bossfight

the central service trap

case study 1: amazon's batched ticket service

case study 2: why not just UUIDs?

the cockroachdb counter-example

case study 3: flickr's database ticket server

case study 4: twitter snowflake

the pagination payoff

case study 5: instagram's snowflake variant

closing thoughts

designing instagram's hashtag page

what you'll take away

the `Computer` protocol — the small trick that buys you everything