The quest for sub-100ms latency in distributed systems

Writing

100ms is the threshold where users start to notice a delay. Below it, interactions feel instant. Above it, there's friction — small, almost imperceptible, but compounding across every click and keystroke until it erodes trust in your product.

Getting an API consistently below that number in a distributed system isn't a single optimization. It's a stack of decisions that each shave off a few milliseconds — and an understanding of where time actually goes.

Measure first, optimize second

The most common mistake is optimizing by intuition. You add a cache because caches are fast, or rewrite a query because it feels slow, without knowing whether either is the actual bottleneck. Start with distributed tracing. Tools like OpenTelemetry with a Jaeger or Tempo backend will show you exactly where each request spends its time.

In most web APIs, the breakdown looks something like this: 2–5ms network RTT within a datacenter, 5–20ms database query time, 1–3ms serialization, and then a long tail of business logic that's hard to categorize. The database is almost always where the time goes. Start there.

Database query optimization

The single highest-leverage optimization in most applications is adding a missing index. An unindexed `WHERE` clause on a table with a million rows will do a full sequential scan — O(n) instead of O(log n). Run `EXPLAIN ANALYZE` on your slow queries and look for "Seq Scan" on large tables.

N+1 queries are the second most common culprit. They happen when you fetch a list of records and then make a separate query for each one. ORMs hide this from you — which is why it's worth logging query counts in development and alarming when a single request issues more than, say, 10 queries.

Connection pooling matters more than most developers expect. Opening a new database connection takes 20–100ms on its own. A connection pool keeps connections warm and reuses them across requests. Tools like PgBouncer (for PostgreSQL) can be the difference between a p99 of 80ms and 400ms under load.

Caching with intention

Caching is powerful and dangerous in equal measure. The danger isn't cache misses — it's stale data that silently corrupts your application's correctness. Before reaching for Redis, ask: what's the cost of serving stale data here? For user profile data, probably acceptable. For an account balance, definitely not.

When caching is appropriate, think in layers. Application-level caches (in-process, keyed by request parameters) are the fastest but consume memory and don't share across instances. Distributed caches like Redis add a network hop but are shared. CDN caches work for public, read-heavy responses.

Cache key design is underrated. A poorly scoped key leads to either too many misses (cache is useless) or too many false hits (stale data). Include every dimension of variation in the key: user ID if the data is user-specific, locale if it's translated, and a version hash if the schema can change.

Network path optimization

Once your application logic is fast, the network itself becomes the constraint. An API server in Singapore serving users in São Paulo has a minimum latency of ~280ms from physics alone — the speed of light through fiber. Edge computing (Cloudflare Workers, Vercel Edge Functions) moves computation closer to users.

Keep-alive connections and HTTP/2 multiplexing reduce the overhead of connection setup on repeated requests. If you're still serving an API over HTTP/1.1 without keep-alive, you're leaving easy performance on the table.

The goal isn't always to hit sub-100ms on every request globally. It's to understand your latency budget, know where it goes, and make deliberate trade-offs. An API that's 120ms with correct data beats one that's 50ms with a cache invalidation bug.

NextDatabaseWhen PostgreSQL isn't enough: Moving to specialized datastores