SaaS·Dec 2024·9 min read

Building Multi-tenant SaaS: Security, Isolation, and the Data Trap

Lessons learned from implementing multi-tenancy: from row-level security to separate infrastructure silos.

Multi-tenancy is one of those architectural decisions that seems straightforward until you're three years in and a customer asks why their data showed up in another customer's dashboard. That bug doesn't just cost you a customer — it costs you the trust of everyone who hears about it.

The stakes are high enough that the isolation strategy deserves serious thought before you write a line of code. Here are the approaches, the trade-offs, and the lessons that come from living with each one.

Three isolation models

At the database level, multi-tenancy comes in three flavors. The first is separate databases: each tenant gets their own database instance. Maximum isolation, maximum cost. This is the right model for enterprise customers who contractually require it, or for workloads where a runaway tenant's queries could affect others.

The second is shared database, separate schemas: all tenants share a database server, but each gets their own schema (namespace). PostgreSQL's schema feature supports this well. It's a reasonable middle ground — cheaper than separate databases, with good logical isolation. The operational overhead of schema migrations becomes significant at scale, though.

The third, and most common, is shared database, shared schema with a `tenant_id` column on every table. Cheapest to operate, hardest to get right. Every query must filter by `tenant_id`. Miss one, and you have a data leak.

Row-level security: the right way to enforce isolation

If you go with the shared-schema approach, don't rely on application-level `WHERE tenant_id = ?` clauses alone. They work until they don't — a new developer writes a query, forgets the filter, and the bug makes it to production.

PostgreSQL's Row-Level Security (RLS) lets you enforce tenant isolation at the database level. You define a policy: "for SELECT, only return rows where `tenant_id` matches the current session's tenant." The application sets the session variable when it establishes a connection; the database enforces the filter automatically, on every query, without trusting the application layer.

The setup requires discipline: every table needs `tenant_id`, every query context needs the session variable, and you need to verify RLS is enabled on every table (easy to miss when you add a new one). But the guarantee you get in return — that a missing WHERE clause can't leak data — is worth it.

The data trap: migrations at scale

Here's the problem nobody warns you about. In a single-tenant application, a schema migration runs once, and if it's slow, you schedule it for a maintenance window. In a shared-schema multi-tenant application with 10,000 tenants and 500 million rows, that same migration might run for hours and lock your database.

Online schema change tools (like `pg_repack` or `pglogical`-based approaches) help with some operations, but they have limits. Adding a column with a default value, in older versions of PostgreSQL, required rewriting the entire table. PostgreSQL 11+ made this a metadata-only operation for constant defaults — one of many reasons to stay on a recent version.

The discipline that saves you here is: test every migration against production-scale data volumes before you run it in production. A migration that takes 2ms on a development dataset can take 20 minutes on 500 million rows. The difference matters enormously when you have SLAs.

Tenant-aware rate limiting and resource isolation

Even with perfect data isolation, one tenant can affect others through resource consumption. A tenant running a heavy export job, or making API calls in a tight loop, consumes database connections, CPU, and I/O that all tenants share. Without controls, your largest or most aggressive tenant will degrade the experience for everyone else.

Implement per-tenant rate limiting at the API gateway level. Track resource usage — query time, rows scanned, API calls — per tenant. When a tenant approaches a limit, throttle them before they impact others. This is table stakes for a SaaS that serves more than a handful of customers.

The architecture of multi-tenant SaaS is ultimately about one thing: making guarantees you can keep. Isolation guarantees. Performance guarantees. Data guarantees. The complexity is in building systems that keep those guarantees even as the tenant count grows, the data volumes compound, and the team churns. Start with the strictest isolation model you can afford and relax it deliberately — never the other way around.

Done readingBack to home