Scaling Back-End Applications

Most applications don't need to scale until they succeed. Premature scaling is waste. But building in ways that prevent scaling is also a mistake. The goal: build normally, then scale when necessary.

Scaling means handling more requests and more data. Two approaches: vertical (bigger servers) and horizontal (more servers).

Vertical Scaling: Bigger Servers

Add more RAM, CPU, or disk to a single server. A 4GB server becomes 16GB. Simple but has hard limits.

Advantages: simple, no code changes, no coordination between servers. Disadvantages: finite (you can only scale so high), expensive, single point of failure.

Vertical scaling gets you far. A single modern server can handle thousands of concurrent connections. Only after hitting vertical limits consider horizontal scaling.

Horizontal Scaling: More Servers

Run the same application on multiple servers. A load balancer distributes requests. If one server goes down, others handle traffic.

Advantages: infinite scaling (add more servers), redundancy, load distribution. Disadvantages: complexity (coordination, state), more operational overhead.

Stateless Architecture: The Requirement

Horizontal scaling requires stateless architecture. Servers can't store state in memory. If a request is routed to a different server next time, it won't find the session.

Solution: store state externally (Redis, database). Any server can handle any request.

This is mandatory for horizontal scaling. If your application stores state in memory (user sessions, file uploads in progress), you can't scale horizontally.

Load Balancing

A load balancer sits in front of servers and distributes requests. Round-robin (each server gets a request in turn), least connections (send to least busy), or hash-based (same client always goes to same server).

Load balancers: nginx, AWS Application Load Balancer, HAProxy, Cloudflare.

Health checks: the load balancer checks if servers are alive and removes dead ones from rotation. This enables graceful degradation.

Database Scaling: Often the Bottleneck

Most scaling problems aren't the application server—it's the database. A single database can handle only so many queries.

Strategies:

Optimize queries: Add indexes, rewrite slow queries. Often the biggest improvement.
Connection pooling: pgBouncer limits connections and reuses them. Dramatically increases capacity.
Read replicas: Read-only copies of the database. Distribute reads across replicas.
Caching: Redis in front of database reduces queries to the database itself.
Database sharding: Split data across multiple databases. Complex but necessary at huge scale.

Caching and read replicas get you very far. Sharding is complexity only worth taking on at massive scale.

Caching to Reduce Database Load

A well-configured cache in front of the database can extend its capacity 10x. Expensive queries that took 100ms are cached for 1 hour, reducing thousands of queries to one.

This is the first scaling lever to pull: cache expensive queries and API responses.

Auto-Scaling

Cloud services (ECS, Kubernetes, Railway) can automatically add/remove servers based on load. CPU above 70%? Add a server. CPU below 30%? Remove one.

This is powerful: you automatically scale to handle spikes and scale down to save costs. But requires monitoring and alarms.

Scaling Strategies Comparison

Strategy	When Appropriate	Complexity	Cost
Single server, optimized	10-1000 requests/sec	Low	Low
Vertical scaling	1000-10000 requests/sec on single server	Low	Medium
Horizontal scaling + load balancer	10000+ requests/sec across servers	Medium	Medium to high
Read replicas + caching	Database is bottleneck	Medium	Medium
Database sharding	Millions of requests/sec	High	High

When To Scale

Don't scale before you need to. Monitor:

Response time: If p95 response time exceeds target, optimize or scale
Error rate: Timeouts and 5xx errors suggest overload
CPU/Memory: High utilization under normal load suggests the server can't handle peak
Database query time: Slow queries indicate database is the bottleneck

Measure. You might be fine with one server when you think you need ten.

Scaling is Not Free

Each scaling lever adds complexity:

Multiple servers require load balancing and state management
Caching requires invalidation logic and cache eviction
Read replicas introduce replication lag (replicas lag behind the primary)
Sharding requires routing logic and cross-shard queries are complex

The goal is to scale as simply as possible. Optimize code and database first. Use caching. Add servers with a load balancer. Only add sharding if you truly need it (millions of requests per second, petabytes of data).

Note

Most applications will never need to scale beyond a single database and a few application servers. Build normally. If scaling becomes necessary, the problem is great (your product is successful). Then solve it.

The Principle

Scaling is about removing bottlenecks. Measure to find the actual bottleneck (usually the database). Apply the appropriate scaling strategy. Repeat as you grow. Don't anticipate problems you haven't hit—you might solve the wrong one.