Scaling Back-End Applications
Most applications don't need to scale until they succeed. Premature scaling is waste. But building in ways that prevent scaling is also a mistake. The goal: build normally, then scale when necessary.
Scaling means handling more requests and more data. Two approaches: vertical (bigger servers) and horizontal (more servers).
Vertical Scaling: Bigger Servers
Add more RAM, CPU, or disk to a single server. A 4GB server becomes 16GB. Simple but has hard limits.
Advantages: simple, no code changes, no coordination between servers. Disadvantages: finite (you can only scale so high), expensive, single point of failure.
Vertical scaling gets you far. A single modern server can handle thousands of concurrent connections. Only after hitting vertical limits consider horizontal scaling.
Horizontal Scaling: More Servers
Run the same application on multiple servers. A load balancer distributes requests. If one server goes down, others handle traffic.
Advantages: infinite scaling (add more servers), redundancy, load distribution. Disadvantages: complexity (coordination, state), more operational overhead.
Stateless Architecture: The Requirement
Horizontal scaling requires stateless architecture. Servers can't store state in memory. If a request is routed to a different server next time, it won't find the session.
Solution: store state externally (Redis, database). Any server can handle any request.
This is mandatory for horizontal scaling. If your application stores state in memory (user sessions, file uploads in progress), you can't scale horizontally.
Load Balancing
A load balancer sits in front of servers and distributes requests. Round-robin (each server gets a request in turn), least connections (send to least busy), or hash-based (same client always goes to same server).
Load balancers: nginx, AWS Application Load Balancer, HAProxy, Cloudflare.
Health checks: the load balancer checks if servers are alive and removes dead ones from rotation. This enables graceful degradation.
Database Scaling: Often the Bottleneck
Most scaling problems aren't the application server—it's the database. A single database can handle only so many queries.
Strategies:
- Optimize queries: Add indexes, rewrite slow queries. Often the biggest improvement.
- Connection pooling: pgBouncer limits connections and reuses them. Dramatically increases capacity.
- Read replicas: Read-only copies of the database. Distribute reads across replicas.
- Caching: Redis in front of database reduces queries to the database itself.
- Database sharding: Split data across multiple databases. Complex but necessary at huge scale.
Caching and read replicas get you very far. Sharding is complexity only worth taking on at massive scale.
Caching to Reduce Database Load
A well-configured cache in front of the database can extend its capacity 10x. Expensive queries that took 100ms are cached for 1 hour, reducing thousands of queries to one.
This is the first scaling lever to pull: cache expensive queries and API responses.
Auto-Scaling
Cloud services (ECS, Kubernetes, Railway) can automatically add/remove servers based on load. CPU above 70%? Add a server. CPU below 30%? Remove one.
This is powerful: you automatically scale to handle spikes and scale down to save costs. But requires monitoring and alarms.
Scaling Strategies Comparison
| Strategy | When Appropriate | Complexity | Cost |
|---|---|---|---|
| Single server, optimized | 10-1000 requests/sec | Low | Low |
| Vertical scaling | 1000-10000 requests/sec on single server | Low | Medium |
| Horizontal scaling + load balancer | 10000+ requests/sec across servers | Medium | Medium to high |
| Read replicas + caching | Database is bottleneck | Medium | Medium |
| Database sharding | Millions of requests/sec | High | High |
When To Scale
Don't scale before you need to. Monitor:
- Response time: If p95 response time exceeds target, optimize or scale
- Error rate: Timeouts and 5xx errors suggest overload
- CPU/Memory: High utilization under normal load suggests the server can't handle peak
- Database query time: Slow queries indicate database is the bottleneck
Measure. You might be fine with one server when you think you need ten.
Scaling is Not Free
Each scaling lever adds complexity:
- Multiple servers require load balancing and state management
- Caching requires invalidation logic and cache eviction
- Read replicas introduce replication lag (replicas lag behind the primary)
- Sharding requires routing logic and cross-shard queries are complex
The goal is to scale as simply as possible. Optimize code and database first. Use caching. Add servers with a load balancer. Only add sharding if you truly need it (millions of requests per second, petabytes of data).
The Principle
Scaling is about removing bottlenecks. Measure to find the actual bottleneck (usually the database). Apply the appropriate scaling strategy. Repeat as you grow. Don't anticipate problems you haven't hit—you might solve the wrong one.