Disaster Recovery | App Dev Guide

Your data center catches fire. Your database is corrupted. Someone deletes production data by accident. Disaster recovery is about having a plan for when everything goes wrong.

RTO and RPO

RTO (Recovery Time Objective): How long can you afford to be down? If your service is down for 1 hour, how much damage happens? For Netflix, maybe millions in lost viewing. For internal tools, probably zero.

RPO (Recovery Point Objective): How much data can you afford to lose? If your database is corrupted and you restore from a backup from 24 hours ago, you lose 24 hours of data. Is that acceptable?

Different applications have different requirements:

Critical systems (payment processing): RTO 1 hour, RPO 0 (no data loss)
Important systems (user accounts): RTO 4 hours, RPO 1 hour
Internal tools: RTO 24 hours, RPO 24 hours

RTO and RPO guide your disaster recovery investment. Higher RTO/RPO = cheaper recovery. Lower RTO/RPO = more expensive. You need to afford both the technology and the operational overhead.

Common Disaster Scenarios

Server failure: One server crashes. Could be hardware failure, overheating, network card dies. With replicas, this is transparent. Without replicas, you're down until the server is fixed or replaced.

Data corruption: A bug causes bad data to be written to the database. You don't notice for hours. Now your backups contain bad data too. Recovery requires restoring from older backups and replaying good transactions.

Ransomware: An attacker encrypts your data. They demand money. Recovery: restore from backups (assuming backups are encrypted separately and the attacker didn't access them).

Accidental deletion: Developer runs `DELETE FROM users WHERE id > 1000` by mistake in production. Thousands of users deleted. Recovery: restore from backup, hope you have recent backups.

Regional outage: AWS region goes down (this has happened). All infrastructure in that region is gone. Recovery: failover to another region (assuming you have multi-region setup).

Ransomware attack: Attacker accesses your systems, exfiltrates data, demands payment. Recovery: investigate, remediate, restore if needed.

Backup Strategies

Frequency: How often do you back up? Hourly? Daily? Weekly? More frequent = lower RPO but more cost and complexity.

Retention: How long do you keep backups? 30 days? 90 days? 1 year? Compliance requirements often dictate retention.

Location: Store backups offsite. If your data center burns down, you don't want backups there too. Most teams back up to a different region in the same cloud provider.

Encryption: Backups contain sensitive data. Encrypt them. If an attacker gains access to backups, they can read everything.

Automated testing: Most organizations never test restores. They assume backups work. Occasionally they test and discover backups are corrupted or incomplete. Test restores regularly. At least monthly, perform a full restore and verify data integrity.

Warning

Test your backups regularly. A backup that you've never restored from is assumed broken. Include restore testing in your operational routine.

Multi-Region Deployments

For critical applications, run in multiple regions. If one region goes down, users are served from another region.

Challenges:

Database replication: Data in region 1 must be replicated to region 2. Synchronous replication is slow (every write waits for replication). Asynchronous replication is fast but risks losing data if region 1 fails.
Complexity: Multi-region deployments double complexity. More things to manage, more failure modes, harder to debug.
Cost: You're essentially running two data centers. Cost doubles.
DNS failover: When region 1 fails, users must be routed to region 2. DNS failover takes seconds to minutes. During that time, users can't access the service.

Multi-region deployment is high-end infrastructure for critical applications. Most startups don't need it.

Failover Strategies

Manual failover: When a failure is detected, a human decides what to do. Call the team, diagnose the problem, perform recovery. Takes hours. Only acceptable for non-critical systems.

Automated failover: System detects failure (database is unreachable for 30 seconds) and automatically switches to backup (read replica becomes primary). Happens in seconds. Acceptable for critical systems but risky—you might failover due to false alarms.

Blue/Green deployments: For deployments, maintain two identical environments (blue and green). Run on blue, deploy to green, test green, switch traffic to green. If green has problems, instantly switch back to blue. Zero downtime deployments.

Runbooks

When disaster strikes, you don't want to figure things out. You want to follow a checklist. Write runbooks for common failure scenarios:

# Database Failure Runbook

1. Verify database is down (ping database endpoint, check monitoring)
2. Check if read replica is up (if yes, promote read replica to primary)
3. Update DNS to point to read replica (or new primary)
4. Monitor for errors (logs, error tracking)
5. Once stable, investigate root cause
6. If data loss, restore from backup (follow backup recovery runbook)

Runbooks should be detailed, tested, and easy to follow. They should be accessible (not on someone's laptop) and updated as infrastructure changes.

Chaos Engineering

Chaos engineering is deliberately breaking things in production to test resilience. You simulate a server failure and see if the system handles it gracefully.

Tools like Gremlin, Chaos Monkey (Netflix's original chaos tool), Litmus help inject failures. They kill servers, drop network packets, delay requests, see if the system survives.

This is advanced and only used by high-reliability organizations. Most teams don't do chaos engineering, which means they're often surprised by failures that chaos engineering would have caught.

Business Continuity Planning

Disaster recovery is about infrastructure. Business continuity is about the business. If your service is down for 24 hours, how do you handle it?

BCP includes:

Communication plan: how to notify affected users, partners, stakeholders
Customer support plan: how to handle support requests during outage
Financial plan: potential loss from downtime, insurance
Recovery plan: steps to restore service

Large organizations have detailed BCPs. Startups often don't. At minimum, have a communication plan (email template, who to contact).

How Managed Services Help

Managed services (AWS RDS, Google Cloud SQL) handle backups, replication, failover for you. You don't worry about it. They guarantee N nines of uptime (99.9%, 99.99%, etc.).

This is worth the premium cost. You reduce operational burden and get higher reliability.

Cost of Downtime

How much does downtime cost your business? For SaaS applications:

Hourly revenue = $100,000
Cost of 1 hour downtime = $100,000
Cost of 24 hour downtime = $2,400,000

Spent money on disaster recovery infrastructure makes sense if it's cheaper than potential downtime cost.

Another cost: customer churn. Users lose trust after outages. Some cancel. Reputational damage. These are hard to quantify but significant.

Disaster Recovery for Startups

Most startups are resource-constrained. Here's a reasonable disaster recovery plan:

Use managed database service (auto-backups, replication)
Automated daily backups to separate region
Test restore monthly (half-hour operation)
Write runbook for common failures
Monitor database for replication lag, disk space
Have on-call rotation for critical infrastructure

This covers most scenarios and costs maybe $500/month. Good tradeoff for early-stage companies.

The Reality

Disaster recovery planning feels boring and depressing. You're preparing for failure. But it's essential. Data loss incidents are devastating. Downtime costs money and damages reputation.

Start with basics: automated backups, regular restore testing, monitoring. As your service becomes critical, invest in more sophisticated recovery (read replicas, multi-region, automated failover).

And please test your backups. Too many companies discover backup problems during actual disasters.