Infrastructure
Monitoring & Alerting
You can't manage what you don't measure. Monitoring tells you what's happening in production. Alerting tells you when something is wrong. Together, they're the difference between sleeping through the night and panic at 3am.
What to Monitor
Uptime/Availability: Is your application running? Is it responding to requests? If it's down for 10 minutes and nobody tells you, that's worse than it being down. You need to know.
Response Time (Latency): How long does it take to respond to a request? If your API takes 100ms usually but takes 5 seconds, something is wrong.
Error Rate: What percentage of requests fail? If 1% of requests fail, that's bad. 0.1% might be acceptable.
Resource Usage: CPU, memory, disk. If your server is at 95% CPU, it will crash soon. Monitor it before it becomes a problem.
Queue Depth: How many jobs are waiting to be processed? If the queue is growing, your system can't keep up.
Business Metrics: Signups, revenue, active users. Technical metrics matter, but business metrics matter more.
The Four Golden Signals
Google defines four golden signals for monitoring:
Latency: How long it takes to serve a request. Track both successful responses and error responses separately (errors are usually fast).
Traffic: How many requests you're handling. Requests per second is standard.
Errors: What fraction of requests are failing. Both explicit errors (HTTP 500) and implicit errors (request too slow).
Saturation: How full your system is. CPU, memory, disk, database connections. When saturation hits 100%, your system stops working.
Monitor these four and you understand your system's health.
Monitoring Tools
Datadog: Premium monitoring platform. Comprehensive dashboards, integrations with everything, APM (application performance monitoring). Expensive but powerful. Large companies use Datadog.
New Relic: Similar to Datadog. Slightly different interface, similar capabilities. Also expensive.
Grafana + Prometheus: Open-source stack. Prometheus scrapes metrics from applications. Grafana visualizes them. Free but requires self-hosting and operational knowledge.
AWS CloudWatch: Built into AWS. If you're on AWS, CloudWatch is the natural choice.
Better Uptime: Focused on uptime monitoring. Checks if your site is up, alerts you if it's down. Cheap and simple.
Choose based on budget and existing ecosystem. For startups, a simple uptime monitor + CloudWatch is often sufficient.
Setting Up Alerts
An alert should trigger when something is wrong and you should take action. Too many alerts lead to alert fatigue—people stop paying attention.
Alert on important problems: If your API response time doubles and requests fail, alert. If your homepage loads 10ms slower than usual, don't alert.
Avoid alert fatigue: Each alert should be actionable. "CPU is at 75%" is not actionable (75% might be normal). "CPU is at 95%" is actionable—scale up or find the problem.
Give context in alerts: Instead of "Error rate is high," say "Error rate is 5% (threshold: 1%), 150 requests failing per minute. Check logs for pattern."
Common alerts:
- Application is down (no requests for 2 minutes)
- Error rate exceeds threshold (>1%)
- Response time exceeds threshold (p99 > 1 second)
- CPU usage exceeds threshold (>85%)
- Memory usage exceeds threshold (>80%)
- Disk is almost full (>90%)
- Database connection pool is exhausted
On-Call Rotations
Production problems happen at 3am on Saturday. Someone needs to be on-call to handle them. Many teams have rotating on-call schedules.
On-call person gets alerted when something is wrong. They diagnose and fix or escalate. Tools like PagerDuty or OpsGenie manage on-call rotations, escalations, and incident tracking.
On-call is stressful but valuable. It forces you to design systems that are operationally simple. If you hate being on-call because your system is fragile, fix the system.
SLIs, SLOs, and SLAs
SLI (Service Level Indicator): Measurable aspect of service performance. "Requests with latency < 100ms" or "Successful requests / total requests."
SLO (Service Level Objective): Target for the SLI. "99% of requests should have latency < 100ms" or "99.9% of requests should succeed."
SLA (Service Level Agreement): Contract with users. If you miss the SLO, what's the consequence? Refund? Credit? Legal damages?
Most tech companies use these terms:
"We commit to 99.9% uptime" = SLA
"We measure uptime using... " = SLI
"Our target is 99.9%" = SLO
SLOs help you prioritize. If your SLO is 99% availability and you're at 99.5%, you have slack to take risks (deploy fast, try new infrastructure). If you're at 99.05%, you're close to breaching and should be conservative.
Error budgets follow from SLOs: if your SLO is 99% uptime, you have a 1% "error budget"—you can afford to be down 7.2 hours per year. Use this wisely.
Synthetic Monitoring
Synthetic monitoring simulates user actions to test your system. A bot logs in, checks the dashboard, logs out. If any step fails, alert.
This catches problems that don't show up in traffic metrics. Your API might respond but return wrong data. Synthetic tests would catch this.
Run synthetic tests from multiple geographic locations. If a user in Sydney can't access your site but a user in New York can, you've got a regional problem.
Real User Monitoring (RUM)
Synthetic monitoring tests from labs. Real user monitoring collects data from actual users in the wild. Your application sends metrics to your monitoring system: page load time, error messages, button clicks.
RUM shows what real users experience. Synthetic monitoring might show everything working; RUM might show that pages take 5 seconds to load for 10% of users on slow networks.
Use both: synthetic for always-on testing, RUM for understanding real user experience.
Dashboards
Create dashboards that show the health of your system at a glance. What should be on the dashboard?
- System status (green/yellow/red)
- Current error rate with trend
- Current latency (p50, p99)
- Current traffic (requests per second)
- Resource usage (CPU, memory, disk)
- Business metrics (signups, revenue)
Keep dashboards simple. If it takes 5 minutes to understand your system health, the dashboard is too complex. Aim for "glance for 10 seconds and know if everything is okay."
Tracing and Distributed Tracing
When a request takes 5 seconds, where did the time go? Database? Slow API call? Network? Tracing shows you.
Each service logs when the request enters and exits. Tools like Jaeger or DataDog visualize the request across all services. You see: request → API (50ms) → Database (4000ms) → Cache (20ms) → Response. The database is the bottleneck.
Tracing is more complex than basic monitoring but invaluable for debugging performance issues.
Common Monitoring Mistakes
Monitoring everything. You end up with 1000 metrics and no idea which matter. Focus on the golden signals and business metrics.
No alerting. Metrics sit in dashboards. Nobody looks at them. Alerts force visibility.
Alerting on things you can't fix. "CPU is at 75%" is not actionable. Either you can fix it (scale, optimize) or you can't (host limits). Don't alert on things you can't fix.
No run books. Alert fires at 3am. What do you do? If you don't have a runbook, you panic. Document the steps to diagnose and fix common issues.
Monitoring by Team Size
| Name | Monitoring | Tools | Effort |
|---|---|---|---|
| Startup (1-10 people) | Simple uptime + error alerts | Better Uptime + CloudWatch, minimal | Minimal, maybe 1 hour/month |
| Growth (10-100 people) | Uptime, latency, errors, resource usage | Grafana + Prometheus or mid-tier Datadog | 1-2 hours/week, one person partial |
| Scale (100+ people) | Comprehensive: traces, RUM, business metrics | Premium platform (Datadog, New Relic), custom integrations | Full-time engineer(s) dedicated |
The Reality
Monitoring is insurance. You don't want to use it, but when you need it, you're glad it's there. The cost of being blind to production issues is much higher than the cost of monitoring.
Start simple. As your system grows, monitoring becomes more important. Invest appropriately as you scale.
The best monitoring system is one you actually use. If alerts are noisy or dashboards are confusing, you won't use them. Keep it simple and actionable.