Infrastructure
Infrastructure Cost Management
Cloud costs spiral silently. You get the first bill and it's $5,000. The second month, $12,000. The third month, $25,000. What happened? Unknown instances, unused resources, data transfer charges. Managing cloud costs is hard and critical.
Why Cloud Costs Spiral
Traditional infrastructure forces you to think about cost. You provision a server, pay $50/month, you see the cost. Cloud makes it easy to spin up resources. You need a database for a project, create an RDS instance. You forget about it. It runs for 6 months. Cost: $500, hidden in a bill with dozens of other line items.
Common cost traps:
- Unused resources: RDS instances, S3 buckets, Elastic IPs sitting idle. Nobody remembers they're running.
- Over-provisioned instances: Picking instance sizes conservatively. Running db.r5.2xlarge when db.t3.small would work.
- Data egress fees: AWS charges for data transferred out of their network. Downloading 10TB monthly costs thousands.
- Test/dev environments: Running production-like infrastructure for testing. Same resources, 10x unnecessary.
- Inefficient services: Using expensive managed services when cheaper alternatives exist.
- Abandoned projects: Project ends, infrastructure keeps running. Nobody tears it down.
Right-Sizing Instances
The most common cost problem: running instances that are too large. Most developers pick conservatively. "It might need 8GB RAM, let me get 16GB." This doubles the cost.
Monitor actual usage. CloudWatch shows CPU, memory, disk, network. If you're running a db.r5.2xlarge (61GB RAM) and using 4GB, downsize to db.t3.small. Cost drops from $1500/month to $100/month.
Most infrastructure can be right-sized to 30-50% of initial provisioning. Do this once annually.
Reserved Instances vs On-Demand vs Spot
On-demand: Pay per hour. Expensive but flexible. Need a server for 3 hours? Pay for 3 hours.
Reserved instances: Commit to 1 or 3 years, pay upfront. Discounts: 1-year reserved is ~30% cheaper than on-demand, 3-year is ~60% cheaper. Worth it for baseline infrastructure that always runs.
Spot instances: Bid for unused capacity. AWS wants to fill unused servers, so they offer discounts. 70-90% cheaper than on-demand. Catch: can be terminated with 2-minute notice if AWS needs the capacity.
Usage pattern:
- Baseline load (always needed): Use reserved instances. Get the discount, know the cost.
- Spiky load (peak traffic): Use spot instances for auto-scaling. Save money, tolerate occasional termination.
- Experimental/temporary: Use on-demand. Flexibility over cost.
Cost Alerting and Budgets
Set up cloud cost alerts. AWS Budgets, GCP Budget Alerts, Azure Budgets. Define monthly budget (say, $5,000). If costs exceed 50%, get alerted. At 100%, get alerted again.
This catches cost overruns before they become disasters. A runaway data transfer process could cost $10,000 unnoticed. Cost alerts catch it within days.
FinOps Practices
FinOps is Financial Operations for infrastructure. Teams delegate cost responsibility to engineers instead of hiding costs in IT budgets.
FinOps practices:
- Show-back: Show each team their infrastructure costs. "Your service costs $50,000/month." Incentivizes optimization.
- Chargeback: Bill teams for their infrastructure. If your service's data transfer is high, your team pays.
- Cost tagging: Tag all resources with project, team, environment. Enables cost allocation.
- Regular reviews: Monthly cost review meetings. Team members discuss cost, celebrate optimizations, plan reductions.
Cost per Customer
The most important metric: cost per customer. If you have 10,000 customers and $5,000/month infrastructure cost, that's $0.50/customer/month.
As you scale, cost per customer should decrease. You add customers with minimal infrastructure increase. If cost per customer increases as you grow, something is wrong.
Track this metric. Use it to guide optimization. If cost per customer is $5 but you charge $10/month, your margin is 50%. If it increases to $8, margin drops to 20%. Pressure to optimize.
S3 Storage Tiers and Lifecycle Policies
S3 storage tiers:
- Standard: Expensive, immediately available. $0.023/GB/month.
- Intelligent-Tiering: Automatically moves between tiers based on access patterns. Middle cost.
- Standard-IA: Infrequent access. Cheaper ($0.0125/GB/month) but charges per retrieval.
- Glacier: Archive storage. Very cheap ($0.004/GB/month) but retrieval takes hours.
Lifecycle policies automatically move data between tiers. Upload to Standard, move to Standard-IA after 30 days, move to Glacier after 90 days. Automatic cost optimization.
Database Cost Optimization
Databases are often the largest cost. Optimization strategies:
- Connection pooling: Reuse database connections. Reduces overhead.
- Query optimization: Add indexes, eliminate slow queries. Faster queries use fewer resources.
- Right-size the instance: Most databases are over-provisioned. Start small, scale up only if needed.
- Read replicas: Scale reads without scaling writes. Cheaper than bigger instances.
- Managed services: RDS costs more per unit than self-hosted but saves operations costs.
CDN and Data Transfer Costs
CDN costs scale with traffic. At low traffic, CDN is negligible. At high traffic, CDN can be significant cost.
Data transfer OUT of AWS/GCP/Azure is expensive ($0.09/GB from AWS to internet). Data transfer between regions is expensive. Data transfer within region is cheap.
Optimize: use regional services, cache aggressively, use CDN to minimize origin traffic.
Cost Management Tools
AWS Cost Explorer: Shows what you're spending and trends. Built into AWS.
Infracost: Estimates infrastructure cost changes. "This Terraform change will increase costs by $200/month." Helps catch cost surprises before deployment.
CloudHealth: Multi-cloud cost management. If you use AWS and GCP, CloudHealth aggregates costs.
Kubecost: Kubernetes cost monitoring. Shows cost per pod, per namespace, per service.
Use tools to gain visibility. What you can't measure, you can't optimize.
Cost Optimization for Early-Stage Startups
Budget is tight. Infrastructure cost feels expensive. Here's a realistic early-stage setup:
- Application: managed platform (Vercel, Railway) or single VPS (~$50/month)
- Database: managed service (PlanetScale, Neon) (~$20/month for startup tier)
- Storage: S3 with Glacier lifecycle (~$10/month for reasonable usage)
- Monitoring: CloudWatch free tier or Datadog starter (~$0-100/month)
- Total: $100-200/month
This gets you far. As you scale and costs grow, optimize aggressively.
Enterprise Cost Management
Large companies have dedicated FinOps teams. Infrastructure costs are millions. A 10% optimization saves hundreds of thousands.
Enterprise practices:
- Detailed cost allocation by service and team
- Regular cost reviews with engineering leadership
- Automated cost optimization (right-sizing, commitment purchasing)
- Vendor negotiations (volume discounts)
- Multi-cloud strategy (avoid lock-in, optimize provider choice per workload)
Common Cost Mistakes
Ignoring costs until the bill is huge. Monthly reviews catch problems early.
Choosing expensive services for flexibility you don't need. Use the cheapest service that meets your needs.
Running identical infrastructure for dev/staging/prod. Dev and staging can be much smaller than prod.
Not automating cost optimization. Manual reviews catch 20% of optimizations. Automation catches 80%.
The Trade-Off: Cost vs Reliability/Speed
| Name | Cost | Reliability | Speed |
|---|---|---|---|
| Aggressive Cost Optimization | Minimum cost, bare bones | Lower (risks downtime to save money) | Slower (fewer resources, longer response times) |
| Balanced Approach | Reasonable, optimized but not minimal | Good (handles normal failures) | Good (responsive, not bloated) |
| Cost No Object | Expensive, over-provisioned | Excellent (redundancy everywhere) | Fast (abundant resources) |
Most teams choose balanced. Cost matters but not at the expense of reliability.
The Reality
Cloud costs don't manage themselves. Set up monitoring, review monthly, optimize continuously. A few hours/month of optimization saves thousands.
When cost is a constraint, cloud becomes incredibly cost-effective. You can build on a shoestring budget. When cost is unconstrained, cloud enables growth without capacity planning.
Either way, understanding and managing costs is essential to sustainable infrastructure.