Infrastructure
Application Logging
When something breaks in production, logs are your time machine. They show you exactly what was happening when the problem occurred. Good logging practices save hours of debugging. Bad logging practices waste days.
Why Logging Matters
Production breaks at 2am. User reports an error. You look at logs and see exactly what happened. Without logs, you're blind. You can only guess.
Logs are also an audit trail. Who accessed what data? When? This is critical for compliance (SOC 2, HIPAA, GDPR). And for debugging, logs are invaluable—you can see the exact sequence of events that led to a bug.
Beyond debugging, logs provide observability. They answer questions: "Which users are affected by this bug?" "Did the migration work?" "Are requests going to the new server or old one?"
Structured Logging
Bad logging:
console.log('User login');
This is unstructured. It tells you something happened but not what. Good logging:
logger.info('user_login', {
userId: '12345',
email: 'user@example.com',
timestamp: '2024-03-17T14:30:00Z',
ipAddress: '192.168.1.1'
});
This is structured logging (JSON). Each log is an object with fields. You can search, filter, and analyze logs programmatically.
Structured logging benefits:
- Easy to search: "Show all login failures for user@example.com"
- Easy to filter: "Show errors in the payment service"
- Easy to correlate: "Show all logs for request ID X"
- Easy to analyze: "What's the average response time?"
Most logging libraries support structured logging. Use them.
Log Levels
Logs have levels indicating severity:
DEBUG: Detailed information for developers. Variable values, control flow. Development only, usually not in production.
INFO: General informational messages. "Server started," "User logged in," "Payment processed." Production should have INFO level.
WARN: Something unexpected happened but the application continued. "Retry failed but succeeded on next attempt," "Slow request (5 seconds)," "Deprecated API called."
ERROR: Something failed. The request couldn't be processed. "Database connection failed," "Payment processing failed," "Invalid input received."
FATAL/CRITICAL: The application can't recover. It's crashing or shutting down.
Use the right level:
// Development
logger.debug('Fetching user from database', { userId });
const user = database.getUser(userId);
// Success
logger.info('User fetched', { userId, email: user.email });
// Retriable failure
logger.warn('Database connection slow', { duration: 2000 });
// Fatal failure
logger.error('Database connection failed', { error: e.message });
What to Log
Requests: Log when requests come in and when they complete. Include method, URL, response status, duration.
Errors: Always log errors with full stack trace and context. Include the operation being performed and why it failed.
External calls: When calling external APIs, log the call and response. If the API fails, log what happened.
User actions: User signed up, user bought something, user canceled subscription. Important business events should be logged.
Slow operations: Database queries taking too long, API calls taking too long. These logs help identify performance problems.
Configuration: When the application starts, log the configuration (not secrets, but important values like API endpoints, feature flags).
What NOT to Log
Passwords: Never log passwords. Ever. Not in production, not in development.
API keys and tokens: Secrets in logs are exposed to anyone with log access. Never log them.
Credit card numbers: PCI DSS forbids logging full credit card numbers. Log last 4 digits at most.
Personally identifiable information (PII): Be careful logging email addresses, phone numbers, addresses. If your logs are accessed, you've exposed PII. Log minimally or not at all.
Health check requests: Logging every healthcheck pollutes your logs. Filter them out.
Centralized Log Management
Logs from your application are written to the process. In production with multiple servers, logs are scattered. You need centralized log management.
Datadog Logs: Part of the Datadog platform. Send logs to Datadog, search them with a web interface. Expensive but comprehensive.
Logtail: Logs-as-a-service. Send logs to Logtail, search and analyze. Cheaper than Datadog for logs-only.
AWS CloudWatch Logs: Part of AWS. Send logs to CloudWatch, query them. Good if you're on AWS already.
Elastic Stack (ELK): Elasticsearch, Logstash, Kibana. Open-source log management. Powerful but requires running and maintaining infrastructure.
Splunk: Enterprise log management. Expensive but feature-rich.
For most startups, a logs-as-a-service solution (Logtail, Datadog) is best. For large organizations, ELK or Splunk might make sense.
Request IDs for Tracing
A user reports a problem. You search logs for the user ID and find 1000 logs. Which ones are from the problem incident?
Solution: request IDs. Assign each request a unique ID. Include it in all logs for that request. Track the request through your entire system.
const requestId = generateUUID();
logger.info('request_start', { requestId, method, url });
// Call database
logger.info('database_query', { requestId, query });
// Call external API
logger.info('api_call', { requestId, endpoint });
logger.info('request_end', { requestId, status, duration });
Now you can search logs for `requestId: abc-123` and see the entire journey of the request through your system.
Log Retention Policies
Logs accumulate. A million requests per day, each with 10 logs. That's 10 million log entries per day. Storage costs grow.
Define retention policies:
- DEBUG logs: keep for 3 days (development debugging)
- INFO logs: keep for 30 days (operational understanding)
- ERROR logs: keep for 90 days (debugging issues)
- Audit logs: keep for 1 year (compliance)
Balance cost against need. 30 days of INFO logs is usually sufficient. Going back 1 year is rarely necessary and very expensive.
Development vs Production Logging
In development, log everything. DEBUG level, verbose output. This helps you understand what your code is doing.
In production, log less. INFO level focus. Verbose debugging logs waste storage and slow down the system.
Achieve this with configuration:
const logLevel = process.env.NODE_ENV === 'production' ? 'info' : 'debug';
const logger = createLogger({ level: logLevel });
Searching and Querying Logs
Your log aggregation system should have a query language. Find logs for a specific user:
userId:"12345" AND level:error
Find all errors in the past hour:
level:error AND timestamp:[now-1h TO now]
Good query support is essential. If you can't find the logs you need, they're useless.
Common Logging Mistakes
Logging too much. INFO logs for every line of code. Your logs are terabytes per day and cost is astronomical. Log important events, not every action.
Logging too little. When an error occurs, you have no context. Log enough to understand what was happening.
No request IDs. Tracing a specific request is impossible. Add request IDs to everything.
Unstructured logs. Logs are free-form text. You can't parse them. Always use structured logging.
No retention policy. Logs grow forever. Eventually you run out of storage or budget. Define retention.
Logging Best Practices
Log business context. Don't just log technical details. "Database query took 500ms" is less useful than "Database query for user payment took 500ms, delayed request by 250ms."
Include timestamps and request IDs. Essential for correlating logs.
Use consistent field names. If one log says `userId` and another says `user_id`, searching is hard. Pick naming conventions and stick to them.
Log stack traces for errors. The full stack trace tells you exactly where the error happened.
The Reality
Logging is not glamorous. It's not fun to think about. But when you need logs, you need them desperately. A 3am incident? Logs are your lifeline.
Invest in good logging practices early. It pays back immediately when debugging issues. Structured logging + centralized storage + request IDs = fast debugging. Lack of logging = hours spent guessing.