Error Handling and Logging

Errors happen. Databases go down, external APIs time out, users send malformed data. How you handle errors determines whether your application degrades gracefully or crashes.

Unhandled errors crash the server. Handled errors return clear responses to the client and log enough information to debug. The difference is professionalism.

Error Classification

Two types of errors:

Operational Errors

Expected errors that can be handled: user not found, email invalid, payment declined, external service timeout. These are normal. Return a 4xx status code and a clear message.

Programmer Errors

Bugs: null reference, undefined method, assertion failure. These shouldn't happen in production. Crash hard, log everything, alert the team. You must fix the bug.

Don't confuse the two. Operational errors are handled (return error response). Programmer errors are not (crash and log).

Graceful Error Handling

Catch operational errors, log them server-side, return clear responses to the client. Never expose internal details.

Bad response: 500 Internal Server Error (no details, user doesn't know what happened)

Good response:

{
"code": "INVALID_EMAIL",
"message": "Please provide a valid email address"
}

The client knows exactly what went wrong and how to fix it.

Structured Logging

Logs should be structured (JSON) not free text. Structured logs are searchable and filterable.

Each log should include:

Timestamp: When it happened
Level: debug, info, warn, error
Message: What happened
Context: Request ID, user ID, relevant data

Example:

{
  "timestamp": "2024-01-15T10:30:45Z",
  "level": "error",
  "message": "Payment processing failed",
  "request_id": "abc123",
  "user_id": 456,
  "order_id": 789,
  "error": "timeout"
}

Log Levels

Debug: Detailed information for debugging (variable values, function calls)
Info: Important events (user logged in, order created)
Warn: Potentially problematic (retry attempt, high response time)
Error: Error conditions that should be investigated (API timeout, validation failed)

In production, log at info level or higher. Debug logs create too much noise. In development, debug everything.

Centralized Logging

Don't log to files on the server. Use a centralized logging service: Datadog, Logtail, Better Stack, AWS CloudWatch.

Centralized logging aggregates logs from all servers, makes them searchable, and enables alerting. Log to stdout and let the service capture it.

Error Tracking

Centralized logging shows what happened. Error tracking (Sentry, Rollbar, Honeybadger) shows errors that shouldn't happen.

Error tracking services:

Capture uncaught exceptions
Group similar errors together
Track error frequency and trends
Link to source code
Notify the team of new errors

Use error tracking to catch bugs before users report them.

Correlation IDs

A request might go through multiple services. Each logs. Correlation IDs connect the logs:

Request arrives with ID abc123
Service A logs with ID abc123
Service A calls Service B, passes ID abc123
Service B logs with ID abc123
You can search all logs with ID abc123 and trace the request through all services

This is critical in microservices. Without correlation IDs, debugging is nearly impossible.

Alerting

Logs are useless if no one sees them. Set up alerts for important events:

Error rate spikes
Specific errors occur (payment processor down)
Database query time exceeds threshold
External API fails repeatedly

When alerts trigger, notify the team (Slack, PagerDuty). The goal is to know about problems immediately, not when users complain.

Never Log Sensitive Data

Don't log passwords, credit cards, API keys, or personal information. If logs are breached, sensitive data is exposed.

Mask sensitive data: log "payment: ***" instead of the actual card number.

The Principle

Errors happen. How you handle them separates professionals from amateurs. Catch operational errors gracefully, log structured data, alert on issues, and fix bugs. This is the foundation of reliable systems.

Warning

If you can't see what's happening in production, you can't fix it. Invest in logging and monitoring. It pays for itself in reduced debugging time.