Need the #1 custom application developer in Brisbane?Click here →

API Error Handling and Resilience

9 min read

Any network call can fail. Timeouts, connection refused, 5xx errors, malformed responses. Your application will encounter these failures. The question is whether your application gracefully handles them or crashes.

Distributed Systems Fail

A network call crosses multiple systems: your application, your internet connection, the internet, the remote service's infrastructure. Each layer is a potential failure point. Even reliable services fail occasionally.

Design your application assuming failures. Don't let a single integration failure take down the whole application.

Graceful Degradation

Your application should continue functioning when non-critical integrations fail. If the analytics service is unavailable, track events locally and send them later. If recommendations service is down, show popular items instead.

Identify which integrations are critical (payments) and which are nice-to-have (analytics). Handle failures differently. Critical failures might warrant an error page. Non-critical failures warrant graceful fallbacks.

Circuit Breaker Pattern

If a service is consistently failing, stop calling it temporarily. Return a fallback instead. After a period, try calling it again. If it's recovered, resume normal operation.

States: Closed (service is healthy, call it normally). Open (service is failing, return fallback). Half-Open (attempting recovery, one request tried, if it succeeds close the circuit).

This prevents cascading failures. If one service is down and you keep hammering it, you waste resources and delay recovery. Circuit breaker stops hammering and lets the service recover.

Retry with Exponential Backoff

For transient failures (temporary network issues), retry. But not immediately—wait and retry with increasing delays. 100ms, 200ms, 400ms, 800ms. This gives the service time to recover.

Don't retry deterministic errors. HTTP 400 Bad Request won't succeed by retrying—the request is malformed. HTTP 401 Unauthorized won't succeed by retrying—you're not authenticated. Only retry transient errors like timeouts and 5xx errors.

Timeouts

Every network call should have an explicit timeout. If the remote service is slow or hung, don't wait forever. Timeout after 5, 10, 30 seconds—depending on the operation.

A hung request that never returns is as bad as the service being down. It blocks resources indefinitely. Set timeouts aggressively.

Idempotency Keys

For operations that must not be duplicated—payments, email sending—use idempotency keys. Generate a unique key per operation. Send it with the request. If the request is retried, send the same key. The service detects the duplicate and returns the same response without re-processing.

This prevents: client retries after timeout, operation runs twice. Idempotency keys ensure exactly-once semantics.

Monitoring External Service Health

Know when external services are down. Monitor their status pages. Set up synthetic tests that regularly call their APIs. Alert when tests fail.

This provides early warning. You know a service is down before your customers complain.

Error Logging

Log all external API errors with context. Which endpoint? Which parameters? What was the response? This is invaluable for debugging. When a customer reports issues, you can review logs and understand what happened.

Include stack traces and timing information. How long did the request take? Was it a timeout?

User-Facing Error Messages

Never expose internal integration errors to users. "Stripe returned a 500" is meaningless. "OpenAI API timeout" is not actionable.

Translate errors to user-friendly messages. "Payment processing is temporarily unavailable. Please try again later." "We're having trouble generating recommendations. We'll show popular items instead."

Log the technical details for debugging. Show the user a helpful message.

Bulkhead Pattern

Isolate failures. If you have multiple external integrations, a failure in one shouldn't affect others. Use separate thread pools, connections, or timeout settings for different services.

If Stripe is slow, it shouldn't slow down requests to OpenAI. Use bulkheads to isolate resource usage.

Fallback Strategies

Determine fallbacks for each integration. If the geocoding service is down, use cached data. If the recommendation service is down, show nothing or popular items. If payment processing is down... that's critical and probably requires an error page.

Warning
A single unhandled integration failure can crash your entire application. Design for failures. Retry with backoff, implement circuit breakers, use graceful degradation. Test failure scenarios.
Tip
Use libraries that handle these patterns for you. Libraries like Resilience4j (Java), Tenacity (Python), or @verdaccio/allow-publish (JavaScript) provide circuit breakers, retries, and fallback logic.