On This Page
1The Problem It Solves2Pattern Structure
3When to Use4When Not to Use
5Trade-offs6Implementation Approach
7Anti-Patterns to Avoid8Cloud-Specific Implementations
9References

The Problem It Solves

Without a circuit breaker, a slow or unavailable downstream service causes the calling service to exhaust its thread pool waiting for responses. Each waiting thread holds a connection and memory. New requests queue up behind the waiting threads. The calling service eventually runs out of resources and fails too — a cascading failure that takes down healthy services along with the unhealthy one.

Pattern Structure

%%{init:{'theme':'base','themeVariables':{'fontSize':'14px','fontFamily':'Inter, system-ui, sans-serif','primaryColor':'#DBEAFE','primaryTextColor':'#1e3a5f','primaryBorderColor':'#2563EB','lineColor':'#374151','clusterBkg':'#F9FAFB','clusterBorder':'#D1D5DB','edgeLabelBackground':'#FFFFFF'},'flowchart':{'curve':'orthogonal','padding':30,'nodeSpacing':65,'rankSpacing':75,'useMaxWidth':true}}}%% flowchart TD START([Service Makes Remote Call]) START --> STATE{Circuit State} STATE -->|Closed — normal operation| CALL[Attempt remote call] CALL --> OUTCOME{Call Outcome} OUTCOME -->|Success| RESET[Reset failure counter\nReturn response] OUTCOME -->|Failure or timeout| COUNT[Increment failure counter] COUNT --> THRESHOLD{Failure threshold\nexceeded?} THRESHOLD -->|No| CALL THRESHOLD -->|Yes| OPEN[Open circuit\nStart recovery timer] STATE -->|Open — failing fast| FAST_FAIL[Fail immediately\nNo remote call made\nReturn fallback or error] FAST_FAIL --> TIMER{Recovery timer\nexpired?} TIMER -->|No| FAST_FAIL TIMER -->|Yes| HALF[Half-open state\nAllow probe request through] HALF --> PROBE[Attempt probe call] PROBE --> PROBE_RESULT{Probe\nSucceeded?} PROBE_RESULT -->|Yes| CLOSED([Close circuit\nResume normal operation]) PROBE_RESULT -->|No| OPEN style START fill:#4f8ef7,color:#fff style CLOSED fill:#10b981,color:#fff style OPEN fill:#fef3c7 style FAST_FAIL fill:#fef3c7 style HALF fill:#e0f2fe

When to Use

  • Any service that makes synchronous remote calls to downstream dependencies
  • Systems where a downstream dependency failing should not cause the caller to fail
  • High-traffic services where thread pool exhaustion from slow downstream calls is a realistic risk
  • Microservices architectures where cascading failures across services are a known operational concern

When Not to Use

  • Asynchronous messaging patterns where the caller does not wait for a response
  • Internal in-process calls that do not cross a network boundary
  • Simple two-tier applications where there is only one dependency and failure is acceptable

Trade-offs

Benefit Cost
Prevents cascading failures — failing fast protects the caller Fallback behaviour must be designed and tested
Gives the downstream service time to recover Adds latency measurement overhead per call
Enables graceful degradation — serve partial results State management for the circuit requires storage or in-process counters
Provides operational visibility into dependency health Half-open probe logic must be tuned per dependency

Implementation Approach

Define thresholds appropriate to the dependency. A payment service tolerates fewer failures before opening than a recommendation service. Common starting points: open after five consecutive failures or 50% failure rate over a ten-second window.

Implement meaningful fallbacks. When the circuit is open, return a cached result, a default value, or a clear error that the upstream caller can handle. A cached product catalogue from five minutes ago is better than an exception that propagates to the user.

Expose circuit state as a metric. The circuit state — closed, open, half-open — and the failure rate per dependency are essential operational metrics. Alert when any circuit opens in production. A circuit opening is a signal that a dependency is failing.

Set appropriate timeouts on the calls the circuit wraps. A circuit breaker without a timeout is incomplete. If the call never times out, the circuit never opens. Set a timeout shorter than the caller's own timeout so failures are detected before the caller times out itself.

Anti-Patterns to Avoid

⚠ 1. Circuit Breaker Without a Fallback

Opening the circuit and returning an unhandled exception that propagates to the user as a 500 error. The cascade is stopped at the service boundary but the user experience is no better than if there were no circuit breaker.

Hover to see the fix ↻
↺ Correct Approach

Design a fallback response for every circuit that can open. The fallback may be degraded — an empty list, a cached result, a user-visible message — but it is a deliberate choice, not an unhandled exception.

⚠ 2. Shared Circuit State Across Instances

Each instance of a horizontally scaled service maintains its own in-process circuit state. Instance A opens its circuit while Instance B sees different traffic and stays closed. The circuit state is inconsistent across the fleet.

Hover to see the fix ↻
↺ Correct Approach

For stateless horizontally-scaled services, use a distributed circuit breaker backed by a shared cache (Redis) or accept that each instance manages its own state independently and use percentage-based thresholds rather than absolute counts.

Cloud-Specific Implementations

  • AWS: Lambda and API Gateway have built-in timeout and retry configuration. For circuit breaker state shared across instances, use ElastiCache Redis. Resilience4j implements circuit breakers for Java-based Lambda functions.

Flowchart

%%{init:{'theme':'base','themeVariables':{'fontSize':'14px','fontFamily':'Inter, system-ui, sans-serif','primaryColor':'#DBEAFE','primaryTextColor':'#1e3a5f','primaryBorderColor':'#2563EB','lineColor':'#374151','clusterBkg':'#F9FAFB','clusterBorder':'#D1D5DB','edgeLabelBackground':'#FFFFFF'},'flowchart':{'curve':'orthogonal','padding':30,'nodeSpacing':65,'rankSpacing':75,'useMaxWidth':true}}}%% flowchart TD START([Remote Call Attempted]) START --> CB_STATE{Circuit State} CB_STATE -->|Closed| ATTEMPT[Make remote call\nStart timeout timer] CB_STATE -->|Open| FAST[Fail immediately\nReturn fallback response\nNo network call made] ATTEMPT --> CB_RESULT{Result within\ntimeout?} CB_RESULT -->|Success| SUCCESS_CB[Return response\nReset failure counter] CB_RESULT -->|Failure or timeout| FAIL_CB[Record failure\nCheck threshold] FAIL_CB --> THRESH_CB{Threshold\nexceeded?} THRESH_CB -->|No| ATTEMPT THRESH_CB -->|Yes| OPEN_CB[Open circuit\nLog alert to observability\nStart recovery timer] FAST --> RECOVER{Recovery\ntimer expired?} RECOVER -->|No| FAST RECOVER -->|Yes| HALF_CB[Half-open\nAllow one probe request] HALF_CB --> PROBE_CB{Probe\nsucceeded?} PROBE_CB -->|Yes| CLOSE_CB([Close circuit\nResume normal operation]) PROBE_CB -->|No| OPEN_CB style START fill:#4f8ef7,color:#fff style CLOSE_CB fill:#10b981,color:#fff style OPEN_CB fill:#fef3c7 style FAST fill:#fef3c7 style HALF_CB fill:#e0f2fe

References

  1. Nygard, Michael T. — Release It! Design and Deploy Production-Ready Software. Pragmatic Bookshelf, 2018.
  2. Fowler, Martin — Circuit Breaker. martinfowler.com/bliki/CircuitBreaker
  3. Resilience4j — Circuit breaker for Java. resilience4j.readme.io
  4. Netflix — Hystrix: Latency and Fault Tolerance. github.com/Netflix/Hystrix
Ascendion Engineering Knowledge Base ← Structural Patterns