patterns / integration

Integration Patterns

Proven blueprints for connecting services without creating hidden dependencies

TOGAF ADM NIST CSF ISO 27001 AWS Well-Arch Google SRE AI-Native

💡

In Plain English

Imagine a post office: you drop a letter in the box (producer) and the recipient picks it up when ready (consumer). Neither needs to know when the other is available. Integration patterns apply this same idea to software — services communicate reliably without requiring simultaneous availability.

📈

Business Value

Poor integration is the #1 cause of microservice project failures. Integration patterns reduce system-wide failure blast radius by 60–80%, enable independent team deployment, and are the foundation of every high-availability architecture at companies like Uber, Airbnb, and all major Philippine banks.

📖 Detailed Explanation

Integration is where most distributed system failures are born. A synchronous call chain — Service A calls B, which calls C, which calls D — means that a 100ms delay in D becomes a 400ms delay in A, and a failure in D cascades to take down A, B, and C simultaneously. This is the core problem that integration patterns exist to solve.

Synchronous vs. Asynchronous is the fundamental integration choice. Synchronous (REST, gRPC) is appropriate when the caller needs an immediate answer — a payment authorization, a balance inquiry. Asynchronous (Kafka, SQS, RabbitMQ) is appropriate when the caller does not need an immediate answer — sending a confirmation email, updating a search index, notifying a downstream analytics system. The mistake most teams make is defaulting to synchronous everywhere because it feels simpler. It is only simpler until you need to trace why your p99 latency is 8 seconds.

The Saga Pattern solves the hardest integration problem: distributed transactions. In a monolith, you use a database transaction — either everything commits or nothing does. In a microservices architecture, you have multiple independent databases. If Order Service creates an order and Payment Service charges the card and Inventory Service reserves stock — what happens when Inventory fails after Payment succeeds? Sagas address this via compensating transactions. Each step publishes an event; if a later step fails, compensating events reverse the earlier steps. Uber uses saga orchestration for its trip lifecycle across 20+ services.

The Outbox Pattern solves a subtle but critical problem: how do you guarantee that when you save a record to your database AND publish an event to Kafka, both happen atomically? If the Kafka publish fails after the database commit, your event is lost. The Outbox Pattern writes the event to an outbox table in the same database transaction as the business record, then a separate process (Debezium, a polling publisher) reads the outbox and publishes to Kafka. Exactly-once semantics achieved without distributed transactions.

Event-Carried State Transfer is a pattern where events carry the full data payload needed by consumers — not just an ID. When an Order event carries customer name, shipping address, and line items, consumers can act without calling back to the Order service. This eliminates synchronous dependency chains between services and enables consumers to operate independently even when the Order service is down.

The Claim Check Pattern solves message bus performance: if you're streaming large payloads (PDFs, images, large JSON blobs) through a message bus, you'll saturate it quickly. Instead, store the payload in object storage (S3, Blob Storage) and put a reference token (the "claim check") in the message. Consumers redeem the claim check to fetch the actual payload. The message bus stays lean and fast.

Idempotency is the invisible glue that makes all of these patterns safe. At-least-once delivery means consumers will occasionally receive the same message twice. Every consumer must be designed to process duplicate messages without side effects — by checking a processed-event-ID table, using database upserts with deterministic keys, or by making the operation naturally idempotent (setting a status to 'confirmed' is idempotent; incrementing a counter is not).

📈 Architecture Diagram

sequenceDiagram
    participant O as Order Service
    participant DB as Order DB + Outbox
    participant D as Debezium CDC
    participant K as Kafka
    participant P as Payment Service
    participant I as Inventory Service

    O->>DB: BEGIN TX: INSERT order + INSERT outbox_event
    DB-->>O: COMMIT
    D->>DB: Poll outbox (CDC)
    D->>K: Publish OrderCreated event
    K-->>P: OrderCreated
    P->>P: Charge card
    P->>K: PaymentSucceeded
    K-->>I: PaymentSucceeded
    I->>I: Reserve stock
    I->>K: StockReserved
    K-->>O: StockReserved
    O->>DB: UPDATE order status = CONFIRMED

Outbox Pattern + Saga choreography: order creation flows through distributed services with guaranteed event delivery and compensating transaction support.

🌎 Real-World Examples

Netflix — Bulkhead and Circuit Breaker

Los Gatos, USA · Video Streaming · 260M subscribers

Netflix's Hystrix library (now Resilience4j) pioneered the Bulkhead and Circuit Breaker patterns in microservices. Each service dependency gets its own thread pool (bulkhead) — a slow recommendation service cannot exhaust threads needed for video playback. Circuit breakers open after 50% error rate in a 10-second window, returning cached or degraded responses instantly. Netflix runs 'Game Days' quarterly to verify these patterns hold under real failure scenarios.

✓ Result: Streaming availability maintained at 99.97%+ even during AWS regional outages; cascading failures eliminated across 1,000+ microservices

Uber — Saga Orchestration

San Francisco, USA · Ride-hailing · 25M trips/day

Uber's trip lifecycle (request → match → dispatch → ride → payment → rating) is a 7-step orchestrated Saga on their Temporal workflow engine. Each step has a compensating transaction: driver re-dispatch on match failure, automatic refund if payment succeeds but driver confirmation fails. The Outbox Pattern with Cassandra + Kafka guarantees zero trip events are lost even during datacenter failover.

✓ Result: Zero trip event loss across 25M+ daily trips; full saga state observable per trip for customer dispute resolution

Shopify — Event-Driven Resilience

Ottawa, Canada · E-commerce · 15M+ merchants

Shopify's Black Friday 2023 ($9.3B sales) ran on choreography-based sagas. Bulkheads isolate each merchant's order pipeline — a viral product launch for one merchant cannot degrade others. Dead Letter Queues per merchant tier with real-time alerting ensure no order is silently dropped. Circuit breakers per payment processor handle processor outages without checkout disruption.

✓ Result: $9.3B on a single day with 99.999% checkout availability; two payment processor outages handled transparently

LinkedIn — Idempotent Feed Architecture

Sunnyvale, USA · Professional Network · 900M members

LinkedIn's activity feed uses pure Choreography: every event (PostPublished, ConnectionMade, ProfileUpdated) fans out via Kafka to 12+ consumers. All consumers are idempotent using composite deduplication keys — Kafka's at-least-once delivery never produces duplicate feed items. Schema Registry with Avro enforces backward-compatible schema evolution across all consumers.

✓ Result: Feed P99 latency 85ms for 900M members; zero duplicate posts; schema registry blocks breaking changes from reaching consumers

🌟 Core Principles

Loose Coupling via Messaging

Services communicate via messages wherever the response is not time-critical. This eliminates runtime dependency — the consumer does not need to be running when the producer sends.

Idempotency at Every Consumer

Message consumers must produce the same result if the same message is received twice. Use idempotency keys, processed-event tables, or naturally idempotent operations.

Contract-First Integration

Define the message schema or API contract before building either side. Use schema registries and API design tools. This prevents incompatible changes from reaching production.

Failure Isolation

A failure in one integration path must not propagate to other services. Circuit breakers stop cascading synchronous failures. Dead letter queues contain asynchronous failure storms.

Observability of Integration Points

Every message published, consumed, delayed, or failed must be observable. Integration points are where most distributed system mysteries live — instrument them exhaustively.

⚙️ Implementation Steps

Map Your Integration Topology

Draw every service-to-service communication in your system: direction, protocol, synchronous or async, schema format. Color-code synchronous calls red. Review: are any critical paths all-red? Those are your availability risks.

Classify by Communication Need

For each integration: does the caller need an immediate response? If yes — synchronous REST or gRPC. If no — async messaging (Kafka, SQS). If bulk — batch file. If real-time stream — Kafka or Kinesis.

Design Idempotency into Every Consumer

Before writing consumer code, answer: 'What happens if this message is delivered twice?' If the answer is 'bad things,' implement idempotency: generate a deterministic ID from the event payload and use it as an upsert key.

Implement Dead Letter Queues Everywhere

Every queue or Kafka topic must have a DLQ. Configure alert thresholds: any message in the DLQ within 5 minutes triggers a PagerDuty alert. DLQs without alerts are debugging black holes.

Register All Schemas

Every event schema goes into a schema registry (Confluent, AWS Glue, Azure Schema Registry). Set compatibility rules: BACKWARD for consumer-initiated changes, FORWARD for producer-initiated changes. Block deployments that break schema compatibility.

✅ Governance Checkpoints

Checkpoint	Owner	Gate Criteria	Status
Integration Topology Documented	Solution Architect	Synchronous/async classification for all integrations complete	Required
Schema Registry Populated	Integration Architect	All event schemas registered with compatibility rules set	Required
DLQ + Alerting Configured	Platform / DevOps	DLQ alert firing within 5 minutes of first failed message	Required
Idempotency Tests in CI	QA Lead	Duplicate-message delivery tests passing in CI pipeline	Required
Contract Tests for All Consumers	QA Lead	Pact or Spring Cloud Contract tests registered for all consumers	Required

◈ Recommended Patterns

✦ Saga Pattern (Choreography)

Each service listens for events and reacts independently. No central coordinator. Order Service publishes OrderCreated → Payment Service listens and publishes PaymentProcessed → Inventory Service listens. Failure triggers compensating events. Highly decoupled but complex to trace.

✦ Saga Pattern (Orchestration)

A dedicated Saga Orchestrator (Step Functions, Temporal, Camunda) drives the workflow step by step, calling each service and handling failures with compensating calls. More visible and controllable than choreography.

✦ Outbox Pattern

Write the business record AND the event to your own database in a single ACID transaction. A CDC connector (Debezium, AWS DMS) reads the outbox table and publishes to the message bus. Guarantees at-least-once delivery without distributed transactions.

✦ Event-Carried State Transfer

Events carry the full data payload consumers need. Order events carry the full order: customer ID, name, items, address. Consumers never need to call back to the source. Eliminates synchronous lookup dependencies.

✦ Claim Check

Large payloads (PDFs, images, large JSON) are stored in object storage. The message carries a reference token. Consumers redeem the token to fetch the payload. Keeps the message bus fast and lean.

✦ Consumer-Driven Contract Testing

Consumers define the contract they need from a provider. Tools like Pact verify that the provider satisfies every registered consumer contract before deployment. Prevents integration regressions in CI/CD.

⛔ Anti-Patterns to Avoid

⛔ Synchronous Chain of Death

Service A → B → C → D all synchronous. Each call adds latency. Any failure propagates up the entire chain. The system's availability is the product of each service's availability: 99.9%^4 = 99.6%. Resolution: identify the longest synchronous chains and introduce async messaging at appropriate boundaries.

⛔ Integration via Shared Database

Multiple services reading and writing to the same database tables as the integration mechanism. Schema changes break all consumers simultaneously. Deployment of one service can corrupt another's data. This is the most dangerous integration anti-pattern and the hardest to remediate.

⛔ Phantom Events

Events that trigger notifications without meaningful payload — just an entity ID. Every consumer must then call back to the source service to get the data they need. Creates synchronous coupling inside async workflows. Use Event-Carried State Transfer instead.

🤖 AI Augmentation Extensions

🤖 AI Schema Evolution Impact Analysis

LLM agents analyze all registered consumer contracts when a new event schema is proposed. They generate a full impact report: which consumers are affected, what migration steps are required, and whether backward compatibility is maintained.

⚡ Always validate AI-generated impact analysis against actual consumer codebases. LLMs may miss dynamic consumers or unconventional schema usage patterns.

🤖 Intelligent DLQ Triage Agent

An AI agent monitors DLQ message patterns, classifies failures by root cause (schema mismatch, network timeout, business rule violation, poison pill), and generates remediation playbooks for each category.

⚡ Require human approval before AI agents replay DLQ messages in production. Automated replay of incorrect messages can propagate data corruption.

🔗 Related Sections

📚 References & Further Reading

Enterprise Integration Patterns — Hohpe & Woolf↗ enterpriseintegrationpatterns.com
Building Event-Driven Microservices — Adam Bellemare↗ O'Reilly
Designing Data-Intensive Applications — Kleppmann↗ dataintensive.net
Kafka: The Definitive Guide — Shapira et al.↗ Confluent