system-design / ha-dr

High Availability & DR

Designing systems that survive failures, disasters, and the unexpected

TOGAF ADM NIST CSF ISO 27001 AWS Well-Arch Google SRE AI-Native

💡

In Plain English

High Availability means your service keeps running even when parts of it break. Disaster Recovery means you can restore service quickly after a catastrophic failure. Together, they answer: 'How do we make sure our system is always there when users need it?'

📈

Business Value

For Philippine banks under BSP Circular 982, HA/DR is a regulatory requirement: RTO ≤ 4 hours, RPO ≤ 2 hours for core systems. For e-commerce, every minute of downtime costs thousands in lost revenue. Netflix's architecture absorbs the failure of entire AWS availability zones without customer impact.

📖 Detailed Explanation

High Availability (HA) and Disaster Recovery (DR) are often conflated but solve different problems. HA is about *eliminating* single points of failure within normal operations — redundant components, automatic failover, load-balanced clusters. DR is about *recovering from* catastrophic failures — data center loss, region-wide outages, ransomware attacks. Both require explicit design; neither emerges from good intentions.

The Four Nines Problem: A system with 99% uptime is down ~88 hours per year. 99.9% is ~8.7 hours. 99.99% is ~52 minutes. 99.999% ("five nines") is ~5 minutes. Each additional nine requires fundamentally different architecture — not just "more servers." Moving from 99.9% to 99.99% requires eliminating all single points of failure, automated health checks with sub-second failover, and stateless application tiers with sticky-session elimination.

RTO and RPO are the two metrics that drive DR architecture decisions. Recovery Time Objective (RTO): how long can the business tolerate downtime? Recovery Point Objective (RPO): how much data loss is acceptable? RTO = 4 hours and RPO = 2 hours (BSP requirement for core banking) implies different architecture than RTO = 1 minute and RPO = 0 (financial trading systems). The more aggressive the targets, the more expensive and complex the architecture. A common mistake is setting aggressive targets without funding the architecture to achieve them.

Multi-AZ vs. Multi-Region: Multi-AZ (deploying across multiple Availability Zones within a region) protects against data center failures — a power outage, cooling failure, or network partition within one AZ. This achieves 99.99%+ availability for most systems. Multi-region protects against AWS region failures (extremely rare) and geographic compliance requirements (Philippines data residency). Multi-region adds significant complexity in data replication consistency, DNS failover, and runbook complexity. Only pursue multi-region when business requirements genuinely demand it.

Active-Active vs. Active-Passive: Active-Active means both regions (or AZs) serve live traffic. Failover is automatic and instant — load balancers stop routing to the failed region. Active-Passive means one region serves traffic while the other is on standby, ready to receive traffic after a failover operation. Active-Active achieves near-zero RTO but requires solving write conflict resolution for shared data. Active-Passive achieves RTO in minutes to hours depending on automation level.

Data Replication Strategy is where most HA/DR designs get complicated. For RDS/Aurora in AWS: use Multi-AZ with synchronous replication for the primary database (zero data loss, ~60-second automated failover). Use Aurora Global Database for cross-region replication with ~1-second RPO. For Kafka: use MirrorMaker 2 for cross-region topic replication. For Redis: use Redis replication with sentinel or Redis Cluster. The key principle: replication must be synchronous for zero RPO; asynchronous replication always implies potential data loss.

Chaos Engineering is the practice of deliberately injecting failures to validate that HA/DR designs actually work under production conditions. Netflix's Simian Army, AWS Fault Injection Simulator, and LitmusChaos (for Kubernetes) allow teams to terminate instances, inject network latency, and simulate AZ failures in staging environments. The finding in almost every chaos experiment: "It doesn't fail the way we thought it would." Run chaos experiments before a real disaster does it for you.

📈 Architecture Diagram

graph TB
    subgraph REGION_A["🌏 AWS ap-southeast-1 (Primary)"]
        ALB_A[Application Load Balancer]
        subgraph AZ_1["Availability Zone 1"]
            APP_A1[App Server 1]
            DB_A1[(RDS Primary)]
        end
        subgraph AZ_2["Availability Zone 2"]
            APP_A2[App Server 2]
            DB_A2[(RDS Standby
Synchronous Replica)]
        end
        ALB_A --> APP_A1
        ALB_A --> APP_A2
        DB_A1 -.->|Sync Replication| DB_A2
    end

    subgraph REGION_B["🌏 AWS ap-east-1 (DR)"]
        ALB_B[Application Load Balancer]
        APP_B[App Servers]
        DB_B[(Aurora Global
Async Replica
RPO ~1s)]
        ALB_B --> APP_B
        APP_B --> DB_B
    end

    R53[Route 53
Health Checks + Failover DNS]
    R53 --> ALB_A
    R53 -.->|Failover if Primary unhealthy| ALB_B
    DB_A1 -.->|Global Replication| DB_B

    style REGION_A fill:#0f172a,color:#f8fafc
    style REGION_B fill:#1e1b4b,color:#f8fafc
    style R53 fill:#1e3a5f,color:#f8fafc

Multi-AZ + Multi-Region HA/DR architecture on AWS: synchronous Multi-AZ replication within the primary region for zero data loss, Aurora Global Database for cross-region RPO of ~1 second, and Route 53 health-check-based DNS failover.

🌎 Real-World Examples

Netflix — Multi-Region Active-Active

Los Gatos, USA · Video Streaming · 260M subscribers

Netflix runs active-active across 3 AWS regions. Route 53 health checks detect region failure in 10 seconds and shift traffic. 'Chaos Kong' exercises (simulating full region loss) run monthly to validate failover under real production load — not just in staging. Netflix pioneered the practice of treating DR testing as continuous engineering, not annual drills.

✓ Result: 99.97%+ availability during multiple AWS regional outages; failover validated monthly under production traffic

AWS — Multi-AZ Reference Architecture

Seattle, USA · Cloud Infrastructure · Global Standard

AWS Multi-AZ deployment with Aurora Global Database is the industry reference for RTO/RPO in cloud-native systems. Aurora Global provides < 1 second RPO and < 1 minute RTO for cross-region failover. AWS publishes their architecture in the Well-Architected Framework Reliability Pillar, used as the design reference by 10,000+ enterprise architects.

✓ Result: Aurora Global Database: < 1s RPO, < 1 min RTO for 15 global regions; reference architecture cited in 10,000+ Well-Architected Reviews

Cloudflare — Global Anycast Resilience

San Francisco, USA · Internet Infrastructure · 285+ PoPs

Cloudflare's anycast network routes every request to the nearest of 285+ data centers globally. Single datacenter failure: traffic reroutes in < 1 second with no DNS TTL delay. Their architecture means no single datacenter is ever critical — all are equally disposable. This is the architectural ideal of Design for Failure applied at internet infrastructure scale.

✓ Result: 13 consecutive years of 99.99%+ global availability; zero customer impact from any single datacenter failure

Monzo — Banking HA on Kubernetes

London, UK · Neobank · 7M customers

Monzo's core banking runs on Kubernetes (EKS) across multiple AWS AZs. Rolling deployments with readiness probes ensure zero-downtime updates. Their on-call model: every engineer owns their service's availability — creating direct incentive to build resilient systems. FCA examination rated their availability higher than 3 major traditional UK banks.

✓ Result: 99.99% banking availability on microservices; FCA 2022: availability rated higher than traditional core banking peers

🌟 Core Principles

Eliminate Single Points of Failure

Every component has a backup. If removing one instance causes an outage, it is a SPOF. Load balancers, databases, message brokers, API gateways — all must run in at least two AZs with automatic failover.

RTO and RPO are Architectural Inputs

Define RTO and RPO before designing the system, not after. They are business requirements that directly determine database replication mode, failover automation, and infrastructure cost. Document them in the NFR catalog.

Test Failover Paths Continuously

An untested failover path is not a failover path — it is a theoretical one. Automate failover testing in staging (monthly) and run annual production failover drills. Chaos Engineering is not optional for high-availability systems.

Health Checks Drive Automation

Every component must expose liveness and readiness health endpoints. Orchestrators (Kubernetes, ECS, ALB) use these to automatically remove unhealthy instances from rotation without human intervention.

Recovery Runbooks Must Be Scripted

Manual recovery procedures with more than five steps will fail under pressure. Automate recovery steps as runbooks (Ansible, SSM Documents, Terraform) that operators can execute with a single command.

⚙️ Implementation Steps

Define RTO/RPO by System Tier

Classify systems: Tier 1 (core banking, payments) = RTO 1hr/RPO 15min. Tier 2 (digital channels) = RTO 4hr/RPO 1hr. Tier 3 (analytics, reporting) = RTO 24hr/RPO 4hr. Different tiers justify different architecture and cost.

Identify and Eliminate SPOFs

Draw the architecture diagram. Circle every component where failure would cause an outage. That is your SPOF inventory. Prioritize elimination by impact: start with the components in every critical path.

Choose the Replication Mode per Component

Synchronous replication = zero data loss but adds write latency (5–10ms for same-region Multi-AZ). Asynchronous = potential data loss measured by replication lag but no write latency impact. Match to your RPO: zero RPO requires synchronous replication.

Implement Automated Health Check Failover

Configure Route 53 health checks for DNS failover. Configure ALB target group health checks for application failover. Configure RDS Multi-AZ for database failover. All should be automatic — human-initiated failover adds 15–30 minutes to your RTO.

Run Monthly Chaos Experiments

Use AWS Fault Injection Simulator, LitmusChaos, or Netflix Chaos Monkey to terminate instances, inject network latency, and simulate AZ failures in staging. Document findings. Fix gaps. Repeat.

✅ Governance Checkpoints

Checkpoint	Owner	Gate Criteria	Status
RTO/RPO Defined and Documented	Solution Architect	RTO and RPO defined per system tier in NFR catalog	Required
SPOF Inventory Complete	Solution Architect	All SPOFs identified and remediation plan in place	Required
Multi-AZ Deployment Verified	Platform Engineer	All Tier 1 components deployed across minimum 2 AZs	Required
Failover Tested and Documented	SRE / DR Lead	Automated failover tested; actual RTO/RPO measured and within targets	Required
DR Drill Results Submitted to BSP	Technology Risk Officer	Annual DR drill results documented and submitted (BSP requirement)	Required

◈ Recommended Patterns

✦ Active-Active Multi-AZ

Application servers in all AZs serve live traffic behind a load balancer. Database uses synchronous Multi-AZ replication with automatic failover. Zero RTO for AZ failures. The standard pattern for 99.99% availability.

✦ Active-Passive Multi-Region

Primary region serves all traffic. DR region runs a warm standby with replicated data. Route 53 failover routing activates the DR region when primary health checks fail. Achieves RTO of 5–15 minutes with full automation.

✦ Pilot Light

A minimal version of the DR environment runs continuously (databases replicated, AMIs current). During failover, scale up the DR environment to full production size. Lower cost than warm standby, higher RTO (30–60 minutes).

✦ Backup and Restore

The simplest DR approach: automated backups to S3, with restore procedures documented and tested. Appropriate for Tier 3 systems. RTO measured in hours; RPO equal to backup frequency.

⛔ Anti-Patterns to Avoid

⛔ Untested DR Plans

A DR plan that has never been executed is a fiction. The most common finding in post-incident reviews: 'We had a DR plan but when we tried to execute it, we discovered it was outdated and incomplete.' Test quarterly.

⛔ Symmetric RTO/RPO for All Systems

Setting the same aggressive RTO/RPO for every system drives unnecessary cost for non-critical systems while the budget is diluted away from genuinely critical ones. Tier your systems by business impact.

🤖 AI Augmentation Extensions

🤖 AI-Assisted Chaos Experiment Design

LLM agents analyze system architecture diagrams and generate a targeted chaos experiment plan — identifying which failure injection scenarios would most effectively validate HA/DR designs based on the specific topology.

⚡ AI-generated chaos plans are starting points. Have your SRE team review and adjust blast radius before executing any experiment in staging.

🤖 Automated DR Drill Documentation

After each DR drill, AI agents parse structured test results, compare actual RTO/RPO against targets, generate the drill report in BSP-required format, and flag gaps needing remediation.

⚡ DR drill reports submitted to BSP must be reviewed and signed by the Technology Risk Officer before submission.

🔗 Related Sections

📚 References & Further Reading

Site Reliability Engineering — Google↗ sre.google
Designing Distributed Systems — Brendan Burns↗ O'Reilly
AWS Well-Architected Framework — Amazon Web Services↗ aws.amazon.com
BSP Circular 982 — Technology Risk Management↗ bsp.gov.ph
Building Evolutionary Architectures — Ford, Parsons, Kua↗ O'Reilly