High Availability & DR
Designing systems that survive failures, disasters, and the unexpected
High Availability means your service keeps running even when parts of it break. Disaster Recovery means you can restore service quickly after a catastrophic failure. Together, they answer: 'How do we make sure our system is always there when users need it?'
For Philippine banks under BSP Circular 982, HA/DR is a regulatory requirement: RTO ≤ 4 hours, RPO ≤ 2 hours for core systems. For e-commerce, every minute of downtime costs thousands in lost revenue. Netflix's architecture absorbs the failure of entire AWS availability zones without customer impact.
📖 Detailed Explanation
High Availability (HA) and Disaster Recovery (DR) are often conflated but solve different problems. HA is about *eliminating* single points of failure within normal operations — redundant components, automatic failover, load-balanced clusters. DR is about *recovering from* catastrophic failures — data center loss, region-wide outages, ransomware attacks. Both require explicit design; neither emerges from good intentions.
The Four Nines Problem: A system with 99% uptime is down ~88 hours per year. 99.9% is ~8.7 hours. 99.99% is ~52 minutes. 99.999% ("five nines") is ~5 minutes. Each additional nine requires fundamentally different architecture — not just "more servers." Moving from 99.9% to 99.99% requires eliminating all single points of failure, automated health checks with sub-second failover, and stateless application tiers with sticky-session elimination.
RTO and RPO are the two metrics that drive DR architecture decisions. Recovery Time Objective (RTO): how long can the business tolerate downtime? Recovery Point Objective (RPO): how much data loss is acceptable? RTO = 4 hours and RPO = 2 hours (BSP requirement for core banking) implies different architecture than RTO = 1 minute and RPO = 0 (financial trading systems). The more aggressive the targets, the more expensive and complex the architecture. A common mistake is setting aggressive targets without funding the architecture to achieve them.
Multi-AZ vs. Multi-Region: Multi-AZ (deploying across multiple Availability Zones within a region) protects against data center failures — a power outage, cooling failure, or network partition within one AZ. This achieves 99.99%+ availability for most systems. Multi-region protects against AWS region failures (extremely rare) and geographic compliance requirements (Philippines data residency). Multi-region adds significant complexity in data replication consistency, DNS failover, and runbook complexity. Only pursue multi-region when business requirements genuinely demand it.
Active-Active vs. Active-Passive: Active-Active means both regions (or AZs) serve live traffic. Failover is automatic and instant — load balancers stop routing to the failed region. Active-Passive means one region serves traffic while the other is on standby, ready to receive traffic after a failover operation. Active-Active achieves near-zero RTO but requires solving write conflict resolution for shared data. Active-Passive achieves RTO in minutes to hours depending on automation level.
Data Replication Strategy is where most HA/DR designs get complicated. For RDS/Aurora in AWS: use Multi-AZ with synchronous replication for the primary database (zero data loss, ~60-second automated failover). Use Aurora Global Database for cross-region replication with ~1-second RPO. For Kafka: use MirrorMaker 2 for cross-region topic replication. For Redis: use Redis replication with sentinel or Redis Cluster. The key principle: replication must be synchronous for zero RPO; asynchronous replication always implies potential data loss.
Chaos Engineering is the practice of deliberately injecting failures to validate that HA/DR designs actually work under production conditions. Netflix's Simian Army, AWS Fault Injection Simulator, and LitmusChaos (for Kubernetes) allow teams to terminate instances, inject network latency, and simulate AZ failures in staging environments. The finding in almost every chaos experiment: "It doesn't fail the way we thought it would." Run chaos experiments before a real disaster does it for you.
📈 Architecture Diagram
graph TB
subgraph REGION_A["🌏 AWS ap-southeast-1 (Primary)"]
ALB_A[Application Load Balancer]
subgraph AZ_1["Availability Zone 1"]
APP_A1[App Server 1]
DB_A1[(RDS Primary)]
end
subgraph AZ_2["Availability Zone 2"]
APP_A2[App Server 2]
DB_A2[(RDS Standby
Synchronous Replica)]
end
ALB_A --> APP_A1
ALB_A --> APP_A2
DB_A1 -.->|Sync Replication| DB_A2
end
subgraph REGION_B["🌏 AWS ap-east-1 (DR)"]
ALB_B[Application Load Balancer]
APP_B[App Servers]
DB_B[(Aurora Global
Async Replica
RPO ~1s)]
ALB_B --> APP_B
APP_B --> DB_B
end
R53[Route 53
Health Checks + Failover DNS]
R53 --> ALB_A
R53 -.->|Failover if Primary unhealthy| ALB_B
DB_A1 -.->|Global Replication| DB_B
style REGION_A fill:#0f172a,color:#f8fafc
style REGION_B fill:#1e1b4b,color:#f8fafc
style R53 fill:#1e3a5f,color:#f8fafc
Multi-AZ + Multi-Region HA/DR architecture on AWS: synchronous Multi-AZ replication within the primary region for zero data loss, Aurora Global Database for cross-region RPO of ~1 second, and Route 53 health-check-based DNS failover.
🌎 Real-World Examples
Netflix runs active-active across 3 AWS regions. Route 53 health checks detect region failure in 10 seconds and shift traffic. 'Chaos Kong' exercises (simulating full region loss) run monthly to validate failover under real production load — not just in staging. Netflix pioneered the practice of treating DR testing as continuous engineering, not annual drills.
✓ Result: 99.97%+ availability during multiple AWS regional outages; failover validated monthly under production traffic
AWS Multi-AZ deployment with Aurora Global Database is the industry reference for RTO/RPO in cloud-native systems. Aurora Global provides < 1 second RPO and < 1 minute RTO for cross-region failover. AWS publishes their architecture in the Well-Architected Framework Reliability Pillar, used as the design reference by 10,000+ enterprise architects.
✓ Result: Aurora Global Database: < 1s RPO, < 1 min RTO for 15 global regions; reference architecture cited in 10,000+ Well-Architected Reviews
Cloudflare's anycast network routes every request to the nearest of 285+ data centers globally. Single datacenter failure: traffic reroutes in < 1 second with no DNS TTL delay. Their architecture means no single datacenter is ever critical — all are equally disposable. This is the architectural ideal of Design for Failure applied at internet infrastructure scale.
✓ Result: 13 consecutive years of 99.99%+ global availability; zero customer impact from any single datacenter failure
Monzo's core banking runs on Kubernetes (EKS) across multiple AWS AZs. Rolling deployments with readiness probes ensure zero-downtime updates. Their on-call model: every engineer owns their service's availability — creating direct incentive to build resilient systems. FCA examination rated their availability higher than 3 major traditional UK banks.
✓ Result: 99.99% banking availability on microservices; FCA 2022: availability rated higher than traditional core banking peers
🌟 Core Principles
Every component has a backup. If removing one instance causes an outage, it is a SPOF. Load balancers, databases, message brokers, API gateways — all must run in at least two AZs with automatic failover.
Define RTO and RPO before designing the system, not after. They are business requirements that directly determine database replication mode, failover automation, and infrastructure cost. Document them in the NFR catalog.
An untested failover path is not a failover path — it is a theoretical one. Automate failover testing in staging (monthly) and run annual production failover drills. Chaos Engineering is not optional for high-availability systems.
Every component must expose liveness and readiness health endpoints. Orchestrators (Kubernetes, ECS, ALB) use these to automatically remove unhealthy instances from rotation without human intervention.
Manual recovery procedures with more than five steps will fail under pressure. Automate recovery steps as runbooks (Ansible, SSM Documents, Terraform) that operators can execute with a single command.
⚙️ Implementation Steps
Define RTO/RPO by System Tier
Classify systems: Tier 1 (core banking, payments) = RTO 1hr/RPO 15min. Tier 2 (digital channels) = RTO 4hr/RPO 1hr. Tier 3 (analytics, reporting) = RTO 24hr/RPO 4hr. Different tiers justify different architecture and cost.
Identify and Eliminate SPOFs
Draw the architecture diagram. Circle every component where failure would cause an outage. That is your SPOF inventory. Prioritize elimination by impact: start with the components in every critical path.
Choose the Replication Mode per Component
Synchronous replication = zero data loss but adds write latency (5–10ms for same-region Multi-AZ). Asynchronous = potential data loss measured by replication lag but no write latency impact. Match to your RPO: zero RPO requires synchronous replication.
Implement Automated Health Check Failover
Configure Route 53 health checks for DNS failover. Configure ALB target group health checks for application failover. Configure RDS Multi-AZ for database failover. All should be automatic — human-initiated failover adds 15–30 minutes to your RTO.
Run Monthly Chaos Experiments
Use AWS Fault Injection Simulator, LitmusChaos, or Netflix Chaos Monkey to terminate instances, inject network latency, and simulate AZ failures in staging. Document findings. Fix gaps. Repeat.
✅ Governance Checkpoints
| Checkpoint | Owner | Gate Criteria | Status |
|---|---|---|---|
| RTO/RPO Defined and Documented | Solution Architect | RTO and RPO defined per system tier in NFR catalog | Required |
| SPOF Inventory Complete | Solution Architect | All SPOFs identified and remediation plan in place | Required |
| Multi-AZ Deployment Verified | Platform Engineer | All Tier 1 components deployed across minimum 2 AZs | Required |
| Failover Tested and Documented | SRE / DR Lead | Automated failover tested; actual RTO/RPO measured and within targets | Required |
| DR Drill Results Submitted to BSP | Technology Risk Officer | Annual DR drill results documented and submitted (BSP requirement) | Required |
◈ Recommended Patterns
✦ Active-Active Multi-AZ
Application servers in all AZs serve live traffic behind a load balancer. Database uses synchronous Multi-AZ replication with automatic failover. Zero RTO for AZ failures. The standard pattern for 99.99% availability.
✦ Active-Passive Multi-Region
Primary region serves all traffic. DR region runs a warm standby with replicated data. Route 53 failover routing activates the DR region when primary health checks fail. Achieves RTO of 5–15 minutes with full automation.
✦ Pilot Light
A minimal version of the DR environment runs continuously (databases replicated, AMIs current). During failover, scale up the DR environment to full production size. Lower cost than warm standby, higher RTO (30–60 minutes).
✦ Backup and Restore
The simplest DR approach: automated backups to S3, with restore procedures documented and tested. Appropriate for Tier 3 systems. RTO measured in hours; RPO equal to backup frequency.
⛔ Anti-Patterns to Avoid
⛔ Untested DR Plans
A DR plan that has never been executed is a fiction. The most common finding in post-incident reviews: 'We had a DR plan but when we tried to execute it, we discovered it was outdated and incomplete.' Test quarterly.
⛔ Symmetric RTO/RPO for All Systems
Setting the same aggressive RTO/RPO for every system drives unnecessary cost for non-critical systems while the budget is diluted away from genuinely critical ones. Tier your systems by business impact.
🤖 AI Augmentation Extensions
LLM agents analyze system architecture diagrams and generate a targeted chaos experiment plan — identifying which failure injection scenarios would most effectively validate HA/DR designs based on the specific topology.
After each DR drill, AI agents parse structured test results, compare actual RTO/RPO against targets, generate the drill report in BSP-required format, and flag gaps needing remediation.
🔗 Related Sections
📚 References & Further Reading
- Site Reliability Engineering — Google↗ sre.google
- Designing Distributed Systems — Brendan Burns↗ O'Reilly
- AWS Well-Architected Framework — Amazon Web Services↗ aws.amazon.com
- BSP Circular 982 — Technology Risk Management↗ bsp.gov.ph
- Building Evolutionary Architectures — Ford, Parsons, Kua↗ O'Reilly