Section:
infra/| Subsection:resilience/
Alignment: TOGAF ADM | NIST CSF | ISO 27001 | AWS Well-Architected | AI-Native Extensions
Overview
Multi-AZ deployment, auto-scaling policies, chaos engineering practices, and SRE toil reduction.
This document is part of the Infrastructure Architecture body of knowledge within the Ascendion Architecture Best-Practice Library. It provides comprehensive, practitioner-grade guidance aligned to industry standards and extended for AI-augmented, agentic, and LLM-driven design contexts.
Core Principles
1. Intentional Design for Infrastructure Resilience
Every aspect of infrastructure resilience must be deliberately designed, not discovered after deployment. Document design decisions as ADRs with explicit rationale.
2. Consistency Across the Portfolio
Apply infrastructure resilience practices consistently across all systems. Inconsistent application creates governance blind spots and makes incident investigation unpredictable.
3. Alignment to Business Outcomes
Infrastructure Resilience practices must demonstrably contribute to business outcomes: reduced downtime, faster delivery, lower operational cost, or improved compliance posture.
4. Evidence-Based Quality Assessment
Quality of infrastructure resilience implementation must be measurable. Define specific metrics and collect evidence continuously — not only at audit or review time.
5. Continuous Evolution
Standards for infrastructure resilience evolve as technology and threat landscapes change. Schedule quarterly reviews of applicable standards and update practices accordingly.
Implementation Guide
Step 1: Current State Assessment
Document the current state of infrastructure resilience practice: what is implemented, what is missing, what is inconsistent across teams. Use the governance/scorecards section for a structured assessment framework.
Step 2: Gap Analysis Against Standards
Compare current state against the standards in this section and applicable frameworks (CNCF Cloud Native Landscape, SRE Book — Google). Prioritize gaps by business impact and remediation effort.
Step 3: Design the Target State
Define the target infrastructure resilience state: which patterns will be adopted, which anti-patterns eliminated, which governance mechanisms introduced. Express as a time-bound roadmap.
Step 4: Incremental Implementation
Implement infrastructure resilience improvements incrementally: pilot with one team or system, measure outcomes, refine the approach, then expand. Avoid big-bang transformations.
Step 5: Validate and Iterate
Measure the impact of implemented changes against defined success criteria. Incorporate lessons learned into the practice standards. Contribute improvements back to this library.
Governance Checkpoints
| Checkpoint | Owner | Gate Criteria | Status |
|---|---|---|---|
| Current State Documented | Solution Architect | Infrastructure Resilience current state assessment completed and reviewed | Required |
| Gap Analysis Reviewed | Architecture Review Board | Gap analysis reviewed and prioritization approved | Required |
| Implementation Plan Approved | Enterprise Architect | Target state and roadmap approved by ARB | Required |
| Quality Metrics Defined | Solution Architect | Measurable success criteria defined for infrastructure resilience improvements | Required |
Recommended Patterns
Reference Architecture Adoption
Start from an established reference architecture for infrastructure resilience rather than designing from scratch. Adapt to organizational context rather than rebuilding proven foundations.
Pattern Library Contribution
When your team solves a recurring infrastructure resilience problem with a novel approach, document it as a pattern for the library. This compounds organizational knowledge over time.
Fitness Function Testing
Encode infrastructure resilience standards as automated architectural fitness functions — tests that run in CI/CD and fail builds when standards are violated. This makes governance continuous rather than periodic.
Anti-Patterns to Avoid
⚠️ Standards Theater
Documenting infrastructure resilience standards in architecture policies that no one reads and no one enforces. Standards without automated validation or governance gates are not operational standards.
⚠️ Copy-Paste Architecture
Adopting another organization's infrastructure resilience patterns wholesale without adapting to organizational context, team capability, or regulatory environment. Always adapt; never just copy.
AI Augmentation Extensions
AI-Assisted Standards Review
LLM agents analyze design documents against infrastructure resilience standards, generating structured gap reports with cited evidence and suggested remediation approaches.
Note: AI review accelerates governance but does not replace expert architectural judgment. Use as a first-pass filter before human review.
RAG Integration for Infrastructure Resilience
This section is optimized for vector ingestion into an AI-powered architecture assistant. Semantic search enables architects to retrieve relevant infrastructure resilience guidance through natural language queries.
Note: Reindex the vector store whenever section content is updated to ensure retrieved guidance reflects current standards.
Related Sections
principles/foundational | patterns/structural | governance/review-templates | adrs/platform
References
- CNCF Cloud Native Landscape — cncf.io
- SRE Book — Google — sre.google
- AWS Infrastructure Best Practices — docs.aws.amazon.com
- HashiCorp Terraform Best Practices — IEEE Xplore
- Documenting Software Architectures — Bass, Clements, Kazman — Amazon
- Building Evolutionary Architectures — Ford, Parsons, Kua — O'Reilly
Last updated: 2025 | Maintained by: Ascendion Solutions Architecture Practice
Section: infra/resilience/ | Aligned to TOGAF · NIST · ISO 27001 · AWS Well-Architected