| 1 | Overview | 2 | Architecture Overview |
| 3 | Implementation Guide | 4 | Decision Criteria — Pillar Trade-offs |
| 5 | Cost Model | 6 | Anti-Patterns to Avoid |
| 7 | References |
Overview
Microsoft distilled lessons from thousands of customer architectures into five pillars: Reliability, Security, Cost Optimisation, Operational Excellence, and Performance Efficiency. Each pillar contains design principles, recommendation guides, and trade-off analysis. The framework does not prescribe a specific architecture — it provides lenses through which to evaluate the trade-offs in any Azure workload.
Reliability — Design for zone failure, not just instance failure. Zone-redundant services (AKS, Service Bus Premium, Event Hubs Premium, Azure SQL, Cosmos DB, Azure Cache for Redis) must be enabled at provisioning — they cannot be retrofitted cheaply. Use paired regions for disaster recovery, configure health probes and auto-failover groups, and validate recoverability with Azure Chaos Studio. Azure Service Health alerts notify on impairments before customers report them.
Security — Aligns directly to the Microsoft Cloud Security Benchmark (MCSB). Apply Zero Trust identity: Managed Identity for service-to-service auth, PIM for privileged human access, no standing admin roles. Enforce network segmentation via private endpoints, Azure Firewall, and NSG rules. Store all secrets in Azure Key Vault with RBAC not access policies. Enable Microsoft Defender for Cloud and route signals to Microsoft Sentinel for centralised SIEM/SOAR.
Cost Optimisation — Azure Reservations and Savings Plans can reduce compute cost by up to 72% versus pay-as-you-go. Azure Cost Management provides budget alerts, anomaly detection, and invoice-level breakdowns. Tagging is the prerequisite for cost attribution — enforce Environment, Team, Project, and CostCenter tags via Azure Policy with a deny effect. Azure Advisor surfaces right-sizing recommendations continuously; treat them as a weekly operational input, not an optional report.
Operational Excellence — All resources are provisioned via Infrastructure as Code (Bicep primary, Terraform supported). Every change goes through Azure DevOps or GitHub Actions pipelines — no manual portal deployments in production. Azure Monitor and Application Insights provide unified observability. Runbooks are Azure Automation documents, not Word files. Chaos engineering with Azure Chaos Studio is scheduled, not reactive.
Performance Efficiency — Select the right SKU before deployment; retrofitting underperforms and overprovisioning masks latency issues. Use VMSS, AKS cluster autoscaler, Container Apps, and App Service autoscale to match capacity to demand. Azure Cache for Redis eliminates repeated database round-trips. Azure Front Door provides global CDN, WAF, and TCP/TLS termination at the edge. All workloads are load-tested to at least 2× expected peak before production.
Azure Advisor is the built-in recommendation engine — it runs continuously across all pillars and surfaces actionable findings in the portal, API, and Azure Policy. Integrating Advisor findings into sprint backlogs makes the Well-Architected posture a living artefact rather than a point-in-time report.
Architecture Overview
%%{init:{'theme':'base','themeVariables':{'fontSize':'14px','fontFamily':'IBM Plex Sans, system-ui, sans-serif','primaryColor':'#DBEAFE','primaryTextColor':'#1e3a5f','primaryBorderColor':'#2563EB','lineColor':'#374151','clusterBkg':'#F9FAFB','clusterBorder':'#D1D5DB','edgeLabelBackground':'#FFFFFF'},'flowchart':{'curve':'orthogonal','padding':30,'nodeSpacing':65,'rankSpacing':75,'useMaxWidth':true}}}%% flowchart TD CLIENT([Global Client Traffic]) CLIENT --> AFD[Azure Front Door\nWAF + CDN + Global failover\nTLS termination at edge] subgraph PRIMARY["Primary Region — 3 Availability Zones"] AKS_P[Zone-redundant AKS\nSystem + user node pools\nCluster autoscaler enabled] SQL_P[Azure SQL auto-failover group\nZone-redundant Business Critical\nAuto-backup 35-day retention] KV_P[Azure Key Vault\nRBAC model\nPurge protection enabled] end subgraph SECONDARY["Secondary Region — Passive"] AKS_S[Passive AKS cluster\nScaled to zero until failover\nSynced via GitOps] SQL_S[SQL read replica\nAuto-failover group secondary\nCross-region replication] end subgraph GOVERNANCE["Governance + Observability"] ADVISOR[Azure Advisor\nContinuous pillar recommendations\nWeekly review cadence] COST[Cost Management\nBudget alerts + anomaly detection\nTag-based attribution] POLICY[Azure Policy\nTag enforcement deny effect\nCompliance dashboard] MONITOR[Azure Monitor + App Insights\nMetrics logs traces\nAlert rules + action groups] end AFD -->|Primary| AKS_P AFD -->|Failover| AKS_S AKS_P --> SQL_P AKS_P --> KV_P SQL_P -->|Replication| SQL_S AKS_S --> SQL_S PRIMARY & SECONDARY --> GOVERNANCE style CLIENT fill:#4f8ef7,color:#fff style AFD fill:#DBEAFE style AKS_P fill:#DBEAFE style SQL_P fill:#DBEAFE style KV_P fill:#DBEAFE style AKS_S fill:#F9FAFB style SQL_S fill:#F9FAFB style ADVISOR fill:#fef3c7 style COST fill:#fef3c7 style POLICY fill:#fef3c7 style MONITOR fill:#fef3c7
Implementation Guide
Reliability — Zone-Redundant AKS Cluster
Zone-redundant deployments must be specified at cluster creation. The zones property cannot be added to an existing node pool — it requires a new pool or cluster rebuild.
// Zone-redundant AKS cluster with system and user node pools
resource aksCluster 'Microsoft.ContainerService/managedClusters@2024-02-01' = {
name: 'aks-${workloadName}-${env}'
location: location
identity: { type: 'SystemAssigned' }
properties: {
dnsPrefix: 'aks-${workloadName}-${env}'
agentPoolProfiles: [
{
name: 'system'
count: 3
vmSize: 'Standard_D4ds_v5'
availabilityZones: ['1', '2', '3'] // Zone-redundant — must be set at creation
mode: 'System'
enableAutoScaling: true
minCount: 3
maxCount: 6
osDiskType: 'Ephemeral'
}
{
name: 'workload'
count: 3
vmSize: 'Standard_D8ds_v5'
availabilityZones: ['1', '2', '3']
mode: 'User'
enableAutoScaling: true
minCount: 3
maxCount: 20
osDiskType: 'Ephemeral'
}
]
networkProfile: {
networkPlugin: 'azure'
networkPolicy: 'azure'
loadBalancerSku: 'standard'
}
addonProfiles: {
omsagent: {
enabled: true
config: { logAnalyticsWorkspaceResourceID: logAnalyticsWorkspace.id }
}
}
}
}
Terraform equivalent:
azurerm_kubernetes_clusterwithdefault_node_pool.zones = ["1","2","3"]andauto_scaler_profileblock. Zone argument is immutable post-creation — plan cluster rebuilds during maintenance windows.
Security — Key Vault with RBAC and Private Endpoint
Access policies are the legacy model. RBAC provides audit trails, conditional access integration, and deny assignments that access policies cannot express.
// Key Vault with RBAC model and private endpoint — no access policies
resource keyVault 'Microsoft.KeyVault/vaults@2023-07-01' = {
name: 'kv-${workloadName}-${env}-${uniqueSuffix}'
location: location
properties: {
sku: { family: 'A', name: 'standard' }
tenantId: subscription().tenantId
enableRbacAuthorization: true // Replaces legacy access policies
enableSoftDelete: true
softDeleteRetentionInDays: 90
enablePurgeProtection: true
publicNetworkAccess: 'Disabled' // Private endpoint only
networkAcls: {
defaultAction: 'Deny'
bypass: 'AzureServices'
}
}
}
resource kvPrivateEndpoint 'Microsoft.Network/privateEndpoints@2023-11-01' = {
name: 'pe-${keyVault.name}'
location: location
properties: {
subnet: { id: privateEndpointSubnetId }
privateLinkServiceConnections: [
{
name: 'kv-connection'
properties: {
privateLinkServiceId: keyVault.id
groupIds: ['vault']
}
}
]
}
}
Terraform equivalent:
azurerm_key_vaultwithenable_rbac_authorization = trueandpublic_network_access_enabled = false, paired withazurerm_private_endpoint.
Cost Optimisation — Azure Policy Tagging Enforcement
Tagging must be enforced at resource creation via policy with a deny effect. Post-hoc tagging campaigns have poor coverage and break cost attribution.
// Azure Policy requiring cost-attribution tags on all resources
resource taggingPolicy 'Microsoft.Authorization/policyDefinitions@2023-04-01' = {
name: 'require-cost-attribution-tags'
properties: {
displayName: 'Require cost attribution tags on all resources'
policyType: 'Custom'
mode: 'Indexed'
parameters: {}
policyRule: {
if: {
allOf: [
{
field: 'tags[Environment]'
exists: 'false'
}
{
field: 'tags[Team]'
exists: 'false'
}
{
field: 'tags[Project]'
exists: 'false'
}
{
field: 'tags[CostCenter]'
exists: 'false'
}
]
}
then: {
effect: 'deny'
}
}
}
}
Terraform equivalent:
azurerm_policy_definitionwithpolicy_ruleJSON. Assign viaazurerm_subscription_policy_assignmentwithenforcement_mode = "Default".
Operational Excellence — Diagnostic Settings and Budget Alerts
All resources route logs and metrics to a central Log Analytics workspace. Monthly budgets with forecasted and actual thresholds prevent surprise invoices.
// Diagnostic settings routing to central Log Analytics workspace
resource diagnosticSettings 'Microsoft.Insights/diagnosticSettings@2021-05-01-preview' = {
name: 'diag-${aksCluster.name}'
scope: aksCluster
properties: {
workspaceId: logAnalyticsWorkspace.id
logs: [
{
categoryGroup: 'allLogs'
enabled: true
retentionPolicy: { enabled: true, days: 90 }
}
]
metrics: [
{
category: 'AllMetrics'
enabled: true
retentionPolicy: { enabled: true, days: 30 }
}
]
}
}
// Monthly consumption budget with forecasted and actual alert thresholds
resource monthlyBudget 'Microsoft.Consumption/budgets@2023-05-01' = {
name: 'budget-${workloadName}-monthly'
properties: {
category: 'Cost'
amount: budgetAmountGBP
timeGrain: 'Monthly'
timePeriod: {
startDate: '2026-01-01'
}
notifications: {
forecastedEightyPercent: {
enabled: true
operator: 'GreaterThan'
threshold: 80
thresholdType: 'Forecasted'
contactEmails: alertEmailAddresses
}
actualOneHundredPercent: {
enabled: true
operator: 'GreaterThan'
threshold: 100
thresholdType: 'Actual'
contactEmails: alertEmailAddresses
}
}
}
}
Terraform equivalent:
azurerm_monitor_diagnostic_settingandazurerm_consumption_budget_subscription. Note thatazurerm_consumption_budget_subscriptionrequires theMicrosoft.Consumptionprovider registered on the subscription.
Performance Efficiency — Azure Front Door with WAF Policy
Azure Front Door is the single global entry point. WAF policy in Prevention mode blocks known attack patterns before they reach the origin. Custom rules enforce geo-filtering and rate limiting.
// Azure Front Door Standard/Premium with WAF policy
resource frontDoorProfile 'Microsoft.Cdn/profiles@2023-05-01' = {
name: 'afd-${workloadName}-${env}'
location: 'Global'
sku: { name: 'Premium_AzureFrontDoor' } // Premium required for WAF + Private Link origins
}
resource wafPolicy 'Microsoft.Network/FrontDoorWebApplicationFirewallPolicies@2022-05-01' = {
name: 'waf${workloadName}${env}'
location: 'Global'
sku: { name: 'Premium_AzureFrontDoor' }
properties: {
policySettings: {
enabledState: 'Enabled'
mode: 'Prevention' // Block — not just detect
requestBodyCheck: 'Enabled'
}
managedRules: {
managedRuleSets: [
{ ruleSetType: 'Microsoft_DefaultRuleSet', ruleSetVersion: '2.1' }
{ ruleSetType: 'Microsoft_BotManagerRuleSet', ruleSetVersion: '1.1' }
]
}
}
}
Terraform equivalent:
azurerm_cdn_frontdoor_profile(SKUPremium_AzureFrontDoor) andazurerm_frontdoor_firewall_policy. Useazurerm_cdn_frontdoor_security_policyto associate the WAF with the Front Door endpoint.
Decision Criteria — Pillar Trade-offs
Every architecture involves trade-offs across pillars. Document these explicitly in ADRs before deployment.
| Decision | Pillar Gained | Pillar Cost | Accepted Trade-off |
|---|---|---|---|
| Zone-redundant AKS across 3 AZs | Reliability | Cost | ~15% node pool overhead accepted for 99.99% zone SLA |
| Azure SQL Business Critical tier | Reliability, Performance | Cost | In-memory replica and auto-failover justify 3× cost over General Purpose for production OLTP |
| Azure Front Door Premium + WAF | Security, Performance | Cost | Global CDN + WAF prevention mode accepted over Application Gateway for multi-region workloads |
| Azure Reservations 3-year | Cost | Flexibility | Baseline compute is predictable; up to 72% savings justified against locked-in commitment |
| Private endpoints for all PaaS | Security | Operational Complexity | No public network access to Key Vault, SQL, Storage; private DNS zones required in every spoke VNet |
| Single-region deployment | Cost, Simplicity | Reliability | RPO/RTO requirements do not justify paired-region standby; revisit if SLA requires 99.99% |
Cost Model
| Resource | Tier | Approximate Monthly Cost | Optimisation Lever |
|---|---|---|---|
| AKS node pool (6 × D4ds_v5) | Standard | ~£480–£640 | 3-year Reserved VM Instances (up to 72% saving) |
| Azure SQL (Business Critical, 4 vCores) | Business Critical | ~£900–£1,100 | Reserved capacity 1-yr (up to 40%) |
| Azure Front Door Premium | Premium | ~£150–£250 + egress | Standard tier if WAF private-link not required |
| Azure Key Vault | Standard | ~£5–£20 | Operations-based billing; negligible vs. compute |
| Log Analytics (50 GB/day) | Pay-as-you-go | ~£90–£120 | Commitment tiers at 100 GB+/day; archive tier for >30 days |
| Azure Backup (1 TB) | Standard | ~£15–£30 | LRS redundancy for non-critical; GRS for production |
| Azure Defender for Containers | Standard | ~£6.50/node/month | Scope to production clusters only during pilot |
Cost optimisation levers:
- Azure Reservations (VMs, SQL, Cosmos DB) and Savings Plans for compute reduce pay-as-you-go rates by up to 72%; 1-year commitment is the minimum — 3-year where baseline load is stable.
- Azure Advisor right-sizing recommendations are surfaced continuously; schedule a weekly review in team stand-ups and act on low-risk (dev/test) resizes immediately.
- Budget alerts at 80% forecasted and 100% actual prevent month-end surprises; pair with anomaly detection alerts for spike identification within hours.
- Tag enforcement via Azure Policy (Environment/Team/Project/CostCenter) is the prerequisite for per-team cost attribution and chargeback reports in Cost Management.
- Dev/test subscriptions benefit from the Azure Dev/Test pricing offer (up to 55% on Windows VMs, no charge for SQL Server licences) — do not run dev workloads on production subscriptions.
- Enable auto-shutdown for non-production VMs and AKS clusters; scale AKS to zero overnight for dev environments using keda-based scaling or the AKS stop/start feature.
Anti-Patterns to Avoid
Deploying AKS node pools, Azure SQL, or Service Bus to a single availability zone because zone-redundancy appears optional in the portal. A single AZ failure takes the entire workload offline without any cross-zone replication.
Specify availabilityZones: ['1', '2', '3'] for all node pools, use Azure SQL zone-redundant Business Critical or General Purpose, and enable zone redundancy for Service Bus Premium and Event Hubs Premium at provisioning. These settings are immutable post-creation.
Deploying resources without enforcing Environment, Team, Project, and CostCenter tags. Cost Management cost reports are meaningless without consistent tags; finance cannot attribute spend to business units; anomalies are invisible until the invoice arrives.
Enforce tags via Azure Policy with a deny effect before any resources are provisioned. Define the tag taxonomy in a policy initiative and assign it at the management group level so it applies to all subscriptions automatically.
Clicking through the Azure portal to create or modify production resources. Manual changes are not auditable, not reproducible, and not reviewable. Configuration drift is undetectable until a failure exposes it.
All production resources are defined in Bicep or Terraform and deployed through Azure DevOps or GitHub Actions pipelines. Run bicep what-if (or terraform plan) as a required pipeline gate before every apply. Direct portal access to production is read-only.
Relying on the monthly invoice as the primary cost signal. Cloud costs can spike by 10× in hours from a runaway autoscale event, a data egress misconfiguration, or a forgotten GPU VM; discovering this 30 days later has no remediation value.
Create a monthly budget with forecasted-80% and actual-100% email notifications for every workload subscription. Enable Cost Management anomaly detection alerts. Review Advisor cost recommendations weekly as a team ritual.
Treating the Well-Architected Review as optional documentation rather than a structured risk gate. Architectures that skip the review regularly go live with High Risk Issues in Security and Reliability that are far cheaper to address before production than after an incident.
Run the Azure Well-Architected Review at aka.ms/azurewaf at project start (shape the architecture), before go-live (final risk gate), and annually (detect drift). Treat High Risk Issues as P1 blockers for go-live sign-off. Track Medium Risk Issues in the team backlog with owners and target quarters.
Flowchart
References
- Azure Well-Architected Framework — https://learn.microsoft.com/azure/well-architected/
- Azure Well-Architected Review tool — https://aka.ms/azurewaf
- Azure Advisor overview — https://learn.microsoft.com/azure/advisor/advisor-overview
- Availability Zones overview — https://learn.microsoft.com/azure/reliability/availability-zones-overview
- Azure Cost Management overview — https://learn.microsoft.com/azure/cost-management-billing/costs/overview-cost-management
- Microsoft Cloud Security Benchmark — https://learn.microsoft.com/security/benchmark/azure/introduction
- Azure Policy built-in definitions — https://learn.microsoft.com/azure/governance/policy/samples/built-in-policies
- Azure Front Door documentation — https://learn.microsoft.com/azure/frontdoor/front-door-overview
- Bicep documentation — https://learn.microsoft.com/azure/azure-resource-manager/bicep/overview
- Portal: AWS Well-Architected comparison