On This Page
1Overview2Architecture Overview
3Implementation Guide4Decision Criteria — Pillar Trade-offs
5Cost Model6Anti-Patterns to Avoid
7References

Overview

Microsoft distilled lessons from thousands of customer architectures into five pillars: Reliability, Security, Cost Optimisation, Operational Excellence, and Performance Efficiency. Each pillar contains design principles, recommendation guides, and trade-off analysis. The framework does not prescribe a specific architecture — it provides lenses through which to evaluate the trade-offs in any Azure workload.

Reliability — Design for zone failure, not just instance failure. Zone-redundant services (AKS, Service Bus Premium, Event Hubs Premium, Azure SQL, Cosmos DB, Azure Cache for Redis) must be enabled at provisioning — they cannot be retrofitted cheaply. Use paired regions for disaster recovery, configure health probes and auto-failover groups, and validate recoverability with Azure Chaos Studio. Azure Service Health alerts notify on impairments before customers report them.

Security — Aligns directly to the Microsoft Cloud Security Benchmark (MCSB). Apply Zero Trust identity: Managed Identity for service-to-service auth, PIM for privileged human access, no standing admin roles. Enforce network segmentation via private endpoints, Azure Firewall, and NSG rules. Store all secrets in Azure Key Vault with RBAC not access policies. Enable Microsoft Defender for Cloud and route signals to Microsoft Sentinel for centralised SIEM/SOAR.

Cost Optimisation — Azure Reservations and Savings Plans can reduce compute cost by up to 72% versus pay-as-you-go. Azure Cost Management provides budget alerts, anomaly detection, and invoice-level breakdowns. Tagging is the prerequisite for cost attribution — enforce Environment, Team, Project, and CostCenter tags via Azure Policy with a deny effect. Azure Advisor surfaces right-sizing recommendations continuously; treat them as a weekly operational input, not an optional report.

Operational Excellence — All resources are provisioned via Infrastructure as Code (Bicep primary, Terraform supported). Every change goes through Azure DevOps or GitHub Actions pipelines — no manual portal deployments in production. Azure Monitor and Application Insights provide unified observability. Runbooks are Azure Automation documents, not Word files. Chaos engineering with Azure Chaos Studio is scheduled, not reactive.

Performance Efficiency — Select the right SKU before deployment; retrofitting underperforms and overprovisioning masks latency issues. Use VMSS, AKS cluster autoscaler, Container Apps, and App Service autoscale to match capacity to demand. Azure Cache for Redis eliminates repeated database round-trips. Azure Front Door provides global CDN, WAF, and TCP/TLS termination at the edge. All workloads are load-tested to at least 2× expected peak before production.

Azure Advisor is the built-in recommendation engine — it runs continuously across all pillars and surfaces actionable findings in the portal, API, and Azure Policy. Integrating Advisor findings into sprint backlogs makes the Well-Architected posture a living artefact rather than a point-in-time report.

Architecture Overview

%%{init:{'theme':'base','themeVariables':{'fontSize':'14px','fontFamily':'IBM Plex Sans, system-ui, sans-serif','primaryColor':'#DBEAFE','primaryTextColor':'#1e3a5f','primaryBorderColor':'#2563EB','lineColor':'#374151','clusterBkg':'#F9FAFB','clusterBorder':'#D1D5DB','edgeLabelBackground':'#FFFFFF'},'flowchart':{'curve':'orthogonal','padding':30,'nodeSpacing':65,'rankSpacing':75,'useMaxWidth':true}}}%% flowchart TD CLIENT([Global Client Traffic]) CLIENT --> AFD[Azure Front Door\nWAF + CDN + Global failover\nTLS termination at edge] subgraph PRIMARY["Primary Region — 3 Availability Zones"] AKS_P[Zone-redundant AKS\nSystem + user node pools\nCluster autoscaler enabled] SQL_P[Azure SQL auto-failover group\nZone-redundant Business Critical\nAuto-backup 35-day retention] KV_P[Azure Key Vault\nRBAC model\nPurge protection enabled] end subgraph SECONDARY["Secondary Region — Passive"] AKS_S[Passive AKS cluster\nScaled to zero until failover\nSynced via GitOps] SQL_S[SQL read replica\nAuto-failover group secondary\nCross-region replication] end subgraph GOVERNANCE["Governance + Observability"] ADVISOR[Azure Advisor\nContinuous pillar recommendations\nWeekly review cadence] COST[Cost Management\nBudget alerts + anomaly detection\nTag-based attribution] POLICY[Azure Policy\nTag enforcement deny effect\nCompliance dashboard] MONITOR[Azure Monitor + App Insights\nMetrics logs traces\nAlert rules + action groups] end AFD -->|Primary| AKS_P AFD -->|Failover| AKS_S AKS_P --> SQL_P AKS_P --> KV_P SQL_P -->|Replication| SQL_S AKS_S --> SQL_S PRIMARY & SECONDARY --> GOVERNANCE style CLIENT fill:#4f8ef7,color:#fff style AFD fill:#DBEAFE style AKS_P fill:#DBEAFE style SQL_P fill:#DBEAFE style KV_P fill:#DBEAFE style AKS_S fill:#F9FAFB style SQL_S fill:#F9FAFB style ADVISOR fill:#fef3c7 style COST fill:#fef3c7 style POLICY fill:#fef3c7 style MONITOR fill:#fef3c7

Implementation Guide

Reliability — Zone-Redundant AKS Cluster

Zone-redundant deployments must be specified at cluster creation. The zones property cannot be added to an existing node pool — it requires a new pool or cluster rebuild.

// Zone-redundant AKS cluster with system and user node pools
resource aksCluster 'Microsoft.ContainerService/managedClusters@2024-02-01' = {
  name: 'aks-${workloadName}-${env}'
  location: location
  identity: { type: 'SystemAssigned' }
  properties: {
    dnsPrefix: 'aks-${workloadName}-${env}'
    agentPoolProfiles: [
      {
        name: 'system'
        count: 3
        vmSize: 'Standard_D4ds_v5'
        availabilityZones: ['1', '2', '3']   // Zone-redundant — must be set at creation
        mode: 'System'
        enableAutoScaling: true
        minCount: 3
        maxCount: 6
        osDiskType: 'Ephemeral'
      }
      {
        name: 'workload'
        count: 3
        vmSize: 'Standard_D8ds_v5'
        availabilityZones: ['1', '2', '3']
        mode: 'User'
        enableAutoScaling: true
        minCount: 3
        maxCount: 20
        osDiskType: 'Ephemeral'
      }
    ]
    networkProfile: {
      networkPlugin: 'azure'
      networkPolicy: 'azure'
      loadBalancerSku: 'standard'
    }
    addonProfiles: {
      omsagent: {
        enabled: true
        config: { logAnalyticsWorkspaceResourceID: logAnalyticsWorkspace.id }
      }
    }
  }
}

Terraform equivalent: azurerm_kubernetes_cluster with default_node_pool.zones = ["1","2","3"] and auto_scaler_profile block. Zone argument is immutable post-creation — plan cluster rebuilds during maintenance windows.

Security — Key Vault with RBAC and Private Endpoint

Access policies are the legacy model. RBAC provides audit trails, conditional access integration, and deny assignments that access policies cannot express.

// Key Vault with RBAC model and private endpoint — no access policies
resource keyVault 'Microsoft.KeyVault/vaults@2023-07-01' = {
  name: 'kv-${workloadName}-${env}-${uniqueSuffix}'
  location: location
  properties: {
    sku: { family: 'A', name: 'standard' }
    tenantId: subscription().tenantId
    enableRbacAuthorization: true           // Replaces legacy access policies
    enableSoftDelete: true
    softDeleteRetentionInDays: 90
    enablePurgeProtection: true
    publicNetworkAccess: 'Disabled'         // Private endpoint only
    networkAcls: {
      defaultAction: 'Deny'
      bypass: 'AzureServices'
    }
  }
}

resource kvPrivateEndpoint 'Microsoft.Network/privateEndpoints@2023-11-01' = {
  name: 'pe-${keyVault.name}'
  location: location
  properties: {
    subnet: { id: privateEndpointSubnetId }
    privateLinkServiceConnections: [
      {
        name: 'kv-connection'
        properties: {
          privateLinkServiceId: keyVault.id
          groupIds: ['vault']
        }
      }
    ]
  }
}

Terraform equivalent: azurerm_key_vault with enable_rbac_authorization = true and public_network_access_enabled = false, paired with azurerm_private_endpoint.

Cost Optimisation — Azure Policy Tagging Enforcement

Tagging must be enforced at resource creation via policy with a deny effect. Post-hoc tagging campaigns have poor coverage and break cost attribution.

// Azure Policy requiring cost-attribution tags on all resources
resource taggingPolicy 'Microsoft.Authorization/policyDefinitions@2023-04-01' = {
  name: 'require-cost-attribution-tags'
  properties: {
    displayName: 'Require cost attribution tags on all resources'
    policyType: 'Custom'
    mode: 'Indexed'
    parameters: {}
    policyRule: {
      if: {
        allOf: [
          {
            field: 'tags[Environment]'
            exists: 'false'
          }
          {
            field: 'tags[Team]'
            exists: 'false'
          }
          {
            field: 'tags[Project]'
            exists: 'false'
          }
          {
            field: 'tags[CostCenter]'
            exists: 'false'
          }
        ]
      }
      then: {
        effect: 'deny'
      }
    }
  }
}

Terraform equivalent: azurerm_policy_definition with policy_rule JSON. Assign via azurerm_subscription_policy_assignment with enforcement_mode = "Default".

Operational Excellence — Diagnostic Settings and Budget Alerts

All resources route logs and metrics to a central Log Analytics workspace. Monthly budgets with forecasted and actual thresholds prevent surprise invoices.

// Diagnostic settings routing to central Log Analytics workspace
resource diagnosticSettings 'Microsoft.Insights/diagnosticSettings@2021-05-01-preview' = {
  name: 'diag-${aksCluster.name}'
  scope: aksCluster
  properties: {
    workspaceId: logAnalyticsWorkspace.id
    logs: [
      {
        categoryGroup: 'allLogs'
        enabled: true
        retentionPolicy: { enabled: true, days: 90 }
      }
    ]
    metrics: [
      {
        category: 'AllMetrics'
        enabled: true
        retentionPolicy: { enabled: true, days: 30 }
      }
    ]
  }
}

// Monthly consumption budget with forecasted and actual alert thresholds
resource monthlyBudget 'Microsoft.Consumption/budgets@2023-05-01' = {
  name: 'budget-${workloadName}-monthly'
  properties: {
    category: 'Cost'
    amount: budgetAmountGBP
    timeGrain: 'Monthly'
    timePeriod: {
      startDate: '2026-01-01'
    }
    notifications: {
      forecastedEightyPercent: {
        enabled: true
        operator: 'GreaterThan'
        threshold: 80
        thresholdType: 'Forecasted'
        contactEmails: alertEmailAddresses
      }
      actualOneHundredPercent: {
        enabled: true
        operator: 'GreaterThan'
        threshold: 100
        thresholdType: 'Actual'
        contactEmails: alertEmailAddresses
      }
    }
  }
}

Terraform equivalent: azurerm_monitor_diagnostic_setting and azurerm_consumption_budget_subscription. Note that azurerm_consumption_budget_subscription requires the Microsoft.Consumption provider registered on the subscription.

Performance Efficiency — Azure Front Door with WAF Policy

Azure Front Door is the single global entry point. WAF policy in Prevention mode blocks known attack patterns before they reach the origin. Custom rules enforce geo-filtering and rate limiting.

// Azure Front Door Standard/Premium with WAF policy
resource frontDoorProfile 'Microsoft.Cdn/profiles@2023-05-01' = {
  name: 'afd-${workloadName}-${env}'
  location: 'Global'
  sku: { name: 'Premium_AzureFrontDoor' }   // Premium required for WAF + Private Link origins
}

resource wafPolicy 'Microsoft.Network/FrontDoorWebApplicationFirewallPolicies@2022-05-01' = {
  name: 'waf${workloadName}${env}'
  location: 'Global'
  sku: { name: 'Premium_AzureFrontDoor' }
  properties: {
    policySettings: {
      enabledState: 'Enabled'
      mode: 'Prevention'                    // Block — not just detect
      requestBodyCheck: 'Enabled'
    }
    managedRules: {
      managedRuleSets: [
        { ruleSetType: 'Microsoft_DefaultRuleSet', ruleSetVersion: '2.1' }
        { ruleSetType: 'Microsoft_BotManagerRuleSet', ruleSetVersion: '1.1' }
      ]
    }
  }
}

Terraform equivalent: azurerm_cdn_frontdoor_profile (SKU Premium_AzureFrontDoor) and azurerm_frontdoor_firewall_policy. Use azurerm_cdn_frontdoor_security_policy to associate the WAF with the Front Door endpoint.

Decision Criteria — Pillar Trade-offs

Every architecture involves trade-offs across pillars. Document these explicitly in ADRs before deployment.

Decision Pillar Gained Pillar Cost Accepted Trade-off
Zone-redundant AKS across 3 AZs Reliability Cost ~15% node pool overhead accepted for 99.99% zone SLA
Azure SQL Business Critical tier Reliability, Performance Cost In-memory replica and auto-failover justify 3× cost over General Purpose for production OLTP
Azure Front Door Premium + WAF Security, Performance Cost Global CDN + WAF prevention mode accepted over Application Gateway for multi-region workloads
Azure Reservations 3-year Cost Flexibility Baseline compute is predictable; up to 72% savings justified against locked-in commitment
Private endpoints for all PaaS Security Operational Complexity No public network access to Key Vault, SQL, Storage; private DNS zones required in every spoke VNet
Single-region deployment Cost, Simplicity Reliability RPO/RTO requirements do not justify paired-region standby; revisit if SLA requires 99.99%

Cost Model

Resource Tier Approximate Monthly Cost Optimisation Lever
AKS node pool (6 × D4ds_v5) Standard ~£480–£640 3-year Reserved VM Instances (up to 72% saving)
Azure SQL (Business Critical, 4 vCores) Business Critical ~£900–£1,100 Reserved capacity 1-yr (up to 40%)
Azure Front Door Premium Premium ~£150–£250 + egress Standard tier if WAF private-link not required
Azure Key Vault Standard ~£5–£20 Operations-based billing; negligible vs. compute
Log Analytics (50 GB/day) Pay-as-you-go ~£90–£120 Commitment tiers at 100 GB+/day; archive tier for >30 days
Azure Backup (1 TB) Standard ~£15–£30 LRS redundancy for non-critical; GRS for production
Azure Defender for Containers Standard 6.50/node/month Scope to production clusters only during pilot

Cost optimisation levers:

  • Azure Reservations (VMs, SQL, Cosmos DB) and Savings Plans for compute reduce pay-as-you-go rates by up to 72%; 1-year commitment is the minimum — 3-year where baseline load is stable.
  • Azure Advisor right-sizing recommendations are surfaced continuously; schedule a weekly review in team stand-ups and act on low-risk (dev/test) resizes immediately.
  • Budget alerts at 80% forecasted and 100% actual prevent month-end surprises; pair with anomaly detection alerts for spike identification within hours.
  • Tag enforcement via Azure Policy (Environment/Team/Project/CostCenter) is the prerequisite for per-team cost attribution and chargeback reports in Cost Management.
  • Dev/test subscriptions benefit from the Azure Dev/Test pricing offer (up to 55% on Windows VMs, no charge for SQL Server licences) — do not run dev workloads on production subscriptions.
  • Enable auto-shutdown for non-production VMs and AKS clusters; scale AKS to zero overnight for dev environments using keda-based scaling or the AKS stop/start feature.

Anti-Patterns to Avoid

⚠ 1. Single Availability Zone for Production Workloads

Deploying AKS node pools, Azure SQL, or Service Bus to a single availability zone because zone-redundancy appears optional in the portal. A single AZ failure takes the entire workload offline without any cross-zone replication.

Hover to see the fix ↻
↺ Correct Approach

Specify availabilityZones: ['1', '2', '3'] for all node pools, use Azure SQL zone-redundant Business Critical or General Purpose, and enable zone redundancy for Service Bus Premium and Event Hubs Premium at provisioning. These settings are immutable post-creation.

⚠ 2. No Resource Tagging Strategy

Deploying resources without enforcing Environment, Team, Project, and CostCenter tags. Cost Management cost reports are meaningless without consistent tags; finance cannot attribute spend to business units; anomalies are invisible until the invoice arrives.

Hover to see the fix ↻
↺ Correct Approach

Enforce tags via Azure Policy with a deny effect before any resources are provisioned. Define the tag taxonomy in a policy initiative and assign it at the management group level so it applies to all subscriptions automatically.

⚠ 3. Manual Portal Deployments for Production Changes

Clicking through the Azure portal to create or modify production resources. Manual changes are not auditable, not reproducible, and not reviewable. Configuration drift is undetectable until a failure exposes it.

Hover to see the fix ↻
↺ Correct Approach

All production resources are defined in Bicep or Terraform and deployed through Azure DevOps or GitHub Actions pipelines. Run bicep what-if (or terraform plan) as a required pipeline gate before every apply. Direct portal access to production is read-only.

⚠ 4. No Budget Alerts or Cost Anomaly Detection

Relying on the monthly invoice as the primary cost signal. Cloud costs can spike by 10× in hours from a runaway autoscale event, a data egress misconfiguration, or a forgotten GPU VM; discovering this 30 days later has no remediation value.

Hover to see the fix ↻
↺ Correct Approach

Create a monthly budget with forecasted-80% and actual-100% email notifications for every workload subscription. Enable Cost Management anomaly detection alerts. Review Advisor cost recommendations weekly as a team ritual.

⚠ 5. Skipping the Azure Well-Architected Review Before Go-Live

Treating the Well-Architected Review as optional documentation rather than a structured risk gate. Architectures that skip the review regularly go live with High Risk Issues in Security and Reliability that are far cheaper to address before production than after an incident.

Hover to see the fix ↻
↺ Correct Approach

Run the Azure Well-Architected Review at aka.ms/azurewaf at project start (shape the architecture), before go-live (final risk gate), and annually (detect drift). Treat High Risk Issues as P1 blockers for go-live sign-off. Track Medium Risk Issues in the team backlog with owners and target quarters.

Flowchart

%%{init:{'theme':'base','themeVariables':{'fontSize':'14px','fontFamily':'IBM Plex Sans, system-ui, sans-serif','primaryColor':'#DBEAFE','primaryTextColor':'#1e3a5f','primaryBorderColor':'#2563EB','lineColor':'#374151','clusterBkg':'#F9FAFB','clusterBorder':'#D1D5DB','edgeLabelBackground':'#FFFFFF'},'flowchart':{'curve':'orthogonal','padding':30,'nodeSpacing':65,'rankSpacing':75,'useMaxWidth':true}}}%% flowchart TD CLIENT([Global Client Traffic]) CLIENT --> AFD[Azure Front Door Premium\nWAF Prevention mode\nCDN + global failover\nTLS termination at edge] subgraph PRIMARY["Primary Region — 3 Availability Zones"] AKS_P[Zone-redundant AKS\nSystem + workload node pools\nCluster autoscaler enabled] SQL_P[Azure SQL auto-failover group\nBusiness Critical zone-redundant\nAuto-backup 35-day retention] KV_P[Azure Key Vault\nRBAC model\nPurge protection + private endpoint] end subgraph SECONDARY["Secondary Region — Passive"] AKS_S[Passive AKS cluster\nScaled to zero until failover\nSynced via GitOps] SQL_S[SQL read replica\nAuto-failover group secondary\nCross-region replication] end subgraph GOVERNANCE["Governance + Observability"] ADVISOR[Azure Advisor\nContinuous pillar recommendations\nWeekly review cadence] COST[Cost Management\nMonthly budget alerts\nAnomaly detection + tag attribution] POLICY[Azure Policy\nTag enforcement deny effect\nCompliance dashboard] MONITOR[Azure Monitor + App Insights\nMetrics logs traces\nAlert rules + action groups] end AFD -->|Active| AKS_P AFD -->|Failover| AKS_S AKS_P --> SQL_P AKS_P --> KV_P SQL_P -->|Replication| SQL_S AKS_S --> SQL_S PRIMARY & SECONDARY --> GOVERNANCE style CLIENT fill:#4f8ef7,color:#fff style AFD fill:#DBEAFE style AKS_P fill:#DBEAFE style SQL_P fill:#DBEAFE style KV_P fill:#DBEAFE style AKS_S fill:#F9FAFB style SQL_S fill:#F9FAFB style ADVISOR fill:#fef3c7 style COST fill:#fef3c7 style POLICY fill:#fef3c7 style MONITOR fill:#fef3c7

References

  1. Azure Well-Architected Framework — https://learn.microsoft.com/azure/well-architected/
  2. Azure Well-Architected Review tool — https://aka.ms/azurewaf
  3. Azure Advisor overview — https://learn.microsoft.com/azure/advisor/advisor-overview
  4. Availability Zones overview — https://learn.microsoft.com/azure/reliability/availability-zones-overview
  5. Azure Cost Management overview — https://learn.microsoft.com/azure/cost-management-billing/costs/overview-cost-management
  6. Microsoft Cloud Security Benchmark — https://learn.microsoft.com/security/benchmark/azure/introduction
  7. Azure Policy built-in definitions — https://learn.microsoft.com/azure/governance/policy/samples/built-in-policies
  8. Azure Front Door documentation — https://learn.microsoft.com/azure/frontdoor/front-door-overview
  9. Bicep documentation — https://learn.microsoft.com/azure/azure-resource-manager/bicep/overview
  10. Portal: AWS Well-Architected comparison
Ascendion Engineering Knowledge Base ← Cloud