Lifecycle Phases

Platform Lifecycle governance on Azure follows a five-phase model. Each phase maps to specific IDP modules, Terraform modules, Azure services, and compliance controls that ensure infrastructure is provisioned, operated, and retired consistently. Click a phase below to jump to its explanation.

Plan Define & Design
Provision Deploy IaC
Scale Grow & Optimize
Maintain Patch & Upgrade
Retire Deprecate & Remove

Lifecycle phases explained

What is done in each phase of the platform lifecycle on Azure.

Plan

Requirements, design, and approval for new or changed infrastructure. Teams define what they need, align with golden path templates, and get required approvals. Outputs include Terraform module selection and environment strategy.

Provision

Infrastructure is created using Terraform (AzureRM). Resources are deployed via pipelines to dev, staging, or production. Includes applying IaC, configuring networking and integrations, and validation.

Scale

Capacity and performance are adjusted via quotas, auto-scaling, and capacity planning. Platform versioning and upgrade policies ensure consistent behavior across environments.

Maintain

Ongoing operations: monitoring, patching, upgrades, and incident response. Teams apply security updates and follow deprecation and upgrade runbooks. Compliance and audit controls are maintained.

Retire

Infrastructure is decommissioned when no longer needed. Data is archived or migrated, resources are destroyed via Terraform, and dependencies are updated following documented procedures.

Target Platform

This specification targets Microsoft Azure with Terraform (AzureRM provider) as the IaC toolchain, AKS (Azure Kubernetes Service) as the primary compute platform, and Azure Policy + Defender for Cloud for governance.

Module Summary

Azure-Native Differentiators
  • Azure Deployment Environments (ADE) — Self-service environment provisioning through DevCenter
  • Microsoft Dev Box — Cloud-powered developer workstations managed via DevCenter
  • Azure Managed Grafana — Fully managed observability dashboards
  • Microsoft Defender for Cloud — Unified security posture & compliance scoring
  • Azure DevCenter — Centralized project organization for developer teams

01 Infrastructure-as-Code Template Management & Golden Path Definitions

Golden Path Philosophy

Golden paths are opinionated but optional infrastructure templates that encode organizational best practices. They provide the fastest, most secure route to production — while allowing teams to diverge when justified.

  • Voluntary adoption target: >80% of new services should use a golden path
  • Teams can override any default via escape hatches (override variables)
  • High escape-hatch usage (>20% of deployments) triggers a platform review
  • Golden paths are maintained by the Platform Engineering team with a published SLA

IaC Standards

All infrastructure is defined in Terraform using the AzureRM provider. Modules follow semantic versioning (SemVer) and are published to the Terraform Cloud private registry or Azure DevOps Artifacts.

  • Provider pinning: azurerm ~> 3.90
  • State backend: Azure Storage (blob) with state locking via lease
  • Workspace strategy: one workspace per environment per service
  • Plan/apply runs through CI/CD — no manual terraform apply in production

Module Structure

Every Terraform module follows a standard layout:

Standard Module Layout
module-name/
├── main.tf           # Primary resource definitions
├── variables.tf      # Input variables with descriptions
├── outputs.tf        # Output values
├── versions.tf       # Provider and Terraform version constraints
├── README.md         # Usage documentation
└── examples/
    ├── basic/        # Minimal working example
    └── complete/     # Full-featured example

Naming Conventions

Resources follow the pattern: <org>-<resource-type>-<purpose>

ExampleResource
acme-aks-clusterAKS cluster module
acme-postgresql-flexiblePostgreSQL Flexible Server module
acme-storage-datalakeData Lake storage module
acme-keyvault-serviceKey Vault module

Azure Verified Modules (AVM) Recommended

Encourage use of Azure Verified Modules from the AVM registry as base modules, wrapped with organization-specific defaults. AVM modules are maintained by Microsoft and community contributors, follow consistent patterns, and include built-in testing.

Review Process

  • All module changes require PR review from the Platform team
  • Production infrastructure changes require 2 approvals
  • Automated plan output attached to PR for review
  • Sentinel/OPA policy checks run on every plan

Escape Hatches

Override variables allow teams to diverge from golden path defaults. Every override is tracked and reported. If a module's escape-hatch usage exceeds 20%, the Platform team reviews whether the golden path needs updating.

Golden Path: Web Service (AKS Deployment) Required

golden-paths/web-service/main.tf
# golden-paths/web-service/main.tf
module "web_service" {
  source  = "app.terraform.io/acme/web-service/azurerm"
  version = "~> 2.0"

  service_name    = var.service_name
  team            = var.team
  container_image = var.container_image
  environment     = var.environment  # dev | staging | production

  # AKS cluster reference
  aks_cluster_name    = var.aks_cluster_name
  aks_resource_group  = var.aks_resource_group

  # Optional overrides
  replicas        = var.replicas        # default: based on environment
  cpu_request     = var.cpu_request     # default: "250m"
  memory_request  = var.memory_request  # default: "256Mi"
  custom_domain   = var.custom_domain
}
variables.tf
variable "service_name" {
  description = "Name of the service"
  type        = string
}
variable "team" {
  description = "Owning team"
  type        = string
}
variable "container_image" {
  description = "Container image URI"
  type        = string
}
variable "environment" {
  description = "Target environment"
  type        = string
  validation {
    condition     = contains(["dev", "staging", "production"], var.environment)
    error_message = "Must be dev, staging, or production."
  }
}
variable "aks_cluster_name" {
  description = "AKS cluster name"
  type        = string
}
variable "aks_resource_group" {
  description = "Resource group containing AKS cluster"
  type        = string
}
variable "replicas" {
  description = "Number of replicas (overrides env default)"
  type        = number
  default     = null
}
variable "cpu_request" {
  description = "CPU request per pod"
  type        = string
  default     = null
}
variable "memory_request" {
  description = "Memory request per pod"
  type        = string
  default     = null
}
variable "custom_domain" {
  description = "Custom domain for ingress"
  type        = string
  default     = null
}

Environment Defaults

ParameterDevStagingProduction
replicas123
min_replicas123
max_replicas2410
cpu_request100m250m500m
memory_request128Mi256Mi512Mi
HPA enabledfalsetruetrue
PDB enabledfalsefalsetrue
Monitoring alertsnoneteam-webhookPagerDuty
Generated: deployment.yaml
# Generated: deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ service_name }}
  namespace: {{ team }}
  labels:
    app.kubernetes.io/name: {{ service_name }}
    app.kubernetes.io/managed-by: terraform
    acme.com/team: {{ team }}
    acme.com/golden-path: web-service-v2
spec:
  replicas: {{ replicas }}
  selector:
    matchLabels:
      app.kubernetes.io/name: {{ service_name }}
  template:
    metadata:
      labels:
        app.kubernetes.io/name: {{ service_name }}
    spec:
      containers:
        - name: {{ service_name }}
          image: {{ container_image }}
          resources:
            requests:
              cpu: {{ cpu_request }}
              memory: {{ memory_request }}
            limits:
              cpu: {{ cpu_limit }}
              memory: {{ memory_limit }}
          ports:
            - containerPort: 8080
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 15
          readinessProbe:
            httpGet:
              path: /ready
              port: 8080

Golden Path: Data Store (PostgreSQL Flexible Server) Required

golden-paths/data-store/main.tf
module "database" {
  source  = "app.terraform.io/acme/postgresql-flexible/azurerm"
  version = "~> 1.0"

  db_name         = var.db_name
  team            = var.team
  environment     = var.environment
  size            = var.size  # small | medium | large

  resource_group_name = var.resource_group_name
  vnet_id            = var.vnet_id
  subnet_id          = var.delegated_subnet_id

  # Defaults enforced by golden path:
  # - ssl_enforcement_enabled = true
  # - geo_redundant_backup = (production only)
  # - threat_detection_policy = enabled
  # - azure_ad_authentication = enabled
  # - private_dns_zone = auto-configured
}

Size Mapping

SizeSKUStorageIOPSBackup Retention
smallB_Standard_B1ms32 GBbaseline7 days
mediumGP_Standard_D2ds_v4128 GB3,00014 days
largeMO_Standard_E4ds_v4512 GB10,00035 days

Golden Path: Azure Storage Account Recommended

Data Lake or static asset storage with encryption, private endpoint, and lifecycle management enforced by default.

golden-paths/storage-account/main.tf
module "storage" {
  source  = "app.terraform.io/acme/storage-account/azurerm"
  version = "~> 1.0"

  storage_name        = var.storage_name
  team                = var.team
  environment         = var.environment
  resource_group_name = var.resource_group_name
  location            = var.location

  # Enforced defaults:
  # - account_tier            = "Standard"
  # - account_replication     = env == "production" ? "GRS" : "LRS"
  # - min_tls_version         = "TLS1_2"
  # - enable_https_only       = true
  # - public_network_access   = false
  # - private_endpoint        = auto-configured
  # - lifecycle_management    = 90-day archive, 365-day delete
  # - blob_versioning         = true
}

Scaffolding Template (Backstage Software Template)

scaffolding/web-service/template.yaml
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
  name: aks-web-service
  title: AKS Web Service
  description: Deploy a web service to AKS using the golden path
  tags: ['aks', 'web-service', 'golden-path', 'azure']
spec:
  owner: platform-engineering
  type: service
  parameters:
    - title: Service Details
      required: [service_name, team]
      properties:
        service_name:
          title: Service Name
          type: string
          pattern: '^[a-z][a-z0-9-]{2,28}[a-z0-9]$'
        team:
          title: Team
          type: string
          enum: [payments, catalog, orders, identity]
        environment:
          title: Initial Environment
          type: string
          enum: [dev, staging]
          default: dev
  steps:
    - id: terraform-init
      name: Initialize Terraform
      action: terraform:init
    - id: terraform-apply
      name: Apply Infrastructure
      action: terraform:apply
    - id: register-catalog
      name: Register in Software Catalog
      action: catalog:register

Glossary

TermDefinition
Golden PathAn opinionated, supported template that encodes best practices for a common infrastructure pattern. Voluntary but strongly encouraged.
Escape HatchOverride variables that let teams diverge from golden path defaults. Tracked and reported.
Terraform ModuleA reusable, versioned unit of Terraform configuration that provisions a set of related resources.
Azure Verified Module (AVM)Microsoft-maintained Terraform/Bicep modules from the AVM registry with standardized testing and interfaces.
Module RegistryTerraform Cloud private registry or Azure DevOps Artifacts where versioned modules are published.
Azure PolicyAzure-native governance engine that enforces organizational rules on resources at scale.
Infrastructure DriftWhen the actual state of deployed resources diverges from the declared Terraform state.
Terraform StateA file tracking the mapping between configuration and real-world resources.
Remote BackendAzure Storage account used to store Terraform state centrally with locking.
WorkspaceAn isolated Terraform state instance, typically one per environment per service.
Plan/Apply CycleThe two-phase Terraform workflow: preview changes (plan), then execute (apply).
Azure DevCenterCentralized management for developer environments and project resources.
Deployment EnvironmentAn Azure DevCenter resource that allows developers to self-serve infrastructure stacks.

Demo: Self-Service Infrastructure Provisioning

This demo screen shows the developer experience for provisioning infrastructure through the IDP portal.

Wireframe Description

Screen Layout: A card grid displaying available golden path templates. Each card shows template name, version, description, team adoption percentage, and last updated date. Clicking a card opens a provisioning form that triggers a simulated Terraform plan with a progress timeline: Validating → Planning → Applying → Registering in Catalog → Complete.

Mock Data

mock-data/golden-paths.json
{
  "golden_paths": [
    {
      "id": "aks-web-service",
      "name": "AKS Web Service",
      "version": "2.1.0",
      "description": "Deploy a containerized web service to AKS",
      "adoption_pct": 87,
      "last_updated": "2024-01-10",
      "tags": ["aks", "web", "production-ready"]
    },
    {
      "id": "postgresql-flexible",
      "name": "PostgreSQL Flexible Server",
      "version": "1.3.0",
      "description": "Managed PostgreSQL with VNet integration",
      "adoption_pct": 92,
      "last_updated": "2024-01-08",
      "tags": ["database", "postgresql", "managed"]
    },
    {
      "id": "storage-datalake",
      "name": "Storage Account (Data Lake)",
      "version": "1.1.0",
      "description": "ADLS Gen2 with private endpoint and lifecycle",
      "adoption_pct": 78,
      "last_updated": "2024-01-05",
      "tags": ["storage", "data-lake", "analytics"]
    }
  ]
}

02 Environment Standardization Across Dev, Staging, and Production

Environment Parity

Staging mirrors production in architecture, NSGs, RBAC, and deployment tooling. Only scale differs. This ensures that anything validated in staging will behave identically in production.

Environment Tiers

TierPurposeLifecycleSLA
DevExploration, testing, experimentationEphemeral (auto-teardown)None
StagingPre-production validationLong-livedInternal
ProductionCustomer-facing workloadsPermanent99.9%+

Azure Deployment Environments (ADE) Recommended

Use Azure Deployment Environments for self-service environment provisioning via DevCenter. Platform engineers define environment types and templates; developers self-serve within guardrails.

Naming Conventions

Resources follow: <env>-<region>-<service>

  • Example: dev-eastus2-payment-api
  • Resource groups: rg-<env>-<service> (e.g., rg-dev-payment-api)

Ephemeral Environments

PR-based environments provisioned via ADE or Terraform workspaces. Auto-teardown on merge. Maximum TTL: 72 hours.

Promotion Flow Required

Dev → Staging → Production. No direct-to-prod deployments. Each promotion requires passing the previous environment's validation gates.

Secrets & Configuration

  • Secrets: Azure Key Vault. Naming: kv-<env>-<service>
  • Feature flags & config: Azure App Configuration. Keys: <service>/<env>/<key>
  • No secrets in code, environment variables, or App Configuration

Access Control

EnvironmentAccess
DevAll engineers (Contributor role)
StagingTeam members + CI/CD service principal
ProductionCI/CD only + PIM for break-glass

Environment Definition (Dev)

environments/dev/main.tf
# environments/dev/main.tf
module "dev_environment" {
  source = "../../modules/environment"

  environment     = "dev"
  location        = "eastus2"
  subscription_id = var.dev_subscription_id

  vnet_config = {
    address_space   = ["10.0.0.0/16"]
    aks_subnet      = "10.0.1.0/24"
    db_subnet       = "10.0.2.0/24"
    services_subnet = "10.0.3.0/24"
  }

  aks_config = {
    kubernetes_version  = "1.29"
    node_pool_vm_size   = "Standard_D2s_v5"
    node_count          = 2
    max_count           = 4
    availability_zones  = []  # No AZ requirement for dev
  }

  monitoring = {
    log_analytics_retention_days = 30
    alert_action_group          = null  # No paging in dev
    grafana_enabled             = false
  }

  cost_management = {
    budget_amount    = 500
    budget_currency  = "USD"
    alert_thresholds = [80, 100]
  }

  tags = {
    Environment = "dev"
    CostCenter  = "engineering"
    ManagedBy   = "terraform"
  }
}

Environment Definition (Production)

environments/production/main.tf
# environments/production/main.tf
module "prod_environment" {
  source = "../../modules/environment"

  environment     = "production"
  location        = "eastus2"
  subscription_id = var.prod_subscription_id

  vnet_config = {
    address_space   = ["10.2.0.0/16"]
    aks_subnet      = "10.2.1.0/24"
    db_subnet       = "10.2.2.0/24"
    services_subnet = "10.2.3.0/24"
  }

  aks_config = {
    kubernetes_version  = "1.29"
    node_pool_vm_size   = "Standard_D4s_v5"
    node_count          = 3
    max_count           = 20
    availability_zones  = ["1", "2", "3"]
  }

  monitoring = {
    log_analytics_retention_days = 90
    alert_action_group          = "pagerduty-critical"
    grafana_enabled             = true
  }

  cost_management = {
    budget_amount    = 8000
    budget_currency  = "USD"
    alert_thresholds = [70, 85, 100]
  }

  tags = {
    Environment = "production"
    CostCenter  = "engineering"
    ManagedBy   = "terraform"
    Compliance  = "soc2"
  }
}

Azure Deployment Environment Definition

environment-definitions/dev-aks/environment.yaml
# environment-definitions/dev-aks/environment.yaml
name: dev-aks-environment
version: 1.0.0
summary: Development AKS environment with supporting services
description: |
  Provisions an AKS cluster with Azure Database for PostgreSQL,
  Key Vault, and Container Registry for development workloads.
runner: Terraform
templatePath: ./main.tf

parameters:
  - id: serviceName
    name: Service Name
    description: Name of the service to deploy
    type: string
    required: true
  - id: teamName
    name: Team
    description: Owning team
    type: string
    required: true
  - id: nodeCount
    name: Node Count
    description: Number of AKS nodes
    type: int
    default: 2

Key Vault + App Configuration

modules/config-management/main.tf
# modules/config-management/main.tf
resource "azurerm_key_vault" "service" {
  name                = "kv-${var.environment}-${var.service_name}"
  location            = var.location
  resource_group_name = var.resource_group_name
  tenant_id           = data.azurerm_client_config.current.tenant_id
  sku_name            = "standard"

  purge_protection_enabled   = var.environment == "production"
  soft_delete_retention_days = var.environment == "production" ? 90 : 7

  network_acls {
    default_action = "Deny"
    bypass         = "AzureServices"
    ip_rules       = var.allowed_ips
    virtual_network_subnet_ids = [var.aks_subnet_id]
  }
}

resource "azurerm_app_configuration" "service" {
  name                = "appconf-${var.environment}-${var.service_name}"
  location            = var.location
  resource_group_name = var.resource_group_name
  sku                 = "standard"

  identity {
    type = "SystemAssigned"
  }
}

Glossary

TermDefinition
Environment ParityThe principle that all environments share the same architecture, differing only in scale.
Ephemeral EnvironmentA short-lived environment created for PR validation, automatically torn down on merge.
Azure Deployment Environment (ADE)A DevCenter resource enabling developers to self-serve pre-defined infrastructure stacks.
DevCenterAzure's centralized management service for developer environments and project resources.
Promotion FlowThe mandatory sequence Dev → Staging → Production for all deployments.
Blue-Green DeploymentRunning two identical production environments and switching traffic between them.
Canary DeploymentRolling out changes to a small subset of traffic before full rollout.
Feature FlagA runtime toggle in Azure App Configuration to enable/disable features per environment.
Configuration DriftWhen actual resource configuration diverges from declared state.
Key VaultAzure's managed secrets store with RBAC, soft-delete, and private endpoint support.
App ConfigurationAzure service for centralized management of application settings and feature flags.
Privileged Identity Management (PIM)Azure AD feature providing just-in-time privileged access with approval workflows.
Break-Glass AccessEmergency production access via PIM, requiring approval and full audit logging.
Environment TierClassification of environments by purpose: Dev, Staging, Production.
Resource Group StrategyOne resource group per environment per service for isolation and cost tracking.

Demo: Environment Dashboard & Provisioning

Screen Layout: A matrix/grid view with rows representing services and columns for environments (Dev, Staging, Production). Each cell shows deployment version, health status indicator, and last deployed timestamp. Drift detection with Azure Policy compliance indicators is shown per cell.

Actions include "Promote to Staging" and "Promote to Production" buttons with an approval workflow modal. An Azure Deployment Environments panel shows active environments with TTL countdown and a DevCenter project selector dropdown.

Mock Data

mock-data/environments.json
{
  "services": [
    {
      "name": "payment-api",
      "team": "payments",
      "environments": {
        "dev":        { "version": "3.2.1", "health": "healthy", "last_deployed": "2024-01-14T10:30:00Z", "policy_compliant": true },
        "staging":    { "version": "3.2.0", "health": "healthy", "last_deployed": "2024-01-13T14:00:00Z", "policy_compliant": true },
        "production": { "version": "3.1.8", "health": "healthy", "last_deployed": "2024-01-10T09:00:00Z", "policy_compliant": true }
      }
    },
    {
      "name": "catalog-service",
      "team": "catalog",
      "environments": {
        "dev":        { "version": "2.5.0", "health": "healthy", "last_deployed": "2024-01-14T11:00:00Z", "policy_compliant": true },
        "staging":    { "version": "2.4.3", "health": "degraded", "last_deployed": "2024-01-12T16:00:00Z", "policy_compliant": false },
        "production": { "version": "2.4.2", "health": "healthy", "last_deployed": "2024-01-08T08:00:00Z", "policy_compliant": true }
      }
    }
  ],
  "ephemeral_environments": [
    { "name": "pr-1234-payment-api", "owner": "jsmith", "ttl_hours": 48, "remaining_hours": 22 },
    { "name": "pr-1267-catalog-svc", "owner": "adoe", "ttl_hours": 72, "remaining_hours": 65 }
  ]
}

03 Platform Versioning, Upgrades, and Capacity Governance

Platform Versioning

All platform components are versioned: AKS cluster version, Terraform module versions, node image versions, and Helm chart versions. Teams pin to the major version and receive automatic patch updates.

AKS Upgrade Strategy Required

Follow Azure's AKS release calendar with environment-specific channels:

EnvironmentChannelBehavior
DevrapidAuto-upgrades to latest
StagingstableAuto-upgrades after stabilization
Productionnone (manual)Manual approval with 2-week testing window

Capacity Governance

  • Namespace quotas: CPU, memory, and storage limits per namespace in AKS
  • Subscription quotas: Azure-level limits monitored and raised proactively
  • Budget management: Azure Cost Management alerts at 80% and 100% thresholds

Cost Allocation Required

All resources tagged with: CostCenter, Team, Service, Environment. Use Azure Cost Management + Billing for team/service attribution.

Deprecation Policy

  • 90-day sunset period with portal warnings
  • Migration guides published at announcement
  • Hard cutoff enforced after sunset period
  • 2-week advance notice for all changes, release notes in portal

Platform Version Manifest

platform-manifest.yaml
# platform-manifest.yaml
apiVersion: platform.acme.com/v1
kind: PlatformManifest
metadata:
  version: "2024.Q1"
  effective_date: "2024-01-15"

components:
  terraform:
    version: "1.7.x"
    provider_azurerm: "~> 3.90"
    state_backend: "azurerm"

  aks:
    kubernetes_version: "1.29"
    node_image_channel: "NodeImage"
    upgrade_channel:
      dev: "rapid"
      staging: "stable"
      production: "none"

  databases:
    postgresql_flexible:
      versions: ["15", "16"]
    cosmos_db:
      api: ["SQL", "MongoDB"]

  monitoring:
    grafana_version: "10.x"
    log_analytics_agent: "ama-logs"
    prometheus: "azure-managed"

  security:
    defender_plans: ["Containers", "KeyVaults", "DNS", "Databases"]
    tls_minimum: "1.2"
    azure_policy_set: "acme-baseline-v2"

  container_registry:
    acr_sku: "Premium"
    geo_replication: ["eastus2", "westus2"]
    content_trust: true

AKS Namespace Resource Quota

quotas/namespace-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-payment-quota
  namespace: payments
spec:
  hard:
    requests.cpu: "8"
    requests.memory: 16Gi
    limits.cpu: "16"
    limits.memory: 32Gi
    pods: "20"
    services: "10"
    persistentvolumeclaims: "5"
---
apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: payments
spec:
  limits:
    - default:
        cpu: "500m"
        memory: "512Mi"
      defaultRequest:
        cpu: "100m"
        memory: "128Mi"
      type: Container

Azure Budget & Cost Alert

modules/cost-management/main.tf
resource "azurerm_consumption_budget_resource_group" "team" {
  name              = "budget-${var.team}-${var.environment}"
  resource_group_id = var.resource_group_id

  amount     = var.monthly_budget
  time_grain = "Monthly"

  time_period {
    start_date = "2024-01-01T00:00:00Z"
  }

  notification {
    enabled        = true
    threshold      = 80
    operator       = "GreaterThan"
    contact_emails = var.team_leads
    contact_groups = [var.action_group_id]
  }

  notification {
    enabled        = true
    threshold      = 100
    operator       = "GreaterThan"
    contact_emails = var.team_leads
    contact_groups = [var.action_group_id, var.finops_action_group_id]
  }
}

Deprecation Notice Template

notices/deprecation.yaml
apiVersion: platform.acme.com/v1
kind: DeprecationNotice
metadata:
  id: "DEP-2024-003"
  severity: warning
spec:
  component: "aks"
  affected_version: "1.27"
  replacement_version: "1.29"
  announcement_date: "2024-01-01"
  sunset_date: "2024-04-01"
  migration_guide: "https://portal.acme.com/guides/aks-1.27-to-1.29"
  affected_teams: ["payments", "catalog", "orders"]
  breaking_changes:
    - "Deprecated PodSecurityPolicy removed"
    - "API version batch/v1beta1 removed"

AKS Upgrade Runbook Template

runbooks/aks-upgrade.yaml
apiVersion: platform.acme.com/v1
kind: UpgradeRunbook
metadata:
  name: aks-cluster-upgrade
  version: "1.0"
spec:
  pre_checks:
    - name: "Verify PDB coverage"
      command: "kubectl get pdb --all-namespaces"
    - name: "Check deprecated API usage"
      command: "kubectl get --raw /metrics | grep apiserver_requested_deprecated_apis"
    - name: "Validate node pool health"
      command: "az aks nodepool list --cluster-name $CLUSTER --resource-group $RG"
  upgrade_steps:
    - name: "Upgrade control plane"
      command: "az aks upgrade --resource-group $RG --name $CLUSTER --kubernetes-version $TARGET --control-plane-only"
    - name: "Upgrade system node pool"
      command: "az aks nodepool upgrade --cluster-name $CLUSTER --name system --resource-group $RG --kubernetes-version $TARGET"
    - name: "Upgrade user node pools (rolling)"
      command: "az aks nodepool upgrade --cluster-name $CLUSTER --name $POOL --resource-group $RG --kubernetes-version $TARGET --max-surge 33%"
  post_checks:
    - name: "Verify all nodes ready"
      command: "kubectl get nodes"
    - name: "Run smoke tests"
      command: "./scripts/smoke-test.sh"
  rollback:
    - name: "Scale up old node pool"
    - name: "Cordon new nodes"
    - name: "Drain and delete new node pool"

Glossary

TermDefinition
SemVerSemantic Versioning (MAJOR.MINOR.PATCH) — the versioning standard for all modules.
Platform ManifestYAML document declaring all approved component versions for a given quarter.
Resource QuotaKubernetes object limiting total resource consumption per namespace.
LimitRangeKubernetes object setting default and maximum resource requests/limits per container.
Capacity GovernancePolicies and automation ensuring resource usage stays within budget and quotas.
Deprecation NoticeFormal announcement of a component version reaching end-of-support.
Sunset PeriodThe 90-day window between deprecation announcement and hard removal.
AKS Upgrade ChannelAzure setting (rapid/stable/none) controlling automatic Kubernetes version upgrades.
Node Image ChannelAzure setting controlling automatic OS image updates on AKS nodes.
Maintenance WindowScheduled time range when AKS can apply automatic upgrades.
PodDisruptionBudgetKubernetes object ensuring minimum pod availability during voluntary disruptions.
Azure Cost ManagementAzure-native service for budget tracking, cost alerts, and spend analysis.
FinOpsFinancial operations discipline for cloud cost optimization and accountability.

Demo: Platform Version & Capacity Dashboard

Screen Layout: Dashboard with AKS cluster version cards showing current/available/deprecated status. Namespace quota usage displays as progress bars (CPU, memory, pods used vs. allocated). An Azure Cost Management section shows team spend vs budget with burn rate trending. An AKS upgrade scheduler with maintenance window picker lets platform engineers schedule rolling upgrades. A deprecation timeline shows upcoming sunsets.

Mock Data

mock-data/platform-versions.json
{
  "clusters": [
    { "name": "aks-dev-eastus2", "current": "1.29.2", "available": "1.30.0", "channel": "rapid", "status": "up-to-date" },
    { "name": "aks-staging-eastus2", "current": "1.29.0", "available": "1.29.2", "channel": "stable", "status": "update-available" },
    { "name": "aks-prod-eastus2", "current": "1.28.5", "available": "1.29.2", "channel": "none", "status": "upgrade-required" }
  ],
  "quotas": {
    "payments": { "cpu_used": 6.2, "cpu_limit": 8, "memory_used_gi": 12.4, "memory_limit_gi": 16, "pods_used": 14, "pods_limit": 20 },
    "catalog":  { "cpu_used": 3.1, "cpu_limit": 8, "memory_used_gi": 5.8, "memory_limit_gi": 16, "pods_used": 8, "pods_limit": 20 }
  },
  "budgets": [
    { "team": "payments", "budget": 3000, "spent": 2340, "burn_rate": 78 },
    { "team": "catalog",  "budget": 2500, "spent": 1875, "burn_rate": 75 }
  ]
}

04 Security, Compliance, and Cost Optimization

Security by Default Required

Golden paths enforce security without developer action:

  • Encryption at rest: Azure-managed keys (all services)
  • Encryption in transit: TLS 1.2+ enforced
  • Managed Identity: No service principal secrets in code
  • NSG lockdown: Deny-all default with explicit allow rules
  • Private endpoints: All PaaS services accessible only via VNet

Azure Policy Required

All compliance rules defined as Azure Policy definitions and initiatives. Policies assigned at management group or subscription level.

EffectUse Case
DenyCritical rules — block non-compliant resource creation
AuditAdvisory rules — flag non-compliance without blocking
DeployIfNotExistsAuto-remediate — deploy missing config (e.g., diagnostic settings)
ModifyAuto-fix — add missing tags or settings on create/update

Defender for Cloud

Enable Defender plans for Containers, Key Vaults, DNS, Databases, and Storage. Use the regulatory compliance dashboard for NIST 800-53, SOC 2, and CIS Azure Benchmarks. Custom compliance standards supported via policy initiatives.

Secrets Management Required

  • Azure Key Vault for all secrets
  • No secrets in code, env vars, or App Configuration
  • Managed Identity + CSI Secret Store Driver for AKS pods
  • Auto-rotation via Key Vault rotation policies

Container Security

  • ACR with content trust enabled
  • Vulnerability scanning via Defender for Containers
  • Critical/High CVEs block deployment (admission controller)
  • Only approved base images from organization ACR

Network Security

  • Private endpoints for all PaaS services
  • AKS with Azure CNI + Network Policies (Calico)
  • NSG flow logs enabled
  • Azure Firewall for egress filtering in production

Cost Optimization

  • Dev/Staging use spot node pools where feasible
  • Production uses Reserved Instances for baseline capacity
  • AKS cluster autoscaler + Karpenter for efficient scaling
  • Unused resources auto-flagged via Azure Advisor
  • Monthly FinOps review with team leads

Azure Policy: Encryption Required

policies/require-storage-encryption.json
{
  "properties": {
    "displayName": "Require encryption at rest for storage accounts",
    "policyType": "Custom",
    "mode": "All",
    "parameters": {},
    "policyRule": {
      "if": {
        "allOf": [
          {
            "field": "type",
            "equals": "Microsoft.Storage/storageAccounts"
          },
          {
            "field": "Microsoft.Storage/storageAccounts/encryption.services.blob.enabled",
            "notEquals": true
          }
        ]
      },
      "then": {
        "effect": "deny"
      }
    }
  }
}

Azure Policy: Approved AKS Node Sizes

policies/allowed-vm-sizes.json
{
  "properties": {
    "displayName": "Allowed AKS node pool VM sizes",
    "policyType": "Custom",
    "mode": "All",
    "parameters": {
      "allowedSizes": {
        "type": "Array",
        "metadata": {
          "displayName": "Allowed VM Sizes",
          "description": "Approved VM sizes for AKS node pools"
        },
        "defaultValue": [
          "Standard_D2s_v5",
          "Standard_D4s_v5",
          "Standard_D8s_v5",
          "Standard_E4ds_v5",
          "Standard_E8ds_v5"
        ]
      }
    },
    "policyRule": {
      "if": {
        "allOf": [
          {
            "field": "type",
            "equals": "Microsoft.ContainerService/managedClusters/agentPools"
          },
          {
            "field": "Microsoft.ContainerService/managedClusters/agentPools/vmSize",
            "notIn": "[parameters('allowedSizes')]"
          }
        ]
      },
      "then": {
        "effect": "deny"
      }
    }
  }
}

Azure Policy: Required Tags

policies/require-tags.json
{
  "properties": {
    "displayName": "Require mandatory resource tags",
    "policyType": "Custom",
    "mode": "Indexed",
    "parameters": {},
    "policyRule": {
      "if": {
        "anyOf": [
          { "field": "tags['CostCenter']", "exists": false },
          { "field": "tags['Team']", "exists": false },
          { "field": "tags['Service']", "exists": false },
          { "field": "tags['Environment']", "exists": false },
          { "field": "tags['ManagedBy']", "exists": false }
        ]
      },
      "then": {
        "effect": "deny"
      }
    }
  }
}

NSG Baseline (AKS)

modules/networking/nsg.tf
resource "azurerm_network_security_group" "aks" {
  name                = "nsg-${var.service_name}-${var.environment}"
  location            = var.location
  resource_group_name = var.resource_group_name

  security_rule {
    name                       = "AllowHTTPSInbound"
    priority                   = 100
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "Tcp"
    source_port_range          = "*"
    destination_port_range     = "443"
    source_address_prefix      = "AzureLoadBalancer"
    destination_address_prefix = var.aks_subnet_prefix
    description                = "Allow HTTPS from Azure LB"
  }

  security_rule {
    name                       = "DenyAllInbound"
    priority                   = 4096
    direction                  = "Inbound"
    access                     = "Deny"
    protocol                   = "*"
    source_port_range          = "*"
    destination_port_range     = "*"
    source_address_prefix      = "*"
    destination_address_prefix = "*"
    description                = "Deny all other inbound"
  }

  tags = var.common_tags
}

Defender for Cloud Custom Compliance

modules/compliance/initiative.tf
resource "azurerm_policy_set_definition" "compliance" {
  name         = "acme-security-standard-v1"
  policy_type  = "Custom"
  display_name = "Acme Security Standard v1"
  description  = "Internal security requirements for all Azure workloads"

  metadata = jsonencode({
    category = "Regulatory Compliance"
    ASC = {
      complianceStandard = {
        displayName = "Acme Security Standard v1"
        version     = "1.0"
      }
    }
  })

  policy_definition_group {
    name         = "ACME-1.1"
    display_name = "ACME-1.1: Data encrypted at rest"
    category     = "Data Protection"
  }

  policy_definition_group {
    name         = "ACME-1.2"
    display_name = "ACME-1.2: Private endpoints only"
    category     = "Network Security"
  }

  policy_definition_group {
    name         = "ACME-2.1"
    display_name = "ACME-2.1: Privileged MFA required"
    category     = "Identity"
  }

  policy_definition_reference {
    policy_definition_id = "/providers/Microsoft.Authorization/policyDefinitions/6fac406b-40ca-413b-bf8e-0bf964659c25"
    reference_id         = "storageEncryption"
    policy_group_names   = ["ACME-1.1"]
  }
}

FinOps Cost Anomaly Alert

modules/finops/anomaly-alert.tf
resource "azurerm_cost_anomaly_alert" "finops" {
  name            = "anomaly-alert-${var.team}"
  display_name    = "Cost anomaly alert for ${var.team}"
  email_subject   = "Azure Cost Anomaly Detected - ${var.team}"
  email_addresses = var.finops_team_emails
  message         = "An unexpected cost increase was detected."

  subscription_id = var.subscription_id
}

Glossary

TermDefinition
Azure Policy (Deny)Blocks resource creation or update that violates the policy rule.
Azure Policy (Audit)Flags non-compliant resources without blocking their creation.
Azure Policy (DeployIfNotExists)Automatically deploys a related resource if missing (e.g., diagnostic settings).
Azure Policy (Modify)Adds or corrects properties on resources at create/update time.
Policy InitiativeA collection of policy definitions grouped under a single assignment.
Defender for CloudUnified security posture management and threat protection for Azure resources.
Regulatory Compliance DashboardDefender for Cloud view mapping policy compliance to standards (NIST, SOC 2, CIS).
Managed IdentityAzure AD identity assigned to a resource, eliminating the need for stored credentials.
Encryption at RestData encrypted when stored on disk using Azure-managed or customer-managed keys.
CSI Secret Store DriverKubernetes driver that mounts Key Vault secrets directly into pod volumes.
Content Trust (ACR)Docker Content Trust for verifying image integrity and publisher identity.
FinOpsPractice of bringing financial accountability to cloud spend through visibility, optimization, and governance.
Reserved Instances1- or 3-year Azure VM commitments at discounted rates for predictable workloads.
Spot Node PoolsAKS node pools using Azure Spot VMs at significant discount for interruptible workloads.
NIST 800-53US federal information security standard with comprehensive security controls.
SOC 2Audit framework for service organizations covering security, availability, and confidentiality.
CIS Azure BenchmarkCenter for Internet Security benchmark for Azure configuration best practices.
Network Security GroupAzure-native firewall rules filtering traffic to/from resources within a VNet.
Private EndpointPrivate IP address within a VNet for accessing Azure PaaS services without public internet.
Azure FirewallManaged cloud firewall for controlling outbound traffic from VNets.

Demo: Security & Compliance Scorecard

Screen Layout: A service-level compliance scorecard showing Defender for Cloud secure score, policy violations count, CVE count, tag compliance percentage, and secrets rotation status per service. A portfolio-level compliance heatmap (services × controls matrix) uses color coding for pass/fail/warning. A FinOps panel shows team spend vs. budget with cost trend charts and optimization recommendations from Azure Advisor.

Mock Data

mock-data/compliance.json
{
  "secure_score": 82,
  "services": [
    {
      "name": "payment-api",
      "policy_violations": 0,
      "cve_count": 2,
      "tag_compliance_pct": 100,
      "secrets_rotated": true,
      "defender_score": 95
    },
    {
      "name": "catalog-service",
      "policy_violations": 3,
      "cve_count": 7,
      "tag_compliance_pct": 85,
      "secrets_rotated": true,
      "defender_score": 78
    }
  ],
  "finops": {
    "total_budget": 15000,
    "total_spent": 11200,
    "recommendations": [
      { "type": "right-sizing", "savings": 320, "resource": "Standard_D8s_v5 -> Standard_D4s_v5" },
      { "type": "reserved-instance", "savings": 1200, "resource": "3x Standard_D4s_v5 (1yr)" }
    ]
  }
}

Unified Demo Application Specification

Acme Platform Console (Azure)

Tech Stack: React 18, TypeScript, Tailwind CSS, Recharts, Mock API layer

Key Screens

  1. Home Dashboard — Platform health, active golden paths, environment matrix, Defender secure score
  2. Golden Path Catalog — Card grid with self-service provisioning via Azure Deployment Environments
  3. Environment Manager — Environment matrix with Azure Policy compliance, promotion workflows, ADE panel
  4. Platform Versions — AKS version tracker, upgrade scheduler (rapid/stable/none channels), deprecation timeline
  5. Compliance Center — Defender for Cloud integration, service scorecards, FinOps dashboard

Data Model

data-model.ts
// Core entities
Service: { id, name, team, golden_path_id, environments[] }
Environment: { id, name, tier, region, health, version, policy_compliant }
GoldenPathTemplate: { id, name, version, description, adoption_pct, tags[] }
ResourceQuota: { namespace, cpu_used, cpu_limit, memory_used, memory_limit, pods_used, pods_limit }
ComplianceControl: { id, name, category, status, framework }
PolicyViolation: { id, resource_id, policy_name, severity, detected_at }
CostReport: { team, budget, spent, burn_rate, recommendations[] }
DeploymentEnvironment: { id, name, owner, ttl_hours, remaining_hours, status }
DevCenterProject: { id, name, environment_types[], teams[] }

Mock API Schema (REST Endpoints)

MethodEndpointDescription
GET/api/servicesList all services with environment status
GET/api/golden-pathsList available golden path templates
POST/api/golden-paths/{id}/provisionTrigger provisioning workflow
GET/api/environmentsList all environments with health
POST/api/environments/{id}/promotePromote deployment to next tier
GET/api/platform/manifestGet current platform version manifest
GET/api/clustersList AKS clusters with version info
POST/api/clusters/{id}/upgradeSchedule AKS upgrade
GET/api/quotas/{namespace}Get namespace quota usage
GET/api/compliance/scoreGet Defender secure score
GET/api/compliance/violationsList policy violations
GET/api/costs/{team}Get team cost report
GET/api/deployment-environmentsList active deployment environments
POST/api/deployment-environmentsCreate new deployment environment
DELETE/api/deployment-environments/{id}Tear down deployment environment

AWS → Azure Service Mapping

CapabilityAWS VersionAzure Version
ComputeECS FargateAKS (Azure Kubernetes Service)
IaCTerraform + AWS providerTerraform + AzureRM provider
Policy EngineSentinel (Terraform Cloud)Azure Policy + Defender for Cloud
SecretsAWS Secrets ManagerAzure Key Vault
ConfigSSM Parameter StoreAzure App Configuration
MonitoringCloudWatchAzure Monitor + Managed Grafana
Container RegistryECRAzure Container Registry (ACR)
ComplianceAWS ConfigDefender for Cloud Regulatory Compliance
Cost ManagementAWS Cost ExplorerAzure Cost Management + Billing
Self-Service Envs(custom)Azure Deployment Environments
Dev Workstations(none)Microsoft Dev Box