Platform Lifecycle — IDP Module Specification
Developer guidelines, templates, and specifications for managing platform infrastructure on Microsoft Azure through an Internal Developer Portal
Lifecycle Phases
Platform Lifecycle governance on Azure follows a five-phase model. Each phase maps to specific IDP modules, Terraform modules, Azure services, and compliance controls that ensure infrastructure is provisioned, operated, and retired consistently. Click a phase below to jump to its explanation.
Lifecycle phases explained
What is done in each phase of the platform lifecycle on Azure.
Plan
Requirements, design, and approval for new or changed infrastructure. Teams define what they need, align with golden path templates, and get required approvals. Outputs include Terraform module selection and environment strategy.
Provision
Infrastructure is created using Terraform (AzureRM). Resources are deployed via pipelines to dev, staging, or production. Includes applying IaC, configuring networking and integrations, and validation.
Scale
Capacity and performance are adjusted via quotas, auto-scaling, and capacity planning. Platform versioning and upgrade policies ensure consistent behavior across environments.
Maintain
Ongoing operations: monitoring, patching, upgrades, and incident response. Teams apply security updates and follow deprecation and upgrade runbooks. Compliance and audit controls are maintained.
Retire
Infrastructure is decommissioned when no longer needed. Data is archived or migrated, resources are destroyed via Terraform, and dependencies are updated following documented procedures.
This specification targets Microsoft Azure with Terraform (AzureRM provider) as the IaC toolchain, AKS (Azure Kubernetes Service) as the primary compute platform, and Azure Policy + Defender for Cloud for governance.
Module Summary
IaC Template Management & Golden Paths
Terraform module standards, golden path definitions, scaffolding templates, and the Backstage integration for self-service provisioning.
Environment Standardization
Dev, Staging, and Production environment parity. Azure Deployment Environments, Key Vault, App Configuration, and promotion workflows.
Versioning, Upgrades & Capacity
Platform manifests, AKS upgrade channels, resource quotas, cost management, and deprecation governance.
Security, Compliance & Cost Optimization
Azure Policy, Defender for Cloud, Key Vault secrets management, container security, network isolation, and FinOps.
- Azure Deployment Environments (ADE) — Self-service environment provisioning through DevCenter
- Microsoft Dev Box — Cloud-powered developer workstations managed via DevCenter
- Azure Managed Grafana — Fully managed observability dashboards
- Microsoft Defender for Cloud — Unified security posture & compliance scoring
- Azure DevCenter — Centralized project organization for developer teams
01 Infrastructure-as-Code Template Management & Golden Path Definitions
Golden Path Philosophy
Golden paths are opinionated but optional infrastructure templates that encode organizational best practices. They provide the fastest, most secure route to production — while allowing teams to diverge when justified.
- Voluntary adoption target: >80% of new services should use a golden path
- Teams can override any default via escape hatches (override variables)
- High escape-hatch usage (>20% of deployments) triggers a platform review
- Golden paths are maintained by the Platform Engineering team with a published SLA
IaC Standards
All infrastructure is defined in Terraform using the AzureRM provider. Modules follow semantic versioning (SemVer) and are published to the Terraform Cloud private registry or Azure DevOps Artifacts.
- Provider pinning:
azurerm ~> 3.90 - State backend: Azure Storage (blob) with state locking via lease
- Workspace strategy: one workspace per environment per service
- Plan/apply runs through CI/CD — no manual
terraform applyin production
Module Structure
Every Terraform module follows a standard layout:
module-name/
├── main.tf # Primary resource definitions
├── variables.tf # Input variables with descriptions
├── outputs.tf # Output values
├── versions.tf # Provider and Terraform version constraints
├── README.md # Usage documentation
└── examples/
├── basic/ # Minimal working example
└── complete/ # Full-featured example
Naming Conventions
Resources follow the pattern: <org>-<resource-type>-<purpose>
| Example | Resource |
|---|---|
acme-aks-cluster | AKS cluster module |
acme-postgresql-flexible | PostgreSQL Flexible Server module |
acme-storage-datalake | Data Lake storage module |
acme-keyvault-service | Key Vault module |
Azure Verified Modules (AVM) Recommended
Encourage use of Azure Verified Modules from the AVM registry as base modules, wrapped with organization-specific defaults. AVM modules are maintained by Microsoft and community contributors, follow consistent patterns, and include built-in testing.
Review Process
- All module changes require PR review from the Platform team
- Production infrastructure changes require 2 approvals
- Automated plan output attached to PR for review
- Sentinel/OPA policy checks run on every plan
Escape Hatches
Override variables allow teams to diverge from golden path defaults. Every override is tracked and reported. If a module's escape-hatch usage exceeds 20%, the Platform team reviews whether the golden path needs updating.
Golden Path: Web Service (AKS Deployment) Required
# golden-paths/web-service/main.tf
module "web_service" {
source = "app.terraform.io/acme/web-service/azurerm"
version = "~> 2.0"
service_name = var.service_name
team = var.team
container_image = var.container_image
environment = var.environment # dev | staging | production
# AKS cluster reference
aks_cluster_name = var.aks_cluster_name
aks_resource_group = var.aks_resource_group
# Optional overrides
replicas = var.replicas # default: based on environment
cpu_request = var.cpu_request # default: "250m"
memory_request = var.memory_request # default: "256Mi"
custom_domain = var.custom_domain
}
variable "service_name" {
description = "Name of the service"
type = string
}
variable "team" {
description = "Owning team"
type = string
}
variable "container_image" {
description = "Container image URI"
type = string
}
variable "environment" {
description = "Target environment"
type = string
validation {
condition = contains(["dev", "staging", "production"], var.environment)
error_message = "Must be dev, staging, or production."
}
}
variable "aks_cluster_name" {
description = "AKS cluster name"
type = string
}
variable "aks_resource_group" {
description = "Resource group containing AKS cluster"
type = string
}
variable "replicas" {
description = "Number of replicas (overrides env default)"
type = number
default = null
}
variable "cpu_request" {
description = "CPU request per pod"
type = string
default = null
}
variable "memory_request" {
description = "Memory request per pod"
type = string
default = null
}
variable "custom_domain" {
description = "Custom domain for ingress"
type = string
default = null
}
Environment Defaults
| Parameter | Dev | Staging | Production |
|---|---|---|---|
| replicas | 1 | 2 | 3 |
| min_replicas | 1 | 2 | 3 |
| max_replicas | 2 | 4 | 10 |
| cpu_request | 100m | 250m | 500m |
| memory_request | 128Mi | 256Mi | 512Mi |
| HPA enabled | false | true | true |
| PDB enabled | false | false | true |
| Monitoring alerts | none | team-webhook | PagerDuty |
# Generated: deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: {{ service_name }}
namespace: {{ team }}
labels:
app.kubernetes.io/name: {{ service_name }}
app.kubernetes.io/managed-by: terraform
acme.com/team: {{ team }}
acme.com/golden-path: web-service-v2
spec:
replicas: {{ replicas }}
selector:
matchLabels:
app.kubernetes.io/name: {{ service_name }}
template:
metadata:
labels:
app.kubernetes.io/name: {{ service_name }}
spec:
containers:
- name: {{ service_name }}
image: {{ container_image }}
resources:
requests:
cpu: {{ cpu_request }}
memory: {{ memory_request }}
limits:
cpu: {{ cpu_limit }}
memory: {{ memory_limit }}
ports:
- containerPort: 8080
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 15
readinessProbe:
httpGet:
path: /ready
port: 8080
Golden Path: Data Store (PostgreSQL Flexible Server) Required
module "database" {
source = "app.terraform.io/acme/postgresql-flexible/azurerm"
version = "~> 1.0"
db_name = var.db_name
team = var.team
environment = var.environment
size = var.size # small | medium | large
resource_group_name = var.resource_group_name
vnet_id = var.vnet_id
subnet_id = var.delegated_subnet_id
# Defaults enforced by golden path:
# - ssl_enforcement_enabled = true
# - geo_redundant_backup = (production only)
# - threat_detection_policy = enabled
# - azure_ad_authentication = enabled
# - private_dns_zone = auto-configured
}
Size Mapping
| Size | SKU | Storage | IOPS | Backup Retention |
|---|---|---|---|---|
small | B_Standard_B1ms | 32 GB | baseline | 7 days |
medium | GP_Standard_D2ds_v4 | 128 GB | 3,000 | 14 days |
large | MO_Standard_E4ds_v4 | 512 GB | 10,000 | 35 days |
Golden Path: Azure Storage Account Recommended
Data Lake or static asset storage with encryption, private endpoint, and lifecycle management enforced by default.
module "storage" {
source = "app.terraform.io/acme/storage-account/azurerm"
version = "~> 1.0"
storage_name = var.storage_name
team = var.team
environment = var.environment
resource_group_name = var.resource_group_name
location = var.location
# Enforced defaults:
# - account_tier = "Standard"
# - account_replication = env == "production" ? "GRS" : "LRS"
# - min_tls_version = "TLS1_2"
# - enable_https_only = true
# - public_network_access = false
# - private_endpoint = auto-configured
# - lifecycle_management = 90-day archive, 365-day delete
# - blob_versioning = true
}
Scaffolding Template (Backstage Software Template)
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
name: aks-web-service
title: AKS Web Service
description: Deploy a web service to AKS using the golden path
tags: ['aks', 'web-service', 'golden-path', 'azure']
spec:
owner: platform-engineering
type: service
parameters:
- title: Service Details
required: [service_name, team]
properties:
service_name:
title: Service Name
type: string
pattern: '^[a-z][a-z0-9-]{2,28}[a-z0-9]$'
team:
title: Team
type: string
enum: [payments, catalog, orders, identity]
environment:
title: Initial Environment
type: string
enum: [dev, staging]
default: dev
steps:
- id: terraform-init
name: Initialize Terraform
action: terraform:init
- id: terraform-apply
name: Apply Infrastructure
action: terraform:apply
- id: register-catalog
name: Register in Software Catalog
action: catalog:register
Glossary
| Term | Definition |
|---|---|
| Golden Path | An opinionated, supported template that encodes best practices for a common infrastructure pattern. Voluntary but strongly encouraged. |
| Escape Hatch | Override variables that let teams diverge from golden path defaults. Tracked and reported. |
| Terraform Module | A reusable, versioned unit of Terraform configuration that provisions a set of related resources. |
| Azure Verified Module (AVM) | Microsoft-maintained Terraform/Bicep modules from the AVM registry with standardized testing and interfaces. |
| Module Registry | Terraform Cloud private registry or Azure DevOps Artifacts where versioned modules are published. |
| Azure Policy | Azure-native governance engine that enforces organizational rules on resources at scale. |
| Infrastructure Drift | When the actual state of deployed resources diverges from the declared Terraform state. |
| Terraform State | A file tracking the mapping between configuration and real-world resources. |
| Remote Backend | Azure Storage account used to store Terraform state centrally with locking. |
| Workspace | An isolated Terraform state instance, typically one per environment per service. |
| Plan/Apply Cycle | The two-phase Terraform workflow: preview changes (plan), then execute (apply). |
| Azure DevCenter | Centralized management for developer environments and project resources. |
| Deployment Environment | An Azure DevCenter resource that allows developers to self-serve infrastructure stacks. |
Demo: Self-Service Infrastructure Provisioning
This demo screen shows the developer experience for provisioning infrastructure through the IDP portal.
Wireframe Description
Screen Layout: A card grid displaying available golden path templates. Each card shows template name, version, description, team adoption percentage, and last updated date. Clicking a card opens a provisioning form that triggers a simulated Terraform plan with a progress timeline: Validating → Planning → Applying → Registering in Catalog → Complete.
Mock Data
{
"golden_paths": [
{
"id": "aks-web-service",
"name": "AKS Web Service",
"version": "2.1.0",
"description": "Deploy a containerized web service to AKS",
"adoption_pct": 87,
"last_updated": "2024-01-10",
"tags": ["aks", "web", "production-ready"]
},
{
"id": "postgresql-flexible",
"name": "PostgreSQL Flexible Server",
"version": "1.3.0",
"description": "Managed PostgreSQL with VNet integration",
"adoption_pct": 92,
"last_updated": "2024-01-08",
"tags": ["database", "postgresql", "managed"]
},
{
"id": "storage-datalake",
"name": "Storage Account (Data Lake)",
"version": "1.1.0",
"description": "ADLS Gen2 with private endpoint and lifecycle",
"adoption_pct": 78,
"last_updated": "2024-01-05",
"tags": ["storage", "data-lake", "analytics"]
}
]
}
02 Environment Standardization Across Dev, Staging, and Production
Environment Parity
Staging mirrors production in architecture, NSGs, RBAC, and deployment tooling. Only scale differs. This ensures that anything validated in staging will behave identically in production.
Environment Tiers
| Tier | Purpose | Lifecycle | SLA |
|---|---|---|---|
| Dev | Exploration, testing, experimentation | Ephemeral (auto-teardown) | None |
| Staging | Pre-production validation | Long-lived | Internal |
| Production | Customer-facing workloads | Permanent | 99.9%+ |
Azure Deployment Environments (ADE) Recommended
Use Azure Deployment Environments for self-service environment provisioning via DevCenter. Platform engineers define environment types and templates; developers self-serve within guardrails.
Naming Conventions
Resources follow: <env>-<region>-<service>
- Example:
dev-eastus2-payment-api - Resource groups:
rg-<env>-<service>(e.g.,rg-dev-payment-api)
Ephemeral Environments
PR-based environments provisioned via ADE or Terraform workspaces. Auto-teardown on merge. Maximum TTL: 72 hours.
Promotion Flow Required
Dev → Staging → Production. No direct-to-prod deployments. Each promotion requires passing the previous environment's validation gates.
Secrets & Configuration
- Secrets: Azure Key Vault. Naming:
kv-<env>-<service> - Feature flags & config: Azure App Configuration. Keys:
<service>/<env>/<key> - No secrets in code, environment variables, or App Configuration
Access Control
| Environment | Access |
|---|---|
| Dev | All engineers (Contributor role) |
| Staging | Team members + CI/CD service principal |
| Production | CI/CD only + PIM for break-glass |
Environment Definition (Dev)
# environments/dev/main.tf
module "dev_environment" {
source = "../../modules/environment"
environment = "dev"
location = "eastus2"
subscription_id = var.dev_subscription_id
vnet_config = {
address_space = ["10.0.0.0/16"]
aks_subnet = "10.0.1.0/24"
db_subnet = "10.0.2.0/24"
services_subnet = "10.0.3.0/24"
}
aks_config = {
kubernetes_version = "1.29"
node_pool_vm_size = "Standard_D2s_v5"
node_count = 2
max_count = 4
availability_zones = [] # No AZ requirement for dev
}
monitoring = {
log_analytics_retention_days = 30
alert_action_group = null # No paging in dev
grafana_enabled = false
}
cost_management = {
budget_amount = 500
budget_currency = "USD"
alert_thresholds = [80, 100]
}
tags = {
Environment = "dev"
CostCenter = "engineering"
ManagedBy = "terraform"
}
}
Environment Definition (Production)
# environments/production/main.tf
module "prod_environment" {
source = "../../modules/environment"
environment = "production"
location = "eastus2"
subscription_id = var.prod_subscription_id
vnet_config = {
address_space = ["10.2.0.0/16"]
aks_subnet = "10.2.1.0/24"
db_subnet = "10.2.2.0/24"
services_subnet = "10.2.3.0/24"
}
aks_config = {
kubernetes_version = "1.29"
node_pool_vm_size = "Standard_D4s_v5"
node_count = 3
max_count = 20
availability_zones = ["1", "2", "3"]
}
monitoring = {
log_analytics_retention_days = 90
alert_action_group = "pagerduty-critical"
grafana_enabled = true
}
cost_management = {
budget_amount = 8000
budget_currency = "USD"
alert_thresholds = [70, 85, 100]
}
tags = {
Environment = "production"
CostCenter = "engineering"
ManagedBy = "terraform"
Compliance = "soc2"
}
}
Azure Deployment Environment Definition
# environment-definitions/dev-aks/environment.yaml
name: dev-aks-environment
version: 1.0.0
summary: Development AKS environment with supporting services
description: |
Provisions an AKS cluster with Azure Database for PostgreSQL,
Key Vault, and Container Registry for development workloads.
runner: Terraform
templatePath: ./main.tf
parameters:
- id: serviceName
name: Service Name
description: Name of the service to deploy
type: string
required: true
- id: teamName
name: Team
description: Owning team
type: string
required: true
- id: nodeCount
name: Node Count
description: Number of AKS nodes
type: int
default: 2
Key Vault + App Configuration
# modules/config-management/main.tf
resource "azurerm_key_vault" "service" {
name = "kv-${var.environment}-${var.service_name}"
location = var.location
resource_group_name = var.resource_group_name
tenant_id = data.azurerm_client_config.current.tenant_id
sku_name = "standard"
purge_protection_enabled = var.environment == "production"
soft_delete_retention_days = var.environment == "production" ? 90 : 7
network_acls {
default_action = "Deny"
bypass = "AzureServices"
ip_rules = var.allowed_ips
virtual_network_subnet_ids = [var.aks_subnet_id]
}
}
resource "azurerm_app_configuration" "service" {
name = "appconf-${var.environment}-${var.service_name}"
location = var.location
resource_group_name = var.resource_group_name
sku = "standard"
identity {
type = "SystemAssigned"
}
}
Glossary
| Term | Definition |
|---|---|
| Environment Parity | The principle that all environments share the same architecture, differing only in scale. |
| Ephemeral Environment | A short-lived environment created for PR validation, automatically torn down on merge. |
| Azure Deployment Environment (ADE) | A DevCenter resource enabling developers to self-serve pre-defined infrastructure stacks. |
| DevCenter | Azure's centralized management service for developer environments and project resources. |
| Promotion Flow | The mandatory sequence Dev → Staging → Production for all deployments. |
| Blue-Green Deployment | Running two identical production environments and switching traffic between them. |
| Canary Deployment | Rolling out changes to a small subset of traffic before full rollout. |
| Feature Flag | A runtime toggle in Azure App Configuration to enable/disable features per environment. |
| Configuration Drift | When actual resource configuration diverges from declared state. |
| Key Vault | Azure's managed secrets store with RBAC, soft-delete, and private endpoint support. |
| App Configuration | Azure service for centralized management of application settings and feature flags. |
| Privileged Identity Management (PIM) | Azure AD feature providing just-in-time privileged access with approval workflows. |
| Break-Glass Access | Emergency production access via PIM, requiring approval and full audit logging. |
| Environment Tier | Classification of environments by purpose: Dev, Staging, Production. |
| Resource Group Strategy | One resource group per environment per service for isolation and cost tracking. |
Demo: Environment Dashboard & Provisioning
Screen Layout: A matrix/grid view with rows representing services and columns for environments (Dev, Staging, Production). Each cell shows deployment version, health status indicator, and last deployed timestamp. Drift detection with Azure Policy compliance indicators is shown per cell.
Actions include "Promote to Staging" and "Promote to Production" buttons with an approval workflow modal. An Azure Deployment Environments panel shows active environments with TTL countdown and a DevCenter project selector dropdown.
Mock Data
{
"services": [
{
"name": "payment-api",
"team": "payments",
"environments": {
"dev": { "version": "3.2.1", "health": "healthy", "last_deployed": "2024-01-14T10:30:00Z", "policy_compliant": true },
"staging": { "version": "3.2.0", "health": "healthy", "last_deployed": "2024-01-13T14:00:00Z", "policy_compliant": true },
"production": { "version": "3.1.8", "health": "healthy", "last_deployed": "2024-01-10T09:00:00Z", "policy_compliant": true }
}
},
{
"name": "catalog-service",
"team": "catalog",
"environments": {
"dev": { "version": "2.5.0", "health": "healthy", "last_deployed": "2024-01-14T11:00:00Z", "policy_compliant": true },
"staging": { "version": "2.4.3", "health": "degraded", "last_deployed": "2024-01-12T16:00:00Z", "policy_compliant": false },
"production": { "version": "2.4.2", "health": "healthy", "last_deployed": "2024-01-08T08:00:00Z", "policy_compliant": true }
}
}
],
"ephemeral_environments": [
{ "name": "pr-1234-payment-api", "owner": "jsmith", "ttl_hours": 48, "remaining_hours": 22 },
{ "name": "pr-1267-catalog-svc", "owner": "adoe", "ttl_hours": 72, "remaining_hours": 65 }
]
}
03 Platform Versioning, Upgrades, and Capacity Governance
Platform Versioning
All platform components are versioned: AKS cluster version, Terraform module versions, node image versions, and Helm chart versions. Teams pin to the major version and receive automatic patch updates.
AKS Upgrade Strategy Required
Follow Azure's AKS release calendar with environment-specific channels:
| Environment | Channel | Behavior |
|---|---|---|
| Dev | rapid | Auto-upgrades to latest |
| Staging | stable | Auto-upgrades after stabilization |
| Production | none (manual) | Manual approval with 2-week testing window |
Capacity Governance
- Namespace quotas: CPU, memory, and storage limits per namespace in AKS
- Subscription quotas: Azure-level limits monitored and raised proactively
- Budget management: Azure Cost Management alerts at 80% and 100% thresholds
Cost Allocation Required
All resources tagged with: CostCenter, Team, Service, Environment. Use Azure Cost Management + Billing for team/service attribution.
Deprecation Policy
- 90-day sunset period with portal warnings
- Migration guides published at announcement
- Hard cutoff enforced after sunset period
- 2-week advance notice for all changes, release notes in portal
Platform Version Manifest
# platform-manifest.yaml
apiVersion: platform.acme.com/v1
kind: PlatformManifest
metadata:
version: "2024.Q1"
effective_date: "2024-01-15"
components:
terraform:
version: "1.7.x"
provider_azurerm: "~> 3.90"
state_backend: "azurerm"
aks:
kubernetes_version: "1.29"
node_image_channel: "NodeImage"
upgrade_channel:
dev: "rapid"
staging: "stable"
production: "none"
databases:
postgresql_flexible:
versions: ["15", "16"]
cosmos_db:
api: ["SQL", "MongoDB"]
monitoring:
grafana_version: "10.x"
log_analytics_agent: "ama-logs"
prometheus: "azure-managed"
security:
defender_plans: ["Containers", "KeyVaults", "DNS", "Databases"]
tls_minimum: "1.2"
azure_policy_set: "acme-baseline-v2"
container_registry:
acr_sku: "Premium"
geo_replication: ["eastus2", "westus2"]
content_trust: true
AKS Namespace Resource Quota
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-payment-quota
namespace: payments
spec:
hard:
requests.cpu: "8"
requests.memory: 16Gi
limits.cpu: "16"
limits.memory: 32Gi
pods: "20"
services: "10"
persistentvolumeclaims: "5"
---
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: payments
spec:
limits:
- default:
cpu: "500m"
memory: "512Mi"
defaultRequest:
cpu: "100m"
memory: "128Mi"
type: Container
Azure Budget & Cost Alert
resource "azurerm_consumption_budget_resource_group" "team" {
name = "budget-${var.team}-${var.environment}"
resource_group_id = var.resource_group_id
amount = var.monthly_budget
time_grain = "Monthly"
time_period {
start_date = "2024-01-01T00:00:00Z"
}
notification {
enabled = true
threshold = 80
operator = "GreaterThan"
contact_emails = var.team_leads
contact_groups = [var.action_group_id]
}
notification {
enabled = true
threshold = 100
operator = "GreaterThan"
contact_emails = var.team_leads
contact_groups = [var.action_group_id, var.finops_action_group_id]
}
}
Deprecation Notice Template
apiVersion: platform.acme.com/v1
kind: DeprecationNotice
metadata:
id: "DEP-2024-003"
severity: warning
spec:
component: "aks"
affected_version: "1.27"
replacement_version: "1.29"
announcement_date: "2024-01-01"
sunset_date: "2024-04-01"
migration_guide: "https://portal.acme.com/guides/aks-1.27-to-1.29"
affected_teams: ["payments", "catalog", "orders"]
breaking_changes:
- "Deprecated PodSecurityPolicy removed"
- "API version batch/v1beta1 removed"
AKS Upgrade Runbook Template
apiVersion: platform.acme.com/v1
kind: UpgradeRunbook
metadata:
name: aks-cluster-upgrade
version: "1.0"
spec:
pre_checks:
- name: "Verify PDB coverage"
command: "kubectl get pdb --all-namespaces"
- name: "Check deprecated API usage"
command: "kubectl get --raw /metrics | grep apiserver_requested_deprecated_apis"
- name: "Validate node pool health"
command: "az aks nodepool list --cluster-name $CLUSTER --resource-group $RG"
upgrade_steps:
- name: "Upgrade control plane"
command: "az aks upgrade --resource-group $RG --name $CLUSTER --kubernetes-version $TARGET --control-plane-only"
- name: "Upgrade system node pool"
command: "az aks nodepool upgrade --cluster-name $CLUSTER --name system --resource-group $RG --kubernetes-version $TARGET"
- name: "Upgrade user node pools (rolling)"
command: "az aks nodepool upgrade --cluster-name $CLUSTER --name $POOL --resource-group $RG --kubernetes-version $TARGET --max-surge 33%"
post_checks:
- name: "Verify all nodes ready"
command: "kubectl get nodes"
- name: "Run smoke tests"
command: "./scripts/smoke-test.sh"
rollback:
- name: "Scale up old node pool"
- name: "Cordon new nodes"
- name: "Drain and delete new node pool"
Glossary
| Term | Definition |
|---|---|
| SemVer | Semantic Versioning (MAJOR.MINOR.PATCH) — the versioning standard for all modules. |
| Platform Manifest | YAML document declaring all approved component versions for a given quarter. |
| Resource Quota | Kubernetes object limiting total resource consumption per namespace. |
| LimitRange | Kubernetes object setting default and maximum resource requests/limits per container. |
| Capacity Governance | Policies and automation ensuring resource usage stays within budget and quotas. |
| Deprecation Notice | Formal announcement of a component version reaching end-of-support. |
| Sunset Period | The 90-day window between deprecation announcement and hard removal. |
| AKS Upgrade Channel | Azure setting (rapid/stable/none) controlling automatic Kubernetes version upgrades. |
| Node Image Channel | Azure setting controlling automatic OS image updates on AKS nodes. |
| Maintenance Window | Scheduled time range when AKS can apply automatic upgrades. |
| PodDisruptionBudget | Kubernetes object ensuring minimum pod availability during voluntary disruptions. |
| Azure Cost Management | Azure-native service for budget tracking, cost alerts, and spend analysis. |
| FinOps | Financial operations discipline for cloud cost optimization and accountability. |
Demo: Platform Version & Capacity Dashboard
Screen Layout: Dashboard with AKS cluster version cards showing current/available/deprecated status. Namespace quota usage displays as progress bars (CPU, memory, pods used vs. allocated). An Azure Cost Management section shows team spend vs budget with burn rate trending. An AKS upgrade scheduler with maintenance window picker lets platform engineers schedule rolling upgrades. A deprecation timeline shows upcoming sunsets.
Mock Data
{
"clusters": [
{ "name": "aks-dev-eastus2", "current": "1.29.2", "available": "1.30.0", "channel": "rapid", "status": "up-to-date" },
{ "name": "aks-staging-eastus2", "current": "1.29.0", "available": "1.29.2", "channel": "stable", "status": "update-available" },
{ "name": "aks-prod-eastus2", "current": "1.28.5", "available": "1.29.2", "channel": "none", "status": "upgrade-required" }
],
"quotas": {
"payments": { "cpu_used": 6.2, "cpu_limit": 8, "memory_used_gi": 12.4, "memory_limit_gi": 16, "pods_used": 14, "pods_limit": 20 },
"catalog": { "cpu_used": 3.1, "cpu_limit": 8, "memory_used_gi": 5.8, "memory_limit_gi": 16, "pods_used": 8, "pods_limit": 20 }
},
"budgets": [
{ "team": "payments", "budget": 3000, "spent": 2340, "burn_rate": 78 },
{ "team": "catalog", "budget": 2500, "spent": 1875, "burn_rate": 75 }
]
}
04 Security, Compliance, and Cost Optimization
Security by Default Required
Golden paths enforce security without developer action:
- Encryption at rest: Azure-managed keys (all services)
- Encryption in transit: TLS 1.2+ enforced
- Managed Identity: No service principal secrets in code
- NSG lockdown: Deny-all default with explicit allow rules
- Private endpoints: All PaaS services accessible only via VNet
Azure Policy Required
All compliance rules defined as Azure Policy definitions and initiatives. Policies assigned at management group or subscription level.
| Effect | Use Case |
|---|---|
Deny | Critical rules — block non-compliant resource creation |
Audit | Advisory rules — flag non-compliance without blocking |
DeployIfNotExists | Auto-remediate — deploy missing config (e.g., diagnostic settings) |
Modify | Auto-fix — add missing tags or settings on create/update |
Defender for Cloud
Enable Defender plans for Containers, Key Vaults, DNS, Databases, and Storage. Use the regulatory compliance dashboard for NIST 800-53, SOC 2, and CIS Azure Benchmarks. Custom compliance standards supported via policy initiatives.
Secrets Management Required
- Azure Key Vault for all secrets
- No secrets in code, env vars, or App Configuration
- Managed Identity + CSI Secret Store Driver for AKS pods
- Auto-rotation via Key Vault rotation policies
Container Security
- ACR with content trust enabled
- Vulnerability scanning via Defender for Containers
- Critical/High CVEs block deployment (admission controller)
- Only approved base images from organization ACR
Network Security
- Private endpoints for all PaaS services
- AKS with Azure CNI + Network Policies (Calico)
- NSG flow logs enabled
- Azure Firewall for egress filtering in production
Cost Optimization
- Dev/Staging use spot node pools where feasible
- Production uses Reserved Instances for baseline capacity
- AKS cluster autoscaler + Karpenter for efficient scaling
- Unused resources auto-flagged via Azure Advisor
- Monthly FinOps review with team leads
Azure Policy: Encryption Required
{
"properties": {
"displayName": "Require encryption at rest for storage accounts",
"policyType": "Custom",
"mode": "All",
"parameters": {},
"policyRule": {
"if": {
"allOf": [
{
"field": "type",
"equals": "Microsoft.Storage/storageAccounts"
},
{
"field": "Microsoft.Storage/storageAccounts/encryption.services.blob.enabled",
"notEquals": true
}
]
},
"then": {
"effect": "deny"
}
}
}
}
Azure Policy: Approved AKS Node Sizes
{
"properties": {
"displayName": "Allowed AKS node pool VM sizes",
"policyType": "Custom",
"mode": "All",
"parameters": {
"allowedSizes": {
"type": "Array",
"metadata": {
"displayName": "Allowed VM Sizes",
"description": "Approved VM sizes for AKS node pools"
},
"defaultValue": [
"Standard_D2s_v5",
"Standard_D4s_v5",
"Standard_D8s_v5",
"Standard_E4ds_v5",
"Standard_E8ds_v5"
]
}
},
"policyRule": {
"if": {
"allOf": [
{
"field": "type",
"equals": "Microsoft.ContainerService/managedClusters/agentPools"
},
{
"field": "Microsoft.ContainerService/managedClusters/agentPools/vmSize",
"notIn": "[parameters('allowedSizes')]"
}
]
},
"then": {
"effect": "deny"
}
}
}
}
Azure Policy: Required Tags
{
"properties": {
"displayName": "Require mandatory resource tags",
"policyType": "Custom",
"mode": "Indexed",
"parameters": {},
"policyRule": {
"if": {
"anyOf": [
{ "field": "tags['CostCenter']", "exists": false },
{ "field": "tags['Team']", "exists": false },
{ "field": "tags['Service']", "exists": false },
{ "field": "tags['Environment']", "exists": false },
{ "field": "tags['ManagedBy']", "exists": false }
]
},
"then": {
"effect": "deny"
}
}
}
}
NSG Baseline (AKS)
resource "azurerm_network_security_group" "aks" {
name = "nsg-${var.service_name}-${var.environment}"
location = var.location
resource_group_name = var.resource_group_name
security_rule {
name = "AllowHTTPSInbound"
priority = 100
direction = "Inbound"
access = "Allow"
protocol = "Tcp"
source_port_range = "*"
destination_port_range = "443"
source_address_prefix = "AzureLoadBalancer"
destination_address_prefix = var.aks_subnet_prefix
description = "Allow HTTPS from Azure LB"
}
security_rule {
name = "DenyAllInbound"
priority = 4096
direction = "Inbound"
access = "Deny"
protocol = "*"
source_port_range = "*"
destination_port_range = "*"
source_address_prefix = "*"
destination_address_prefix = "*"
description = "Deny all other inbound"
}
tags = var.common_tags
}
Defender for Cloud Custom Compliance
resource "azurerm_policy_set_definition" "compliance" {
name = "acme-security-standard-v1"
policy_type = "Custom"
display_name = "Acme Security Standard v1"
description = "Internal security requirements for all Azure workloads"
metadata = jsonencode({
category = "Regulatory Compliance"
ASC = {
complianceStandard = {
displayName = "Acme Security Standard v1"
version = "1.0"
}
}
})
policy_definition_group {
name = "ACME-1.1"
display_name = "ACME-1.1: Data encrypted at rest"
category = "Data Protection"
}
policy_definition_group {
name = "ACME-1.2"
display_name = "ACME-1.2: Private endpoints only"
category = "Network Security"
}
policy_definition_group {
name = "ACME-2.1"
display_name = "ACME-2.1: Privileged MFA required"
category = "Identity"
}
policy_definition_reference {
policy_definition_id = "/providers/Microsoft.Authorization/policyDefinitions/6fac406b-40ca-413b-bf8e-0bf964659c25"
reference_id = "storageEncryption"
policy_group_names = ["ACME-1.1"]
}
}
FinOps Cost Anomaly Alert
resource "azurerm_cost_anomaly_alert" "finops" {
name = "anomaly-alert-${var.team}"
display_name = "Cost anomaly alert for ${var.team}"
email_subject = "Azure Cost Anomaly Detected - ${var.team}"
email_addresses = var.finops_team_emails
message = "An unexpected cost increase was detected."
subscription_id = var.subscription_id
}
Glossary
| Term | Definition |
|---|---|
| Azure Policy (Deny) | Blocks resource creation or update that violates the policy rule. |
| Azure Policy (Audit) | Flags non-compliant resources without blocking their creation. |
| Azure Policy (DeployIfNotExists) | Automatically deploys a related resource if missing (e.g., diagnostic settings). |
| Azure Policy (Modify) | Adds or corrects properties on resources at create/update time. |
| Policy Initiative | A collection of policy definitions grouped under a single assignment. |
| Defender for Cloud | Unified security posture management and threat protection for Azure resources. |
| Regulatory Compliance Dashboard | Defender for Cloud view mapping policy compliance to standards (NIST, SOC 2, CIS). |
| Managed Identity | Azure AD identity assigned to a resource, eliminating the need for stored credentials. |
| Encryption at Rest | Data encrypted when stored on disk using Azure-managed or customer-managed keys. |
| CSI Secret Store Driver | Kubernetes driver that mounts Key Vault secrets directly into pod volumes. |
| Content Trust (ACR) | Docker Content Trust for verifying image integrity and publisher identity. |
| FinOps | Practice of bringing financial accountability to cloud spend through visibility, optimization, and governance. |
| Reserved Instances | 1- or 3-year Azure VM commitments at discounted rates for predictable workloads. |
| Spot Node Pools | AKS node pools using Azure Spot VMs at significant discount for interruptible workloads. |
| NIST 800-53 | US federal information security standard with comprehensive security controls. |
| SOC 2 | Audit framework for service organizations covering security, availability, and confidentiality. |
| CIS Azure Benchmark | Center for Internet Security benchmark for Azure configuration best practices. |
| Network Security Group | Azure-native firewall rules filtering traffic to/from resources within a VNet. |
| Private Endpoint | Private IP address within a VNet for accessing Azure PaaS services without public internet. |
| Azure Firewall | Managed cloud firewall for controlling outbound traffic from VNets. |
Demo: Security & Compliance Scorecard
Screen Layout: A service-level compliance scorecard showing Defender for Cloud secure score, policy violations count, CVE count, tag compliance percentage, and secrets rotation status per service. A portfolio-level compliance heatmap (services × controls matrix) uses color coding for pass/fail/warning. A FinOps panel shows team spend vs. budget with cost trend charts and optimization recommendations from Azure Advisor.
Mock Data
{
"secure_score": 82,
"services": [
{
"name": "payment-api",
"policy_violations": 0,
"cve_count": 2,
"tag_compliance_pct": 100,
"secrets_rotated": true,
"defender_score": 95
},
{
"name": "catalog-service",
"policy_violations": 3,
"cve_count": 7,
"tag_compliance_pct": 85,
"secrets_rotated": true,
"defender_score": 78
}
],
"finops": {
"total_budget": 15000,
"total_spent": 11200,
"recommendations": [
{ "type": "right-sizing", "savings": 320, "resource": "Standard_D8s_v5 -> Standard_D4s_v5" },
{ "type": "reserved-instance", "savings": 1200, "resource": "3x Standard_D4s_v5 (1yr)" }
]
}
}
Unified Demo Application Specification
Acme Platform Console (Azure)
Tech Stack: React 18, TypeScript, Tailwind CSS, Recharts, Mock API layer
Key Screens
- Home Dashboard — Platform health, active golden paths, environment matrix, Defender secure score
- Golden Path Catalog — Card grid with self-service provisioning via Azure Deployment Environments
- Environment Manager — Environment matrix with Azure Policy compliance, promotion workflows, ADE panel
- Platform Versions — AKS version tracker, upgrade scheduler (rapid/stable/none channels), deprecation timeline
- Compliance Center — Defender for Cloud integration, service scorecards, FinOps dashboard
Data Model
// Core entities
Service: { id, name, team, golden_path_id, environments[] }
Environment: { id, name, tier, region, health, version, policy_compliant }
GoldenPathTemplate: { id, name, version, description, adoption_pct, tags[] }
ResourceQuota: { namespace, cpu_used, cpu_limit, memory_used, memory_limit, pods_used, pods_limit }
ComplianceControl: { id, name, category, status, framework }
PolicyViolation: { id, resource_id, policy_name, severity, detected_at }
CostReport: { team, budget, spent, burn_rate, recommendations[] }
DeploymentEnvironment: { id, name, owner, ttl_hours, remaining_hours, status }
DevCenterProject: { id, name, environment_types[], teams[] }
Mock API Schema (REST Endpoints)
| Method | Endpoint | Description |
|---|---|---|
GET | /api/services | List all services with environment status |
GET | /api/golden-paths | List available golden path templates |
POST | /api/golden-paths/{id}/provision | Trigger provisioning workflow |
GET | /api/environments | List all environments with health |
POST | /api/environments/{id}/promote | Promote deployment to next tier |
GET | /api/platform/manifest | Get current platform version manifest |
GET | /api/clusters | List AKS clusters with version info |
POST | /api/clusters/{id}/upgrade | Schedule AKS upgrade |
GET | /api/quotas/{namespace} | Get namespace quota usage |
GET | /api/compliance/score | Get Defender secure score |
GET | /api/compliance/violations | List policy violations |
GET | /api/costs/{team} | Get team cost report |
GET | /api/deployment-environments | List active deployment environments |
POST | /api/deployment-environments | Create new deployment environment |
DELETE | /api/deployment-environments/{id} | Tear down deployment environment |
AWS → Azure Service Mapping
| Capability | AWS Version | Azure Version |
|---|---|---|
| Compute | ECS Fargate | AKS (Azure Kubernetes Service) |
| IaC | Terraform + AWS provider | Terraform + AzureRM provider |
| Policy Engine | Sentinel (Terraform Cloud) | Azure Policy + Defender for Cloud |
| Secrets | AWS Secrets Manager | Azure Key Vault |
| Config | SSM Parameter Store | Azure App Configuration |
| Monitoring | CloudWatch | Azure Monitor + Managed Grafana |
| Container Registry | ECR | Azure Container Registry (ACR) |
| Compliance | AWS Config | Defender for Cloud Regulatory Compliance |
| Cost Management | AWS Cost Explorer | Azure Cost Management + Billing |
| Self-Service Envs | (custom) | Azure Deployment Environments |
| Dev Workstations | (none) | Microsoft Dev Box |