准备用于在 Azure Kubernetes 服务（AKS）上部署 Kafka 的基础结构

在本文中，您将为在 Azure Kubernetes 服务 (AKS) 上部署 Kafka 集群准备基础设施。

体系结构概述

Kafka 部署的目标 AKS 体系结构通过全面的区域冗余设计确定高可用性的优先级。设计需要三个节点池（每个可用性区域一个）来保持工作负荷分发和存储一致性。分区配置至关重要，因为这种体系结构中的持久卷具有分区亲和性。必须在相应的区域中创建使用群集自动缩放程序预配的任何新节点。如果没有此区域特异性，则具有区域绑定永久性卷的 Pod 将保持挂起状态。 Strimzi 群集作员和 Kafka 中转站实例的多个副本在各个区域中定义和分布，从而针对目标区域中的节点和整个区域故障提供复原能力。为防止资源争用并确保可预测的性能，强烈建议为 Kafka 工作负荷使用专用节点池。

先决条件

如果还没有，请查看概述，了解如何使用 Strimzi 在 Azure Kubernetes 服务（AKS）上部署 Kafka。
已安装 Terraform v1.3.0 或更高版本。
Azure CLI 已安装并已经过身份验证。
以下角色具有足够的权限来创建基础结构资源并将 RBAC 分配给托管标识：网络参与者、Azure Kubernetes 服务参与者和基于角色的访问控制管理员。

部署基础结构

以下步骤将指导你部署 AKS 群集和支持 Kafka 部署所需的基础结构。

小窍门

如果有现有的 AKS 群集或现有的支持基础结构：可以跳过完整的部署步骤或进行代码调整，但请确保基础结构和 AKS 群集满足以下要求：

虚拟网络及其节点的子网。
在 AKS 群集上启用的 Azure 磁盘 CSI 驱动程序（默认已启用）。
每个可用性区域的节点池（区域 1、2 和 3）。
适用于具有适当 VM 大小的 Kafka 的专用节点池，具体取决于工作负载的要求。
已配置 Azure 托管 Prometheus 和 Azure 托管 Grafana。

设置环境变量。

在运行任何 CLI 命令之前，请将以下环境变量设置为在本指南中使用，其中包含满足要求的值：

export RESOURCE_GROUP_NAME="rg-kafka"  
export LOCATION="canadacentral"  
export VNET_NAME="vnet-aks-kafka"  
export SUBNET_NAME="node-subnet"  
export AKS_CLUSTER_NAME="aks-kafka-cluster"  
export AKS_TIER=standard
export NAT_GATEWAY_NAME="nat-kafka"  
export ADDRESS_SPACE="10.31.0.0/20"  
export SUBNET_PREFIX="10.31.0.0/21"  
export SYSTEM_NODE_COUNT_MIN=3
export SYSTEM_NODE_COUNT_MAX=6
export SYSTEM_NODE_VM_SIZE="Standard_D4ds_v5"   
export KAFKA_NODE_COUNT_MIN=1  
export KAFKA_NODE_COUNT_MAX=3 
export KAFKA_NODE_COUNT=1
export KAFKA_NODE_VM_SIZE="Standard_D16ds_v5"  
export LOG_ANALYTICS_WORKSPACE_NAME="law-monitoring"  
export DIAGNOSTIC_SETTINGS_NAME="aks-diagnostic-settings"  
export ACR_NAME="aksacr123"  
export ACR_SKU="Premium"  
export USER_ASSIGNED_IDENTITY_NAME="uami-aks"  
export KUBERNETES_VERSION="1.30.0"  
export AAD_ADMIN_GROUP_OBJECT_IDS="<your-admin-group-object-id>"    
export AAD_TENANT_ID="<your-tenant-id>"  
export GRAFANA_NAME="grafana-kafka-aks"  
export PROMETHEUS_WORKSPACE_NAME="prometheus-aks"

预群集网络部署

在为 Kafka 部署 AKS 群集之前，请部署支持 AKS 群集部署的必备网络资源。

使用 az group create 命令创建资源组。

az group create --name $RESOURCE_GROUP_NAME --location $LOCATION

使用 az network vnet create 命令创建虚拟网络。

az network vnet create \
--resource-group $RESOURCE_GROUP_NAME \
--name $VNET_NAME \
--address-prefix $ADDRESS_SPACE \
--location $LOCATION

使用 az network vnet subnet create 命令创建子网。

az network vnet subnet create \
--resource-group $RESOURCE_GROUP_NAME \
--vnet-name $VNET_NAME \
--name $SUBNET_NAME \
--address-prefix $SUBNET_PREFIX

使用 az network public-ip create 命令为 NAT 网关创建公共 IP。

az network public-ip create \
--resource-group $RESOURCE_GROUP_NAME \
--name ${NAT_GATEWAY_NAME}-public-ip \
--sku Standard \
--location $LOCATION

使用 az network nat gateway create 命令创建 NAT 网关。

az network nat gateway create \
--resource-group $RESOURCE_GROUP_NAME \
--name $NAT_GATEWAY_NAME \
--public-ip-addresses ${NAT_GATEWAY_NAME}-public-ip \
--location $LOCATION

使用 az network vnet subnet update 命令将 NAT 网关关联到节点子网。

az network vnet subnet update \
--resource-group $RESOURCE_GROUP_NAME \
--vnet-name $VNET_NAME \
--name $SUBNET_NAME \
--nat-gateway $NAT_GATEWAY_NAME

预群集监视和治理部署

在为 Kafka 部署 AKS 群集之前，请部署支持 AKS 群集部署的先决条件监视和管理资源。

使用 az monitor log-analytics workspace create 命令创建 Log Analytics 工作区。

az monitor log-analytics workspace create \
--resource-group $RESOURCE_GROUP_NAME \
--workspace-name $LOG_ANALYTICS_WORKSPACE_NAME \
--location $LOCATION

使用 az monitor account create 命令为 Prometheus 创建 Azure Monitor 工作区。

az monitor account create \
--resource-group $RESOURCE_GROUP_NAME \
--name $PROMETHEUS_WORKSPACE_NAME \
--location $LOCATION

使用 az grafana create 命令创建 Azure 托管 Grafana 实例。

az grafana create \
--resource-group $RESOURCE_GROUP_NAME \
--name $GRAFANA_NAME \
--location $LOCATION \
--api-key Enabled \
--deterministic-outbound-ip Enabled \
--public-network-access Enabled \
--grafana-major-version 11

注释

Azure 托管 Grafana 在选定区域中提供区域冗余。如果目标区域具有区域冗余，请使用 --zone-redundancy Enabled 参数。

使用 az acr create 命令创建 Azure 容器注册表。

az acr create \
--resource-group $RESOURCE_GROUP_NAME \
--name $ACR_NAME \
--sku $ACR_SKU \
--location $LOCATION \
--admin-enabled false \
--zone-redundancy Enabled

使用 az identity create 命令创建用户分配的托管标识。

az identity create \
--resource-group $RESOURCE_GROUP_NAME \
--name $USER_ASSIGNED_IDENTITY_NAME \
--location $LOCATION

使用 az role assignment create 命令为 Grafana 实例的托管标识分配 RBAC 权限。

az role assignment create \
--assignee $(az grafana show --resource-group $RESOURCE_GROUP_NAME --name $GRAFANA_NAME --query identity.principalId -o tsv) \
--role "Monitoring Reader" --scope $(az group show --name $RESOURCE_GROUP_NAME --query id -o tsv)

AKS 群集部署

使用 Azure CLI 为 Kafka 每个可用性区域部署具有专用节点池的 AKS 群集。

使用 az role assignment create 命令将网络参与者角色分配给 AKS 的用户分配托管标识。

az role assignment create \
--assignee $(az identity show --resource-group $RESOURCE_GROUP_NAME --name $USER_ASSIGNED_IDENTITY_NAME --query principalId -o tsv) \
--role "Network Contributor" \
--scope $(az group show --name $RESOURCE_GROUP_NAME --query id -o tsv)

使用 az aks create 命令创建 AKS 群集。

az aks create \
--name $AKS_CLUSTER_NAME \
--aad-admin-group-object-ids $AAD_ADMIN_GROUP_OBJECT_IDS \
--aad-tenant-id $AAD_TENANT_ID \
--assign-identity $(az identity show --resource-group $RESOURCE_GROUP_NAME --name $USER_ASSIGNED_IDENTITY_NAME --query id -o tsv) \
--attach-acr $(az acr show --resource-group $RESOURCE_GROUP_NAME --name $ACR_NAME --query id -o tsv) \
--auto-upgrade-channel patch \
--enable-aad \
--enable-addons monitoring \
--enable-azure-monitor-metrics \
--enable-cluster-autoscaler \
--enable-managed-identity \
--enable-oidc-issuer \
--enable-workload-identity \
--kubernetes-version $KUBERNETES_VERSION \
--load-balancer-sku standard \
--location $LOCATION \
--max-count $SYSTEM_NODE_COUNT_MAX \
--max-pods 110 \
--min-count $SYSTEM_NODE_COUNT_MIN \
--network-dataplane cilium \
--network-plugin azure \
--network-plugin-mode overlay \
--network-policy cilium \
--node-osdisk-type Ephemeral \
--node-os-upgrade-channel NodeImage \
--node-vm-size $SYSTEM_NODE_VM_SIZE \
--nodepool-labels "role=system" \
--nodepool-name systempool \
--nodepool-tags "env=production" \
--os-sku AzureLinux \
--outbound-type userAssignedNATGateway \
--pod-cidr 10.244.0.0/16 \
--resource-group $RESOURCE_GROUP_NAME \
--tags "env=production" \
--tier $AKS_TIER \
--vnet-subnet-id $(az network vnet subnet show --resource-group $RESOURCE_GROUP_NAME --vnet-name $VNET_NAME --name $SUBNET_NAME --query id -o tsv) \
--workspace-resource-id $(az monitor log-analytics workspace show --resource-group $RESOURCE_GROUP_NAME --workspace-name $LOG_ANALYTICS_WORKSPACE_NAME --query id -o tsv) \
--zones 1 2 3

使用 for 循环和 az aks nodepool add 命令为每个可用性区域创建一个额外的节点池。

for zone in 1 2 3; do
  az aks nodepool add \
  --cluster-name $AKS_CLUSTER_NAME \
  --enable-cluster-autoscaler \
  --labels app=kafka \
  --max-count $KAFKA_NODE_COUNT_MAX \
  --max-surge 10% \
  --min-count $KAFKA_NODE_COUNT_MIN \
  --node-count $KAFKA_NODE_COUNT \
  --mode User \
  --name "kafka$zone" \
  --node-osdisk-type Ephemeral \
  --node-vm-size $KAFKA_NODE_VM_SIZE \
  --os-sku AzureLinux \
  --resource-group $RESOURCE_GROUP_NAME \
  --vnet-subnet-id $(az network vnet subnet show --resource-group $RESOURCE_GROUP_NAME --vnet-name $VNET_NAME --name $SUBNET_NAME --query id -o tsv) \
  --zones $zone
done

使用 az aks update 命令启用 Azure 托管 Prometheus 和 Grafana 集成。

az aks update \
--name $AKS_CLUSTER_NAME \
--resource-group $RESOURCE_GROUP_NAME \
--enable-azure-monitor-metrics \
--azure-monitor-workspace-resource-id $(az monitor account show --resource-group $RESOURCE_GROUP_NAME --name $PROMETHEUS_WORKSPACE_NAME --query id -o tsv) \
--grafana-resource-id $(az grafana show --resource-group $RESOURCE_GROUP_NAME --name $GRAFANA_NAME --query id -o tsv)

可选：使用 az monitor diagnostic-settings create 命令为 AKS 群集配置诊断设置。

az monitor diagnostic-settings create \
--resource $(az aks show --resource-group $RESOURCE_GROUP_NAME --name $AKS_CLUSTER_NAME --query id -o tsv) \
--name $DIAGNOSTIC_SETTINGS_NAME \
--workspace $(az monitor log-analytics workspace show --resource-group $RESOURCE_GROUP_NAME --workspace-name $LOG_ANALYTICS_WORKSPACE_NAME --query id -o tsv) \
--logs '[{"category": "kube-apiserver", "enabled": true}, {"category": "kube-audit", "enabled": true}, {"category": "kube-audit-admin", "enabled": true}, {"category": "kube-controller-manager", "enabled": true}, {"category": "kube-scheduler", "enabled": true}, {"category": "cluster-autoscaler", "enabled": true}, {"category": "cloud-controller-manager", "enabled": true}, {"category": "guard", "enabled": true}, {"category": "csi-azuredisk-controller", "enabled": true}, {"category": "csi-azurefile-controller", "enabled": true}, {"category": "csi-snapshot-controller", "enabled": true}]' \
--metrics '[{"category": "AllMetrics", "enabled": true}]'

在本部分中，你将使用 Terraform 部署 AKS 群集和支持基础结构资源：

使用 Azure 验证模块（AVM），创建一个专用的 AKS 集群，并在每个可用性区域内配置节点池。
虚拟网络和子网配置。
用于出站连接的 NAT 网关。
具有专用终结点的 Azure 容器注册表。
AKS 的用户分配的托管标识。
用于 Prometheus 指标的 Azure Monitor 工作区。
具有 Prometheus 集成的 Azure 托管 Grafana 仪表板。
适用于具有适当标签的 Kafka 工作负荷的专用节点池。
用于持久性卷的 Azure 磁盘 CSI 驱动程序（默认启用）。

注释

此 Terraform 部署将 Azure 验证模块用于生产 AKS 群集。因此，群集部署为专用群集，并具有预设配置。必须建立适当的连接才能运行后续 kubectl 命令。

若要自定义模块配置以满足需求，请创建分支或克隆存储库并更新模块源引用。

将目录 variables.tf 复制到 Terraform 目录。

variable "azure_subscription_id" {
type        = string
description = "The Azure subscription ID to use for the resources."

}
variable "enable_telemetry" {
type        = bool
default     = true
description = "This variable controls whether or not telemetry is enabled for the module."
}

variable "kubernetes_cluster_name" {
type        = string
default     = "kafka-cluster"
description = "The name of the Kubernetes cluster."
}

variable "kubernetes_version" {
type        = string
default     = "1.30"
description = "The version of Kubernetes to use for the cluster."
}

variable "resource_group_name" {
type        = string
description = "The name of the resource group in which to create the resources."
}

variable "rbac_aad_admin_group_object_ids" {
type        = list(string)
description = "The object IDs of the Azure AD groups that should be granted admin access to the Kubernetes cluster."    
}

variable "location" {
type        = string
description = "The location in which to create the resources."
}

查看变量并根据需要创建变量 kafka.tfvars 。使用满足要求的值进行更新：

# Replace placeholder values with your actual configuration

azure_subscription_id = "00000000-0000-0000-0000-000000000000" # Replace with your actual subscription ID
location              = "Canada Central"
enable_telemetry      = true
kubernetes_cluster_name = "kafka-aks-cluster"
kubernetes_version    = "1.30"
resource_group_name   = "rg-kafka-prod"
rbac_aad_admin_group_object_ids = [
"0000-0000-0000-0000", 
# Add additional admin group object IDs as needed
]

将目录 main.tf 复制到 Terraform 目录。

terraform {
  required_version = ">= 1.3.0"
  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = ">= 4, <5"
    }
  }
}
provider "azurerm" {
  features {
    resource_group {
      prevent_deletion_if_contains_resources = false
    }
  }
  subscription_id = var.azure_subscription_id
}
module "naming" {
  source  = "Azure/naming/azurerm"
  version = ">= 0.3.0"
}

resource "azurerm_user_assigned_identity" "this" {
  location            = var.location
  name                = "uami-${var.kubernetes_cluster_name}"
  resource_group_name = var.resource_group_name
}

data "azurerm_client_config" "current" {}

module "avm-ptn-aks-production" {
  source = "github.com/Azure/terraform-azurerm-avm-ptn-aks-production"
  kubernetes_version  = "1.30"
  enable_telemetry    = var.enable_telemetry 
  name                = var.kubernetes_cluster_name
  resource_group_name = var.resource_group_name
  location = var.location 
  default_node_pool_vm_sku = "Standard_D8ds_v5"
  network = {
    name                = module.avm_res_network_virtualnetwork.name
    resource_group_name = var.resource_group_name
    node_subnet_id      = module.avm_res_network_virtualnetwork.subnets["subnet"].resource_id
    pod_cidr            = "192.168.0.0/16"
  }
  acr = {
    name                          = module.naming.container_registry.name_unique
    subnet_resource_id            = module.avm_res_network_virtualnetwork.subnets["private_link_subnet"].resource_id
    private_dns_zone_resource_ids = [azurerm_private_dns_zone.this.id]
  }
  managed_identities = {
    user_assigned_resource_ids = [
      azurerm_user_assigned_identity.this.id
    ]
  }
  rbac_aad_tenant_id = data.azurerm_client_config.current.tenant_id
  rbac_aad_admin_group_object_ids =  var.rbac_aad_admin_group_object_ids
  rbac_aad_azure_rbac_enabled = true

  node_pools = {
    kafka = {
      name                 = "kafka"
      vm_size              = "Standard_D16ds_v5"
      orchestrator_version = "1.30"
      max_count            = 3
      min_count            = 1
      os_sku               = "AzureLinux"
      mode                 = "User"
      os_disk_size_gb      = 128
      labels = {
        "app" = "kafka"
      }
    }
  }
}

resource "azurerm_private_dns_zone" "this" {
  name                = "privatelink.azurecr.cn"
  resource_group_name = var.resource_group_name
}

resource "azurerm_nat_gateway" "this" {
  location            = var.location
  name                = module.naming.nat_gateway.name_unique
  resource_group_name = var.resource_group_name
}

resource "azurerm_public_ip" "this" {
  name                = module.naming.public_ip.name_unique
  location            = var.location
  resource_group_name = var.resource_group_name
  allocation_method   = "Static"
  sku                 = "Standard"
}

resource "azurerm_nat_gateway_public_ip_association" "this" {
  nat_gateway_id       = azurerm_nat_gateway.this.id
  public_ip_address_id = azurerm_public_ip.this.id  
}

module "avm_res_network_virtualnetwork" {
  source  = "Azure/avm-res-network-virtualnetwork/azurerm"
  version = "0.7.1"

  address_space       = ["10.31.0.0/16"]
  location            = var.location
  name                = "vnet-aks-lab"
  resource_group_name = var.resource_group_name
  subnets = {
    "subnet" = {
      name             = "nodecidr"
      address_prefixes = ["10.31.0.0/17"]
      nat_gateway = {
        id = azurerm_nat_gateway.this.id
      }
      private_link_service_network_policies_enabled = false
    }
    "private_link_subnet" = {
      name             = "private_link_subnet"
      address_prefixes = ["10.31.129.0/24"]
    }
  }
}

resource "azurerm_monitor_workspace" "this" {
  name                = "prometheus-aks"
  location            = var.location
  resource_group_name = var.resource_group_name
}

resource "azurerm_monitor_data_collection_endpoint" "dataCollectionEndpoint" {
  name                = "prom-aks-endpoint"
  location            = var.location
  resource_group_name = var.resource_group_name
  kind                = "Linux"
}

resource "azurerm_monitor_data_collection_rule" "dataCollectionRule" {
  name      = "prom-aks-dcr"
  location            = var.location
  resource_group_name = var.resource_group_name
  data_collection_endpoint_id = azurerm_monitor_data_collection_endpoint.dataCollectionEndpoint.id
  kind                        = "Linux"
  description = "DCR for Azure Monitor Metrics Profile (Managed Prometheus)"
  destinations {
    monitor_account {
      monitor_account_id = azurerm_monitor_workspace.this.id
      name               = "PrometheusAzMonitorAccount"
    }
  }
  data_flow {
    streams      = ["Microsoft-PrometheusMetrics"]
    destinations = ["PrometheusAzMonitorAccount"]
  }
  data_sources {
    prometheus_forwarder {
      streams = ["Microsoft-PrometheusMetrics"]
      name    = "PrometheusDataSource"
    }
  }

}

resource "azurerm_monitor_data_collection_rule_association" "dataCollectionRuleAssociation" {
  name                    = "prom-aks-dcra"
  target_resource_id      = module.avm-ptn-aks-production.resource_id
  data_collection_rule_id = azurerm_monitor_data_collection_rule.dataCollectionRule.id
  description             = "Association of data collection rule. Deleting this association will break the data collection for this AKS Cluster."
}

resource "azurerm_dashboard_grafana" "this" {
  name                              = "grafana-kafka-aks"
  location                          = var.location
  resource_group_name               = var.resource_group_name
  api_key_enabled                   = true
  deterministic_outbound_ip_enabled = true
  public_network_access_enabled     = true
  grafana_major_version             = 11

  azure_monitor_workspace_integrations {
    resource_id = azurerm_monitor_workspace.this.id
  }

  identity {
    type = "SystemAssigned"
  }
}

data "azurerm_resource_group" "current" {
  name       = var.resource_group_name
  depends_on = [azurerm_dashboard_grafana.this]
}

resource "azurerm_role_assignment" "grafana_monitoring_reader" {
  scope                            = data.azurerm_resource_group.current.id
  role_definition_name             = "Monitoring Reader"
  principal_id                     = azurerm_dashboard_grafana.this.identity[0].principal_id
  skip_service_principal_aad_check = true
}

resource "azurerm_kubernetes_cluster_extension" "container_storage" {
  name           = "microsoft-azurecontainerstorage"
  cluster_id     = module.avm-ptn-aks-production.resource_id
  extension_type = "microsoft.azurecontainerstorage"
  configuration_settings = {
    "enable-azure-container-storage" : "azureDisk",
  }
}

使用 terraform init 命令初始化 Terraform。
```
terraform init  
```
使用 terraform plan 命令创建部署计划。
```
terraform plan -var-file="kafka.tfvars"
```

使用 terraform apply 命令应用配置。

terraform apply -var-file="kafka.tfvars"

验证部署并连接到群集

部署 AKS 群集后，使用以下步骤验证部署并连接到 AKS API 服务器。

使用 az aks show 命令验证 AKS 群集的部署。

az aks show --resource-group $RESOURCE_GROUP_NAME --name $AKS_CLUSTER_NAME --output table

验证部署后，使用 az aks get-credentials 命令连接到 AKS 群集。

az aks get-credentials --resource-group $RESOURCE_GROUP_NAME --name $AKS_CLUSTER_NAME

使用 kubectl get 命令列出节点来验证连接性。
```
kubectl get nodes  
```

为 Kafka 创建存储类

使用 kubectl apply 命令为高级 SSD v2 磁盘创建存储类。

kubectl apply -f - <<EOF  
---  
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: kafka-premium-ssd-v2
provisioner: disk.csi.azure.com
parameters:
  skuName: PremiumV2_LRS
  diskIOPSReadWrite: "5000"
  diskMBpsReadWrite: "200"
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
EOF

重要

上述存储配置表示起点。对于生产部署，请根据预期的 Kafka 群集大小和工作负荷要求调整 diskIOPSReadWrite 和 diskMBpsReadWrite 值。

后续步骤

在 Azure Kubernetes 服务上部署 Strimzi 和 Kafka （AKS）

供稿人

Microsoft维护本文。以下贡献者最初撰写了这篇文章：

塞尔吉奥·纳瓦尔 |高级客户工程师
Erin Schaffer | 内容开发人员 2

Last updated on 2025-11-24

通过