本文介绍如何为 Azure Kubernetes 服务(AKS)节点自动预配(NAP)配置节点池,包括 SKU 选择器、资源限制和优先级权重。
节点池概述
NodePools 在节点自动预配创建的节点上设置约束,以及在这些节点上运行的 Pod。 首次安装节点自动预配时,会创建默认的 NodePool。 可以修改此 NodePool 或添加更多 NodePool。
NodePools 的主要行为:
- 节点自动预配至少需要一个 NodePool 才能正常运行
- 节点自动预配评估每个配置的 NodePool
- 节点自动预配会跳过 Pod 不允许的污点的 NodePool
- 节点自动预配将启动污点应用于预配的节点,但不需要 Pod 容忍
- 节点自动预配最适合相互排斥的 NodePools。 当多个 NodePools 匹配时,将使用权重最高的节点池
节点自动预配使用虚拟机(VM)库存保留单元(SKU)要求,为挂起的工作负荷选择最佳虚拟机。 配置 SKU 系列、VM 类型、现成或按需实例、体系结构和资源限制。
默认节点池配置
了解默认 NodePools
AKS 创建可以自定义的默认节点池配置:
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: default
spec:
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
expireAfter: Never
template:
spec:
nodeClassRef:
name: default
# Requirements that constrain the parameters of provisioned nodes.
# These requirements are combined with pod.spec.affinity.nodeAffinity rules.
# Operators { In, NotIn, Exists, DoesNotExist, Gt, and Lt } are supported.
# https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#operators
requirements:
- key: kubernetes.io/arch
operator: In
values:
- amd64
- key: kubernetes.io/os
operator: In
values:
- linux
- key: karpenter.sh/capacity-type
operator: In
values:
- on-demand
- key: karpenter.azure.com/sku-family
operator: In
values:
- D
还会创建一个 system-surge
节点池来自动缩放系统池节点。
控制默认 NodePool 创建
该 --node-provisioning-default-pools
标志控制创建的默认节点自动预配 NodePools:
-
Auto
(默认值):创建两个标准 NodePool 以立即使用 -
None
:未创建默认 NodePool - 必须定义自己的节点池
警告
从“自动”更改为“无”:如果将此设置从 Auto
现有群集更改为 None
“无”,则不会自动删除默认的 NodePools。 如果需要,必须手动删除它们。
综合 NodePool 示例
以下示例演示了所有可用的 NodePool 配置选项,并提供了详细的注释:
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: default
spec:
# Template section that describes how to template out NodeClaim resources that Karpenter will provision
# Karpenter will consider this template to be the minimum requirements needed to provision a Node using this NodePool
# It will overlay this NodePool with Pods that need to schedule to further constrain the NodeClaims
# Karpenter will provision to launch new Nodes for the cluster
template:
metadata:
# Labels are arbitrary key-values that are applied to all nodes
labels:
billing-team: my-team
# Annotations are arbitrary key-values that are applied to all nodes
annotations:
example.com/owner: "my-team"
spec:
nodeClassRef:
apiVersion: karpenter.azure.com/v1beta1
kind: AKSNodeClass
name: default
# Provisioned nodes will have these taints
# Taints may prevent pods from scheduling if they are not tolerated by the pod.
taints:
- key: example.com/special-taint
effect: NoSchedule
# Provisioned nodes will have these taints, but pods do not need to tolerate these taints to be provisioned by this
# NodePool. These taints are expected to be temporary and some other entity (e.g. a DaemonSet) is responsible for
# removing the taint after it has finished initializing the node.
startupTaints:
- key: example.com/another-taint
effect: NoSchedule
# Requirements that constrain the parameters of provisioned nodes.
# These requirements are combined with pod.spec.topologySpreadConstraints, pod.spec.affinity.nodeAffinity, pod.spec.affinity.podAffinity, and pod.spec.nodeSelector rules.
# Operators { In, NotIn, Exists, DoesNotExist, Gt, and Lt } are supported.
# https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#operators
requirements:
- key: "karpenter.azure.com/sku-family"
operator: In
values: ["D", "E", "F"]
# minValues here enforces the scheduler to consider at least that number of unique sku-family to schedule the pods.
# This field is ALPHA and can be dropped or replaced at any time
minValues: 2
- key: "karpenter.azure.com/sku-name"
operator: In
values: ["Standard_D2s_v3","Standard_D4s_v3","Standard_E2s_v3","Standard_E4s_v3","Standard_F2s_v2","Standard_F4s_v2"]
minValues: 5
- key: "karpenter.azure.com/sku-cpu"
operator: In
values: ["2", "4", "8", "16"]
- key: "karpenter.azure.com/sku-version"
operator: Gt
values: ["2"]
- key: "topology.kubernetes.io/zone"
operator: In
values: ["chinanorth3-1", "chinanorth3-2"]
- key: "kubernetes.io/arch"
operator: In
values: ["arm64", "amd64"]
- key: "karpenter.sh/capacity-type"
operator: In
values: ["spot", "on-demand"]
# ExpireAfter is the duration the controller will wait
# before terminating a node, measured from when the node is created. This
# is useful to implement features like eventually consistent node upgrade,
# memory leak protection, and disruption testing.
expireAfter: 720h
# TerminationGracePeriod is the maximum duration the controller will wait before forcefully deleting the pods on a node, measured from when deletion is first initiated.
#
# Warning: this feature takes precedence over a Pod's terminationGracePeriodSeconds value, and bypasses any blocked PDBs or the karpenter.sh/do-not-disrupt annotation.
#
# This field is intended to be used by cluster administrators to enforce that nodes can be cycled within a given time period.
# When set, drifted nodes will begin draining even if there are pods blocking eviction. Draining will respect PDBs and the do-not-disrupt annotation until the TGP is reached.
#
# Karpenter will preemptively delete pods so their terminationGracePeriodSeconds align with the node's terminationGracePeriod.
# If a pod would be terminated without being granted its full terminationGracePeriodSeconds prior to the node timeout,
# that pod will be deleted at T = node timeout - pod terminationGracePeriodSeconds.
#
# The feature can also be used to allow maximum time limits for long-running jobs which can delay node termination with preStop hooks.
# If left undefined, the controller will wait indefinitely for pods to be drained.
terminationGracePeriod: 30s
# Disruption section which describes the ways in which Karpenter can disrupt and replace Nodes
# Configuration in this section constrains how aggressive Karpenter can be with performing operations
# like rolling Nodes due to them hitting their maximum lifetime (expiry) or scaling down nodes to reduce cluster cost
disruption:
# Describes which types of Nodes Karpenter should consider for consolidation
# If using 'WhenEmptyOrUnderutilized', Karpenter will consider all nodes for consolidation and attempt to remove or replace Nodes when it discovers that the Node is underutilized and could be changed to reduce cost
# If using `WhenEmpty`, Karpenter will only consider nodes for consolidation that contain no workload pods
consolidationPolicy: WhenEmptyOrUnderutilized | WhenEmpty
# The amount of time Karpenter should wait after discovering a consolidation decision
# This value can currently only be set when the consolidationPolicy is 'WhenEmpty'
# You can choose to disable consolidation entirely by setting the string value 'Never' here
consolidateAfter: 30s
# Budgets control the speed Karpenter can scale down nodes.
# Karpenter will respect the minimum of the currently active budgets, and will round up
# when considering percentages. Duration and Schedule must be set together.
budgets:
- nodes: 10%
# On Weekdays during business hours, don't do any deprovisioning.
- schedule: "0 9 * * mon-fri"
duration: 8h
nodes: "0"
# Resource limits constrain the total size of the pool.
# Limits prevent Karpenter from creating new instances once the limit is exceeded.
limits:
cpu: "1000"
memory: 1000Gi
# Priority given to the NodePool when the scheduler considers which NodePool
# to select. Higher weights indicate higher priority when comparing NodePools.
# Specifying no weight is equivalent to specifying a weight of 0.
weight: 10
支持的节点预配程序要求
Kubernetes 定义 Azure 实现 的Well-Known 标签 。 在 NodePool API 的“spec.requirements”部分中定义这些标签。
除了 Kubernetes 中的已知标签,Node Auto Provisioning 还支持特定于 Azure 的标签,以便进行更高级的计划。
具有已知标签的 SKU 选择器
Selector | Description | Example |
---|---|---|
karpenter.azure.com/sku-family | VM SKU 系列 | D、F、L 等 |
karpenter.azure.com/sku-name | 显式 SKU 名称 | Standard_A1_v2 |
karpenter.azure.com/sku-version | SKU 版本(不含“v”,可使用 1) | 1、2 |
karpenter.sh/capacity-type | VM 分配类型(现成/按需) | 现成或按需 |
karpenter.azure.com/sku-cpu | VM 中的 CPU 数 | 16 |
karpenter.azure.com/sku-memory | VM 中的内存 (MiB) | 131072 |
karpenter.azure.com/sku-gpu-name | GPU 名称 | A100 |
karpenter.azure.com/sku-gpu-manufacturer | GPU 制造商 | nvidia |
karpenter.azure.com/sku-gpu-count | 每个 VM 的 GPU 计数 | 2 |
karpenter.azure.com/sku-networking-accelerated | VM 是否具有加速网络 | [true、false] |
karpenter.azure.com/sku-storage-premium-capable | VM 是否支持高级 IO 存储 | [true、false] |
karpenter.azure.com/sku-storage-ephemeralos-maxsize | 临时 OS 磁盘的大小限制 (Gb) | 92 |
topology.kubernetes.io/zone | 可用性区域 | [chinanorth3-1,chinanorth3-2,chinanorth3-3] |
kubernetes.io/os | 作系统(仅在预览版期间 Linux) | Linux |
kubernetes.io/arch | CPU 体系结构(AMD64 或 ARM64) | [amd64, arm64] |
已知标签
实例类型
-
键:
node.kubernetes.io/instance-type
-
键:
karpenter.azure.com/sku-family
-
键:
karpenter.azure.com/sku-name
-
键:
karpenter.azure.com/sku-version
通常,实例类型应为列表,而不是单个值。 建议不定义这些要求,因为它可最大程度地选择有效地放置 Pod。
不支持大多数 Azure VM 大小,但不包括不支持 AKS 的专用大小。
SKU 系列示例
选择 karpenter.azure.com/sku-family
器允许以特定 VM 系列为目标:
- D 系列:具有均衡 CPU 与内存比率的常规用途 VM
- F 系列:CPU 与内存比率较高的计算优化 VM
- E 系列:针对内存密集型应用程序的内存优化 VM
- L 系列:具有高磁盘吞吐量的存储优化 VM
- N 系列:为计算密集型工作负荷启用 GPU 的 VM
示例配置:
requirements:
- key: karpenter.azure.com/sku-family
operator: In
values:
- D
- F
SKU 名称示例
选择 karpenter.azure.com/sku-name
器允许指定确切的 VM 实例类型:
requirements:
- key: karpenter.azure.com/sku-name
operator: In
values:
- Standard_D4s_v3
- Standard_F8s_v2
SKU 版本示例
选择 karpenter.azure.com/sku-version
器面向特定代系:
requirements:
- key: karpenter.azure.com/sku-version
operator: In
values:
- "3" # v3 generation
- "5" # v5 generation
可用性区域
-
键:
topology.kubernetes.io/zone
-
值示例:
chinanorth3-1
-
值列表:用于
az account list-locations --output table
查看可用区域
可以将节点自动预配配置为在特定区域中创建节点。 Azure 订阅的可用性区域 chinanorth3-1
可能与另一个 Azure 订阅的物理位置 chinanorth3-1
不同。
Architecture
-
键:
kubernetes.io/arch
-
值:
amd64
arm64
节点自动预配支持和amd64
arm64
节点。
操作系统
-
键:
kubernetes.io/os
-
值:
linux
节点自动预配支持 linux
作系统(Ubuntu + AzureLinux)。 即将推出 Windows 支持。
容量类型
-
键:
karpenter.sh/capacity-type
-
值:
spot
on-demand
如果 NodePool 允许现成实例和按需实例,节点自动预配将优先选择现成产品/服务。
节点池限制
默认情况下,节点自动预配会尝试在可用的 Azure 配额内计划工作负荷。 还可以通过在节点池规范中指定限制,从而指定节点池使用的资源上限。
spec:
# Resource limits constrain the total size of the cluster.
# Limits prevent Node Auto Provisioning from creating new instances once the limit is exceeded.
limits:
cpu: "1000"
memory: 1000Gi
节点池权重
定义多个节点池后,可以设置应安排工作负载的首选位置。 定义节点池定义的相对权重。
spec:
# Priority given to the node pool when the scheduler considers which to select.
# Higher weights indicate higher priority when comparing node pools.
# Specifying no weight is equivalent to specifying a weight of 0.
weight: 10
示例节点池配置
启用了 GPU 的节点池
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu-pool
spec:
weight: 20
limits:
cpu: "500"
memory: 500Gi
template:
spec:
nodeClassRef:
name: default
requirements:
- key: karpenter.azure.com/sku-gpu-manufacturer
operator: In
values:
- nvidia
- key: karpenter.azure.com/sku-gpu-count
operator: Gt
values:
- "0"
现成实例节点池
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: spot-pool
spec:
weight: 5
template:
spec:
nodeClassRef:
name: default
requirements:
- key: karpenter.sh/capacity-type
operator: In
values:
- spot
- on-demand # Spot nodepools with on-demand configured will fallback to on-demand capacity
- key: karpenter.azure.com/sku-family
operator: In
values:
- D
- F