共用方式為

有状态工作负荷升级模式

使用这些经过验证的模式升级运行数据库和有状态应用程序的群集,而不会丢失数据。

本文介绍的内容

本文为具有有状态工作负荷的 Azure Kubernetes 服务 (AKS) 群集提供特定于数据库的升级模式,例如:

  • PostgreSQL 费里斯轮模式,用于约 30 秒的停机时间。
  • Redis 滚动替换零停机时间缓存升级。
  • 用于副本集安全性的 MongoDB 下拉级联。
  • 安全响应的紧急升级清单。
  • 数据保护的验证和回滚过程。

这些模式最适合数据库管理员对具有持久性数据和任务关键状态服务的应用程序使用。

有关详细信息,请参阅以下相关文章:


有关快速入门,请选择相关说明的链接:

选择模式

数据库类型 升级模式 停机时间 复杂性 最适合
PostgreSQL 摩天轮 约 30 秒的停机时间 中等 生产数据库
Redis 滚动替换 没有 缓存层
MongoDB 下级联 约 10 秒的停机时间 中等 文档数据库
Elasticsearch 分片重新均衡 (即将推出) 没有 搜索群集
任何数据库 备份还原 (即将推出) 2 分钟到 5 分钟的停机时间 简单设置

紧急升级清单

由于安全问题,现在是否需要升级?

  1. 运行以下命令进行即时安全检查(两分钟):

    # Verify all replicas are healthy
    kubectl get pods -l tier=database -o wide
    # Check replication lag
    ./scripts/check-replica-health.sh
    # Ensure recent backup exists
    kubectl get job backup-job -o jsonpath='{.status.completionTime}'
    
  2. 选择紧急模式(一分钟):

    • PostgreSQL/MySQL: 使用 铁轮 (30 秒停机)。
    • Redis/Memcached: 使用 滚动更换 (零停机时间)。
    • MongoDB/CouchDB: 使用 单步级联( 10 秒停机)。
  3. 使用安全网运行(15 分钟到 30 分钟窗口):

    • 始终提前测试回滚过程。
    • 在升级期间监视应用程序指标。
    • 使数据库团队保持备用状态。

摩天轮模式:PostgreSQL

此模式最适合跨可用性区域设置主要/副本的 3 节点 PostgreSQL 群集。

视觉模式:

Initial: [PRIMARY] [REPLICA-1] [REPLICA-2]
Step 1:  [PRIMARY] [REPLICA-1] [NEW-NODE]  ← Add new node
Step 2:  [REPLICA-1] [NEW-NODE] [REPLICA-2] ← Promote & remove old primary  
Step 3:  [NEW-PRIMARY] [NEW-NODE] [REPLICA-2] ← Complete rotation

快速实现(20 分钟)

# 1. Add new node to cluster
kubectl scale statefulset postgres-cluster --replicas=4
# 2. Wait for new replica to sync
kubectl wait --for=condition=ready pod postgres-cluster-3 --timeout=300s
# 3. Promote new primary and failover (30-second downtime window)
kubectl exec postgres-cluster-3 -- pg_ctl promote -D /var/lib/postgresql/data
# 4. Update service endpoint
kubectl patch service postgres-primary --patch '{"spec":{"selector":{"app":"postgres-cluster","role":"primary","pod":"postgres-cluster-3"}}}'
# 5. Remove old primary node
kubectl delete pod postgres-cluster-0
详细的分步指南

先决条件验证

#!/bin/bash
# pre-upgrade-validation.sh

echo "=== PostgreSQL Cluster Health Check ==="
# Check replication status
kubectl exec postgres-primary-0 -- psql -c "SELECT * FROM pg_stat_replication;"
# Verify sync replication (must show 'sync' state)
SYNC_COUNT=$(kubectl exec postgres-primary-0 -- psql -t -c "SELECT count(*) FROM pg_stat_replication WHERE sync_state='sync';")
if [ "$SYNC_COUNT" -lt 2 ]; then
    echo "ERROR: Need at least 2 synchronous replicas"
    exit 1
fi
# Confirm recent backup exists
LAST_BACKUP=$(kubectl get job postgres-backup -o jsonpath='{.status.completionTime}')
echo "Last backup: $LAST_BACKUP"
# Test failover capability in staging first
echo "✅ Prerequisites validated"

步骤 1:使用新节点纵向扩展

# Add new node with upgraded Kubernetes version
kubectl patch statefulset postgres-cluster --patch '{
  "spec": {
    "replicas": 4,
    "template": {
      "spec": {
        "nodeSelector": {
          "kubernetes.io/arch": "amd64",
          "aks-nodepool": "upgraded-pool"
        }
      }
    }
  }
}'
# Monitor new pod startup
kubectl get pods -l app=postgres-cluster -w
# Verify new replica joins cluster
kubectl exec postgres-cluster-3 -- psql -c "SELECT * FROM pg_stat_replication;"

步骤 2:运行受控故障转移

#!/bin/bash
# controlled-failover.sh

echo "=== Starting Controlled Failover ==="
# Ensure minimal replication lag (< 0.1-second)
LAG=$(kubectl exec postgres-primary-0 -- psql -t -c "SELECT EXTRACT(EPOCH FROM now() - pg_last_xact_replay_timestamp());")
if (( $(echo "$LAG > 0.1" | bc -l) )); then
    echo "ERROR: Replication lag too high ($LAG seconds)"
    exit 1
fi
# Pause application writes (use connection pool drain)
kubectl patch configmap pgbouncer-config --patch '{"data":{"pgbouncer.ini":"[databases]\napp_db = host=postgres-primary port=5432 dbname=appdb pool_mode=statement max_db_connections=0"}}'
# Wait for active transactions to complete
sleep 10
# Promote new primary (this is the 30-second downtime window)
kubectl exec postgres-cluster-3 -- pg_ctl promote -D /var/lib/postgresql/data
# Update service selector to new primary
kubectl patch service postgres-primary --patch '{
  "spec": {
    "selector": {
      "statefulset.kubernetes.io/pod-name": "postgres-cluster-3"
    }
  }
}'
# Resume application writes
kubectl patch configmap pgbouncer-config --patch '{"data":{"pgbouncer.ini":"[databases]\napp_db = host=postgres-primary port=5432 dbname=appdb pool_mode=statement"}}'
echo "✅ Failover completed"

步骤 3:清理和验证

# Remove old primary node
kubectl delete pod postgres-cluster-0 --force
# Scale back to 3 replicas
kubectl patch statefulset postgres-cluster --patch '{"spec":{"replicas":3}}'
# Validate cluster health
kubectl exec postgres-cluster-3 -- psql -c "SELECT * FROM pg_stat_replication;"
# Test application connectivity
kubectl run test-db-connection --image=postgres:15 --rm -it -- psql -h postgres-primary -U app_user -d app_db -c "SELECT version();"

高级配置

对于需要 <10 秒停机的任务关键型数据库:

# Use synchronous replication with multiple standbys
# postgresql.conf
synchronous_standby_names = 'ANY 2 (standby1, standby2, standby3)'
synchronous_commit = 'remote_apply'

成功验证

若要验证进度,请使用以下清单:

  • 新的主数据库接受读取和写入。
  • 所有副本都显示正常的复制。
  • 应用程序自动重新连接。
  • 未检测到数据丢失。
  • 在新的主数据库上测试的备份/还原。

紧急回滚

即时问题(<2 分钟)

将流量重定向到以前的主数据库:

kubectl patch service postgres-primary --patch '{
  "spec": {
    "selector": {
      "statefulset.kubernetes.io/pod-name": "postgres-cluster-1"
    }
  }
}'
用于全面的故障转移恢复(5-10 分钟)
  1. 停止对当前主节点的写入:

    kubectl exec postgres-primary-0 -- psql -c "SELECT pg_reload_conf();"
    
  2. 将服务重定向到正常的副本:

    kubectl patch service postgres-primary --patch '{"spec":{"selector":{"statefulset.kubernetes.io/pod-name":"postgres-replica-1-0"}}}'
    
  3. 将副本提升到新的主副本:

    kubectl exec postgres-replica-1-0 -- pg_ctl promote -D /var/lib/postgresql/data
    kubectl wait --for=condition=ready pod postgres-replica-1-0 --timeout=60s
    
  4. 更新连接字符串:

    kubectl patch configmap postgres-config --patch '{"data":{"primary-host":"postgres-replica-1-0.postgres"}}'
    
  5. 验证新主数据库是否接受写入:

    kubectl exec postgres-replica-1-0 -- psql -c "CREATE TABLE upgrade_test (id serial, timestamp timestamp default now());"
    kubectl exec postgres-replica-1-0 -- psql -c "INSERT INTO upgrade_test DEFAULT VALUES;"
    

预期结果: 大约 30 秒的停机时间、零数据丢失和运行当前版本的 Kubernetes 的升级节点。

步骤 3:升级 Node1(前主节点)

#!/bin/bash
# upgrade-node1.sh

echo "=== Step 3: Upgrade Node1 (Former Primary) ==="

# Drain Node1 gracefully
kubectl drain aks-nodepool1-12345678-vmss000000 --grace-period=300 --delete-emptydir-data --ignore-daemonsets

# Trigger node upgrade
az aks nodepool upgrade \
    --resource-group production-rg \
    --cluster-name aks-prod \
    --name nodepool1 \
    --kubernetes-version 1.29.0 \
    --max-surge 0 \
    --max-unavailable 1

# Monitor upgrade progress
while kubectl get node aks-nodepool1-12345678-vmss000000 -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}' | grep -q "False"; do
    echo "Waiting for node upgrade to complete..."
    sleep 30
done

echo "Node1 upgrade completed"

步骤 4:将 Node1 重新加入为副本

#!/bin/bash
# rejoin-node1.sh

echo "=== Step 4: Rejoin Node1 as Replica ==="

# Wait for postgres pod to be scheduled on upgraded node
kubectl wait --for=condition=ready pod postgres-primary-0 --timeout=300s

# Reconfigure as replica pointing to new primary (Node2)
kubectl exec postgres-primary-0 -- bash -c "
echo 'standby_mode = on' >> /var/lib/postgresql/data/recovery.conf
echo 'primary_conninfo = \"host=postgres-replica-1-0.postgres port=5432\"' >> /var/lib/postgresql/data/recovery.conf
"

# Restart postgres to apply replica configuration
kubectl delete pod postgres-primary-0
kubectl wait --for=condition=ready pod postgres-primary-0 --timeout=120s

# Verify replication is working
kubectl exec postgres-replica-1-0 -- psql -c "SELECT * FROM pg_stat_replication WHERE application_name='postgres-primary-0';"

echo "Node1 successfully rejoined as replica"

步骤 5:升级 Node3 (副本 2)

#!/bin/bash
# upgrade-node3.sh

echo "=== Step 5: Upgrade Node3 (Replica-2) ==="

# Similar process for Node3
kubectl drain aks-nodepool1-12345678-vmss000002 --grace-period=300 --delete-emptydir-data --ignore-daemonsets

az aks nodepool upgrade \
    --resource-group production-rg \
    --cluster-name aks-prod \
    --name nodepool1 \
    --kubernetes-version 1.29.0 \
    --max-surge 0 \
    --max-unavailable 1

# Wait for upgrade and pod readiness
kubectl wait --for=condition=ready pod postgres-replica-2-0 --timeout=300s

# Verify all replicas are in sync
kubectl exec postgres-replica-1-0 -- psql -c "SELECT application_name, state, sync_state FROM pg_stat_replication;"

步骤 6:最终故障转移(Node2 → Node3)

#!/bin/bash
# final-failover.sh

echo "=== Step 6: Final Failover and Node2 Upgrade ==="

# Failover primary from Node2 to Node3
kubectl patch service postgres-primary --patch '{"spec":{"selector":{"statefulset.kubernetes.io/pod-name":"postgres-replica-2-0"}}}'
kubectl exec postgres-replica-2-0 -- pg_ctl promote -D /var/lib/postgresql/data

# Upgrade Node2
kubectl drain aks-nodepool1-12345678-vmss000001 --grace-period=300 --delete-emptydir-data --ignore-daemonsets

az aks nodepool upgrade \
    --resource-group production-rg \
    --cluster-name aks-prod \
    --name nodepool1 \
    --kubernetes-version 1.29.0 \
    --max-surge 0 \
    --max-unavailable 1

# Rejoin Node2 as replica
kubectl wait --for=condition=ready pod postgres-replica-1-0 --timeout=300s

echo "All nodes upgraded successfully. PostgreSQL cluster operational."

验证和监视

#!/bin/bash
# post-upgrade-validation.sh

echo "=== Post-Upgrade Validation ==="

# Verify cluster topology
kubectl get pods -l app=postgres -o wide

# Check all replicas are connected
kubectl exec postgres-replica-2-0 -- psql -c "SELECT application_name, client_addr, state FROM pg_stat_replication;"

# Validate data integrity
kubectl exec postgres-replica-2-0 -- psql -c "SELECT COUNT(*) FROM upgrade_test;"

# Performance validation
kubectl exec postgres-replica-2-0 -- psql -c "EXPLAIN ANALYZE SELECT * FROM pg_stat_activity;"

echo "Upgrade validation completed successfully"

Redis 群集滚动替换

在此方案中,六节点 Redis 群集(三个主副本和三个副本)需要零停机。

执行

#!/bin/bash
# redis-cluster-upgrade.sh

echo "=== Redis Cluster Rolling Upgrade ==="

# Get cluster topology
kubectl exec redis-0 -- redis-cli cluster nodes

# Upgrade replica nodes first (no impact to writes)
for replica in redis-1 redis-3 redis-5; do
    echo "Upgrading replica: $replica"
    
    # Remove replica from cluster temporarily
    REPLICA_ID=$(kubectl exec redis-0 -- redis-cli cluster nodes | grep $replica | cut -d' ' -f1)
    kubectl exec redis-0 -- redis-cli cluster forget $REPLICA_ID
    
    # Drain and upgrade node
    kubectl delete pod $replica
    kubectl wait --for=condition=ready pod $replica --timeout=120s
    
    # Rejoin cluster
    kubectl exec redis-0 -- redis-cli cluster meet $(kubectl get pod $replica -o jsonpath='{.status.podIP}') 6379
    
    echo "Replica $replica upgraded and rejoined"
done

# Upgrade master nodes with failover
for master in redis-0 redis-2 redis-4; do
    echo "Upgrading master: $master"
    
    # Trigger failover to replica
    kubectl exec $master -- redis-cli cluster failover
    
    # Wait for failover completion
    sleep 10
    
    # Upgrade the demoted master (now replica)
    kubectl delete pod $master
    kubectl wait --for=condition=ready pod $master --timeout=120s
    
    echo "Master $master upgraded"
done

echo "Redis cluster upgrade completed"

MongoDB 副本集下拉

在此方案中,三个成员的 MongoDB 副本集需要协调的主要下一步。

执行

#!/bin/bash
# MongoDB upgrade script

Echo "=== MongoDB Replica Set Upgrade ==="

# Check replica set status
kubectl exec mongo-0 --mongo --eval "rs.status()"