验证 Azure Kubernetes 服务 (AKS) 上的 Valkey 群集的复原能力

本文介绍如何验证 Azure Kubernetes 服务 (AKS) 上的 Valkey 群集的复原能力。

备注

本文包含对术语“主”的引用,而 Microsoft 不再使用该术语。 在从 Valkey 软件中删除该术语后,我们会将其从本文中删除。

生成并运行 Valkey 的示例客户端应用程序

以下步骤演示如何为 Valkey 生成示例客户端应用程序,并将应用程序的 Docker 映像推送到 Azure 容器注册表 (ACR)。

示例客户端应用程序使用 Locust 负载测试框架来模拟 Valkey 群集上的工作负载。

  1. 使用以下命令创建 Dockerfile 和 requirements.txt 并将它们放置在新目录中:

    mkdir valkey-client
    cd valkey-client
    
    cat > Dockerfile <<EOF
    FROM python:3.10-slim-bullseye
    COPY requirements.txt .
    COPY locustfile.py .
    RUN pip install --upgrade pip && pip install --no-cache-dir -r requirements.txt
    EOF
    
    cat > requirements.txt <<EOF
    valkey
    locust
    EOF
    
  2. 创建包含以下内容的 locustfile.py 文件:

    cat > locustfile.py <<EOF
    import time
    from locust import between, task, User, events,tag, constant_throughput
    from valkey import ValkeyCluster
    from random import randint
    
    class ValkeyLocust(User):
        wait_time = constant_throughput(50)
        host = "valkey-cluster.valkey.svc.cluster.local"
        def __init__(self, *args, **kwargs):
            super(ValkeyLocust, self).__init__(*args, **kwargs)
            self.client = ValkeyClient(host=self.host)
        def on_stop(self):
            self.client.close()
        @task
        @tag("set")
        def set_value(self):
            self.client.set_value("set_value")
        @task
        @tag("get")
        def get_value(self):
            self.client.get_value("get_value")
    
    class ValkeyClient(object):
        def __init__(self, host, *args, **kwargs):
            super().__init__(*args, **kwargs)
            with open("/etc/valkey-password/valkey-password-file.conf", "r") as f:
                self.password = f.readlines()[0].split(" ")[1].strip()
            self.host = host
            self.vc = ValkeyCluster(host=self.host,
                                    port=6379,
                                    password=self.password,
                                    username="default",
                                    cluster_error_retry_attempts=0,
                                    socket_timeout=2,
                                    keepalive=1
                                    )
    
        def set_value(self, key, command='SET'):
            start_time = time.perf_counter()
            try:
                result = self.vc.set(randint(0, 1000), randint(0, 1000))
                if not result:
                    result = ''
                length = len(str(result))
                total_time = (time.perf_counter()- start_time) * 1000
                events.request.fire(
                    request_type=command,
                    name=key,
                    response_time=total_time,
                    response_length=length,
                )
            except Exception as e:
                total_time = (time.perf_counter()- start_time) * 1000
                events.request.fire(
                    request_type=command,
                    name=key,
                    response_time=total_time,
                    response_length=0,
                    exception=e
                )
                result = ''
            return result
        def get_value(self, key, command='GET'):
            start_time = time.perf_counter()
            try:
                result = self.vc.get(randint(0, 1000))
                if not result:
                    result = ''
                length = len(str(result))
                total_time = (time.perf_counter()- start_time) * 1000
                events.request.fire(
                    request_type=command,
                    name=key,
                    response_time=total_time,
                    response_length=length,
                )
            except Exception as e:
                total_time = (time.perf_counter()- start_time) * 1000
                events.request.fire(
                    request_type=command,
                    name=key,
                    response_time=total_time,
                    response_length=0,
                    exception=e
                )
                result = ''
            return result
    EOF
    

    此 python 代码会实现一个 Locust User 类,该类连接到 Valkey 群集并执行“设置和获取”操作。 可以扩展此类以实现更复杂的操作

  3. 使用 az acr build 命令生成 Docker 映像并将其上传到 ACR。

    az acr build --image valkey-client --registry ${MY_ACR_REGISTRY} .
    

在 Azure Kubernetes 服务 (AKS) 上测试 Valkey 群集

  1. 使用 kubectl apply 命令创建使用上一步骤中生成的 Valkey 客户端映像的 Pod。 Pod 规范包含具有客户端用于连接到 Valkey 群集的 Valkey 密码的机密存储库 CSI 卷。

    kubectl apply -f - <<EOF
    ---
    kind: Pod
    apiVersion: v1
    metadata:
      name: valkey-client
      namespace: valkey
    spec:
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: agentpool
                  operator: In
                  values:
                  - nodepool1
        containers:
        - name: valkey-client
          image: ${MY_ACR_REGISTRY}.azurecr.cn/valkey-client
          command: ["locust", "--processes", "4"]
          volumeMounts:
            - name: valkey-password
              mountPath: "/etc/valkey-password"
        volumes:
        - name: valkey-password
          csi:
            driver: secrets-store.csi.k8s.io
            readOnly: true
            volumeAttributes:
              secretProviderClass: "valkey-password"
    EOF
    
  2. 端口转发端口 8089,以使用 kubectl port-forward 命令访问本地计算机上的 Locust Web 界面。

     kubectl port-forward -n valkey valkey-client 8089:8089
    
  3. 访问 http://localhost:8089 处的 Locust Web 界面并启动测试。 可以调整用户数和生成速率,以模拟 Valkey 群集上的工作负载。 下图使用 100 个用户和 10 生成速率:

    显示 Locust 测试仪表板的网页的屏幕截图。

  4. 通过使用带有 --cascade=orphan 标志的 kubectl delete 命令删除 StatefulSet 来模拟中断。 目标是能够删除单个 Pod,而无需 StatefulSet 立即重新创建已删除的 Pod。

    kubectl delete statefulset valkey-masters --cascade=orphan
    
  5. 使用 kubectl delete pod 命令删除 valkey-masters-0 Pod。

    kubectl delete pod valkey-masters-0
    
  6. 使用 kubectl get pods 命令检查 Pod 列表。

    kubectl get pods
    

    输出应指示 Pod valkey-masters-0 已删除:

    NAME                READY   STATUS    RESTARTS   AGE
    valkey-client       1/1     Running   0          6m34s
    valkey-masters-1    1/1     Running   0          16m
    valkey-masters-2    1/1     Running   0          16m
    valkey-replicas-0   1/1     Running   0          16m
    valkey-replicas-1   1/1     Running   0          16m
    valkey-replicas-2   1/1     Running   0          16m
    
  7. 使用 kubectl logs valkey-replicas-0 命令获取 valkey-replicas-0 Pod 的日志。

    kubectl logs valkey-replicas-0
    

    在输出中,我们观察到完整事件持续约 18 秒:

    1:S 05 Nov 2024 12:18:53.961 * Connection with primary lost.
    1:S 05 Nov 2024 12:18:53.961 * Caching the disconnected primary state.
    1:S 05 Nov 2024 12:18:53.961 * Reconnecting to PRIMARY 10.224.0.250:6379
    1:S 05 Nov 2024 12:18:53.961 * PRIMARY <-> REPLICA sync started
    1:S 05 Nov 2024 12:18:53.964 # Error condition on socket for SYNC: Connection refused
    1:S 05 Nov 2024 12:18:54.910 * Connecting to PRIMARY 10.224.0.250:6379
    1:S 05 Nov 2024 12:18:54.910 * PRIMARY <-> REPLICA sync started
    1:S 05 Nov 2024 12:18:54.912 # Error condition on socket for SYNC: Connection refused
    1:S 05 Nov 2024 12:18:55.920 * Connecting to PRIMARY 10.224.0.250:6379
    [..CUT..]
    1:S 05 Nov 2024 12:19:10.056 * Connecting to PRIMARY 10.224.0.250:6379
    1:S 05 Nov 2024 12:19:10.057 * PRIMARY <-> REPLICA sync started
    1:S 05 Nov 2024 12:19:10.058 # Error condition on socket for SYNC: Connection refused
    1:S 05 Nov 2024 12:19:10.709 * Node c44d4b682b6fb9b37033d3e30574873545266d67 () reported node 9e7c43890613cc3ad4006a9cdc0b5e5fc5b6d44e     () as not reachable.
    1:S 05 Nov 2024 12:19:10.864 * NODE 9e7c43890613cc3ad4006a9cdc0b5e5fc5b6d44e () possibly failing.
    1:S 05 Nov 2024 12:19:11.066 * 10000 changes in 60 seconds. Saving...
    1:S 05 Nov 2024 12:19:11.068 * Background saving started by pid 29
    1:S 05 Nov 2024 12:19:11.068 * Connecting to PRIMARY 10.224.0.250:6379
    1:S 05 Nov 2024 12:19:11.068 * PRIMARY <-> REPLICA sync started
    1:S 05 Nov 2024 12:19:11.069 # Error condition on socket for SYNC: Connection refused
    29:C 05 Nov 2024 12:19:11.090 * DB saved on disk
    29:C 05 Nov 2024 12:19:11.090 * Fork CoW for RDB: current 0 MB, peak 0 MB, average 0 MB
    1:S 05 Nov 2024 12:19:11.169 * Background saving terminated with success
    1:S 05 Nov 2024 12:19:11.884 * FAIL message received from ba36d5167ee6016c01296a4a0127716f8edf8290 () about     9e7c43890613cc3ad4006a9cdc0b5e5fc5b6d44e ()
    1:S 05 Nov 2024 12:19:11.884 # Cluster state changed: fail
    1:S 05 Nov 2024 12:19:11.974 * Start of election delayed for 510 milliseconds (rank #0, offset 7225807).
    1:S 05 Nov 2024 12:19:11.976 * Node d43f370a417d299b78bd1983792469fe5c39dcdf () reported node 9e7c43890613cc3ad4006a9cdc0b5e5fc5b6d44e     () as not reachable.
    1:S 05 Nov 2024 12:19:12.076 * Connecting to PRIMARY 10.224.0.250:6379
    1:S 05 Nov 2024 12:19:12.076 * PRIMARY <-> REPLICA sync started
    1:S 05 Nov 2024 12:19:12.076 * Currently unable to failover: Waiting the delay before I can start a new failover.
    1:S 05 Nov 2024 12:19:12.078 # Error condition on socket for SYNC: Connection refused
    1:S 05 Nov 2024 12:19:12.581 * Starting a failover election for epoch 15.
    1:S 05 Nov 2024 12:19:12.616 * Currently unable to failover: Waiting for votes, but majority still not reached.
    1:S 05 Nov 2024 12:19:12.616 * Needed quorum: 2. Number of votes received so far: 1
    1:S 05 Nov 2024 12:19:12.616 * Failover election won: I'm the new primary.
    1:S 05 Nov 2024 12:19:12.616 * configEpoch set to 15 after successful failover
    1:M 05 Nov 2024 12:19:12.616 * Discarding previously cached primary state.
    1:M 05 Nov 2024 12:19:12.616 * Setting secondary replication ID to c0b5b2df8a43b19a4d43d8f8b272a07139e0ca34, valid up to offset:     7225808. New replication ID is 029fcfbae0e3e4a1dccd73066043deba6140c699
    1:M 05 Nov 2024 12:19:12.616 * Cluster state changed: ok
    

    在此 18 秒的时间范围内,我们观察到写入属于已删除的 Pod 的分片失败,Valkey 群集正在选择新的主节点。 在此时间范围内,请求延迟峰值为 60 毫秒。

    图的屏幕截图,其中显示了请求延迟的第 95 百分位激增到 60 毫秒。

    选择新主节点后,Valkey 群集将继续服务请求,延迟约为 2 毫秒。

后续步骤

本文介绍了如何使用 Locust 生成测试应用程序,以及如何模拟 Valkey 主 Pod 的故障。 你观察到 Valkey 群集可以从故障中恢复,并继续服务延迟出现短暂激增的请求。

若要详细了解使用 Azure Kubernetes 服务 (AKS) 的有状态的工作负载,请参阅以下文章:

[在 AKS 节点池升级期间验证 Valkey 复原能力][upgrade-valkey-aks-nodepool]

供稿人

Microsoft 会维护本文。 本系列文章为以下参与者的原创作品:

  • Nelly Kiboi | 服务工程师
  • Saverio Proto | 首席客户体验工程师