验证 Azure Stack HCI 群集Validate an Azure Stack HCI cluster

适用于:Azure Stack HCI 版本 v20H2;Windows Server 2019Applies to: Azure Stack HCI, version v20H2; Windows Server 2019

本操作指南文章重点介绍群集验证为何重要,以及何时在现有 Azure Stack HCI 群集上运行它。This how-to article focuses on why cluster validation is important, and when to run it on an existing Azure Stack HCI cluster. 建议对以下主要方案执行群集验证:We recommend performing cluster validation for the following primary scenarios:

  • 部署服务器群集后,运行 Validate-DCB 工具以测试网络。After deploying a server cluster, run the Validate-DCB tool to test networking.

  • 更新服务器群集后,根据方案,运行两个验证选项以排查群集问题。After updating a server cluster, depending on your scenario, run both validation options to troubleshoot cluster issues.

  • 在设置使用存储副本进行复制后,通过检查某些特定事件并运行几个命令来验证复制是否正常进行。After setting up replication with Storage Replica, validate that the replication is proceeding normally by checking some specific events and running a couple commands.

  • 创建服务器群集后,请先运行 Validate-DCB 工具,然后再将其投入生产。After creating a server cluster, run the Validate-DCB tool before placing it into production.

    若要了解有关如何部署 Azure Stack HCI 群集的详细信息,请参阅部署概述To learn more about how to deploy an Azure Stack HCI cluster, see the Deployment overview.

什么是群集验证?What is cluster validation?

群集验证的目的是在群集投入生产之前找出硬件或配置的问题。Cluster validation is intended to catch hardware or configuration problems before a cluster goes into production. 群集验证有助于确保你即将部署的 Azure Stack HCI 解决方案真正可靠。Cluster validation helps to ensure that the Azure Stack HCI solution that you're about to deploy is truly dependable. 你还可以在已配置的故障转移群集上使用群集验证作为诊断工具。You can also use cluster validation on configured failover clusters as a diagnostic tool.

特定验证方案Specific validation scenarios

本部分介绍需要验证或验证有用的方案。This section describes scenarios in which validation is also needed or useful.

  • 在配置群集之前进行验证:Validation before the cluster is configured:

    • 准备成为故障转移群集的一组服务器: 这是最直接的验证方案。A set of servers ready to become a failover cluster: This is the most straightforward validation scenario. 硬件组件(系统、网络和存储)已连接,但系统尚未作为群集运行。The hardware components (systems, networks, and storage) are connected, but the systems aren't yet functioning as a cluster. 在这种情况下运行测试不会影响可用性。Running tests in this situation has no affect on availability.

    • 服务器 VM: 对于群集中的虚拟化服务器,请像在任何其他新群集上一样运行群集验证。Server VMs: For virtualized servers in a cluster, run cluster validation as you would on any other new cluster. 运行该功能的要求是相同的,无论是否具有:The requirement to run the feature is the same whether you have:

      • 在两台物理计算机之间发生故障转移的“主机群集”。A "host cluster" where failover occurs between two physical computers.
      • 在同一台物理计算机上的来宾操作系统之间发生故障转移的“来宾群集”。A "guest cluster" where failover occurs between guest operating systems on the same physical computer.
  • 在配置并使用群集之后进行验证:Validation after the cluster is configured and in use:

    • 在将服务器添加到群集之前: 向群集添加服务器时,强烈建议验证群集。Before adding a server to the cluster: When you add a server to a cluster, we strongly recommend validating the cluster. 运行群集验证时,请同时指定现有群集成员和新服务器。Specify both the existing cluster members and the new server when you run cluster validation.

    • 添加驱动器时: 在向群集添加其他驱动器时(这不同于替换故障的驱动器或创建依赖于现有驱动器的虚拟磁盘或卷),运行群集验证以确认新存储将正常运行。When adding drives: When you add additional drives to the cluster, which is different from replacing failed drives or creating virtual disks or volumes that rely on the existing drives, run cluster validation to confirm that the new storage will function correctly.

    • 执行可影响固件或驱动程序的更改时: 如果对群集进行可影响固件或驱动程序的升级或更改,则必须运行群集验证,以确认硬件、固件、驱动程序和软件的新组合支持故障转移群集功能。When making changes that affect firmware or drivers: If you upgrade or make changes to the cluster that affect firmware or drivers, you must run cluster validation to confirm that the new combination of hardware, firmware, drivers, and software supports failover cluster functionality.

    • 使用备份还原系统后: 使用备份还原系统后,运行群集验证以确认系统作为群集的一部分正常工作。After restoring a system from backup: After you restore a system from backup, run cluster validation to confirm that the system functions correctly as part of a cluster.

验证网络Validate networking

Microsoft Validate-DCB 工具用于验证群集上的数据中心桥接 (DCB) 配置。The Microsoft Validate-DCB tool is designed to validate the Data Center Bridging (DCB) configuration on the cluster. 为此,该工具将预期的配置作为输入,然后测试群集中的每个服务器。To do this, the tool takes an expected configuration as input, and then tests each server in the cluster. 本部分介绍如何安装和运行 Validate-DCB 工具、查看结果并解决该工具识别的网络错误。This section covers how to install and run the Validate-DCB tool, review results, and resolve networking errors that the tool identifies.

在网络上,基于聚合以太网 (RoCE) 的远程直接内存访问 (RDMA) 需要 DCB 技术来实现无损的网络结构。On the network, remote direct memory access (RDMA) over Converged Ethernet (RoCE) requires DCB technologies to make the network fabric lossless. 在使用 iWARP 的情况下,DCB 是可选的。With iWARP, DCB is optional. 但是,配置 DCB 可能很复杂,需要在以下位置进行精确配置:However, configuring DCB can be complex, with exact configuration required across:

  • 群集中的每个服务器Each server in the cluster
  • RDMA 流量通过的结构上的每个网络端口Each network port that RDMA traffic passes through on the fabric

必备条件Prerequisites

  • 要验证的服务器群集的网络设置信息,包括:Network setup information of the server cluster that you want to validate, including:
    • 主机或服务器群集名称Host or server cluster name
    • 虚拟交换机名称Virtual switch name
    • 网络适配器名称Network adapter names
    • 优先级流控制 (PFC) 和增强版传输选择 (ETS) 设置Priority Flow Control (PFC) and Enhanced Transmission Selection (ETS) settings
  • 用于从 Microsoft 下载 Windows PowerShell 中的工具模块的 Internet 连接。An internet connection to download the tool module in Windows PowerShell from Microsoft.

安装并运行 Validate-DCB 工具Install and run the Validate-DCB tool

若要安装并运行 Validate-DCB 工具,请执行以下操作:To install and run the Validate-DCB tool:

  1. 在管理 PC 上,以管理员身份打开 Windows PowerShell 会话,然后使用以下命令安装该工具。On your management PC, open a Windows PowerShell session as an Administrator, and then use the following command to install the tool.

    Install-Module Validate-DCB
    
  2. 接受使用 NuGet 提供程序并访问存储库以安装该工具的请求。Accept the requests to use the NuGet provider and access the repository to install the tool.

  3. 在 PowerShell 连接到 Microsoft 网络以下载该工具后,键入 Validate-DCB,然后按 Enter 键启动工具向导。After PowerShell connects to the Microsoft network to download the tool, type Validate-DCB and press Enter to start the tool wizard.

    备注

    如果无法运行 Validate-DCB 工具脚本,则可能需要调整 PowerShell 执行策略。If you cannot run the Validate-DCB tool script, you might need to adjust your PowerShell execution policies. 使用 Get-ExecutionPolicy cmdlet 查看当前脚本执行策略设置。Use the Get-ExecutionPolicy cmdlet to view your current script execution policy settings. 有关在 PowerShell 中设置执行策略的信息,请参阅关于执行策略For information on setting execution policies in PowerShell, see About Execution Policies.

  4. 在“欢迎使用 Validate-DCB 配置向导”页面上,选择“下一步”。On the Welcome to the Validate-DCB configuration wizard page, select Next.

  5. 在“群集和节点”页面上,键入要验证的服务器群集名称,选择“解析”将其列在页面上,然后选择“下一个” 。On the Clusters and Nodes page, type the name of the server cluster that you want to validate, select Resolve to list it on the page, and then select Next.

    Validate-DCB 配置向导的“群集和节点”页面

  6. 在“适配器”页上:On the Adapters page:

    1. 选中“附加的 vSwitch”复选框并键入该 vSwitch 的名称。Select the vSwitch attached checkbox and type the name of the vSwitch.
    2. 在“适配器名称”下,键入每个物理 NIC 的名称,在“主机 vNIC 名称”下,键入每个虚拟 NIC (vNIC) 的名称,在“VLAN”下,键入每个适配器使用的 VLAN ID 。Under Adapter Name, type the name of each physical NIC, under Host vNIC Name, the name of each virtual NIC (vNIC), and under VLAN, the VLAN ID in use for each adapter.
    3. 展开“RDMA 类型”下拉列表框并选择相应的协议:RoCE 或 iWARP 。Expand the RDMA Type drop-down list box and select the appropriate protocol: RoCE or iWARP. 同时将“Jumbo 帧”设置为适合你的网络的值,然后选择“下一步” 。Also set Jumbo Frames to the appropriate value for your network, and then select Next.

     配置向导的“适配器”页面

    备注

  7. 在“数据中心桥接”页面上,修改值以匹配组织的“优先级”、“策略名称”和“带宽保留”的设置,然后选择“下一步” 。On the Data Center Bridging page, modify the values to match your organization's settings for Priority, Policy Name, and Bandwidth Reservation, and then select Next.

     配置向导的“数据中心桥接”页面

    备注

    在前一个向导页面上选择基于 RoCE 的 RDMA 需要 DCB 来保证所有 NIC 和交换机端口的网络可靠性。Selecting RDMA over RoCE on the previous wizard page requires DCB for network reliability on all NICs and switchports.

  8. 在“保存和部署”页面的“配置文件路径”框中,使用 .ps1 扩展名将配置文件保存到某个位置,以便在稍后需要时可以再次使用,然后选择“导出”以开始运行 Validate-DCB 工具 。On the Save and Deploy page, in the Configuration File Path box, save the configuration file using a .ps1 extension to a location where you can use it again later if needed, and then select Export to start running the Validate-DCB tool.

    • 你可以选择通过完成该页面的“将配置部署到节点”部分来部署配置文件,这包括使用 Azure 自动化帐户部署配置并对其进行验证的功能。You can optionally deploy your configuration file by completing the Deploy Configuration to Nodes section of the page, which includes the ability to use an Azure Automation account to deploy the configuration and then validate it. 请参阅创建 Azure 自动化帐户以开始使用 Azure 自动化。See Create an Azure Automation account to get started with Azure Automation.

     配置向导的“保存并部署”页面

查看结果并修复错误Review results and fix errors

Validate-DCB 工具产生两个单元的结果:The Validate-DCB tool produces results in two units:

  1. [全局单元]结果列出了运行模式测试的先决条件和要求。[Global Unit] results list prerequisites and requirements to run the modal tests.
  2. [模式单元]结果提供关于每个群集主机配置和最佳做法的反馈。[Modal Unit] results provide feedback on each cluster host configuration and best practices.

此示例通过指示失败计数为 0 来表明单个服务器的所有先决条件和模式单元测试的扫描结果成功。This example shows successful scan results of a single server for all prerequisites and modal unit tests by indicating a Failed Count of 0.

 全局单元和模式单元测试结果

以下步骤演示了如何从 vNIC SMB02 识别大型数据包错误以及如何修复:The following steps show how to identify a Jumbo Packet error from vNIC SMB02 and fix it:

  1. Validate-DCB 工具扫描的结果显示,失败计数错误为 1。The results of the Validate-DCB tool scans show a Failed Count error of 1.

    Validate-DCB 工具扫描结果显示失败计数错误为 1

  2. 向后滚动结果显示一个红色标示的错误,指示主机 S046036 上 vNIC SMB02 的大型数据包被设置为默认大小 1514,但应该设置为 9014。Scrolling back through the results shows an error in red indicating that the Jumbo Packet for vNIC SMB02 on Host S046036 is set at the default size of 1514, but should be set to 9014.

    Validate-DCB 工具扫描结果显示大型数据包大小设置错误

  3. 查看主机 S046036 上 vNIC SMB02 的高级属性,可以发现大型数据包被设置为默认“已禁用” 。Reviewing the Advanced properties of vNIC SMB02 on Host S046036 shows that the Jumbo Packet is set to the default of Disabled.

    服务器主机的 Hyper-v 高级属性大型数据包设置

  4. 修复错误需要启用大型数据包功能并将其大小更改为 9014 字节。Fixing the error requires enabling the Jumbo Packet feature and changing its size to 9014 bytes. 在主机 S046036 上再次运行扫描将通过返回失败计数 0 来确认此更改。Running the scan again on host S046036 confirms this change by returning a Failed Count of 0.

    Validate-DCB 扫描结果确认服务器主机的大型数据包设置已修复

验证群集Validate the cluster

使用以下步骤验证 Windows Admin Center 中现有群集中的服务器。Use the following steps to validate the servers in an existing cluster in Windows Admin Center.

  1. 在 Windows Admin Center 的“所有连接”下,选择要验证的 Azure Stack HCI 群集,然后选择“连接” 。In Windows Admin Center, under All connections, select the Azure Stack HCI cluster that you want to validate, and then select Connect.

    “群集管理器仪表板”显示群集的概述信息。The Cluster Manager Dashboard displays overview information about the cluster.

  2. 在“群集管理器仪表板”的“工具”下,选择“服务器” 。On the Cluster Manager Dashboard, under Tools, select Servers.

  3. 在“库存”页面上,选择群集中的服务器,然后展开“更多”子菜单并选择“验证群集” 。On the Inventory page, select the servers in the cluster, then expand the More submenu and select Validate cluster.

  4. 在“验证群集”弹出窗口中,选择“是” 。On the Validate Cluster pop-up window, select Yes.

    “验证群集”弹出窗口

  5. 在“凭证安全服务提供者(CredSSP)”弹出窗口中,选择“是” 。On the Credential Security Service Provider (CredSSP) pop-up window, select Yes.

  6. 提供凭据以启用 CredSSP,然后选择“继续” 。Provide your credentials to enable CredSSP and then select Continue.
    群集验证在后台运行,并在完成时向你发出通知,此时你可以查看验证报告,如下一节所述。Cluster validation runs in the background and gives you a notification when it's complete, at which point you can view the validation report, as described in the next section.

备注

在验证群集服务器之后,出于安全原因,需要禁用 CredSSP。After your cluster servers have been validated, you will need to disable CredSSP for security reasons.

禁用 CredSSPDisable CredSSP

成功验证服务器群集后,出于安全目的,需要在每台服务器上禁用凭据安全支持提供者 (CredSSP) 协议。After your server cluster is successfully validated, you'll need to disable the Credential Security Support Provider (CredSSP) protocol on each server for security purposes. 有关详细信息,请参阅 CVE-2018-0886For more information, see CVE-2018-0886.

  1. 在 Windows Admin Center 中的“所有连接”下,选择群集中的第一个服务器,然后选择“连接” 。In Windows Admin Center, under All connections, select the first server in your cluster, and then select Connect.

  2. 在“概述”页上,选择“禁用 CredSSP”,然后在“禁用 CredSSP”弹出窗口中,选择“是” 。On the Overview page, select Disable CredSSP, and then on the Disable CredSSP pop-up window, select Yes.

    步骤 2 的结果删除了服务器“概述”页面顶部的红色“CredSSP 已启用”横幅,并在其他服务器上禁用了 CredSSP 。The result of Step 2 removes the red CredSSP ENABLED banner at the top of the server's Overview page, and disables CredSSP on the other servers.

查看验证报告View validation reports

现在,你可以查看群集验证报告。Now you're ready to view your cluster validation report.

可以通过多种方式来访问验证报告:There are a couple ways to access validation reports:

  • 在“清单”页上,展开“更多”子菜单,然后选择“查看验证报告” 。On the Inventory page, expand the More submenu, and then select View validation reports.

  • 在 Windows Admin Center 的右上角,选择“通知”铃铛图标以显示“通知”窗格 。At the top right of Windows Admin Center, select the Notifications bell icon to display the Notifications pane. 选择“已成功验证群集”通知,然后选择“转到故障转移群集验证报告” 。Select the Successfully validated cluster notice, and then select Go to Failover Cluster validation report.

备注

服务器群集验证过程可能需要一些时间才能完成。The server cluster validation process may take some time to complete. 当进程正在运行时,不要切换到 Windows Admin Center 中的其他工具。Don't switch to another tool in Windows Admin Center while the process is running. 在“通知”窗格中,“验证群集”通知下方的状态栏指示进程何时完成 。In the Notifications pane, a status bar below your Validate cluster notice indicates when the process is done.

使用 PowerShell 验证群集Validate the cluster using PowerShell

你还可以使用 Windows PowerShell 在服务器群集上运行验证测试并查看结果。You can also use Windows PowerShell to run validation tests on your server cluster and view the results. 可以在设置群集之前和之后运行测试。You can run tests both before and after a cluster is set up.

若要在服务器群集上运行验证测试,请从管理 PC 发出 Get-Cluster 和 Test-Cluster PowerShell cmdlet,或者仅在群集上直接运行 Test-Cluster cmdlet :To run a validation test on a server cluster, issue the Get-Cluster and Test-Cluster PowerShell cmdlets from your management PC, or run only the Test-Cluster cmdlet directly on the cluster:

$Cluster = Get-Cluster -Name 'server-cluster1'
Test-Cluster -InputObject $Cluster -Verbose

如需更多示例和用法信息,请查看 Test-Cluster 参考文档。For more examples and usage information, see the Test-Cluster reference documentation.

验证存储副本的复制Validate replication for Storage Replica

如果使用存储副本在延伸群集中或在群集之间复制卷,则可以使用几个事件和 cmdlet 来获取复制的状态。If you're using Storage Replica to replicate volumes in a stretched cluster or cluster-to-cluster, there are there are several events and cmdlets that you can use to get the state of replication.

在以下方案中,我们通过为两个站点创建复制组 (RG) 来配置存储副本,然后为 Site1 中的源服务器节点(Server1、Server2)和 Site2 中的目标(复制的)服务器节点(Server3、Server4)指定数据卷和日志卷。In the following scenario, we configured Storage Replica by creating replication groups (RGs) for two sites, and then specified the data volumes and log volumes for both the source server nodes in Site1 (Server1, Server2), and the destination (replicated) server nodes in Site2 (Server3, Server4).

若要确定 Site1 中 Server1 的复制进度,请运行 Get-WinEvent 命令并检查事件 5015、5002、5004、1237、5001 和 2200:To determine the replication progress for Server1 in Site1, run the Get-WinEvent command and examine events 5015, 5002, 5004, 1237, 5001, and 2200:

Get-WinEvent -ComputerName Server1 -ProviderName Microsoft-Windows-StorageReplica -max 20

对于 Site2 中的 Server3,运行以下 Get-WinEvent 命令以查看显示伙伴关系创建情况的存储副本事件。For Server3 in Site2, run the following Get-WinEvent command to see the Storage Replica events that show creation of the partnership. 此事件会显示复制的字节数和所用的时间。This event states the number of copied bytes and the time taken. 例如:For example:

Get-WinEvent -ComputerName Server3 -ProviderName Microsoft-Windows-StorageReplica | Where-Object {$_.ID -eq "1215"} | FL

对于 Site2 中的 Server3,运行 Get-WinEvent 命令并检查事件 5009、1237、5001、5015、5005 和 2200 以了解处理进度。For Server3 in Site2, run the Get-WinEvent command and examine events 5009, 1237, 5001, 5015, 5005, and 2200 to understand the processing progress. 在该序列中不应有错误的警告。There should be no warnings of errors in this sequence. 将有许多 1237 事件,这些事件指示进度。There will be many 1237 events - these indicate progress.

Get-WinEvent -ComputerName Server3 -ProviderName Microsoft-Windows-StorageReplica | FL

或者,副本的目标服务器组显示要始终复制的剩余字节数,并且可以通过 PowerShell 使用 Get-SRGroup 进行查询。Alternately, the destination server group for the replica states the number of byte remaining to copy at all times, and can be queried through PowerShell with Get-SRGroup. 例如:For example:

(Get-SRGroup).Replicas | Select-Object numofbytesremaining

对于 Site2 中的节点 Server3,运行以下命令并检查事件 5009、1237、5001、5015、5005 和 2200,以了解复制进度。For node Server3 in Site2, run the following command and examine events 5009, 1237, 5001, 5015, 5005, and 2200 to understand the replication progress. 应该没有错误警告。There should be no warnings of errors. 但是,将有许多“1237”事件,这些事件只是指示进度。However, there will be many "1237" events - these simply indicate progress.

Get-WinEvent -ComputerName Server3 -ProviderName Microsoft-Windows-StorageReplica | FL

作为不会终止的进度脚本:As a progress script that will not terminate:

while($true) {
$v = (Get-SRGroup -Name "Replication2").replicas | Select-Object numofbytesremaining
[System.Console]::Write("Number of bytes remaining: {0}`r", $v.numofbytesremaining)
Start-Sleep -s 5
}

若要获取延伸群集中的复制状态,请使用 Get-SRGroupGet-SRPartnershipTo get replication state within the stretched cluster, use Get-SRGroup and Get-SRPartnership:

Get-SRGroup -Cluster ClusterS1
Get-SRPartnership -Cluster ClusterS1
(Get-SRGroup).replicas -Cluster ClusterS1

确认站点之间的数据复制成功后,就可以创建 VM 和其他工作负载了。Once successful data replication is confirmed between sites, you can create your VMs and other workloads.

另请参阅See also