监视 Azure Stack HCI 群集Monitor Azure Stack HCI clusters

适用于:Azure Stack HCI 版本 20H2;Windows Server 2019Applies to: Azure Stack HCI, version 20H2; Windows Server 2019

有三种方法可以监视 Azure Stack HCI 群集及其基础组件:Windows Admin Center、Azure Monitor 和 PowerShell。There are three ways to monitor Azure Stack HCI clusters and its underlying components: Windows Admin Center, Azure Monitor, and PowerShell.

使用 Windows Admin Center 仪表板进行监视Monitor using Windows Admin Center dashboard

在管理电脑或服务器上安装 Windows Admin Center,然后将其添加并连接到要监视的 Azure Stack HCI 群集。Install Windows Admin Center on a management PC or server, and then add and connect to the Azure Stack HCI cluster that you wish to monitor. 登录后,Windows Admin Center 仪表板顶部会立即突出显示关键警报。Critical alerts are prominently displayed at the top of the Windows Admin Center dashboard as soon as you log in. 例如,以下屏幕截图表明需要安装更新且群集有一个严重的驱动器错误:For example, the screenshot below indicates that updates need to be installed, and that the cluster has one critical drive error:

Windows Admin Center 仪表板警报示例

监视虚拟机Monitor virtual machines

了解运行应用程序和数据库的虚拟机 (VM) 的运行状况非常重要。It's important to understand the health of the virtual machines (VMs) on which your applications and databases run. 如果没有为 VM 上运行的工作负载分配足够的 CPU 或内存,性能可能会降低,或者应用程序可能会变得不可用。If a VM is not assigned enough CPU or memory for the workloads running on it, performance could slow, or the application could become unavailable. 如果 VM 在 5 分钟或更长时间内响应的检测信号少于三个,则可能存在问题。If a VM responds to less than three heartbeats for a period of five minutes or longer, there may be a problem.

若要在 Windows Admin Center 中监视 VM,请单击左侧“工具”菜单中的“虚拟机” 。To monitor VMs in Windows Admin Center, click Virtual machines from the Tools menu at the left. 若要查看在群集上运行的 VM 的完整清单,请单击页面顶部的“清单”。To view a complete inventory of VMs running on the cluster, click Inventory at the top of the page. 你将看到一个表,其中包含每个 VM 的相关信息,包括:You'll see a table with information about each VM, including:

  • 名称: VM 的名称。Name: The name of the VM.
  • 状态: 指示 VM 是在运行还是已停止。State: Indicates if the VM is running or stopped.
  • 主机服务器: 指示 VM 在群集中的哪个服务器上运行。Host server: Indicates which server in the cluster the VM is running on.
  • CPU 使用率: VM 使用的资源占群集总 CPU 资源量的百分比。CPU usage: The percentage of the cluster's total CPU resources that the VM is consuming.
  • 内存压力: VM 使用的资源占可用内存资源的百分比。Memory pressure: The percentage of available memory resources that the VM is consuming.
  • 内存需求: VM 使用的分配的内存量(GB 或 MB)。Memory demand: The amount of assigned memory (GB or MB) that the VM is consuming.
  • 分配的内存: 分配给 VM 的内存总量。Assigned memory: The total amount of memory assigned to the VM.
  • 运行时间: VM 已运行多长时间(以“天:小时:分钟:秒”表示)。Uptime: How long the VM has been running in days:hours:minutes:seconds.
  • 检测信号: 指示群集是否可以与 VM 通信。Heartbeat: Indicates whether the cluster can communicate with the VM.
  • 灾难恢复状态: 显示 VM 是否已登录 Azure Site Recovery。Disaster recovery status: Shows whether the VM is signed into Azure Site Recovery.

监视服务器Monitor servers

可以直接通过 Windows Admin Center 监视构成 Azure Stack HCI 群集的主机服务器。You can monitor the host servers that comprise an Azure Stack HCI cluster directly from Windows Admin Center. 如果没有为主机服务器配置足够的 CPU 或内存来提供 VM 所需的资源,它们可能会导致性能瓶颈。If host servers are not configured with sufficient CPU or memory to provide the resources VMs require, they can be a performance bottleneck.

若要在 Windows Admin Center 中监视服务器,请单击左侧“工具”菜单中的“服务器” 。To monitor servers in Windows Admin Center, click Servers from the Tools menu at the left. 若要查看群集中服务器的完整清单,请单击页面顶部的“清单”。To view a complete inventory of servers in the cluster, click Inventory at the top of the page. 你将看到一个表,其中包含每个服务器的相关信息,包括:You'll see a table with information about each server, including:

  • 名称: 群集中的主机服务器的名称。Name: The name of the host server in the cluster.
  • 状态: 指示服务器是启动还是关闭状态。Status: Indicates if the server is up or down.
  • 运行时间: 服务器已启动多长时间。Uptime: How long the server has been up.
  • 制造商: 服务器的硬件制造商。Manufacturer: The hardware manufacturer of the server.
  • 型号: 服务器的型号。Model: The model of the server.
  • 序列号: 服务器的序列号。Serial number: The serial number of the server.
  • CPU 使用率: 正在使用的主机服务器 CPU 占比。CPU usage: The percentage of the host server's CPU that is being utilized. 群集中任何服务器使用超过 85% 的 CPU 的时间都不应长于 10 分钟。No server in the cluster should use more than 85 percent of its CPU for longer than 10 minutes.
  • 内存使用率: 正在使用的主机服务器内存占比。Memory usage: The percentage of the host server's memory that is being utilized. 如果服务器可用内存小于 100MB 的时间达 10 分钟或更长时间,请考虑增加内存。If a server has less than 100MB of memory available for 10 minutes or longer, consider adding memory.

监视卷Monitor volumes

存储卷的填充速度可能会很快,因此应定期监视它们以避免对应用程序造成任何影响,这非常重要。Storage volumes can fill up quickly, making it important to monitor them on a regular basis to avoid any application impact. 若要在 Windows Admin Center 中监视卷,请单击左侧“工具”菜单中的“卷” 。To monitor volumes in Windows Admin Center, click Volumes from the Tools menu at the left. 若要查看群集上存储卷的完整清单,请单击页面顶部的“清单”。To view a complete inventory of storage volumes on the cluster, click Inventory at the top of the page. 你将看到一个表,其中包含每个卷的相关信息,包括:You'll see a table with information about each volume, including:

  • 名称: 卷的名称。Name: The name of the volume.
  • 状态: “OK”表示卷正常;否则,将显示警告或错误。Status: "OK" indicates that the volume is healthy; otherwise, a warning or error is reported.
  • 文件系统: 卷(ReFS 和 CSVFS)上的文件系统。File system: File system on the volume (ReFS, CSVFS).
  • 复原能力: 指示卷是双向镜像、三向镜像还是镜像加速奇偶校验。Resiliency: Indicates whether the volume is a two-way mirror, three-way mirror, or mirror-accelerated parity.
  • Size: 卷的大小 (TB/GB)Size: Size of the volume (TB/GB)
  • 存储池: 卷所属的存储池。Storage pool: The storage pool the volume belongs to.
  • 存储使用率: 已使用的卷存储容量占比。Storage usage: The percentage of the volume's storage capacity that is being used.
  • IOPS: 每秒输入/输出操作数。IOPS: Number of input/output operations per second.

监视驱动器Monitor drives

Azure Stack HCI 的存储虚拟化方式使得丢失一个单独的驱动器不会对集群造成重大影响。Azure Stack HCI virtualizes storage in such a way that losing an individual drive will not significantly impact the cluster. 但是,需要替换故障驱动器,并且驱动器有可能因填充或引入延迟而影响性能。However, failed drives will need to be replaced, and drives can impact performance by filling up or introducing latency. 如果操作系统无法与驱动器通信,则驱动器可能会变得松散或断开连接,其连接器可能出现故障或驱动器本身可能出现故障。If the operating system cannot communicate with a drive, the drive may be loose or disconnected, its connector may have failed, or the drive itself may have failed. 在通信中断 15 分钟后,Windows 将自动停用驱动器。Windows automatically retires drives after 15 minutes of lost communication.

若要在 Windows Admin Center 中监视驱动器,请单击左侧“工具”菜单中的“驱动器” 。To monitor drives in Windows Admin Center, click Drives from the Tools menu at the left. 若要查看群集上驱动器的完整清单,请单击页面顶部的“清单”。To view a complete inventory of drives on the cluster, click Inventory at the top of the page. 你将看到一个表,其中包含每个驱动器的相关信息,包括:You'll see a table with information about each drive, including:

  • 序列号: 驱动器的序列号。Serial number: The serial number of the drive.
  • 状态: “OK”表示驱动器正常;否则,将显示警告或错误。Status: "OK" indicates that the drive is healthy; otherwise, a warning or error is reported.
  • 型号: 驱动器的型号。Model: The model of the drive.
  • Size: 驱动器的总容量 (TB/GB)。Size: The total capacity of the drive (TB/GB).
  • 类型: 驱动器类型(SSD、HDD)。Type: Drive type (SSD, HDD).
  • 用途: 指示驱动器是用于缓存还是容量。Used for: Indicates whether the drive is used for cache or capacity.
  • 位置: 驱动器连接到的存储适配器和端口。Location: The storage adapter and port the drive is connected to.
  • 服务器: 驱动器连接到的服务器的名称。Server: The name of the server the drive is connected to.
  • 存储池: 驱动器所属的存储池。Storage pool: The storage pool the drive belongs to.
  • 存储使用率: 已使用的驱动器存储容量占比。Storage usage: The percentage of the drive's storage capacity that is being used.

添加性能计数器Add performance counters

使用 Windows Admin Center 中的性能监视器工具可实时查看和比较 Windows、应用或设备的性能计数器。Use the Performance Monitor tool in Windows Admin Center to view and compare performance counters for Windows, apps, or devices in real-time.

  1. 从左侧的“工具”菜单中选择“性能监视器”。Select Performance Monitor from the Tools menu on the left.
  2. 单击“空白工作区”以启动新工作区,或者单击“还原以前的工作区”以还原以前的工作区。Click blank workspace to start a new workspace, or restore previous to restore a previous workspace.
  3. 如果要创建新的工作区,请单击“添加计数器”按钮,然后选择要监视的一个或多个源服务器,或选择整个群集。If creating a new workspace, click the Add counter button and select one or more source servers to monitor, or select the entire cluster.
  4. 选择要监视的对象和实例以及计数器和图形类型,以查看动态性能信息。Select the object and instance you wish to monitor, as well as the counter and graph type to view dynamic performance information.
  5. 选择顶部菜单中的“保存”>“另存为”,保存工作区。Save the workspace by choosing Save > Save As from the top menu.

例如,下面的屏幕截图显示了名为“内存使用率”的性能计数器,它显示了跨二节点群集的内存的相关信息。For example, the screenshot below shows a performance counter called "Memory usage" that displays information about memory across a two-node cluster.

Windows Admin Center 中的实时性能计数器示例

通过 PowerShell 查询和处理性能历史记录Query and process performance history with PowerShell

还可以使用 PowerShell cmdlet 监视 Azure Stack HCI 群集,这些 cmdlet 返回有关群集及其组件的信息。You can also monitor Azure Stack HCI clusters using PowerShell cmdlets that return information about the cluster and its components. 请参阅存储空间直通的性能历史记录See Performance history for Storage Spaces Direct.

使用运行状况服务功能Use the Health Service feature

应调查群集上的任何运行状况服务错误。Any Health Service fault on the cluster should be investigated. 请参阅 Windows Server 中的运行状况服务,了解如何运行报告和确定故障。See Health Service in Windows Server to learn how to run reports and identify faults.

对运行状况和操作状态进行故障排除Troubleshoot health and operational states

若要了解存储池、虚拟磁盘和驱动器的运行状况和操作状态,请参阅对存储空间和存储空间直通运行状况和操作状态进行故障排除To understand the health and operational states of storage pools, virtual disks, and drives, see Troubleshoot Storage Spaces and Storage Spaces Direct health and operational states.

使用存储 QoS 监视性能Monitor performance using storage QoS

存储服务质量 (QoS) 提供了一种集中监视和管理 VM 的存储 I/O 的方法,以减少邻近干扰问题并提供一致的性能。Storage Quality of Service (QoS) provides a way to centrally monitor and manage storage I/O for VMs to mitigate noisy neighbor issues and provide consistent performance. 请参阅存储服务质量See Storage Quality of Service.

在 Azure Monitor 中设置警报Set up alerts in Azure Monitor

Azure Stack HCI 与 Azure Monitor 集成,使用户能够设置警报,并在超过 CPU、磁盘容量和内存使用率阈值、未返回 VM 检测信号或者出现系统严重错误或运行状况服务错误时收到通知。Azure Stack HCI integrates with Azure Monitor to allow users to set up alerts and be notified if CPU, disk capacity, and memory utilization thresholds are exceeded, if VM heartbeats are not returned, or if there is a system critical error or health service fault.

后续步骤Next steps

如需相关信息,另请参阅:For related information, see also: