在 Azure Stack 中监视运行状况和警报Monitor health and alerts in Azure Stack

适用于:Azure Stack 集成系统和 Azure Stack 开发工具包Applies to: Azure Stack integrated systems and Azure Stack Development Kit

Azure Stack 包含基础结构监视功能,有助于查看 Azure Stack 区域的运行状况和警报。Azure Stack includes infrastructure monitoring capabilities that help you view health and alerts for an Azure Stack region. “区域管理”磁贴默认固定在默认提供商订阅的管理员门户上,其中列出了 Azure Stack 的所有已部署区域。The Region management tile, pinned by default in the administrator portal for the Default Provider Subscription, lists all the deployed regions of Azure Stack. 该磁贴除了显示每个区域的活动严重警报和警告警报数目以外,The tile shows the number of active critical and warning alerts for each region. 也是 Azure Stack 运行状况和警报功能的入口点。The tile is your entry point into the health and alert functionality of Azure Stack.

“区域管理”磁贴

了解 Azure Stack 中的运行状况Understand health in Azure Stack

运行状况资源提供程序管理运行状况和警报。The Health resource provider manages health and alerts. 在 Azure Stack 部署和配置期间,Azure Stack 基础结构组件将注册到运行状况资源提供程序。Azure Stack infrastructure components register with the health resource provider during Azure Stack deployment and configuration. 注册后,可以显示每个组件的运行状况和警报。This registration enables the display of health and alerts for each component. Azure Stack 中的“运行状况”是个简单的概念。Health in Azure Stack is a simple concept. 如果组件的已注册实例存在警报,该组件的运行状况会反映最不利的活动警报的严重性:警告或严重。If alerts for a registered instance of a component exist, the health state of that component reflects the worst active alert severity: warning or critical.

警报严重性定义Alert severity definition

在 Azure Stack 中引发的警报只有两个严重级别:警告严重In Azure Stack raises alerts with only two severities: warning and critical.

  • 警告Warning
    操作员可以按计划方式处理警告警报。An operator can address the warning alert in a scheduled manner. 该警报通常不会影响用户工作负荷。The alert typically does not impact user workloads.

  • 严重Critical
    操作员应紧急处理严重警报。An operator should address the critical alert with urgency. 这些警报是目前影响或将很快影响 Azure Stack 用户的问题。These are issues that currently impact or will soon impact Azure Stack users.

查看和管理组件运行状况View and manage component health state

可以通过管理员门户以及 REST API 和 PowerShell 查看组件的运行状况。You can view the health state of components in the administrator portal and through REST API and PowerShell.

若要在门户中查看运行状况,请在“区域管理”磁贴中单击想要查看的区域。To view the health state in the portal, click the region that you want to view in the Region management tile. 可以查看基础结构角色和资源提供程序的运行状况。You can view the health state of infrastructure roles and of resource providers.

基础结构角色列表

可以单击资源提供程序或基础结构角色来查看更详细的信息。You can click a resource provider or infrastructure role to view more detailed information.

Warning

如果单击基础结构角色,然后单击角色实例,则会看到“启动”、“重启”或“关机”选项。If you click an infrastructure role, and then click the role instance, there are options to Start, Restart, or Shutdown. 对集成系统应用更新时,请勿使用这些操作。Do not use these actions when you apply updates to an integrated system. 此外,请勿在 Azure Stack 开发工具包环境中使用这些选项。Also, do not use these options in an Azure Stack Development Kit environment. 这些选项是针对每个基础结构角色具有多个角色实例的集成系统环境设计的。These options are only designed for an integrated systems environment, where there is more than one role instance per infrastructure role. 在开发工具包中重启角色实例(特别是 AzS-Xrp01)会导致系统不稳定。Restarting a role instance (especially AzS-Xrp01) in the development kit causes system instability. 如需故障排除方面的帮助,请在 Azure Stack 论坛中提问。For troubleshooting assistance, post your issue to the Azure Stack forum.

查看警报View alerts

可直接从“区域管理”边栏选项卡查看每个 Azure Stack 区域的活动警报列表。The list of active alerts for each Azure Stack region is available directly from the Region management blade. 默认配置中的第一个磁贴是“警报”磁贴,其中显示区域的严重警报和警告警报摘要。The first tile in the default configuration is the Alerts tile, which displays a summary of the critical and warning alerts for the region. 如同此边栏选项卡中的其他磁贴一样,可将“警报”磁贴固定到仪表板,以便快速访问。You can pin the Alerts tile, like any other tile on this blade, to the dashboard for quick access.

显示警告的“警报”磁贴

选择“警报”磁贴的上半部分可以导航到区域的所有活动警报列表。By selecting the top part of the Alerts tile, you navigate to the list of all active alerts for the region. 如果选择磁贴中的“严重”或“警告”行项,则会导航到警报的筛选列表(“严重”或“警告”)。If you select either the Critical or Warning line item within the tile, you navigate to a filtered list of alerts (Critical or Warning).

“警报”边栏选项卡支持按状态(“活动”或“已关闭”)和严重性(“严重”或“警告”)进行筛选。The Alerts blade supports the ability to filter both on status (Active or Closed) and severity (Critical or Warning). 默认视图显示所有活动警报。The default view displays all active alerts. 所有已关闭的警报在七天后将从系统中删除。All closed alerts are removed from the system after seven days.

在“筛选器”窗格中按严重或警告状态进行筛选

“视图 API”操作显示用于生成列表视图的 REST API。The View API action displays the REST API that was used to generate the list view. 借助此操作可以快速熟悉可用于查询警报的 REST API 语法。This action provides a quick way to become familiar with the REST API syntax that you can use to query alerts. 可在自动化中使用此 API,或者将它与现有的数据中心监视、报告和票证解决方案相集成。You can use this API in automation or for integration with your existing datacenter monitoring, reporting, and ticketing solutions.

可以单击特定的警报来查看警报详细信息。You can click a specific alert to view the alert details. 警报详细信息显示与警报关联的所有字段,并可让用户快速导航到受影响的组件和警报源。The alert details show all fields that are associated with the alert, and enable quick navigation to the affected component and source of the alert. 例如,如果某个基础结构角色实例脱机或不可访问,则会发生以下警报。For example, the following alert occurs if one of the infrastructure role instances goes offline or is not accessible.

“警报详细信息”边栏选项卡

修复警报Repair alerts

可以在某些警报中选择“修复”。You can select Repair in some alerts.

选中以后,“修复”操作会执行特定于警报的步骤来尝试解决问题。When selected, the Repair action performs steps specific to the alert to attempt to resolve the issue. 选中以后,“修复”操作的状态会以门户通知的形式提供。Once selected the status of the Repair action is available as a portal notification.

正在进行的修复

“修复”操作会在同一门户通知边栏选项卡中报告成功完成了操作或无法完成操作。The Repair action will report successful completion or failure to complete the action in the same portal notification blade. 如果某项“修复”操作因出现警报而失败,则可在警报详细信息中重新运行“修复”操作。If a Repair action fails for an alert, you may rerun the Repair action from the alert detail. 如果“修复”操作成功完成,请勿重新运行“修复”操作。If the Repair action successfully completes, do not rerun the Repair action.

修复成功完成

基础结构角色实例重新联机后,会自动关闭此警报。After the infrastructure role instance is back online, this alert automatically closes. 在根本问题得到解决后,许多(但并非所有)警报会自动关闭。Many, but not every alert, automatically closes when the underlying issue is resolved. 如果 Azure Stack 解决了问题,提供“修复”操作按钮的警报会自动关闭。Alerts that provide a Repair action button will close automatically if Azure Stack resolves the issue. 对于所有其他警报,请在执行补救步骤之后选择“关闭警报”。For all other alerts, select Close Alert after you perform remediation steps. 如果问题仍然存在,Azure Stack 会生成新警报。If the issue persists, Azure Stack generates a new alert. 如果解决了问题,警报将保持关闭,无需采取其他步骤。If you resolve the issue, the alert remains closed and requires no more steps.

后续步骤Next steps

在 Azure Stack 中管理更新Manage updates in Azure Stack

Azure Stack 中的区域管理Region management in Azure Stack