Azure Service Fabric 的监视和诊断Monitoring and diagnostics for Azure Service Fabric

本文概述 Azure Service Fabric 的监视和诊断。This article provides an overview of monitoring and diagnostics for Azure Service Fabric. 在任何云环境中开发、测试和部署工作负荷时,监视和诊断至关重要。Monitoring and diagnostics are critical to developing, testing, and deploying workloads in any cloud environment. 例如,可以跟踪应用程序的使用方式、Service Fabric 平台所采取的操作、带性能计数器的资源利用率以及群集的总体运行状况。For example, you can track how your applications are used, the actions taken by the Service Fabric platform, your resource utilization with performance counters, and the overall health of your cluster. 可以使用此信息来诊断和更正问题,避免将来发生此类问题。You can use this information to diagnose and correct issues, and prevent them from occurring in the future. 接下来的几节将简要介绍 Service Fabric 监视的每个区域,以便将生产工作负荷纳入考虑范围。The next few sections will briefly explain each area of Service Fabric monitoring to consider for production workloads.

备注

本文最近已更新,从使用术语“Log Analytics”改为使用术语“Azure Monitor 日志”。This article was recently updated to use the term Azure Monitor logs instead of Log Analytics. 日志数据仍然存储在 Log Analytics 工作区中,并仍然由同一 Log Analytics 服务收集并分析。Log data is still stored in a Log Analytics workspace and is still collected and analyzed by the same Log Analytics service. 我们正在更新术语,以便更好地反映 Azure Monitor 中日志的角色。We are updating the terminology to better reflect the role of logs in Azure Monitor. 有关详细信息,请参阅 Azure Monitor 术语更改See Azure Monitor terminology changes for details.

应用程序监视Application monitoring

应用程序监视跟踪应用程序的功能和组件的使用方式。Application monitoring tracks how features and components of your application are being used. 监视应用程序可以确保捕获影响用户的问题。You want to monitor your applications to make sure issues that impact users are caught. 应用程序监视主要负责用户开发应用程序及其服务,因为它对应用程序的业务逻辑是唯一的。The responsibility of application monitoring is on the users developing an application and its services since it is unique to the business logic of your application. 监视应用程序可能对以下情况有帮助:Monitoring your applications can be useful in the following scenarios:

  • 应用程序遇到了多少流量?How much traffic is my application experiencing? - 是否需要缩放服务来满足用户需求,或解决应用程序中的潜在瓶颈?- Do you need to scale your services to meet user demands or address a potential bottleneck in your application?
  • 服务到服务调用是否成功并已进行跟踪?Are my service to service calls successful and tracked?
  • 应用程序的用户执行哪些操作?What actions are taken by the users of my application? - 收集遥测可指导将来的功能开发和更好地诊断应用程序错误- Collecting telemetry can guide future feature development and better diagnostics for application errors
  • 应用程序正在引发未经处理的异常?Is my application throwing unhandled exceptions?
  • 容器内运行的服务中发生了什么情况?What is happening within the services running inside my containers?

由于应用程序监视位于应用程序的上下文中,因此其最大的优点在于,开发人员可以使用他们喜欢的任意工具和框架!The great thing about application monitoring is that developers can use whatever tools and framework they'd like since it lives within the context of your application! 可以通过 Azure Monitor(使用 Application Insights 进行事件分析中的 Application Insights)了解有关 Azure 应用程序监视解决方案的更多信息。You can learn more about the Azure solution for application monitoring with Azure Monitor - Application Insights in Event analysis with Application Insights. 我们还提供了介绍如何为 .NET应用程序设置此监视的教程。We also have a tutorial with how to set this up for .NET Applications. 本教程介绍如何安装正确的工具、在应用程序中编写自定义遥测的示例以及在 Azure 门户中查看应用程序诊断和遥测。This tutorial goes over how to install the right tools, an example to write custom telemetry in your application, and viewing the application diagnostics and telemetry in the Azure portal.

平台(群集)监视Platform (Cluster) monitoring

由于用户自己编写代码,用户可以控制来自其应用程序的遥测,但是来自 Service Fabric 平台的诊断怎么办?A user is in control over what telemetry comes from their application since a user writes the code itself, but what about the diagnostics from the Service Fabric platform? Service Fabric 的目标之一是确保应用程序能够灵活应对硬件故障。One of Service Fabric's goals is to keep applications resilient to hardware failures. 为了实现此目的,可以通过平台的系统服务功能检测基础结构问题,并快速将工作负荷故障转移到群集中的其他节点。This goal is achieved through the platform's system services' ability to detect infrastructure issues and rapidly failover workloads to other nodes in the cluster. 但在此特殊情况下,系统服务本身出现问题该怎么办?But in this particular case, what if the system services themselves have issues? 或者,在尝试部署或移动工作负荷时,如果违反服务位置的规则该怎么办?Or if in attempting to deploy or move a workload, rules for the placement of services are violated? Service Fabric 为这些以及更多内容提供诊断,以确保你了解群集中发生的活动。Service Fabric provides diagnostics for these and more to make sure you are informed about activity taking place in your cluster. 群集监视的一些示例场景包括:Some sample scenarios for cluster monitoring include:

Service Fabric 提供了一组现成的综合事件。Service Fabric provides a comprehensive set of events out of the box. 可以通过 EventStore 或操作通道(平台公开的事件通道)来访问这些 Service Fabric 事件These Service Fabric events can be accessed through the EventStore or the operational channel (event channel exposed by the platform).

  • Service Fabric 事件通道 - 在 Windows 上,Service Fabric 事件可通过使用一组相关 logLevelKeywordFilters(用于在操作通道与“数据和消息”通道之间进行选择)从单个 ETW 提供程序获得 - 这是我们用来分离出要根据需要筛选的传出 Service Fabric 事件的方式。Service Fabric event channels - On Windows, Service Fabric events are available from a single ETW provider with a set of relevant logLevelKeywordFilters used to pick between Operational and Data & Messaging channels - this is the way in which we separate out outgoing Service Fabric events to be filtered on as needed. 在 Linux 中,Service Fabric 事件通过 LTTng 传入一个存储表中,可在其中按需筛选事件。On Linux, Service Fabric events come through LTTng and are put into one Storage table, from where they can be filtered as needed. 这些通道包含组织有序的结构化事件用于更好地了解群集的状态。These channels contain curated, structured events that can be used to better understand the state of your cluster. 创建群集时默认会启用“诊断”并创建一个 Azure 存储表,来自这些通道的事件将发送到其中,方便将来进行查询。Diagnostics are enabled by default at the cluster creation time, which create an Azure Storage table where the events from these channels are sent for you to query in the future.

  • EventStore - EventStore 是该平台提供的一项功能,它提供通过 REST API 且在 Service Fabric Explorer 中可用的 Service Fabric 平台事件。EventStore - The EventStore is a feature offered by the platform that provides Service Fabric platform events available in the Service Fabric Explorer and through REST API. 可以查看群集中每个实体的动态快照视图,例如节点、服务、应用程序和基于事件时间的查询。You can see a snapshot view of what's going on in your cluster for each entity e.g. node, service, application and query based on the time of the event. 还可以从 EventStore 概述了解有关 EventStore 的详细信息。You can also Read more about the EventStore at the EventStore Overview.

屏幕截图显示了“节点”窗格中多个事件(包括 NodeDown 事件)的“事件”选项卡。

诊断以一系列现成的全面的事件集的形式提供。The diagnostics provided are in the form of a comprehensive set of events out of the box. 这些 Service Fabric 事件说明了平台在节点、应用程序、服务、分区等不同实体上执行的操作。在上述最后一个场景中,如果节点发生故障,平台将发出 NodeDown 事件,可以立即通过所选的监控工具通知你。These Service Fabric events illustrate actions done by the platform on different entities such as Nodes, Applications, Services, Partitions etc. In the last scenario above, if a node were to go down, the platform would emit a NodeDown event and you could be notified immediately by your monitoring tool of choice. 故障转移期间,其他常见示例包括 ApplicationUpgradeRollbackStartedPartitionReconfiguredOther common examples include ApplicationUpgradeRollbackStarted or PartitionReconfigured during a failover. Windows 和 Linux 群集上都有相同的事件。The same events are available on both Windows and Linux clusters.

事件通过 Windows 和 Linux 上的标准通道发送,并且可以由任何支持这些事件的监视工具读取。The events are sent through standard channels on both Windows and Linux and can be read by any monitoring tool that supports these. 平台级别事件和日志生成提供了更多群集监视概念。More cluster monitoring concepts are available at Platform level event and log generation.

运行状况监视Health monitoring

Service Fabric 平台包含运行状况模型,针对群集中的实体状态提供可扩展的运行状况报告。The Service Fabric platform includes a health model, which provides extensible health reporting for the status of entities in a cluster. 每个节点、应用程序、服务、分区、副本或实例都具有持续可更新的运行状况。Each node, application, service, partition, replica, or instance, has a continuously updatable health status. 运行状况可能是“正常”、“警告”或“错误”。The health status can either be "OK", "Warning", or "Error". 将 Service Fabric 事件视为群集对各种实体所做的动词,将运行状况视为每个实体的形容词。Think of Service Fabric events as verbs done by the cluster to various entities and health as an adjective for each entity. 每次特定实体的运行状况转换时,也会发出事件。Each time the health of a particular entity transitions, an event will also be emitted. 这样,就可以在所选监视工具中为运行状况事件设置查询和警报,就像任何其他事件一样。This way you can set up queries and alerts for health events in your monitoring tool of choice, just like any other event.

此外,我们甚至允许用户重写实体的运行状况。Additionally, we even let users override health for entities. 如果应用程序正在进行升级且验证测试失败,则可以使用运行状况 API 写入 Service Fabric 运行状况,以指示应用程序未正常运行,并且 Service Fabric 将自动回滚升级!If your application is going through an upgrade and you have validation tests failing, you can write to Service Fabric Health using the Health API to indicate your application is no longer healthy, and Service Fabric will automatically rollback the upgrade! 有关运行状况模型的详细信息,请参阅 Service Fabric 运行状况监视简介For more on the health model, check out the introduction to Service Fabric health monitoring

SFX 运行状况仪表板

监视器Watchdogs

监视器通常是一个独立的服务,可以监视各个服务的运行状况和负载、ping 终结点,以及报告群集中任何组件的运行状况。Generally, a watchdog is a separate service that can watch health and load across services, ping endpoints, and report health for anything in the cluster. 这有助于防止某些根据单个服务视图检测不到的错误。This can help prevent errors that would not be detected based on the view of a single service. 监视器也是一个托管代码的好选择,在此无需用户交互即可执行补救措施(例如每隔特定时间就清理一次存储中的日志文件)。Watchdogs are also a good place to host code that performs remedial actions without user interaction (for example, cleaning up log files in storage at certain time intervals). 可在此处获取监视软件服务实现示例。You can find a sample watchdog service implementation here.

基础结构(性能)监视Infrastructure (performance) monitoring

既然我们已介绍了应用程序和平台中的诊断,那么如何知道硬件按预期方式正常运行?Now that we've covered the diagnostics in your application and the platform, how do we know the hardware is functioning as expected? 监视底层基础结构是了解群集状态和资源利用率的重要组成部分。Monitoring your underlying infrastructure is a key part of understanding the state of your cluster and your resource utilization. 测量系统性能取决于多种因素,主观上取决于工作负荷。Measuring system performance depends on many factors that can be subjective depending on your workloads. 这些因素通常通过性能计数器来衡量。These factors are typically measured through performance counters. 这些性能计数器可以来自各种来源,包括操作系统、.NET Framework 或 Service Fabric 平台本身。These performance counters can come from a variety of sources including the operating system, the .NET framework, or the Service Fabric platform itself. 在某些情况下,它们将是有用的Some scenarios in which they would be useful are

  • 能否有效地利用硬件?Am I utilizing my hardware efficiently? 是否要以 90% 的 CPU 使用率或 10% 的 CPU 使用率使用硬件。Do you want to use your hardware at 90% CPU or 10% CPU. 这在扩展群集或优化应用程序进程时非常方便。This comes in handy when scaling your cluster, or optimizing your application's processes.
  • 是否可以主动预测基础结构问题?Can I predict infrastructure issues proactively? - 许多问题在发生之前,都会出现性能骤然发生变化(下降)的前兆,因此,可以使用网络 I/O 和 CPU 利用率等性能计数器来主动预测和诊断问题。- many issues are preceded by sudden changes (drops) in performance, so you can use performance counters such as network I/O and CPU utilization to predict and diagnose the issues proactively.

可以在性能指标中找到应在基础结构级别收集的性能计数器列表。A list of performance counters that should be collected at the infrastructure level can be found at Performance metrics.

Service Fabric 还为 Reliable Services 和 Reliable Actors 编程模型提供了一组性能计数器。Service Fabric also provides a set of performance counters for the Reliable Services and Actors programming models. 如果使用其中的任一模型,这些性能计数器可以提供信息,以帮助确保执行组件正常启动和停止,或者以足够快的速度处理可靠服务请求。If you are using either of these models, these performance counters can information to ensure that your actors are spinning up and down correctly, or that your reliable service requests are being handled fast enough. 有关详细信息,请参阅 Reliable Services 远程处理的监视Reliable Actors 的性能监视For more information, see Monitoring for Reliable Service Remoting and Performance monitoring for Reliable Actors.

用于收集这些内容的 Azure Monitor 解决方案是 Azure Monitor 日志,就像平台级别监控一样。The Azure Monitor solution to collect these is Azure Monitor logs just like platform level monitoring. 应使用 Log Analytics 代理以收集相应的性能计数器,并在 Azure Monitor 日志中查看它们。You should use the Log Analytics agent to collect the appropriate performance counters, and view them in Azure Monitor logs.

现已了解监视和示例场景的每个区域,以下是 Azure 监视工具的摘要以及监视上述所有区域所需的设置。Now that we've gone over each area of monitoring and example scenarios, here is a summary of the Azure monitoring tools and set up needed to monitor all areas above.

此外,还可使用并修改位于此处的示例 ARM 模板以自动部署所有必要的资源和代理。You can also use and modify the sample ARM template located here to automate deployment of all necessary resources and agents.

其他日志记录解决方案Other logging solutions

尽管我们推荐的该解决方案 Application Insights 内置集成了 Service Fabric,但许多事件会通过 ETW 提供程序写出,并且可随其他日志记录解决方案一起扩展。Although the solution we recommended, Application Insights has built in integration with Service Fabric, many events are written out through ETW providers and are extensible with other logging solutions. 此外,还应考虑 Elastic Stack(尤其是考虑在脱机环境中运行群集时)、Dynatrace 或其他任何偏好的平台。You should also look into the Elastic Stack (especially if you are considering running a cluster in an offline environment), Dynatrace, or any other platform of your preference. 我们在此处提供了一个可用的集成合作伙伴列表。We have a list of integrated partners available here.

选择任何平台时都应考虑的关键点包括:用户界面的舒适度、查询功能的舒适度、可用的自定义可视化效果和仪表板、平台提供的用于增强监视体验的其他工具。The key points for any platform you choose should include how comfortable you are with the user interface, the querying capabilities, the custom visualizations and dashboards available, and the additional tools they provide to enhance your monitoring experience.

后续步骤Next steps