Azure Service Fabric 的监视和诊断最佳做法Monitoring and diagnostic best practices for Azure Service Fabric

在任何云环境中开发、测试和部署工作负荷时,监视和诊断至关重要。Monitoring and diagnostics are critical to developing, testing, and deploying workloads in any cloud environment. 例如,可以跟踪应用程序的使用方式、Service Fabric 平台所采取的操作、带性能计数器的资源利用率以及群集的总体运行状况。For example, you can track how your applications are used, the actions taken by the Service Fabric platform, your resource utilization with performance counters, and the overall health of your cluster. 可以使用此信息来诊断和更正问题,避免将来发生此类问题。You can use this information to diagnose and correct issues, and prevent them from occurring in the future.

应用程序监视Application monitoring

应用程序监视跟踪应用程序的功能和组件的使用方式。Application monitoring tracks how features and components of your application are being used. 监视应用程序可以确保捕获影响用户的问题。Monitor your applications to make sure issues that impact your users are caught. 应用程序监视是开发应用程序及其服务的人员的责任,因为它特定于应用程序的业务逻辑。Application monitoring is the responsibility of those developing the application and its services because it is unique to the business logic of your application. 建议使用 Application Insights(Azure 的应用程序监视工具)来设置应用程序监视。It is recommended that you set up application monitoring with Application Insights, Azure's application monitoring tool.

群集监视Cluster monitoring

Service Fabric 的目标之一是使应用程序能够灵活应对硬件故障。One of Service Fabric's goals is to make applications resilient to hardware failures. 为了实现此目的,可以通过平台的系统服务功能检测基础结构问题,并快速将工作负荷故障转移到群集中的其他节点。This goal is achieved through the platform's system services' ability to detect infrastructure issues and rapidly failover workloads to other nodes in the cluster. 但是,系统服务本身出现问题该怎么办?But what if the system services themselves have issues? 或者,在尝试部署或移动工作负荷时,如果违反服务位置的规则该怎么办?Or if in attempting to deploy or move a workload, rules for the placement of services are violated? Service Fabric 为这些问题以及其他问题提供诊断,确保你了解 Service Fabric 平台与应用程序、服务、容器和节点的交互情况。Service Fabric provides diagnostics for these, and other issues, to make sure you are informed about how the Service Fabric platform interacts with your applications, services, containers, and nodes.

对于 Windows 群集,建议使用诊断代理Azure Monitor 日志来设置群集监视。For Windows clusters, it is recommended that you set up cluster monitoring with Diagnostics Agent and Azure Monitor logs.

对于 Linux 群集,Azure Monitor 日志也是建议用于 Azure 平台和基础结构监视的工具。For Linux clusters, Azure Monitor logs is also the recommended tool for Azure platform and infrastructure monitoring. Linux 平台诊断要求不同的配置,详见 Syslog 中的 Service Fabric Linux 群集事件Linux platform diagnostics require different configuration as noted in Service Fabric Linux cluster events in Syslog.

基础结构监视Infrastructure monitoring

若要监视群集级别的事件,建议使用 Azure Monitor 日志Azure Monitor logs is recommended for monitoring cluster level events. 使用上一链接中介绍的工作区配置 Log Analytics 代理以后,即可收集性能指标,例如:CPU 使用率、.NET 性能计数器(例如进程级别的 CPU 使用率)、Service Fabric 性能计数器(例如来自 Reliable Service 的异常数),以及容器指标(例如 CPU 使用率)。Once you configure the Log Analytics agent with your workspace as described in previous link, you will be able to collect performance metrics such as CPU Utilization, .NET performance counters such as process level CPU utilization, Service Fabric performance counters such as # of exceptions from a reliable service, and container metrics such as CPU Utilization. 需将容器日志写入 stdout 或 stderr,使之在 Azure Monitor 日志中可用。You will need to write container logs to stdout or stderr so that they will be available in Azure Monitor logs.


监视器通常是一个独立的服务,用于监视各个服务的运行状况和负载、ping 终结点,以及报告群集中的异常运行状况事件。Generally, a watchdog is a separate service that watches health and load across services, pings endpoints, and reports unexpected health events in the cluster. 这有助于防止出现仅基于单个服务的性能所检测不到的错误。This can help prevent errors that may not be detected based only on the performance of a single service. 监视器也是一个托管代码的好选择,无需用户交互即可执行补救措施,例如每隔特定时间就清理一次存储中的日志文件。Watchdogs are also a good place to host code that performs remedial action that don't require user interaction such as cleaning up log files in storage at certain time intervals. 若要查看示例监视器服务实现,请参阅 Syslog 中的 Service Fabric Linux 群集事件See a sample watchdog service implementation in Service Fabric Linux cluster events in Syslog.

后续步骤Next steps