监视群集Monitoring the cluster

在群集级别进行监视以确定硬件和群集的运行情况是否符合预期是非常重要的。It is important to monitor at the cluster level to determine whether or not your hardware and cluster are behaving as expected. 在硬件发生故障期间,Service Fabric 可保持应用程序运行,但用户仍需要诊断错误是在应用程序中还是底层基础结构中发生。Though Service Fabric can keep applications running during a hardware failure, but you still need to diagnose whether an error is occurring in an application or in the underlying infrastructure. 还应该监视群集以便更好地规划容量,帮助决定添加或删除硬件。You also should monitor your clusters to better plan for capacity, helping in decisions about adding or removing hardware.

Service Fabric 通过 EventStore 和各种现成的日志通道将多个结构化平台事件作为 Service Fabric 事件公开。Service Fabric exposes several structured platform events, as Service Fabric events, through the EventStore and various log channels out-of-the-box.

在 Windows 上,Service Fabric 事件可通过使用一组相关 logLevelKeywordFilters(用于在操作通道与“数据和消息”通道之间进行选择)从单个 ETW 提供程序获得 - 这是我们用来分离出要根据需要筛选的传出 Service Fabric 事件的方式。On Windows, Service Fabric events are available from a single ETW provider with a set of relevant logLevelKeywordFilters used to pick between Operational and Data & Messaging channels - this is the way in which we separate out outgoing Service Fabric events to be filtered on as needed.

  • 可操作:由 Service Fabric 和群集执行的高级操作,包括出现节点事件、部署新应用程序或升级回滚等。可在此处查看事件的完整列表。Operational High-level operations performed by Service Fabric and the cluster, including events for a node coming up, a new application being deployed, or an upgrade rollback, etc. See the full list of events here.

  • 可操作 - 详细信息Operational - detailed
    运行状况报告和负载均衡决策。Health reports and load balancing decisions.

可通过各种方式访问操作通道,包括 ETW/Windows 事件日志、EventStore(适用于 Windows 群集的 Windows 版本 6.2 及更高版本上可用)。The operation channel can be accessed through a variety of ways including ETW/Windows Event Logs, the EventStore (available on Windows in versions 6.2 and later for Windows clusters). 使用 EventStore 可以按实体(实体包括群集、节点、应用程序、服务、分区、副本和容器)访问群集的事件,并通过 REST API 和 Service Fabric 客户端库将其公开。The EventStore gives you access to your cluster's events on a per entity basis (entities including cluster, nodes, applications, services, partitions, replicas, and containers) and exposes them via REST APIs and the Service Fabric client library. 使用 EventStore 监视开发/测试群集,并获取生产群集状态的时间点了解。Use the EventStore to monitor your dev/test clusters, and for getting a point-in-time understanding of the state of your production clusters.

  • 数据和消息Data & Messaging
    消息(当前仅限 ReverseProxy)和数据路径(可靠的服务模型)中生成的关键日志和事件。Critical logs and events generated in the messaging (currently only the ReverseProxy) and data path (reliable services models).

  • 数据和消息 - 详细信息Data & Messaging - detailed
    包含群集中的数据和消息提供的所有非关键日志的详细通道(此通道的事件量非常大)。Verbose channel that contains all the non-critical logs from data and messaging in the cluster (this channel has a very high volume of events).

除了这些之外,还提供了两个结构化的 EventSource 通道以及为支持目的而收集的日志。In addition to these, there are two structured EventSource channels provided, as well as logs that we collect for support purposes.

  • Reliable Services 事件Reliable Services events
    特定于编程模型的事件。Programming model specific events.

  • Reliable Actors 事件Reliable Actors events
    特定于编程模型的事件和性能计数器。Programming model specific events and performance counters.

  • 支持日志Support logs
    Service Fabric 生成的系统日志,仅供我们提供支持时使用。System logs generated by Service Fabric only to be used by us when providing support.

这些不同的通道涵盖了大部分推荐的平台级别日志记录。These various channels cover most of the platform level logging that is recommended. 若要改进平台级别日志记录,建议更好地了解运行状况模型和添加自定义运行状况报表,并添加自定义性能计数器,以实时了解服务和应用程序对群集的影响。To improve platform level logging, consider investing in better understanding the health model and adding custom health reports, and adding custom Performance Counters to build a real-time understanding of the impact of your services and applications on the cluster.

为了利用这些日志,强烈建议在 Azure 门户中创建群集期间启用“诊断”。In order to take advantage of these logs, it is highly recommended to leave "Diagnostics" enabled during cluster creation in the Azure Portal. 如果开启诊断,部署群集时,Windows Azure 诊断就可确认运行 Operational、Reliable Services 和 Reliable Actors 通道,并按照通过 Azure 诊断聚合事件中所述存储数据。By turning on diagnostics, when the cluster is deployed, Windows Azure Diagnostics is able to acknowledge the Operational, Reliable Services, and Reliable actors channels, and store the data as explained further in Aggregate events with Azure Diagnostics.

Azure Service Fabric 运行状况和负载报告Azure Service Fabric health and load reporting

Service Fabric 具有自身的运行状况模型,以下文章对此做了详细介绍:Service Fabric has its own health model, which is described in detail in these articles:

运行状况监视对于运行服务的多个方面至关重要,尤其是在应用程序升级期间。Health monitoring is critical to multiple aspects of operating a service, especially during an application upgrade. 升级服务的每个升级域后,升级域必须在部署转到下一个升级域之前通过运行状况检查。After each upgrade domain of the service is upgraded, the upgrade domain must pass health checks before the deployment moves to the next upgrade domain. 如果无法实现良好的运行状况,部署会回滚,使应用程序保持一种已知正常的状态。If OK health status cannot be achieved, the deployment is rolled back, so that the application remains in a known OK state. 尽管在回滚服务之前某些客户可能会受到影响,但大多数客户不会遇到问题。Although some customers might be affected before the services are rolled back, most customers won't experience an issue. 此外,问题的解决速度相对较快,无需等待操作员的人工操作。Also, a resolution occurs relatively quickly without having to wait for action from a human operator. 在代码中合并的运行状况检查越多,服务应对部署问题的弹性就越高。The more health checks that are incorporated into your code, the more resilient your service is to deployment issues.

服务运行状况的另一个方面是从服务报告指标。Another aspect of service health is reporting metrics from the service. 指标在 Service Fabric 中非常重要,因为它们用于均衡资源使用量。Metrics are important in Service Fabric because they are used to balance resource usage. 指标还可用作系统运行状况的指示器。Metrics can also be an indicator of system health. 例如,假设某个应用程序包含许多服务,每个实例报告每秒请求数 (RPS) 指标。For example, you might have an application that has many services, and each instance reports a requests per second (RPS) metric. 如果一个服务使用的资源比另一个服务要多,Service Fabric 会围绕群集移动服务实例,尽量使资源利用率保持均衡。If one service is using more resources than another service, Service Fabric moves service instances around the cluster, to try to maintain even resource utilization. 有关资源利用的工作原理的详细说明,请参阅 Manage resource consumption and load in Service Fabric with metrics(在 Service Fabric 中使用指标管理资源消耗和负载)。For a more detailed explanation of how resource utilization works, see Manage resource consumption and load in Service Fabric with metrics.

使用指标还能洞察服务的执行情况。Metrics also can help give you insight into how your service is performing. 在不同的时间,都可以使用指标来检查服务是否根据所需的参数运行。Over time, you can use metrics to check that the service is operating within expected parameters. 例如,如果趋势表明星期一上午 9 点的平均 RPS 为 1,000,则可以设置一份运行状况报告,在 RPS 低于 500 或高于 1,500 时发出警报。For example, if trends show that at 9 AM on Monday morning the average RPS is 1,000, then you might set up a health report that alerts you if the RPS is below 500 or above 1,500. 尽管一切看上去可能正常,但还是值得执行一番检查,确保客户获得优越的体验。Everything might be perfectly fine, but it might be worth a look to be sure that your customers are having a great experience. 服务可以定义一组可在执行运行状况检查时报告的指标,同时避免对群集的资源均衡造成影响。Your service can define a set of metrics that can be reported for health check purposes, but that don't affect the resource balancing of the cluster. 为此,可将指标权重设置为零。To do this, set the metric weight to zero. 建议一开始为所有指标使用零权重,只有在确定指标加权会对群集的资源均衡产生何种影响时,才增大权重。We recommend that you start all metrics with a weight of zero, and not increase the weight until you are sure that you understand how weighting the metrics affects resource balancing for your cluster.

提示

不要使用过多的加权指标,Don't use too many weighted metrics. 否则可能难以了解服务实例为何会出于均衡目的而被移动,It can be difficult to understand why service instances are being moved around for balancing. 并且某些指标可能会持续很长时间!A few metrics can go a long way!

可以指明应用程序的运行状况和性能的任何信息都是指标和运行状况报告的候选项。Any information that can indicate the health and performance of your application is a candidate for metrics and health reports. CPU 性能计数器可以告知节点的利用情况,但不会指明特定的服务是否正常,因为单个节点上可能运行了多个服务。A CPU performance counter can tell you how your node is utilized, but it doesn't tell you whether a particular service is healthy, because multiple services might be running on a single node. 但是,RPS、已处理的项数和请求延迟等指标都可以指明特定服务的运行状况。But, metrics like RPS, items processed, and request latency all can indicate the health of a specific service.

Service Fabric 支持日志Service Fabric support logs

如需联系 Azure 支持部门来获取 Azure Service Fabric 群集方面的帮助,几乎始终都需要提供支持日志。If you need to contact Azure support for help with your Azure Service Fabric cluster, support logs are almost always required. 如果群集托管在 Azure 中,则会自动配置支持日志,并在创建群集的过程中收集这些日志。If your cluster is hosted in Azure, support logs are automatically configured and collected as part of creating a cluster. 日志存储在群集资源组中的专用存储帐户内。The logs are stored in a dedicated storage account in your cluster's resource group. 该存储帐户没有固定的名称,但在其中可以看到以 fabric 开头的 Blob 容器和表。The storage account doesn't have a fixed name, but in the account, you see blob containers and tables with names that start with fabric. 有关为独立群集设置日志收集的信息,请参阅 Create and manage a standalone Azure Service Fabric cluster(创建和管理独立 Azure Service Fabric 群集)以及 Configuration settings for a standalone Windows cluster(Windows 独立群集的配置设置)。For information about setting up log collections for a standalone cluster, see Create and manage a standalone Azure Service Fabric cluster and Configuration settings for a standalone Windows cluster. 对于独立的 Service Fabric 实例,应该将日志发送到本地文件共享。For standalone Service Fabric instances, the logs should be sent to a local file share. 必须提供这些日志才能获得支持,但是,这些日志只能由 Azure 客户支持团队使用。You are required to have these logs for support, but they are not intended to be usable by anyone outside of the Azure customer support team.

测量性能Measuring performance

测量群集性能有助于了解它如何处理负载以及如何做出关于缩放群集的决策(请参阅有关在 Azure 上在本地缩放群集的详细信息)。Measure performance of your cluster will help you understand how it is able to handle load and drive decisions around scaling your cluster (see more about scaling a cluster on Azure, or on-premises). 将来分析日志时,性能数据还可用于比较你或你的应用程序和服务可能执行的操作。Performance data is also useful when compared to actions you or your applications and services may have taken, when analyzing logs in the future.

有关使用 Service Fabric 时要收集的性能计数器的列表,请参阅 Service Fabric 中的性能计数器For a list of performance counters to collect when using Service Fabric, see Performance Counters in Service Fabric

以下是设置群集收集性能数据的两种常见方式:Here are two common ways in which you can set up collecting performance data for your cluster:

  • 使用代理Using an agent
    这是从计算机中收集性能的首选方法,因为代理通常有可以收集的可能性能指标列表,并且选择要收集或更改的指标是一个相对简单的过程。This is the preferred way of collecting performance from a machine, since agents usually have a list of possible performance metrics that can be collected, and it is a relatively easy process to choose the metrics you want to collect or change. 阅读有关 Service Fabric 的创建 Log Analytics 代理中的 Azure Monitor 提供的 Azure Monitor 日志,了解有关 Log Analytics 代理的更多信息,该代理是一个能够获取群集 VM 和已部署容器的性能数据的监视代理。The Read about the Azure Monitor offering Azure Monitor logs in Service Fabric's Setting up the Log Analytics agent to learn more about the Log Analytics agent, which is one such monitoring agent that is able to pick up performance data for cluster VMs and deployed containers.

  • 性能计数器到 Azure 表存储Performance counters to Azure Table Storage
    还可将性能指标发送到与事件相同的表存储。You can also send performance metrics to the same table storage as the events. 此操作需要更改 Azure 诊断配置以从群集中的 VM 读取适当的性能计数器,如果要部署任何容器,也能使其读取 Docker 统计数据。This requires changing the Azure Diagnostics configuration to pick up the appropriate performance counters from the VMs in your cluster, and enabling it to pick up docker stats if you will be deploying any containers. 阅读有关在 Service Fabric 中配置 WAD 中的性能计数器的文章,设置性能计数器集合。Read about configuring Performance Counters in WAD in Service Fabric to set up performance counter collection.

后续步骤Next steps

  • 了解 Service Fabric 内置诊断体验:EventStoreLearn about Service Fabric's in built diagnostic experience, the EventStore