使用 Service Fabric 诊断常见情况Diagnose common scenarios with Service Fabric

本文阐述了用户在使用 Service Fabric 进行监视和诊断时遇到的常见情况。This article illustrates common scenarios users have encountered in the area of monitoring and diagnostics with Service Fabric. 所介绍的方案涵盖了 Service Fabric 的所有 3 层:应用程序、群集和基础结构。The scenarios presented cover all 3 layers of service fabric: Application, Cluster, and Infrastructure. 每个解决方案均使用 Application Insights 和 Azure Monitor 日志(Azure 监视工具)来处理每种情况。Each solution uses Application Insights and Azure Monitor logs, Azure monitoring tools, to complete each scenario. 每个解决方案中的步骤都向用户介绍了如何在 Service Fabric 环境中使用 Application Insights 和 Azure Monitor 日志。The steps in each solution give users an introduction on how to use Application Insights and Azure Monitor logs in the context of Service Fabric.

备注

本文最近已更新,从使用术语“Log Analytics”改为使用术语“Azure Monitor 日志”。This article was recently updated to use the term Azure Monitor logs instead of Log Analytics. 日志数据仍然存储在 Log Analytics 工作区中,并仍然由同一 Log Analytics 服务收集并分析。Log data is still stored in a Log Analytics workspace and is still collected and analyzed by the same Log Analytics service. 我们正在更新术语,以便更好地反映 Azure Monitor 中日志的角色。We are updating the terminology to better reflect the role of logs in Azure Monitor. 有关详细信息,请参阅 Azure Monitor 术语更改See Azure Monitor terminology changes for details.

先决条件和建议Prerequisites and Recommendations

本文中的解决方案将使用以下工具。The solutions in this article will use the following tools. 建议对这些工具进行设置和配置:We recommend you have these set up and configured:

如何在应用程序中查看未经处理的异常?How can I see unhandled exceptions in my application?

  1. 导航到应用程序配置的 Application Insights 资源。Navigate to your Application Insights resource that your application is configured with.

  2. 单击左上角的“搜索”。Click on Search in the top left. 然后单击下一个面板上的筛选器。Then click filter on the next panel.

    AI 概述

  3. 你将看到很多类型的事件(跟踪、请求、自定义事件)。You will see lots of types of events (traces, requests, custom events). 选择“异常”作为筛选器。Choose "Exception" as your filter.

    AI 筛选器列表

    如果正在使用 Service Fabric Application Insights SDK,那么通过单击列表中的异常,可以查看更多详细信息,包括服务上下文。By clicking an exception in the list, you can look at more details including the service context if you are using the Service Fabric Application Insights SDK.

    AI 异常

如何查看服务中使用的 HTTP 调用?How do I view which HTTP calls are used in my services?

  1. 在同一个 Application Insights 资源中,可以筛选“请求”而不是异常,并查看发出的所有请求In the same Application Insights resource, you can filter on "requests" instead of exceptions and view all requests made

  2. 如果正在使用 Service Fabric Application Insights SDK,则可以看到彼此连接的服务的可视形式以及成功和失败请求的数量。If you are using the Service Fabric Application Insights SDK, you can see a visual representation of your services connected to one another, and the number of succeeded and failed requests. 单击左侧的“应用程序映射”On the left click "Application Map"

    AI 应用映射边栏选项卡 AI 应用映射

    有关应用程序映射的详细信息,请访问应用程序映射文档For more information on the application map, visit the Application Map documentation

如何在节点出现故障时创建警报How do I create an alert when a node goes down

  1. 节点事件由 Service Fabric 群集跟踪。Node events are tracked by your Service Fabric cluster. 导航到名为 ServiceFabric(NameofResourceGroup) 的 Service Fabric 分析解决方案资源Navigate to the Service Fabric Analytics solution resource named ServiceFabric(NameofResourceGroup)

  2. 单击标题为“摘要”的边栏选项卡底部的图表Click on the graph on the bottom of the blade titled "Summary"

    Azure Monitor 日志解决方案

  3. 此处有许多图表和磁贴,上面显示了各种指标。Here you have many graphs and tiles displaying various metrics. 单击其中一个图表,它会带你进入“日志搜索”。Click on one of the graphs and it will take you to the Log Search. 在这里,你可以查询任何群集事件或性能计数器。Here you can query for any cluster events or performance counters.

  4. 输入以下查询。Enter the following query. 这些事件 ID 位于节点事件参考These event IDs are found in the Node events reference

    ServiceFabricOperationalEvent
    | where EventID >= 25622 and EventID <= 25626
    
  5. 单击顶部的“新建警报规则”,现在只要发生基于此查询的事件,就会通过所选通信方式收到警报。Click "New Alert Rule" at the top and now anytime an event arrives based on this query, you will receive an alert in your chosen method of communication.

    Azure Monitor 日志新建警报

怎样才能收到应用程序升级回滚警报?How can I be alerted of application upgrade rollbacks?

  1. 在与之前相同的“日志搜索”窗口中,针对升级回滚输入以下查询。On the same Log Search window as before enter the following query for upgrade rollbacks. 这些事件 ID 位于应用程序事件参考下方These event IDs are found under Application events reference

    ServiceFabricOperationalEvent
    | where EventID == 29623 or EventID == 29624
    
  2. 单击顶部的“新建警报规则”,现在只要发生基于此查询的事件,你就会收到警报。Click "New Alert Rule" at the top and now anytime an event arrives based on this query, you will receive an alert.

如何监视性能计数器?How can I monitor performance counters?

  1. 向群集添加 Log Analytics 代理后,需要添加要跟踪的特定性能计数器。导航到门户中的 Log Analytics 工作区页面(工作区选项卡位于解决方案页面的左侧菜单中)。Once you have added the Log Analytics agent to your cluster, you need to add the specific performance counters you want to track. Navigate to the Log Analytics workspace's page in the portal - from the solution's page the workspace tab is on the left menu.

    Log Analytics 工作区选项卡

  2. 进入工作区页面后,单击同一左侧菜单中的“高级设置”。Once you're on the workspace's page, click on "Advanced settings" in the same left menu.

    Log Analytics 高级设置

  3. 单击“数据”>“Windows 性能计数器”(对于 Linux 计算机,则为“数据”>“Linux 性能计数器”),开始通过 Log Analytics 代理从节点收集特定计数器。Click on Data > Windows Performance Counters (Data > Linux Performance Counters for Linux machines) to start collecting specific counters from your nodes via the Log Analytics agent. 以下是要添加的计数器的格式示例Here are examples of the format for counters to add

    • .NET CLR Memory(<ProcessNameHere>)\\# Total committed Bytes

    • Processor(_Total)\\% Processor Time

      在快速入门中,VotingData 和 VotingWeb 是所用进程名称,因此,将按以下格式跟踪这些计数器In the quickstart, VotingData and VotingWeb are the process names used, so tracking these counters would look like

    • .NET CLR Memory(VotingData)\\# Total committed Bytes

    • .NET CLR Memory(VotingWeb)\\# Total committed Bytes

      Log Analytics 性能计数器

  4. 这将允许你查看基础结构处理工作负荷的方式,并根据资源利用率设置相关警报。This will allow you to see how your infrastructure is handling your workloads, and set relevant alerts based on resource utilization. 例如,如果处理器总利用率高于 90% 或低于 5%,则可能需要设置警报。For example - you may want to set an alert if the total Processor utilization goes above 90% or below 5%. 此时将使用名为“处理器时间百分比”的计数器。The counter name you would use for this is "% Processor Time." 可通过为以下查询创建警报规则来执行此操作:You could do this by creating an alert rule for the following query:

    Perf | where CounterName == "% Processor Time" and InstanceName == "_Total" | where CounterValue >= 90 or CounterValue <= 5.
    

如何跟踪 Reliable Services 和 Actors 的性能?How do I track performance of my Reliable Services and Actors?

若要跟踪应用程序中 Reliable Services 或 Actors 的性能,还应收集 Service Fabric Actor、Actor Method、Service 和 Service Method 计数器。To track the performance of Reliable Services or Actors in your applications, you should collect the Service Fabric Actor, Actor Method, Service, and Service Method counters as well. 下面是要收集的 Reliable Service 和 Actor 性能计数器的示例Here are examples of reliable service and actor performance counters to collect

备注

Log Analytics 代理当前无法收集 Service Fabric 性能计数器,但其他诊断解决方案可以收集这些计数器Service Fabric performance counters cannot be collected by the Log Analytics agent currently, but can be collected by other diagnostic solutions

  • Service Fabric Service(*)\\Average milliseconds per request
  • Service Fabric Service Method(*)\\Invocations/Sec
  • Service Fabric Actor(*)\\Average milliseconds per request
  • Service Fabric Actor Method(*)\\Invocations/Sec

在 Reliable ServicesActors 上查看这些链接可获取性能计数器的完整列表Check these links for the full list of performance counters on Reliable Services and Actors

后续步骤Next steps