使用 Windows Azure 诊断聚合和集合事件Event aggregation and collection using Windows Azure Diagnostics

当你运行 Azure Service Fabric 群集时,最好是从一个中心位置的所有节点中收集日志。When you're running an Azure Service Fabric cluster, it's a good idea to collect the logs from all the nodes in a central location. 将日志放在中心位置可帮助分析和排查群集中的问题,或该群集中运行的应用程序与服务的问题。Having the logs in a central location helps you analyze and troubleshoot issues in your cluster, or issues in the applications and services running in that cluster.

上传和收集日志的方式之一是使用 Windows Azure 诊断 (LAD) 扩展,它可将日志上传到 Azure 存储,并且还提供了将日志发送到 Azure 事件中心的选项。One way to upload and collect logs is to use the Windows Azure Diagnostics (WAD) extension, which uploads logs to Azure Storage, and also has the option to send logs to Azure Event Hubs.

Note

本文进行了更新,以便使用新的 Azure PowerShell Az 模块。This article has been updated to use the new Azure PowerShell Az module. 你仍然可以使用 AzureRM 模块,至少在 2020 年 12 月之前,它将继续接收 bug 修补程序。You can still use the AzureRM module, which will continue to receive bug fixes until at least December 2020. 若要详细了解新的 Az 模块和 AzureRM 兼容性,请参阅新 Azure Powershell Az 模块简介To learn more about the new Az module and AzureRM compatibility, see Introducing the new Azure PowerShell Az module. 有关 Az 模块安装说明,请参阅安装 Azure PowerShellFor Az module installation instructions, see Install Azure PowerShell.

先决条件Prerequisites

本文中使用了以下工具:The following tools are used in this article:

Service Fabric 平台事件Service Fabric platform events

Service Fabric 提供了一些现成的日志记录通道,该扩展预配置了其中的以下通道来将监视和诊断数据发送到存储表或其他位置:Service Fabric sets you up with a few out-of-the-box logging channels, of which the following channels are pre-configured with the extension to send monitoring and diagnostics data to a storage table or elsewhere:

通过门户部署诊断扩展Deploy the Diagnostics extension through the portal

收集日志的第一个步骤是将诊断扩展部署在 Service Fabric 群集中的每个虚拟机规模集节点上。The first step in collecting logs is to deploy the Diagnostics extension on the virtual machine scale set nodes in the Service Fabric cluster. 诊断扩展将收集每个 VM 上的日志,并将它们上传到指定的存储帐户。The Diagnostics extension collects logs on each VM and uploads them to the storage account that you specify. 以下步骤概述了如何通过 Azure 门户和 Azure 资源管理器模板为新的和现有的群集完成此操作。The following steps outline how to accomplish this for new and existing clusters through the Azure portal and Azure Resource Manager templates.

在通过 Azure 门户创建群集过程中部署诊断扩展Deploy the Diagnostics extension as part of cluster creation through Azure portal

创建群集时,在群集配置步骤中,展开可选设置并确保将“诊断”设置为“打开”(默认设置)。When creating your cluster, in the cluster configuration step, expand the optional settings and ensure that Diagnostics is set to On (the default setting).

门户中有关创建群集的 Azure 诊断设置

强烈建议在最终步骤中单击“创建”之前下载模板。We highly recommend that you download the template before you click Create in the final step. 有关详细信息,请参阅使用 Azure Resource Manager 模板设置 Service Fabric 群集For details, refer to Set up a Service Fabric cluster by using an Azure Resource Manager template. 需要使用该模板来更改要从(上面列出的)哪些通道来收集数据。You need the template to make changes on what channels (listed above) to gather data from.

群集模板

Note

目前没有任何方法可以筛选或清理发送到表的事件。 如果未实现某个流程来从表中删除事件,则表会不断增大(默认上限为 50 GB)。 本文的下文中进一步说明了如何对此进行更改。 另外,在监视器示例中有一个运行数据整理服务的示例,建议为自己编写一个,除非有需要存储超过 30 或 90 天日志的的理由。

通过 Azure 资源管理器部署诊断扩展Deploy the Diagnostics extension through Azure Resource Manager

创建包含诊断扩展的群集Create a cluster with the diagnostics extension

若要使用资源管理器创建群集,需要将诊断配置 JSON 添加到整个资源管理器模板。To create a cluster by using Resource Manager, you need to add the Diagnostics configuration JSON to the full Resource Manager template. 我们会在 Resource Manager 模板示例中提供包含五个 VM 的群集 Resource Manager 模板,并在演示 Resource Manager 模板示例的过程中添加诊断配置。We provide a sample five-VM cluster Resource Manager template with Diagnostics configuration added to it as part of our Resource Manager template samples. 可以在 Azure 示例库中的此位置看到它。具有诊断资源管理器的五节点群集的模板示例You can see it at this location in the Azure Samples gallery: Five-node cluster with Diagnostics Resource Manager template sample.

若要查看 Resource Manager 模板中的诊断设置,请打开 azuredeploy.json 文件并搜索 IaaSDiagnosticsTo see the Diagnostics setting in the Resource Manager template, open the azuredeploy.json file and search for IaaSDiagnostics. 若要使用此模板创建群集,请选择在上面的链接中提供的“部署到 Azure” 按钮。To create a cluster by using this template, select the Deploy to Azure button available at the previous link.

或者,也可以下载 Resource Manager 示例,进行更改,并在 Azure PowerShell 窗口中输入 New-AzResourceGroupDeployment 命令,使用修改后的模板创建群集。Alternatively, you can download the Resource Manager sample, make changes to it, and create a cluster with the modified template by using the New-AzResourceGroupDeployment command in an Azure PowerShell window. 有关要在命令中传入的参数,请参阅以下代码。See the following code for the parameters that you pass in to the command. 有关如何使用 PowerShell 部署资源组的详细信息,请参阅使用 Azure Resource Manager 模板部署资源组一文。For detailed information on how to deploy a resource group by using PowerShell, see the article Deploy a resource group with the Azure Resource Manager template.

向现有群集添加诊断扩展Add the diagnostics extension to an existing cluster

如果存在尚未部署诊断的现有群集,可以通过群集模板来添加或更新该扩展。If you have an existing cluster that doesn't have Diagnostics deployed, you can add or update it via the cluster template. 修改用于创建现有群集的 Resource Manager 模板,或者如前所述从门户下载该模板。Modify the Resource Manager template that's used to create the existing cluster or download the template from the portal as described earlier. 执行以下任务来修改 template.json 文件:Modify the template.json file by performing the following tasks:

通过将新存储资源添加到 resources 节将其添加到模板。Add a new storage resource to the template by adding to the resources section.

{
    "apiVersion": "2018-07-01",
    "type": "Microsoft.Storage/storageAccounts",
    "name": "[parameters('applicationDiagnosticsStorageAccountName')]",
    "location": "[parameters('computeLocation')]",
    "sku": {
    "name": "[parameters('applicationDiagnosticsStorageAccountType')]"
    "tier": "standard"
  },
    "tags": {
    "resourceType": "Service Fabric",
    "clusterName": "[parameters('clusterName')]"
  }
},

接下来,将该资源添加到存储帐户定义后面与 supportLogStorageAccountName 之间的 parameters 节中。Next, add to the parameters section just after the storage account definitions, between supportLogStorageAccountName. 将占位符文本 storage account name goes here 替换为所需的存储帐户的名称。Replace the placeholder text storage account name goes here with the name of the storage account you'd like.

    "applicationDiagnosticsStorageAccountType": {
      "type": "string",
      "allowedValues": [
        "Standard_LRS",
        "Standard_GRS"
      ],
      "defaultValue": "Standard_LRS",
      "metadata": {
        "description": "Replication option for the application diagnostics storage account"
      }
    },
    "applicationDiagnosticsStorageAccountName": {
      "type": "string",
      "defaultValue": "**STORAGE ACCOUNT NAME GOES HERE**",
      "metadata": {
        "description": "Name for the storage account that contains application diagnostics data from the cluster"
      }
    },

然后,通过在 extensions 数组中添加以下代码更新 template.json 文件的 VirtualMachineProfile 节。Then, update the VirtualMachineProfile section of the template.json file by adding the following code within the extensions array. 请务必根据插入位置,在开头或末尾添加逗点。Be sure to add a comma at the beginning or the end, depending on where it's inserted.

{
    "name": "[concat(parameters('vmNodeType0Name'),'_Microsoft.Insights.VMDiagnosticsSettings')]",
    "properties": {
        "type": "IaaSDiagnostics",
        "autoUpgradeMinorVersion": true,
        "protectedSettings": {
        "storageAccountName": "[parameters('applicationDiagnosticsStorageAccountName')]",
        "storageAccountKey": "[listKeys(resourceId('Microsoft.Storage/storageAccounts', parameters('applicationDiagnosticsStorageAccountName')),'2015-05-01-preview').key1]",
        "storageAccountEndPoint": "https://core.chinacloudapi.cn/"
        },
        "publisher": "Microsoft.Azure.Diagnostics",
        "settings": {
        "WadCfg": {
            "DiagnosticMonitorConfiguration": {
            "overallQuotaInMB": "50000",
            "EtwProviders": {
                "EtwEventSourceProviderConfiguration": [
                {
                    "provider": "Microsoft-ServiceFabric-Actors",
                    "scheduledTransferKeywordFilter": "1",
                    "scheduledTransferPeriod": "PT5M",
                    "DefaultEvents": {
                    "eventDestination": "ServiceFabricReliableActorEventTable"
                    }
                },
                {
                    "provider": "Microsoft-ServiceFabric-Services",
                    "scheduledTransferPeriod": "PT5M",
                    "DefaultEvents": {
                    "eventDestination": "ServiceFabricReliableServiceEventTable"
                    }
                }
                ],
                "EtwManifestProviderConfiguration": [
                {
                    "provider": "cbd93bc2-71e5-4566-b3a7-595d8eeca6e8",
                    "scheduledTransferLogLevelFilter": "Information",
                    "scheduledTransferKeywordFilter": "4611686018427387904",
                    "scheduledTransferPeriod": "PT5M",
                    "DefaultEvents": {
                    "eventDestination": "ServiceFabricSystemEventTable"
                    }
                }
                ]
            }
            }
        },
        "StorageAccount": "[parameters('applicationDiagnosticsStorageAccountName')]"
        },
        "typeHandlerVersion": "1.5"
    }
}

如上所述修改 template.json 文件后,请重新发布 Resource Manager 模板。After you modify the template.json file as described, republish the Resource Manager template. 如果已导出模板,则运行 deploy.ps1 文件会重新发布模板。If the template was exported, running the deploy.ps1 file republishes the template. 部署后,请确保 ProvisioningStateSucceededAfter you deploy, ensure that ProvisioningState is Succeeded.

Tip

如果要将容器部署到群集,可通过将此代码添加到“WadCfg > DiagnosticMonitorConfiguration”节,启用 WAD 来选取 docker 统计信息。

"DockerSources": {
   "Stats": {
       "enabled": true,
       "sampleRate": "PT1M"
   }
},

更新存储配额Update storage quota

由于由该扩展填充的表不断增长,直至达到配额,因此可能需要考虑减小配额大小。Since the tables populated by the extension grows until the quota is hit, you may want to consider decreasing the quota size. 默认值为 50 GB,可以在模板中在 DiagnosticMonitorConfiguration 下的 overallQuotaInMB 字段下进行配置。The default value is 50 GB and is configurable in the template under the overallQuotaInMB field under DiagnosticMonitorConfiguration

"overallQuotaInMB": "50000",

日志收集配置Log collection configurations

其他通道的日志也可供收集,下面是你可以在 Azure 中运行的群集的模板中进行的一些最常见配置。Logs from additional channels are also available for collection, here are some of the most common configurations you can make in the template for clusters running in Azure.

  • 操作通道 - 基本:默认情况下启用,由 Service Fabric 和群集执行的高级操作,包括出现节点事件、部署新应用程序或升级回滚等。有关事件的列表,请参阅操作通道事件Operational Channel - Base: Enabled by default, high-level operations performed by Service Fabric and the cluster, including events for a node coming up, a new application being deployed, or an upgrade rollback, etc. For a list of events, refer to Operational Channel Events.

    scheduledTransferKeywordFilter: "4611686018427387904"
    
  • 操作通道 - 详细:这包括运行状况报告和负载均衡决策,加上基本操作通道中的所有内容。Operational Channel - Detailed: This includes health reports and load balancing decisions, plus everything in the base operational channel. 这些事件由系统或代码使用 ReportPartitionHealthReportLoad 等运行状况或加载报告 API 生成。These events are generated by either the system or your code by using the health or load reporting APIs such as ReportPartitionHealth or ReportLoad. 要在 Visual Studio 的诊断事件查看器中查看这些事件,请将“Microsoft-ServiceFabric:4:0x4000000000000008”添加到 ETW 提供程序列表。To view these events in Visual Studio's Diagnostic Event Viewer add "Microsoft-ServiceFabric:4:0x4000000000000008" to the list of ETW providers.

    scheduledTransferKeywordFilter: "4611686018427387912"
    
  • 数据和消息通道 - 基本:消息(当前仅限 ReverseProxy)和数据路径中生成的关键日志和事件,以及详细操作通道日志。Data and Messaging Channel - Base: Critical logs and events generated in the messaging (currently only the ReverseProxy) and data path, in addition to detailed operational channel logs. 这些事件是请求处理失败和 ReverseProxy 中的其他严重问题以及已处理的请求。These events are request processing failures and other critical issues in the ReverseProxy, as well as requests processed. 这是我们针对全面日志记录的建议This is our recommendation for comprehensive logging. 若要在 Visual Studio 的诊断事件查看器中查看这些事件,请将“Microsoft-ServiceFabric:4:0x4000000000000010”添加到 ETW 提供程序列表。To view these events in Visual Studio's Diagnostic Event Viewer, add "Microsoft-ServiceFabric:4:0x4000000000000010" to the list of ETW providers.

    scheduledTransferKeywordFilter: "4611686018427387928"
    
  • 数据和消息通道 - 详细:包含群集和详细操作通道中的数据和消息提供的所有非关键日志的详细通道。Data & Messaging Channel - Detailed: Verbose channel that contains all the non-critical logs from data and messaging in the cluster and the detailed operational channel. 有关对所有反向代理事件的详细故障排除,请参阅反向代理诊断指南For detailed troubleshooting of all reverse proxy events, refer to the reverse proxy diagnostics guide. 若要在 Visual Studio 的诊断事件查看器中查看这些事件,请将“Microsoft-ServiceFabric:4:0x4000000000000020”添加到 ETW 提供程序列表。To view these events in Visual Studio's Diagnostic Event viewer, add "Microsoft-ServiceFabric:4:0x4000000000000020" to the list of ETW providers.

    scheduledTransferKeywordFilter: "4611686018427387944"
    

Note

此通道包含非常大量的事件,从详细通道启用事件收集会导致快速生成大量跟踪并可能会消耗存储容量。 请只有在绝对必要的情况下才启用此项。

若要启用“基本操作通道”(建议启用以获得干扰最少的全面日志记录),模板的 WadCfg 中的 EtwManifestProviderConfiguration 将如下所示:To enable the Base Operational Channel our recommendation for comprehensive logging with the least amount of noise, The EtwManifestProviderConfiguration in the WadCfg of your template would look like the following:

  "WadCfg": {
        "DiagnosticMonitorConfiguration": {
          "overallQuotaInMB": "50000",
          "EtwProviders": {
            "EtwEventSourceProviderConfiguration": [
              {
                "provider": "Microsoft-ServiceFabric-Actors",
                "scheduledTransferKeywordFilter": "1",
                "scheduledTransferPeriod": "PT5M",
                "DefaultEvents": {
                  "eventDestination": "ServiceFabricReliableActorEventTable"
                }
              },
              {
                "provider": "Microsoft-ServiceFabric-Services",
                "scheduledTransferPeriod": "PT5M",
                "DefaultEvents": {
                  "eventDestination": "ServiceFabricReliableServiceEventTable"
                }
              }
            ],
            "EtwManifestProviderConfiguration": [
              {
                "provider": "cbd93bc2-71e5-4566-b3a7-595d8eeca6e8",
                "scheduledTransferLogLevelFilter": "Information",
                "scheduledTransferKeywordFilter": "4611686018427387904",
                "scheduledTransferPeriod": "PT5M",
                "DefaultEvents": {
                  "eventDestination": "ServiceFabricSystemEventTable"
                }
              }
            ]
          }
        }
      },

从新的 EventSource 通道收集Collect from new EventSource channels

若要将诊断更新为从新的 EventSource 通道(表示要部署的新应用程序)收集日志,请执行之前描述的相同的步骤,其中描述了现有群集的诊断设置。To update Diagnostics to collect logs from new EventSource channels that represent a new application that you're about to deploy, perform the same steps as previously described for the setup of Diagnostics for an existing cluster.

在使用 New-AzResourceGroupDeployment PowerShell 命令应用配置更新之前,请更新 template.json 文件中的 EtwEventSourceProviderConfiguration 节,添加新 EventSource 通道的条目。Update the EtwEventSourceProviderConfiguration section in the template.json file to add entries for the new EventSource channels before you apply the configuration update by using the New-AzResourceGroupDeployment PowerShell command. 事件源的名称定义为 Visual Studio 生成的 ServiceEventSource.cs 文件中的代码的一部分。The name of the event source is defined as part of your code in the Visual Studio-generated ServiceEventSource.cs file.

例如,如果事件源名为 My-Eventsource,请添加以下代码,将来自 My-Eventsource 的事件放入名为 MyDestinationTableName 的表中。For example, if your event source is named My-Eventsource, add the following code to place the events from My-Eventsource into a table named MyDestinationTableName.

{
    "provider": "My-Eventsource",
    "scheduledTransferPeriod": "PT5M",
    "DefaultEvents": {
    "eventDestination": "MyDestinationTableName"
    }
}

若要收集性能计数器或事件日志,请参考使用 Azure 资源管理器模板创建具有监视和诊断功能的 Windows 虚拟机中提供的示例修改资源管理器模板。To collect performance counters or event logs, modify the Resource Manager template by using the examples provided in Create a Windows virtual machine with monitoring and diagnostics by using an Azure Resource Manager template. 然后,重新发布资源管理器模板。Then, republish the Resource Manager template.

收集性能计数器Collect Performance Counters

若要从群集中收集性能指标,请将性能计数器添加到群集的资源管理器模板中的“WadCfg > DiagnosticMonitorConfiguration”。To collect performance metrics from your cluster, add the performance counters to your "WadCfg > DiagnosticMonitorConfiguration" in the Resource Manager template for your cluster. 有关修改 WadCfg 以收集特定性能计数器的步骤,请参阅通过 WAD 监控性能See Performance monitoring with WAD for steps on modifying your WadCfg to collect specific performance counters. 对于我们建议收集的性能计数器列表,请参阅 Service Fabric 性能计数器Reference Service Fabric Performance Counters for a list of performance counters that we recommend collecting.

后续步骤Next steps

正确配置 Azure 诊断后,将看到来自 ETW 和 EventSource 日志的存储表中的数据。Once you have correctly configured Azure diagnostics, you will see data in your Storage tables from the ETW and EventSource logs.

Note

目前没有任何方法可以筛选或清理已发送到表的事件。 如果未实施某个过程从表中删除事件,该表会不断增大。 目前,在监视器示例中有一个运行数据整理服务的示例,建议为自己编写一个,除非有需要存储超过 30 或 90 天日志的的理由。