有状态 Reliable Services 的诊断功能Diagnostic functionality for Stateful Reliable Services

Azure Service Fabri 有状态 Reliable Services StatefulServiceBase 类会发出 EventSource 事件,这些事件可用于调试服务、提供对运行时运行方式的深入了解,以及帮助进行故障排除。The Azure Service Fabric Stateful Reliable Services StatefulServiceBase class emits EventSource events that can be used to debug the service, provide insights into how the runtime is operating, and help with troubleshooting.

EventSource 事件EventSource events

有状态 Reliable Services StatefulServiceBase 类的 EventSource 名称是“Microsoft-ServiceFabric-Services”。The EventSource name for the Stateful Reliable Services StatefulServiceBase class is "Microsoft-ServiceFabric-Services." 在 Visual Studio 中调试服务时,来自此事件源的事件将显示在“诊断事件”窗口中。Events from this event source appear in the Diagnostics Events window when the service is being debugged in Visual Studio.

可帮助收集和/或查看 EventSource 事件的工具和技术的示例包括:PerfViewAzure 诊断Microsoft TraceEvent 库Examples of tools and technologies that help in collecting and/or viewing EventSource events are PerfView, Azure Diagnostics, and the Microsoft TraceEvent Library.

事件Events

事件名称Event name 事件 IDEvent ID LevelLevel 事件说明Event description
StatefulRunAsyncInvocationStatefulRunAsyncInvocation 11 信息性Informational 在服务 RunAsync 任务启动时发出Emitted when the service RunAsync task is started
StatefulRunAsyncCancellationStatefulRunAsyncCancellation 22 信息性Informational 在服务 RunAsync 任务取消时发出Emitted when the service RunAsync task is canceled
StatefulRunAsyncCompletionStatefulRunAsyncCompletion 33 信息性Informational 在服务 RunAsync 任务完成时发出Emitted when the service RunAsync task is finished
StatefulRunAsyncSlowCancellationStatefulRunAsyncSlowCancellation 44 警告Warning 在服务 RunAsync 任务完成取消所用时间过长时发出Emitted when the service RunAsync task takes too long to complete cancellation
StatefulRunAsyncFailureStatefulRunAsyncFailure 55 错误Error 在服务 RunAsync 任务引发异常时发出Emitted when the service RunAsync task throws an exception

解释事件Interpret events

StatefulRunAsyncInvocation、StatefulRunAsyncCompletion 和 StatefulRunAsyncCancellation 事件可用于供服务编写者了解服务的生命周期以及服务启动、取消和完成时的计时。StatefulRunAsyncInvocation, StatefulRunAsyncCompletion, and StatefulRunAsyncCancellation events are useful to the service writer to understand the lifecycle of a service, as well as the timing for when a service starts, cancels, or finishes. 在调试服务问题或了解服务生命周期时,此信息可能会十分有用。This information can be useful when debugging service issues or understanding the service lifecycle.

服务编写者应密切注意 StatefulRunAsyncSlowCancellation 和 StatefulRunAsyncFailure 事件,因为它们指示与服务有关的问题。Service writers should pay close attention to StatefulRunAsyncSlowCancellation and StatefulRunAsyncFailure events because they indicate issues with the service.

每当服务 RunAsync() 任务引发异常时,都会发出 StatefulRunAsyncFailure。StatefulRunAsyncFailure is emitted whenever the service RunAsync() task throws an exception. 引发的异常通常指示服务中存在错误或 bug。Typically, an exception thrown indicates an error or bug in the service. 此外,该异常会导致服务失败,从而将服务移到另一个节点。Additionally, the exception causes the service to fail, so it is moved to a different node. 此操作可能是一种高开销操作,并会在移动服务时延迟传入请求。This operation can be expensive and can delay incoming requests while the service is moved. 服务编写者应确定异常的原因并在可能时缓解它。Service writers should determine the cause of the exception and, if possible, mitigate it.

每当 RunAsync 任务的取消请求花费的时间超过四秒时,都会发出 StatefulRunAsyncSlowCancellation。StatefulRunAsyncSlowCancellation is emitted whenever a cancellation request for the RunAsync task takes longer than four seconds. 当服务花费过长时间完成取消时,它会影响在另一个节点快速重启服务的能力。When a service takes too long to complete cancellation, it affects the ability of the service to be quickly restarted on another node. 这种情况可能会影响服务的总体可用性。This scenario might affect the overall availability of the service.

性能计数器Performance counters

Reliable Services 运行时定义以下性能计数器类别:The Reliable Services runtime defines the following performance counter categories:

CategoryCategory 说明Description
Service Fabric 事务性复制器Service Fabric Transactional Replicator 特定于 Azure Service Fabric 事务性复制器的计数器Counters specific to the Azure Service Fabric Transactional Replicator
Service Fabric TStoreService Fabric TStore 特定于 Azure Service Fabric TStore 的计数器Counters specific to the Azure Service Fabric TStore

Service Fabric 事务性复制器供可靠状态管理器用来在给定的副本集内复制事务。The Service Fabric Transactional Replicator is used by the Reliable State Manager to replicate transactions within a given set of replicas.

Service Fabric TStore 是可靠集合中使用的组件,用于存储和检索键值对。The Service Fabric TStore is a component used in Reliable Collections for storing and retrieving key-value pairs.

默认情况下在 Windows 操作系统中提供的 Windows 性能监视器 应用程序可用于收集和查看性能计数器数据。The Windows Performance Monitor application that is available by default in the Windows operating system can be used to collect and view performance counter data. Azure 诊断是用于收集性能计数器数据并将其上传到 Azure 表的另一个选项。Azure Diagnostics is another option for collecting performance counter data and uploading it to Azure tables.

性能计数器实例名称Performance counter instance names

具有大量 Reliable Services 或 Reliable Services 分区的群集将具有大量事务复制器性能计数器实例。A cluster that has a large number of reliable services or reliable service partitions will have a large number of transactional replicator performance counter instances. 对于 TStore 性能计数器也是如此,但还乘以可靠字典和可靠队列使用的数量。This is also the case for TStore performance counters, but is also multiplied by the number of Reliable Dictionaries and Reliable Queues used. 性能计数器实例名称有助于标识与性能计数器实例关联的特定分区、服务副本和状态提供程序(在 TStore 的情况下)。The performance counter instance names can help in identifying the specific partition, service replica, and state provider in the case of TStore, that the performance counter instance is associated with.

Service Fabric 事务性复制器类别Service Fabric Transactional Replicator category

对于类别 Service Fabric Transactional Replicator,计数器实例名称采用以下格式:For the category Service Fabric Transactional Replicator, the counter instance names are in the following format:

ServiceFabricPartitionId:ServiceFabricReplicaId

ServiceFabricPartitionId 是与性能计数器实例相关联的 Service Fabric 分区 ID 的字符串表示形式。ServiceFabricPartitionId is the string representation of the Service Fabric partition ID that the performance counter instance is associated with. 分区 ID 是 GUID,并且其字符串表示形式是通过使用格式说明符“D”的 Guid.ToString 生成的。The partition ID is a GUID, and its string representation is generated through Guid.ToString with format specifier "D".

ServiceFabricReplicaId 是一种与给定的 Reliable Service 副本相关联的 ID。ServiceFabricReplicaId is the ID associated with a given replica of a reliable service. 副本 ID 包含在性能计数器实例名称中,以确保其唯一性并避免与其他性能计数器实例(由同一分区生成)发生冲突。The replica ID is included in the performance counter instance name to ensure its uniqueness and avoid conflict with other performance counter instances generated by the same partition. 此处提供有关副本及其在 Reliable Services 中的角色的更多详细信息。Further details about replicas and their role in reliable services can be found here.

以下计数器实例名称通常适用于 Service Fabric Transactional Replicator 类别的计数器:The following counter instance name is typical for a counter under the Service Fabric Transactional Replicator category:

00d0126d-3e36-4d68-98da-cc4f7195d85e:131652217797162571

在上述示例中,00d0126d-3e36-4d68-98da-cc4f7195d85e 是 Service Fabric 分区 ID 的字符串表示形式,131652217797162571 是副本 ID。In the preceding example, 00d0126d-3e36-4d68-98da-cc4f7195d85e is the string representation of the Service Fabric partition ID, and 131652217797162571 is the replica ID.

Service Fabric TStore 类别Service Fabric TStore category

对于类别 Service Fabric TStore,计数器实例名称采用以下格式:For the category Service Fabric TStore, the counter instance names are in the following format:

ServiceFabricPartitionId:ServiceFabricReplicaId:StateProviderId_PerformanceCounterInstanceDifferentiator_StateProviderName

ServiceFabricPartitionId 是与性能计数器实例相关联的 Service Fabric 分区 ID 的字符串表示形式。ServiceFabricPartitionId is the string representation of the Service Fabric partition ID that the performance counter instance is associated with. 分区 ID 是 GUID,并且其字符串表示形式是通过使用格式说明符“D”的 Guid.ToString 生成的。The partition ID is a GUID, and its string representation is generated through Guid.ToString with format specifier "D".

ServiceFabricReplicaId 是一种与给定的 Reliable Service 副本相关联的 ID。ServiceFabricReplicaId is the ID associated with a given replica of a reliable service. 副本 ID 包含在性能计数器实例名称中,以确保其唯一性并避免与其他性能计数器实例(由同一分区生成)发生冲突。The replica ID is included in the performance counter instance name to ensure its uniqueness and avoid conflict with other performance counter instances generated by the same partition. 此处提供有关副本及其在 Reliable Services 中的角色的更多详细信息。Further details about replicas and their role in reliable services can be found here.

StateProviderId 是与可靠服务中的状态提供程序关联的 ID。StateProviderId is the ID associated with a state provider within a reliable service. 状态提供程序 ID 包含在性能计数器实例名称中,用于区分 TStore。The state provider ID is included in the performance counter instance name to differentiate a TStore from another.

PerformanceCounterInstanceDifferentiator 是与状态提供程序中的性能计数器实例关联的区分 ID。PerformanceCounterInstanceDifferentiator is a differentiating ID associated with a performance counter instance within a state provider. 此区分符 包含在性能计数器实例名称中,以确保其唯一性并避免与其他性能计数器实例(由同一状态提供程序生成)发生冲突。This differentiator is included in the performance counter instance name to ensure its uniqueness and avoid conflict with other performance counter instances generated by the same state provider.

StateProviderName 是与可靠服务中的状态提供程序关联的名称。StateProviderName is the name associated with a state provider within a reliable service. 状态提供程序名称包含在性能计数器实例名称中,以便用户轻松识别它提供的状态。The state provider name is included in the performance counter instance name for users to easily identify what state it provides.

以下计数器实例名称通常适用于 Service Fabric TStore 类别的计数器:The following counter instance name is typical for a counter under the Service Fabric TStore category:

00d0126d-3e36-4d68-98da-cc4f7195d85e:131652217797162571:142652217797162571_1337_urn:MyReliableDictionary/dataStore

在前面的示例中,00d0126d-3e36-4d68-98da-cc4f7195d85e 是 Service Fabric 分区 ID 的字符串表示形式,131652217797162571 是副本 ID,142652217797162571 是状态提供程序 ID,1337 是性能计数器实例区分符。In the preceding example, 00d0126d-3e36-4d68-98da-cc4f7195d85e is the string representation of the Service Fabric partition ID, 131652217797162571 is the replica ID, 142652217797162571 is the state provider ID, and 1337 is the performance counter instance differentiator. urn:MyReliableDictionary/dataStore 是存储名为 urn:MyReliableDictionary 的集合的数据的状态提供程序的名称。urn:MyReliableDictionary/dataStore is the name of the state provider that stores data for the collection named urn:MyReliableDictionary.

事务复制器性能计数器Transactional Replicator performance counters

Reliable Services 运行时发出的以下事件属于 Service Fabric Transactional Replicator类别The Reliable Services runtime emits the following events under the Service Fabric Transactional Replicator category

计数器名称Counter name 说明Description
启动事务操作数/秒Begin Txn Operations/sec 每秒创建的新写入事务数。The number of new write transactions created per second.
事务操作数/秒Txn Operations/sec 每秒在 Reliable Collections 上执行的添加/更新/删除操作的数目。The number of add/update/delete operations performed on reliable collections per second.
日志刷新字节数/秒Log Flush Bytes/sec 事务复制器每秒刷新到磁盘的字节数The number of bytes being flushed to the disk by the Transactional Replicator per second
中止的操作数/秒Throttled Operations/sec 事务复制器出于限制原因每秒拒绝的操作数。The number of operations rejected every second by the Transactional Replicator due to throttling.
平均值事务用时(毫秒)/提交Avg. Transaction ms/Commit 每个事务的提交平均延时(毫秒)Average commit latency per transaction in milliseconds
平均值刷新延迟(毫秒)Avg. Flush Latency (ms) 由事务复制器启动的磁盘刷新操作的平均持续时间(毫秒)Average duration of disk flush operations initiated by the Transactional Replicator in milliseconds

TStore 性能计数器TStore performance counters

Reliable Services 运行时发出的以下事件属于 Service Fabric TStore类别The Reliable Services runtime emits the following events under the Service Fabric TStore category

计数器名称Counter name 说明Description
项计数Item Count 存储中的项数。The number of items in the store.
磁盘大小Disk Size 存储检查点文件的磁盘总大小(以字节为单位)。The total disk size, in bytes, of checkpoint files for the store.
检查点文件写入字节数/秒Checkpoint File Write Bytes/sec 最近检查点文件每秒写入的字节数。The number of bytes written per second for the most recent checkpoint file.
副本磁盘传输字节数/秒Copy Disk Transfer Bytes/sec 在存储副本期间每秒(在主要副本上)读取或(在次要副本上)写入的磁盘字节数。The number of disk bytes read (on the primary replica) or written (on a secondary replica) per second during a store copy.

后续步骤Next steps

PerfView 中的 EventSource 提供程序EventSource providers in PerfView