排查 Azure Cache for Redis 客户端问题Troubleshoot Azure Cache for Redis client-side issues

此部分讨论由于应用程序使用的 Redis 客户端上出现状况而导致的故障排除问题。This section discusses troubleshooting issues that occur because of a condition on the Redis client that your application uses.

Redis 客户端上的内存压力Memory pressure on Redis client

客户端计算机上出现的内存压力会导致各种性能问题,这些问题可能会延迟对缓存所发出的响应的处理。Memory pressure on the client machine leads to all kinds of performance problems that can delay processing of responses from the cache. 出现内存压力时,系统可能会将数据分页到磁盘。When memory pressure hits, the system may page data to disk. 分页错误 导致系统的性能显著下降。This page faulting causes the system to slow down significantly.

检测客户端上的内存压力:To detect memory pressure on the client:

  • 监视计算机上的内存用量,确保所用内存未超过可用内存。Monitor memory usage on machine to make sure that it doesn't exceed available memory.
  • 监视客户端的 Page Faults/Sec 性能计数器。Monitor the client's Page Faults/Sec performance counter. 在正常运行期间,大多数系统会出现某些页面错误。During normal operation, most systems have some page faults. 如果页面错误中存在与请求超时相关的峰值,可能表示出现了内存压力。Spikes in page faults corresponding with request timeouts can indicate memory pressure.

可通过多种方式缓解客户端上较高的内存压力:High memory pressure on the client can be mitigated several ways:

  • 深入分析内存使用模式,并减少客户端上的内存消耗量。Dig into your memory usage patterns to reduce memory consumption on the client.
  • 将客户端 VM 升级到可提供更多内存的更大大小。Upgrade your client VM to a larger size with more memory.

流量突增Traffic burst

流量激增时,如果 ThreadPool 设置不佳,则可能导致对 Redis 服务器已发送但尚未在客户端上使用的数据的处理出现延迟。Bursts of traffic combined with poor ThreadPool settings can result in delays in processing data already sent by the Redis Server but not yet consumed on the client side.

使用示例 ThreadPoolLogger 监视 ThreadPool 统计信息在不同时间的变化。Monitor how your ThreadPool statistics change over time using an example ThreadPoolLogger. 可以使用 StackExchange.Redis 发出的 TimeoutException 消息(如下所示)做进一步的调查:You can use TimeoutException messages from StackExchange.Redis like below to further investigate:

System.TimeoutException: Timeout performing EVAL, inst: 8, mgr: Inactive, queue: 0, qu: 0, qs: 0, qc: 0, wr: 0, wq: 0, in: 64221, ar: 0,
IOCP: (Busy=6,Free=999,Min=2,Max=1000), WORKER: (Busy=7,Free=8184,Min=2,Max=8191)

在上面的异常中,有几个需要注意的问题:In the preceding exception, there are several issues that are interesting:

  • 请注意,在 IOCP 节和 WORKER 节中,Busy 值大于 Min 值。Notice that in the IOCP section and the WORKER section you have a Busy value that is greater than the Min value. 这种差异意味着 ThreadPool 设置需要调整。This difference means your ThreadPool settings need adjusting.
  • 也可参看 in: 64221You can also see in: 64221. 此值表示客户端的内核套接字层收到了 64,211 字节,但应用程序尚未读取这些字节。This value indicates that 64,211 bytes have been received at the client's kernel socket layer but haven't been read by the application. 这种差异通常意味着,应用程序(例如 StackExchange.Redis)从网络读取数据的速度没有服务器向你发送数据的速度快。This difference typically means that your application (for example, StackExchange.Redis) isn't reading data from the network as quickly as the server is sending it to you.

可以配置 ThreadPool 设置,确保线程池在流量激增的情况下快速扩展。You can configure your ThreadPool Settings to make sure that your thread pool scales up quickly under burst scenarios.

客户端 CPU 使用率过高High client CPU usage

客户端 CPU 使用率偏高表示系统跟不上所要求执行的工作的进度。High client CPU usage indicates the system can't keep up with the work it's been asked to do. 即使缓存发送响应的速度很快,客户端也可能无法及时处理该响应。Even though the cache sent the response quickly, the client may fail to process the response in a timely fashion.

使用 Azure 门户中提供的指标或者通过计算机上的性能计数器监视客户端的系统范围的 CPU 使用率。Monitor the client's system-wide CPU usage using metrics available in the Azure portal or through performance counters on the machine. 请注意不要监视进程 CPU,因为即使单个进程的 CPU 使用率较低,但系统范围的 CPU 使用率也可能很高。 Be careful not to monitor process CPU because a single process can have low CPU usage but the system-wide CPU can be high. 注意与超时相对应的 CPU 使用率峰值。Watch for spikes in CPU usage that correspond with timeouts. CPU 使用率较高可能还会导致 TimeoutException 错误消息中出现较大的 in: XXX 值,如流量突增部分所述。High CPU may also cause high in: XXX values in TimeoutException error messages as described in the Traffic burst section.

Note

StackExchange.Redis 1.1.603 及更高版本在 TimeoutException 错误消息中包括了 local-cpu 指标。StackExchange.Redis 1.1.603 and later includes the local-cpu metric in TimeoutException error messages. 确保使用最新版本的 StackExchange.Redis NuGet 包Ensure you using the latest version of the StackExchange.Redis NuGet package. 我们会不断对代码中的 Bug 进行修正,以便更好地应对超时情况。因此,请务必使用最新的版本。There are bugs constantly being fixed in the code to make it more robust to timeouts so having the latest version is important.

缓解客户端 CPU 使用率较高的问题:To mitigate a client's high CPU usage:

  • 调查出现 CPU 峰值的原因。Investigate what is causing CPU spikes.
  • 将客户端升级到可提供更多 CPU 容量的更大 VM 大小。Upgrade your client to a larger VM size with more CPU capacity.

客户端带宽限制Client-side bandwidth limitation

不同的客户端计算机体系结构对于可提供的网络带宽存在不同的限制。Depending on the architecture of client machines, they may have limitations on how much network bandwidth they have available. 如果客户端超出可用带宽,则客户端的数据处理速度将赶不上服务器的数据发送速度。If the client exceeds the available bandwidth by overloading network capacity, then data isn't processed on the client side as quickly as the server is sending it. 这种情况下会导致超时。This situation can lead to timeouts.

使用 示例 BandwidthLogger监视带宽使用率在不同时间的变化。Monitor how your Bandwidth usage change over time using an example BandwidthLogger. 在对权限有限制的某些环境(例如 Azure 网站)中,此代码可能无法成功运行。This code may not run successfully in some environments with restricted permissions (like Azure web sites).

若要缓解此问题,请减少网络带宽消耗,或者将客户端 VM 大小提高到可以提供更大网络容量的大小。To mitigate, reduce network bandwidth consumption or increase the client VM size to one with more network capacity.

请求或响应大小过大Large request or response Size

请求/响应过大可能导致超时。A large request/response can cause timeouts. 例如,假设你在客户端上配置的超时值为 1 秒。As an example, suppose your timeout value configured on your client is 1 second. 你的应用程序(使用相同的物理网络连接)的同时请求两个键 (例如,A 和 B)。Your application requests two keys (for example, 'A' and 'B') at the same time (using the same physical network connection). 大多数客户端支持对请求进行“管道操作”,使得请求“A”和“B”可以逐个发送,而无需等待响应。Most clients support request "pipelining", where both requests 'A' and 'B' are sent one after the other without waiting for their responses. 服务器会按相同顺序将响应发送回来。The server sends the responses back in the same order. 如果响应“A”较大,可能会消耗掉后续请求的大部分超时时间。If response 'A' is large, it can eat up most of the timeout for later requests.

在以下示例中,请求“A”和“B”快速发送到服务器。In the following example, request 'A' and 'B' are sent quickly to the server. 服务器开始快速发送响应“A”和“B”。The server starts sending responses 'A' and 'B' quickly. 由于数据传输需要时间,即使服务器的响应速度很快,响应“B”也必须等到响应“A”超时。Because of data transfer times, response 'B' must wait behind response 'A' times out even though the server responded quickly.

|-------- 1 Second Timeout (A)----------|
|-Request A-|
     |-------- 1 Second Timeout (B) ----------|
     |-Request B-|
            |- Read Response A --------|
                                       |- Read Response B-| (**TIMEOUT**)

此请求/响应很难度量值。This request/response is a difficult one to measure. 可对客户端代码进行检测,以跟踪大型请求和响应。You could instrument your client code to track large requests and responses.

针对大型响应的解决方法各不相同,但是包括:Resolutions for large response sizes are varied but include:

  1. 优化应用程序以处理大量的小值,而不是处理少量的大值。Optimize your application for a large number of small values, rather than a few large values.
    • 首选解决方案是将数据分解成较小的相关值。The preferred solution is to break up your data into related smaller values.
  2. 增大 VM 的大小以获得更高的带宽能力Increase the size of your VM to get higher bandwidth capabilities
    • 提高客户端或服务器 VM 上的带宽可以缩短较大响应的数据传输时间。More bandwidth on your client or server VM may reduce data transfer times for larger responses.
    • 将两台计算机上的网络用量与当前 VM 大小的限制进行比较。Compare your current network usage on both machines to the limits of your current VM size. 只提高服务器上的带宽,或者只提高客户端上的带宽,都不足以解决问题。More bandwidth on only the server or only on the client may not be enough.
  3. 增加应用程序使用的连接对象数。Increase the number of connection objects your application uses.
    • 使用轮询方法通过不同的连接对象发出请求。Use a round-robin approach to make requests over different connection objects.

其他信息Additional information