排查 Azure Cache for Redis 服务器端问题Troubleshoot Azure Cache for Redis server-side issues

此部分讨论由于 Azure Cache for Redis 或托管它的虚拟机上出现状况而导致的故障排除问题。This section discusses troubleshooting issues that occur because of a condition on an Azure Cache for Redis or the virtual machine(s) hosting it.


本指南中的多个故障排除步骤包括了运行 Redis 命令和监视各种性能指标的说明。Several of the troubleshooting steps in this guide include instructions to run Redis commands and monitor various performance metrics. 如需更多信息和说明,请参阅 其他信息 部分的文章。For more information and instructions, see the articles in the Additional information section.

Redis 服务器上的内存压力Memory pressure on Redis server

服务器端的内存压力会导致各种性能问题,从而延缓对请求的处理。Memory pressure on the server side leads to all kinds of performance problems that can delay processing of requests. 出现内存压力时,系统可能会将数据分页到磁盘。When memory pressure hits, the system may page data to disk. 分页错误 导致系统的性能显著下降。This page faulting causes the system to slow down significantly. 这种内存压力可能有多个原因:There are several possible causes of this memory pressure:

  • 缓存中填充的数据即将达到其最大容量。The cache is filled with data near its maximum capacity.
  • Redis 出现大量内存碎片。Redis is seeing high memory fragmentation. 这种碎片往往是存储大型对象造成的,因为 Redis 已针对小型对象进行优化。This fragmentation is most often caused by storing large objects since Redis is optimized for small objects.

Redis 通过 INFO 命令公开以下两项统计信息来帮助你识别此问题:“used_memory”和“used_memory_rss”。Redis exposes two stats through the INFO command that can help you identify this issue: "used_memory" and "used_memory_rss". 可以使用门户查看这些指标You can view these metrics using the portal.

可以通过多种可能的更改来帮助确保内存用量正常:There are several possible changes you can make to help keep memory usage healthy:

  • 配置内存策略,对密钥设置过期时间。Configure a memory policy and set expiration times on your keys. 如果存在内存碎片,则此策略可能还不足够。This policy may not be sufficient if you have fragmentation.
  • 配置 maxmemory-reserved 值,该值应足够大,可以抵消内存碎片造成的影响。Configure a maxmemory-reserved value that is large enough to compensate for memory fragmentation.
  • 将大型缓存对象分解成小型相关对象。Break up your large cached objects into smaller related objects.
  • 基于指标(例如已用内存)创建警报,以提前收到有关潜在影响的通知。Create alerts on metrics like used memory to be notified early about potential impacts.
  • 扩展到可提供更多内存容量的更大缓存大小。Scale to a larger cache size with more memory capacity.

CPU 使用率或服务器负载过高High CPU usage or server load

服务器负载或 CPU 使用率偏高意味着服务器无法及时处理请求。A high server load or CPU usage means the server can't process requests in a timely fashion. 服务器可能会减慢响应速度,且无法跟上请求速率。The server may be slow to respond and unable to keep up with request rates.

监视指标,例如 CPU 或服务器负载。Monitor metrics such as CPU or server load. 注意与超时相对应的 CPU 使用率峰值。Watch for spikes in CPU usage that correspond with timeouts.

可以通过做出几项更改来缓解较高的服务器负载:There are several changes you can make to mitigate high server load:

  • 调查导致 CPU 峰值的原因,例如下面提到的长时间运行的命令或由于内存压力大而导致的页面错误。Investigate what is causing CPU spikes such as long-running commands noted below or page faulting because of high memory pressure.
  • 基于指标(例如 CPU 或服务器负载)创建警报,以提前收到有关潜在影响的通知。Create alerts on metrics like CPU or server load to be notified early about potential impacts.
  • 扩展到可提供更多 CPU 容量的更大缓存大小。Scale to a larger cache size with more CPU capacity.

长时间运行的命令Long-running commands

某些 Redis 命令的执行开销比其他命令高。Some Redis commands are more expensive to execute than others. Redis 命令文档介绍了每个命令的时间复杂性。The Redis commands documentation shows the time complexity of each command. 由于 Redis 命令处理是单线程的,因此需要时间运行的命令将阻塞其后的所有其他命令。Because Redis command processing is single-threaded, a command that takes time to run will block all others that come after it. 你应该查看正在向 Redis 服务器发出的命令,以了解它们对性能的影响。You should review the commands that you're issuing to your Redis server to understand their performance impacts. 例如,我们经常使用 KEYS 命令,但事先并不知道它是一个 O(N) 操作。For instance, the KEYS command is often used without knowing that it's an O(N) operation. 可以使用 SCAN 来避免 KEYS,以降低 CPU 峰值。You can avoid KEYS by using SCAN to reduce CPU spikes.

使用 SLOWLOG 命令可以测量正在对服务器执行的命令的开销。Using the SLOWLOG command, you can measure expensive commands being executed against the server.

服务器端带宽限制Server-side bandwidth limitation

不同的缓存大小具有不同的网络带宽容量。Different cache sizes have different network bandwidth capacities. 如果服务器超出可用带宽,则数据无法快速发送到客户端。If the server exceeds the available bandwidth, then data won't be sent to the client as quickly. 客户端请求可能会超时,因为服务器无法以足够快的速度将数据推送到客户端。Clients requests could time out because the server can't push data to the client fast enough.

可以使用“缓存读取”和“缓存写入”指标来查看使用的服务器端带宽量。The "Cache Read" and "Cache Write" metrics can be used to see how much server-side bandwidth is being used. 可以在门户中查看这些指标You can view these metrics in the portal.

缓解网络带宽用量即将达到最大容量的情况:To mitigate situations where network bandwidth usage is close to maximum capacity:

  • 更改客户端调用行为,以降低网络需求。Change client call behavior to reduce network demand.
  • 基于指标(例如缓存读取或缓存写入)创建警报,以提前收到有关潜在影响的通知。Create alerts on metrics like cache read or cache write to be notified early about potential impacts.
  • 扩展到可提供更高网络带宽容量的更大缓存大小。Scale to a larger cache size with more network bandwidth capacity.

其他信息Additional information