排查 Azure 文件存储性能问题Troubleshoot Azure Files performance issues

本文列出了与 Azure 文件共享相关的一些常见问题。This article lists some common problems related to Azure file shares. 其中提供了这些问题的潜在原因和解决方法。It provides potential causes and workarounds when these problems are encountered.

高延迟、低吞吐量和一般性能问题High latency, low throughput, and general performance issues

原因 1:共享遇到限制Cause 1: Share experiencing throttling

高级共享上的默认配额为 100 GiB,这可以提供 100 个基线 IOPS (一小时内可能会激增到 300)。The default quota on a premium share is 100 GiB, which provides 100 baseline IOPS (with a potential to burst up to 300 for an hour). 有关预配及其与 IOPS 之间的关系的详细信息,请参阅规划指南中的预配的共享部分。For more information about provisioning and its relationship to IOPS, see the Provisioned shares section of the planning guide.

若要确认共享是否受到限制,可以利用门户中的“Azure 指标”。To confirm if your share is being throttled, you can leverage Azure Metrics in the portal.

  1. 登录到 Azure 门户Sign in to the Azure portal.

  2. 选择“所有服务”,然后搜索“指标”********。Select All services and then search for Metrics.

  3. 选择“指标”。Select Metrics.

  4. 选择你的存储帐户作为资源。Select your storage account as the resource.

  5. 选择“文件”作为指标命名空间。****Select File as the metric namespace.

  6. 选择“事务”作为指标。****Select Transactions as the metric.

  7. 添加 ResponseType 的筛选器,并检查是否有任何请求的响应代码为 SuccessWithThrottling(适用于 SMB)或 ClientThrottlingError(适用于 REST)。Add a filter for ResponseType and check to see if any requests have a response code of SuccessWithThrottling (for SMB) or ClientThrottlingError (for REST).

高级文件共享的指标选项

备注

若想在文件共享受到限制时收到警报,请参阅如何创建文件共享受到限制时的警报To receive an alert if a file share is throttled, see How to create an alert if a file share is throttled.

解决方案Solution

  • 通过在共享中指定更高的配额来增大共享预配的容量。Increase share provisioned capacity by specifying a higher quota on your share.

原因 2:元数据/命名空间密集型工作负荷Cause 2: Metadata/namespace heavy workload

如果大多数请求以元数据为中心(例如 createfile/openfile/closefile/queryinfo/querydirectory),则与读/写操作相比,延迟将会更严重。If the majority of your requests are metadata centric, (such as createfile/openfile/closefile/queryinfo/querydirectory) then the latency will be worse when compared to read/write operations.

若要确认大多数请求是否以元数据为中心,可以使用上述相同的步骤。To confirm if most of your requests are metadata centric, you can use the same steps as above. 不要添加 ResponseType 的筛选器,而是添加 API 名称的筛选器。Except instead of adding a filter for ResponseType, add a filter for API Name.

筛选指标中的 API 名称

解决方法Workaround

  • 检查是否可以修改应用程序来减少元数据操作的数量。Check if the application can be modified to reduce the number of metadata operations.
  • 在文件共享上添加 VHD,并从客户端通过 SMB 装载 VHD,以便对数据执行文件操作。Add a VHD on the file share and mount VHD over SMB from the client to perform files operations against the data. 此方法适用于单个写入器和多个读取器方案,并允许元数据操作在本地进行,提供与本地直连存储类似的性能。This approach works for single writer and multiple readers scenarios and allows metadata operations to be local, offering performance similar to a local direct-attached storage.

原因 3:单线程应用程序Cause 3: Single-threaded application

如果客户使用的应用程序是单线程的,这可能会导致 IOPS/吞吐量明显低于最大可能的值,具体取决于预配的共享大小。If the application being used by the customer is single-threaded, this can result in significantly lower IOPS/throughput than the maximum possible based on your provisioned share size.

解决方案Solution

  • 通过增加线程数来提高应用程序的并行度。Increase application parallelism by increasing the number of threads.
  • 切换到支持并行度的应用程序。Switch to applications where parallelism is possible. 例如,对于复制操作,客户可以在 Windows 客户端中使用 AzCopy 或 RoboCopy,或者在 Linux 客户端中使用 parallel 命令。For example, for copy operations, customers could use AzCopy or RoboCopy from Windows clients or the parallel command on Linux clients.

请求的延迟很高Very high latency for requests

原因Cause

客户端 VM 所在的区域可能与文件共享所在的区域不同。The client VM could be located in a different region than the file share.

解决方案Solution

  • 从与文件共享位于同一区域的 VM 运行应用程序。Run the application from a VM that is located in the same region as the file share.

客户端无法实现网络支持的最大吞吐量Client unable to achieve maximum throughput supported by the network

此问题的潜在原因之一是缺少 SMB 多通道支持。One potential cause of this is a lack fo SMB multi-channel support. 目前,Azure 文件共享仅支持单个通道,因此只会建立从客户端 VM 到服务器的一个连接。Currently, Azure file shares only support single channel, so there is only one connection from the client VM to the server. 此单一连接限定为客户端 VM 上的单一核心,因此,可从 VM 实现的最大吞吐量受限于单个核心。This single connection is pegged to a single core on the client VM, so the maximum throughput achievable from a VM is bound by a single core.

解决方法Workaround

  • 获取核心更大的 VM 可能有助于提高吞吐量。Obtaining a VM with a bigger core may help improve throughput.

  • 从多个 VM 运行客户端应用程序会提高吞吐量。Running the client application from multiple VMs will increase throughput.

  • 尽可能地使用 REST API。Use REST APIs where possible.

与 Windows 客户端相比,Linux 客户端上的吞吐量要低得多。Throughput on Linux clients is significantly lower when compared to Windows clients.

原因Cause

这是 Linux 上实施的 SMB 客户端的一个已知问题。This is a known issue with the implementation of SMB client on Linux.

解决方法Workaround

  • 跨多个 VM 分散负载。Spread the load across multiple VMs.
  • 在同一 VM 上,通过 nosharesock 选项使用多个装入点,并将负载分散到这些装入点。On the same VM, use multiple mount points with nosharesock option, and spread the load across these mount points.
  • 在 Linux 上,尝试使用 nostrictsync 选项进行装载,以免每次调用 fsync 时都强制执行 SMB 刷新。On Linux, try mounting with nostrictsync option to avoid forcing SMB flush on every fsync call. 对于 Azure 文件,此选项不会影响数据一致性,但可能会导致目录列表(ls -l 命令)中出现过时的文件元数据。For Azure Files, this option does not interfere with data consistency, but may result in stale file metadata on directory listing (ls -l command). 直接查询文件的元数据(stat 命令)会返回最新的文件元数据。Directly querying metadata of file (stat command) will return the most up-to date file metadata.

涉及大量打开/关闭操作的元数据密集型工作负荷出现较高的延迟。High latencies for metadata heavy workloads involving extensive open/close operations.

原因Cause

缺少目录租约支持。Lack of support for directory leases.

解决方法Workaround

  • 如果可能,请避免短时间内在同一目录中使用过多的打开/关闭句柄。If possible, avoid excessive opening/closing handle on the same directory within a short period of time.
  • 对于 Linux VM,请指定“actimeo=<sec>”作为装载选项,以增大目录条目缓存超时。For Linux VMs, increase the directory entry cache timeout by specifying actimeo=<sec> as a mount option. 默认情况下,该超时为 1 秒,使用更大的值(例如 3 或 5)可能有所帮助。By default, it is one second, so a larger value like three or five might help.
  • 对于 Linux VM,请将内核升级到 4.20 或更高版本。For Linux VMs, upgrade the kernel to 4.20 or higher.

CentOS/RHEL 上的 IOPS 较低Low IOPS on CentOS/RHEL

原因Cause

CentOS/RHEL 不支持大于 1 的 IO 深度。IO depth greater than one is not supported on CentOS/RHEL.

解决方法Workaround

  • 升级到 CentOS 8/RHEL 8。Upgrade to CentOS 8 / RHEL 8.
  • 改用 Ubuntu。Change to Ubuntu.

在 Linux 中将文件复制到 Azure 文件以及从中复制文件时速度缓慢Slow file copying to and from Azure Files in Linux

如果在向/从 Azure 文件复制文件时速度缓慢,请查看 Linux 故障排除指南中的在 Linux 中向/从 Azure 文件复制文件时速度缓慢部分。If you are experiencing slow file copying to and from Azure Files, take a look at the Slow file copying to and from Azure Files in Linux section in the Linux troubleshooting guide.

IOPS 出现抖动/锯齿模式Jittery/saw-tooth pattern for IOPS

原因Cause

客户端应用程序总是超过基线 IOPS。Client application consistently exceeds baseline IOPS. 目前,请求负载没有服务端平滑处理,因此,如果客户端超过基线 IOPS,则服务会对其进行限制。Currently, there is no service side smoothing of the request load, so if the client exceeds baseline IOPS, it will get throttled by the service. 该限制可能导致客户端遇到抖动/锯齿 IOPS 模式。That throttling can result in the client experiencing a jittery/saw-tooth IOPS pattern. 在这种情况下,客户端实现的平均 IOPS 可能低于基线 IOPS。In this case, average IOPS achieved by the client might be lower than the baseline IOPS.

解决方法Workaround

  • 减少客户端应用程序的请求负载,使共享不会受到限制。Reduce the request load from the client application, so that the share does not get throttled.
  • 提高共享配额,使共享不会受到限制。Increase the quota of the share so that the share does not get throttled.

过多的 DirectoryOpen/DirectoryClose 调用Excessive DirectoryOpen/DirectoryClose calls

原因Cause

如果最频繁的 API 调用中包括 DirectoryOpen/DirectoryClose 调用,而你预计客户端不会发出这么多的调用,则问题可能与 Azure 客户端 VM 上安装的防病毒软件有关。If the number of DirectoryOpen/DirectoryClose calls is among the top API calls and you don't expect the client to be making that many calls, it may be an issue with the antivirus installed on the Azure client VM.

解决方法Workaround

文件创建速度慢于预期File creation is slower than expected

原因Cause

依赖于创建大量文件的工作负荷在高级文件共享和标准文件共享中的性能没有明显差异。Workloads that rely on creating a large number of files will not see a substantial difference between the performance of premium file shares and standard file shares.

解决方法Workaround

  • 无。None.

在 Windows 8.1 或 Server 2012 R2 中的性能不佳Slow performance from Windows 8.1 or Server 2012 R2

原因Cause

对于 IO 密集型工作负荷,访问 Azure 文件时的延迟要高于预期。Higher than expected latency accessing Azure Files for IO intensive workloads.

解决方法Workaround

如何创建文件共享受到限制时的警报How to create an alert if a file share is throttled

  1. 在 Azure 门户 中转到自己的存储帐户。Go to your storage account in the Azure portal.
  2. 在“监视”部分中单击“警报”,然后单击“+ 新建警报规则”。 In the Monitoring section, click Alerts and then click + New alert rule.
  3. 单击“编辑资源”,为存储帐户选择“文件资源类型”,然后单击“完成”。Click Edit resource, select the File resource type for the storage account and then click Done. 例如,如果存储帐户名称为“contoso”,则选择“contoso/文件”资源。For example, if the storage account name is contoso, select the contoso/file resource.
  4. 单击“选择条件”以添加条件。Click Select Condition to add a condition.
  5. 你将看到存储帐户支持的信号列表,请选择“事务”指标。You will see a list of signals supported for the storage account, select the Transactions metric.
  6. 在“配置信号逻辑”边栏选项卡上,单击“维度名称”下拉列表,然后选择“响应类型”。On the Configure signal logic blade, click the Dimension name drop-down and select Response type.
  7. 单击“维度值”下拉列表,并选择“SuccessWithThrottling”(对于 SMB)或“ClientThrottlingError”(对于 REST)。Click the Dimension values drop-down and select SuccessWithThrottling (for SMB) or ClientThrottlingError (for REST).

备注

如果 SuccessWithThrottling 或 ClientThrottlingError 维度值未列出,则意味着资源尚未受到限制。If the SuccessWithThrottling or ClientThrottlingError dimension value is not listed, this means the resource has not been throttled. 若要添加维度值,请单击“维度值”下拉列表旁边的“添加自定义值”,键入“SuccessWithThrottling”或“ClientThrottlingError”,单击“确定”,然后重复步骤 7。To add the dimension value, click Add custom value beside the Dimension values drop-down, type SuccessWithThrottling or ClientThrottlingError, click OK and then repeat step #7.

  1. 单击“维度名称”下拉列表并选择“文件共享”。Click the Dimension name drop-down and select File Share.
  2. 单击“维度值”下拉列表,并选择要对其发出警报的文件共享。Click the Dimension values drop-down and select the file share(s) that you want to alert on.

备注

如果文件共享是标准文件共享,请选择“所有当前值和将来值”。If the file share is a standard file share, select All current and future values. “维度值”下拉列表不会列出文件共享,因为每共享指标不可用于标准文件共享。The dimension values drop-down will not list the file share(s) because per-share metrics are not available for standard file shares. 如果存储帐户中的任何文件共享受到限制,则会触发标准文件共享的限制警报,并且警报不会识别哪个文件共享受到限制。Throttling alerts for standard file shares will be triggered if any file share within the storage account is throttled and the alert will not identify which file share was throttled. 因为每共享指标不可用于标准文件共享,所以建议为每个存储帐户使用一个文件共享。Since per-share metrics are not available for standard file shares, the recommendation is to have one file share per storage account.

  1. 定义“警报参数”(阈值、运算符、聚合粒度和评估频率),然后单击“完成”。Define the alert parameters (threshold value, operator, aggregation granularity and frequency of evaluation) and click Done.

提示

如果你使用的是静态阈值,并且文件共享当前受到限制,则可通过指标图表来确定合理的阈值。If you are using a static threshold, the metric chart can help determine a reasonable threshold value if the file share is currently being throttled. 如果使用的是动态阈值,则指标图表将显示基于最新数据计算出的阈值。If you are using a dynamic threshold, the metric chart will display the calculated thresholds based on recent data.

  1. 单击“选择操作组”,通过选择现有操作组或创建新的操作组,将一个操作组(电子邮件、短信等)添加到警报中。Click Select action group to add an action group (email, SMS, etc.) to the alert either by selecting an existing action group or creating a new action group.
  2. 填写警报详细信息,例如警报规则名称说明严重性Fill in the Alert details like Alert rule name, Description and Severity.
  3. 单击“创建警报规则”以创建警报。Click Create alert rule to create the alert.

若要详细了解如何在 Azure Monitor 中配置警报,请参阅 Microsoft Azure 中的警报概述To learn more about configuring alerts in Azure Monitor, see Overview of alerts in Microsoft Azure.

另请参阅See also