Azure SignalR 服务常见问题故障排除指南Troubleshooting guide for Azure SignalR Service common issues

本指南旨在根据客户过去几年内遇到和解决的常见问题提供有用的故障排除指南。This guidance is to provide useful troubleshooting guide based on the common issues customers met and resolved in the past years.

访问令牌太长Access token too long

可能的错误:Possible errors:

  • 客户端 ERR_CONNECTION_Client-side ERR_CONNECTION_
  • 414 URI 太长414 URI Too Long
  • 413 有效负载太大413 Payload Too Large
  • 访问令牌不得长于 4 K。Access Token must not be longer than 4K. 413 请求实体太大413 Request Entity Too Large

根本原因:Root cause:

对于 HTTP/2,单个标头的最大长度为 4 K。因此,如果使用浏览器访问 Azure 服务,则会出现有关此限制的 ERR_CONNECTION_ 错误。For HTTP/2, the max length for a single header is 4 K, so if using browser to access Azure service, there will be an error ERR_CONNECTION_ for this limitation.

对于 HTTP/1.1 或 C# 客户端,最大 URI 长度为 12 K,最大标头长度 16 K。For HTTP/1.1, or C# clients, the max URI length is 12 K, the max header length is 16 K.

使用 SDK 1.0.6 或更高版本时,/negotiate 会在生成的访问令牌大于 4 K 时引发“413 Payload Too Large”错误。With SDK version 1.0.6 or higher, /negotiate will throw 413 Payload Too Large when the generated access token is larger than 4 K.

解决方案:Solution:

默认情况下,在生成针对 ASRS(A zure S ignal R S ervice,即 Azure SignalR 服务)的 JWT 访问令牌时,会包括 context.User.Claims 中的声明,这样,这些声明会被保留,并可以在客户端连接到 Hub 时从 ASRS 传递到 HubBy default, claims from context.User.Claims are included when generating JWT access token to ASRS(A zure S ignal R S ervice), so that the claims are preserved and can be passed from ASRS to the Hub when the client connects to the Hub.

在某些情况下,会利用 context.User.Claims 来存储应用服务器的大量信息,其中的大多数信息不是供 Hub 使用,而是供其他组件使用。In some cases, context.User.Claims are leveraged to store lots of information for app server, most of which are not used by Hubs but by other components.

生成的访问令牌通过网络传递。对于 WebSocket/SSE 连接,访问令牌通过查询字符串传递。The generated access token is passed through the network, and for WebSocket/SSE connections, access tokens are passed through query strings. 因此,我们建议仅当 Hub 需要时才通过 ASRS 将必需的声明从客户端传递给应用服务器,这是最佳做法。 So as the best practice, we suggest only passing necessary claims from the client through ASRS to your app server when the Hub needs.

可以通过 ClaimsProvider 在访问令牌中自定义传递给 ASRS 的声明。There is a ClaimsProvider for you to customize the claims passing to ASRS inside the access token.

以下代码适用于 ASP.NET Core:For ASP.NET Core:

services.AddSignalR()
        .AddAzureSignalR(options =>
            {
                // pick up necessary claims
                options.ClaimsProvider = context => context.User.Claims.Where(...);
            });

以下代码适用于 ASP.NET:For ASP.NET:

services.MapAzureSignalR(GetType().FullName, options =>
            {
                // pick up necessary claims
                options.ClaimsProvider = context.Authentication?.User.Claims.Where(...);
            });

需要 TLS 1.2TLS 1.2 required

可能的错误:Possible errors:

  • ASP.NET 的“无可用服务器”错误 #279ASP.NET "No server available" error #279
  • ASP.NET 的“连接未处于活动状态,无法将数据发送到服务。”ASP.NET "The connection is not active, data cannot be sent to the service." 错误 #324error #324
  • “向 https:// 发出 HTTP 请求时出错。"An error occurred while making the HTTP request to https://. 此错误可能是由于未在 HTTPS 用例中正确使用 HTTP.SYS 配置服务器证书所致。This error could be due to the fact that the server certificate is not configured properly with HTTP.SYS in the HTTPS case. 此外,客户端与服务器之间的安全绑定不匹配也可能造成此错误。”This error could also be caused by a mismatch of the security binding between the client and the server."

根本原因:Root cause:

出于安全考虑,Azure 服务仅支持 TLS 1.2。Azure Service only supports TLS1.2 for security concerns. 使用 .NET Framework 时,TLS 1.2 可能不是默认协议。With .NET framework, it is possible that TLS1.2 is not the default protocol. 因此,无法成功建立与 ASRS 的服务器连接。As a result, the server connections to ASRS cannot be successfully established.

故障排除指南Troubleshooting guide

  1. 如果可以在本地重现此错误,请取消选中“仅我的代码”,引发所有 CLR 异常,并在本地调试应用服务器以查看引发的具体异常。If this error can be reproduced locally, uncheck Just My Code and throw all CLR exceptions and debug the app server locally to see what exception throws.

    • 取消选中“仅我的代码”Uncheck Just My Code

      取消选中“仅我的代码”

    • 引发 CLR 异常Throw CLR exceptions

      引发 CLR 异常

    • 请查看调试应用服务器端代码时引发的异常:See the exceptions throw when debugging the app server-side code:

      引发异常

  2. 对于 ASP.NET 错误,还可以将以下代码添加到 Startup.cs,以便启用详细的跟踪并查看日志中的错误。For ASP.NET ones, you can also add following code to your Startup.cs to enable detailed trace and see the errors from the log.

app.MapAzureSignalR(this.GetType().FullName);
// Make sure this switch is called after MapAzureSignalR
GlobalHost.TraceManager.Switch.Level = SourceLevels.Information;

解决方案:Solution:

将以下代码添加到 Startup.cs:Add following code to your Startup:

ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls12;

针对客户端请求返回了“400 错误请求”400 Bad Request returned for client requests

根本原因Root cause

检查客户端请求是否有多个 hub 查询字符串。Check if your client request has multiple hub query strings. hub 是保留的查询参数。如果服务检测到查询中有多个 hub,则会引发 400 错误。hub is a preserved query parameter and 400 will throw if the service detects more than one hub in the query.

针对客户端请求返回“401 未授权”401 Unauthorized returned for client requests

根本原因Root cause

JWT 令牌生存期的默认值目前为 1 小时。Currently the default value of JWT token's lifetime is 1 hour.

对于 ASP.NET Core SignalR,它在使用 WebSocket 传输类型时是正常的。For ASP.NET Core SignalR, when it is using WebSocket transport type, it is OK.

对于 ASP.NET Core SignalR 的其他传输类型(SSE 和长轮询),这意味着默认情况下连接最多可以保持 1 小时。For ASP.NET Core SignalR's other transport type, SSE and long-polling, this means by default the connection can at most persist for 1 hour.

对于 ASP.NET SignalR,客户端会不时地将 /ping KeepAlive 请求发送给服务。当 /ping 失败时,客户端会中止连接,且不再重新连接。For ASP.NET SignalR, the client sends a /ping KeepAlive request to the service from time to time, when the /ping fails, the client aborts the connection and never reconnect. 这意味着,对于 ASP.NET SignalR,默认令牌生存期会使连接最多持续 1 小时,而无论传输类型是哪一种。This means, for ASP.NET SignalR, the default token lifetime makes the connection lasts for at most 1 hour for all the transport type.

解决方案Solution

出于安全考虑,不建议延长 TTL。For security concerns, extend TTL is not encouraged. 建议在发生此类 401 错误时,添加客户端的重新连接逻辑以重启连接。We suggest adding reconnect logic from the client to restart the connection when such 401 occurs. 客户端在重启连接时,会与应用服务器协商以再次获取 JWT 令牌并获取续订的令牌。When the client restarts the connection, it will negotiate with app server to get the JWT token again and get a renewed token.

请查看此文,了解如何重启客户端连接。Check here for how to restart client connections.

针对客户端请求返回 404404 returned for client requests

对于 SignalR 持久性连接,它首先与 Azure SignalR 服务执行 /negotiate,然后建立与 Azure SignalR 服务的实际连接。For a SignalR persistent connection, it first /negotiate to Azure SignalR service and then establishes the real connection to Azure SignalR service.

故障排除指南Troubleshooting guide

  • 按照如何查看传出请求操作,以获取从客户端到服务的请求。Following How to view outgoing requests to get the request from the client to the service.
  • 出现 404 时,请检查请求的 URL。Check the URL of the request when 404 occurs. 如果 URL 是针对你的 Web 应用,并且类似于 {your_web_app}/hubs/{hubName},则请检查客户端 SkipNegotiation 是否为 trueIf the URL is targeting to your web app, and similar to {your_web_app}/hubs/{hubName}, check if the client SkipNegotiation is true. 使用 Azure SignalR 时,客户端会在首次与应用服务器协商时接收重定向 URL。When using Azure SignalR, the client receives redirect URL when it first negotiates with the app server. 使用 Azure SignalR 时,客户端不应跳过协商。The client should NOT skip negotiation when using Azure SignalR.
  • 如果在调用 /negotiate 后过了 5 秒以上才处理连接请求,则可能会发生另一 404 错误。Another 404 can happen when the connect request is handled more than 5 seconds after /negotiate is called. 如果对服务请求的响应较慢,请检查客户端请求的时间戳,并向我们提出问题。Check the timestamp of the client request, and open an issue to us if the request to the service has a slow response.

针对 ASP.NET SignalR 的重新连接请求返回了 404404 returned for ASP.NET SignalR's reconnect request

对于 ASP.NET SignalR,当客户端连接断开时,它会使用相同的 connectionId 重新连接三次,然后才停止连接。For ASP.NET SignalR, when the client connection drops, it reconnects using the same connectionId for three times before stopping the connection. 如果连接断开是由于网络间歇性问题,则可以使用 /reconnect/reconnect 可以成功地重新建立持久性连接。/reconnect can help if the connection is dropped due to network intermittent issues that /reconnect can reestablish the persistent connection successfully. 在其他情况下,例如,在客户端连接断开是因为路由的服务器连接断开的情况下,或者在 SignalR 服务有一些内部错误(如实例重启/故障转移/部署错误)的情况下,连接不再存在,因此 /reconnect 会返回 404Under other circumstances, for example, the client connection is dropped due to the routed server connection is dropped, or SignalR Service has some internal errors like instance restart/failover/deployment, the connection no longer exists, thus /reconnect returns 404. 它是 /reconnect 的预期行为,三次重试后连接会停止。It is the expected behavior for /reconnect and after three times retry the connection stops. 建议在连接停止时使用连接重启逻辑。We suggest having connection restart logic when connection stops.

针对客户端请求返回“429 (请求过多)”429 (Too Many Requests) returned for client requests

存在两种情况。There are two cases.

并发连接计数超出限制。Concurrent connection count exceeds limit.

对于免费实例,并发连接计数限制为 20。对于标准实例,每个单位的并发连接计数限制为 1K,这意味着 100 个单位允许 100 K 个并发连接。 For Free instances, Concurrent connection count limit is 20 For Standard instances, concurrent connection count limit per unit is 1 K, which means Unit100 allows 100-K concurrent connections.

连接包括客户端连接和服务器连接。The connections include both client and server connections. 请查看此文,了解如何进行连接计数。check here for how connections are counted.

协商时出现 500 错误:Azure SignalR 服务尚未连接,请稍后再试。500 Error when negotiate: Azure SignalR Service is not connected yet, please try again later.

根本原因Root cause

如果没有与 Azure SignalR 服务的服务器连接,则会报告此错误。This error is reported when there is no server connection to Azure SignalR Service connected.

故障排除指南Troubleshooting guide

启用服务器端跟踪,以便在服务器尝试连接到 Azure SignalR 服务时查明错误详情。Enable server-side trace to find out the error details when the server tries to connect to Azure SignalR Service.

为 ASP.NET Core SignalR 启用服务器端日志记录Enable server-side logging for ASP.NET Core SignalR

ASP.NET Core SignalR 的服务器端日志记录与在 ASP.NET Core Framework 中提供的基于 ILogger日志记录集成。Server-side logging for ASP.NET Core SignalR integrates with the ILogger based logging provided in the ASP.NET Core framework. 你可以使用 ConfigureLogging 来启用服务器端日志记录,示例用法如下:You can enable server-side logging by using ConfigureLogging, a sample usage as follows:

.ConfigureLogging((hostingContext, logging) =>
        {
            logging.AddConsole();
            logging.AddDebug();
        })

Azure SignalR 的记录器类别始终以 Microsoft.Azure.SignalR 开头。Logger categories for Azure SignalR always start with Microsoft.Azure.SignalR. 若要从 Azure SignalR 启用详细日志,请在 appsettings.json 文件中将前面的前缀配置为 Debug 级别,如下所示:To enable detailed logs from Azure SignalR, configure the preceding prefixes to Debug level in your appsettings.json file like below:

{
    "Logging": {
        "LogLevel": {
            ...
            "Microsoft.Azure.SignalR": "Debug",
            ...
        }
    }
}

为 ASP.NET SignalR 启用服务器端跟踪Enable server-side traces for ASP.NET SignalR

使用 >= 1.0.0 的 SDK 版本时,可以通过将以下内容添加到 web.config 来启用跟踪:(详细信息When using SDK version >= 1.0.0, you can enable traces by adding the following to web.config: (Details)

<system.diagnostics>
    <sources>
      <source name="Microsoft.Azure.SignalR" switchName="SignalRSwitch">
        <listeners>
          <add name="ASRS" />
        </listeners>
      </source>
    </sources>
    <!-- Sets the trace verbosity level -->
    <switches>
      <add name="SignalRSwitch" value="Information" />
    </switches>
    <!-- Specifies the trace writer for output -->
    <sharedListeners>
      <add name="ASRS" type="System.Diagnostics.TextWriterTraceListener" initializeData="asrs.log.txt" />
    </sharedListeners>
    <trace autoflush="true" />
  </system.diagnostics>

客户端连接断开Client connection drops

当客户端连接到 Azure SignalR 时,客户端与 Azure SignalR 之间的持久性连接有时可能会因不同的原因而断开。When the client is connected to the Azure SignalR, the persistent connection between the client and Azure SignalR can sometimes drop for different reasons. 此部分介绍导致此类连接断开的几种可能性,并提供一些有关如何确定根本原因的指导。This section describes several possibilities causing such connection drop and provides some guidance on how to identify the root cause.

客户端出现的可能的错误Possible errors seen from the client-side

  • The remote party closed the WebSocket connection without completing the close handshake
  • Service timeout. 30.00ms elapsed without receiving a message from service.
  • {"type":7,"error":"Connection closed with an error."}
  • {"type":7,"error":"Internal server error."}

根本原因:Root cause:

客户端连接可能会在各种情况下断开:Client connections can drop under various circumstances:

  • Hub 引发传入请求的异常时。When Hub throws exceptions with the incoming request.
  • 当客户端路由到的服务器连接断开时。请参阅下一部分,了解有关服务器连接断开的详细信息。When the server connection, which the client routed to, drops, see below section for details on server connection drops.
  • 当客户端与 SignalR 服务之间发生网络连接问题时。When a network connectivity issue happens between client and SignalR Service.
  • 当 SignalR 服务有一些内部错误(如实例重启错误、故障转移错误、部署错误等)时。When SignalR Service has some internal errors like instance restart, failover, deployment, and so on.

故障排除指南Troubleshooting guide

  1. 打开应用服务器端日志以查看是否发生了异常Open app server-side log to see if anything abnormal took place
  2. 检查应用服务器端事件日志以查看应用服务器是否已重启Check app server-side event log to see if the app server restarted
  3. 创建一个将提交给我们的问题,提供时间范围,并通过电子邮件向我们发送资源名称Create an issue to us providing the time frame, and email the resource name to us

客户端连接计数不断增加Client connection increases constantly

可能是客户端连接使用不当导致的。It might be caused by improper usage of client connection. 如果用户忘记停止/释放 SignalR 客户端,则连接会保持打开状态。If someone forgets to stop/dispose SignalR client, the connection remains open.

在 Azure 门户资源菜单的“监视”部分的 SignalR 指标中出现可能的错误Possible errors seen from the SignalR's metrics that is in Monitoring section of Azure portal resource menu

在 Azure SignalR 的指标中,客户端连接计数会不断增加,持续很长时间。Client connections rise constantly for a long time in Azure SignalR's Metrics.

客户端连接计数不断增加

根本原因:Root cause:

从未调用 SignalR 客户端连接的 DisposeAsync,因此连接保持打开状态。SignalR client connection's DisposeAsync never be called, the connection keeps open.

故障排除指南Troubleshooting guide

  1. 检查 SignalR 客户端是否从未关闭。Check if the SignalR client never close.

解决方案Solution

检查是否关闭了连接。Check if you close connection. 手动调用 HubConnection.DisposeAsync(),以便在使用连接后停止连接。Manually call HubConnection.DisposeAsync() to stop the connection after using it.

例如:For example:

var connection = new HubConnectionBuilder()
    .WithUrl(...)
    .Build();
try
{
    await connection.StartAsync();
    // Do your stuff
    await connection.StopAsync();
}
finally
{
    await connection.DisposeAsync();
}

常见的客户端连接使用不当情况Common improper client connection usage

Azure 函数示例Azure Function example

当有人在 Azure 函数方法中建立 SignalR 客户端连接,而不是使其成为 Azure 函数类的静态成员时,通常会出现此问题。This issue often occurs when someone establishes SignalR client connection in Azure Function method instead of making it a static member to your Function class. 你可能预计只会建立一个客户端连接,但却发现 Azure 门户资源菜单的“监视”部分的指标中客户端连接计数不断增加,所有这些连接只有在 Azure 函数或 Azure SignalR 服务重启后才会断开。You might expect only one client connection is established, but you see client connection count increases constantly in Metrics that is in Monitoring section of Azure portal resource menu, all these connections drop only after the Azure Function or Azure SignalR service restarts. 这是因为,对于每个请求,Azure 函数都会创建一个客户端连接,而如果你不在 Azure 函数方法中停止客户端连接,则客户端会让到 Azure SignalR 服务的连接保持活动状态。 This is because for each request, Azure Function creates one client connection, if you don't stop client connection in Function method, the client keeps the connections alive to Azure SignalR service.

解决方案Solution

服务器连接断开Server connection drops

当应用服务器在后台启动时,Azure SDK 就会开始启动到远程 Azure SignalR 的服务器连接。When the app server starts, in the background, the Azure SDK starts to initiate server connections to the remote Azure SignalR. Azure SignalR 服务内部情况所述,Azure SignalR 会将传入客户端流量路由到这些服务器连接。As described in Internals of Azure SignalR Service, Azure SignalR routes incoming client traffics to these server connections. 断开服务器连接后,它所处理的所有客户端连接也会关闭。Once a server connection is dropped, all the client connections it serves will be closed too.

应用服务器与 SignalR 服务之间的连接是持久性连接,因此可能会遇到网络连接问题。As the connections between the app server and SignalR Service are persistent connections, they may experience network connectivity issues. 在服务器 SDK 中,我们对服务器连接实施“始终进行重新连接”策略。In the Server SDK, we have Always Reconnect strategy to server connections. 我们还建议用户使用随机延迟时间向客户端添加连续重新连接逻辑,避免向服务器同时发送大规模的请求,这是最佳做法。As the best practice, we also encourage users to add continuous reconnect logic to the clients with a random delay time to avoid massive simultaneous requests to the server.

我们会定期发布 Azure SignalR 服务的新版本,有时会修补或升级 Azure 范围内的操作系统,偶尔也会使用依赖服务。On a regular basis, there are new version releases for the Azure SignalR Service, and sometimes the Azure-wide OS patching or upgrades or occasionally interruption from our dependent services. 这可能会导致服务中断很短的时间,但只要客户端存在断开连接/重新连接机制,此影响就会很小,就像任何客户端导致的断开连接-重新连接一样。These may bring in a short period of service disruption, but as long as client-side has the disconnect/reconnect mechanism, the impact is minimal like any client-side caused disconnect-reconnect.

此部分介绍导致服务器连接断开的几种可能性,并提供一些有关如何确定根本原因的指导。This section describes several possibilities leading to server connection drop, and provides some guidance on how to identify the root cause.

服务器端出现的可能的错误:Possible errors seen from server-side:

  • [Error]Connection "..." to the service was dropped
  • The remote party closed the WebSocket connection without completing the close handshake
  • Service timeout. 30.00ms elapsed without receiving a message from service.

根本原因:Root cause:

服务器-服务连接通过 ASRS(A zure S ignal R S ervice,Azure SignalR 服务)关闭。Server-service connection is closed by ASRS(A zure S ignal R S ervice).

故障排除指南Troubleshooting guide

  1. 打开应用服务器端日志以查看是否发生了异常Open app server-side log to see if anything abnormal took place
  2. 检查应用服务器端事件日志以查看应用服务器是否已重启Check app server-side event log to see if the app server restarted
  3. 创建一个将提交给我们的问题,提供时间范围,并通过电子邮件向我们发送资源名称Create an issue to us providing the time frame, and email the resource name to us

提示Tips

  • 如何查看客户端的传出请求?How to view the outgoing request from client? 以 ASP.NET Core 的为例(ASP.NET 的类似):Take ASP.NET Core one for example (ASP.NET one is similar):
    • 从浏览器:From browser:

      以 Chrome 为例,你可以使用 F12 打开控制台窗口,然后切换到“网络”选项卡。可能需要使用 F5 来刷新页面,以便从一开始就捕获网络。Take Chrome as an example, you can use F12 to open the console window, and switch to Network tab. You might need to refresh the page using F5 to capture the network from the very beginning.

      Chrome 的“查看网络”

    • 从 C# 客户端:From C# client:

      可以使用 Fiddler 查看本地 Web 流量。You can view local web traffics using Fiddler. 从 Fiddler 4.5 开始,支持 WebSocket 流量。WebSocket traffics are supported since Fiddler 4.5.

      Fiddler 的“查看网络”

后续步骤Next steps

在本指南中,你了解了如何处理常见问题。In this guide, you learned about how to handle the common issues. 你还可以了解更多常用的故障排除方法。You could also learn more generic troubleshooting methods.