对 IoT Edge 上的实时视频分析进行故障排除Troubleshoot Live Video Analytics on IoT Edge

本文介绍对 Azure IoT Edge 上的实时视频分析 (LVA) 进行故障排除的步骤。This article covers troubleshooting steps for Live Video Analytics (LVA) on Azure IoT Edge.

排查部署问题Troubleshoot deployment issues

诊断Diagnostics

在部署实时视频分析的过程中,需设置 Azure 资源,如 IoT 中心和 IoT Edge 设备。As part of your Live Video Analytics deployment, you set up Azure resources such as IoT Hub and IoT Edge devices. 作为诊断问题的第一步,请务必确保按照以下说明正确设置 Edge 设备:As a first step to diagnosing problems, always ensure that the Edge device is properly set up by following these instructions:

  1. 运行 check 命令Run the check command.
  2. 检查 IoT Edge 版本Check your IoT Edge version.
  3. 检查 IoT Edge 安全管理器的状态及其日志Check the status of the IoT Edge security manager and its logs.
  4. 查看经过 IoT Edge 中心的消息View the messages that are going through the IoT Edge hub.
  5. 重启容器Restart containers.
  6. 检查防火墙和端口配置规则Check your firewall and port configuration rules.

部署前的问题Pre-deployment issues

如果边缘基础结构正常,则可以查找部署清单文件的问题。If the edge infrastructure is fine, you can look for issues with the deployment manifest file. 若要在 IoT Edge 设备上与任何其他 IoT 模块一起部署 IoT Edge 模块上的实时视频分析,请使用包含 IoT Edge 中心、IoT Edge 代理和其他模块及其属性的部署清单。To deploy the Live Video Analytics on IoT Edge module on the IoT Edge device alongside any other IoT modules, you use a deployment manifest that contains the IoT Edge hub, IoT Edge agent, and other modules and their properties. 如果 JSON 代码格式不正确,则可能会收到以下错误:If the JSON code isn't well formed, you might receive the following error:

az iot edge set-modules --hub-name <iot-hub-name> --device-id lva-sample-device --content <path-to-deployment_manifest.json>

未能分析参数“content”的文件“”中的 JSON,出现异常:“额外数据:第 101 行第 1 列 (char 5325)”Failed to parse JSON from file: '' for argument 'content' with exception: "Extra data: line 101 column 1 (char 5325)"

如果遇到此错误,建议检查 JSON 中是否缺少括号或存在文件结构的其他问题。If you encounter this error, we recommend that you check the JSON for missing brackets or other issues with the structure of the file. 可以使用客户端(如 Notepad++ 与 JSON Viewer 插件)或联机工具(如 JSON Formatter & Validator)来验证文件结构。To validate the file structure, you can use a client such as the Notepad++ with JSON Viewer plug-in or an online tool such as the JSON Formatter & Validator.

在部署期间:用媒体图直接方法诊断During deployment: Diagnose with media graph direct methods

在 IoT Edge 设备上正确部署 IoT Edge 模块上的实时视频分析后,可以通过调用直接方法来创建和运行媒体图。After the Live Video Analytics on IoT Edge module is deployed correctly on the IoT Edge device, you can create and run the media graph by invoking direct methods. 可以使用 Azure 门户通过直接方法来运行媒体图诊断:You can use the Azure portal to run a diagnosis of the media graph via direct methods:

  1. 在 Azure 门户中,转到连接到 IoT Edge 设备的 IoT 中心。In the Azure portal, go to the IoT hub that's connected to your IoT Edge device.

  2. 查找“自动设备管理”,然后选择“IoT Edge” 。Look for Automatic device management, and then select IoT Edge.

  3. 在 Edge 设备列表中,选择要诊断的设备。In the list of Edge devices, select the device that you want to diagnose.

    显示 Edge 设备列表的 Azure 门户的屏幕截图

  4. 检查响应代码是否为 200-OK。Check to see whether the response code is 200-OK. IoT Edge 运行时的其他响应代码包括:Other response codes for the IoT Edge runtime include:

    • 400 - 部署配置格式不正确或无效。400 - The deployment configuration is malformed or invalid.
    • 417 - 没有为设备设置部署配置。417 - The device doesn't have a deployment configuration set.
    • 412 - 部署配置中的架构版本无效。412 - The schema version in the deployment configuration is invalid.
    • 406 - IoT Edge 设备脱机或不发送状态报告。406 - The IoT Edge device is offline or not sending status reports.
    • 500 - IoT Edge 运行时中出现了一个错误。500 - An error occurred in the IoT Edge runtime.
  5. 如果收到状态 501 代码,请检查以确保直接方法名称正确。If you get a status 501 code, check to ensure that the direct method name is accurate. 如果方法名称和请求有效负载准确,则应获得结果,并显示成功代码 =200。If the method name and request payload are accurate, you should get results along with success code =200. 如果请求有效负载不准确,将显示状态 =400 以及指示错误代码和消息的响应有效负载,这些错误代码和消息应该有助于诊断直接方法调用的问题。If the request payload is inaccurate, you will get a status =400 and a response payload that indicates error code and message that should help with diagnosing the issue with your direct method call.

    • 检查报告的属性和所需属性有助于了解模块属性是否已与部署同步。Checking on reported and desired properties can help you understand whether the module properties have synced with the deployment. 如果没有,可以重启 IoT Edge 设备。If they haven't, you can restart your IoT Edge device.

    • 使用直接方法指南来调用一些方法,尤其是一些简单的方法,如 GraphTopologyList。Use the Direct methods guide to call a few methods, especially simple ones such as GraphTopologyList. 本指南还指定了所需的请求和响应有效负载以及错误代码。The guide also specifies expected request and response payloads and error codes. 简单的直接方法成功后,可以确保实时视频分析 IoT Edge 模块功能正常。After the simple direct methods are successful, you can be assured that the Live Video Analytics IoT Edge module is functionally OK.

      IoT Edge 模块的“直接方法”窗格的屏幕截图。

  6. 如果“在部署中指定”和“由设备报告”列指示“是”,则可以针对 IoT Edge 模块上的实时视频分析调用直接方法 。If the Specified in deployment and Reported by device columns indicate Yes, you can invoke direct methods on the Live Video Analytics on IoT Edge module. 选择该模块后,会转到一个页面,可以在其中查看所需的属性和报告的属性,并可以调用直接方法。Select the module to go to a page where you can check the desired and reported properties and invoke direct methods. 请记住以下几点:Keep in mind the following:

后期部署:在运行期间诊断日志中的问题Post deployment: Diagnose logs for issues during the run

IoT Edge 模块的容器日志应包含诊断信息,以帮助调试模块运行时的问题。The container logs for your IoT Edge module should contain diagnostics information to help debug your issues during module runtime. 可以检查容器日志中的问题,并对问题进行自我诊断。You can check container logs for issues and self-diagnose the issue.

如果你已经运行了前面的所有检查,但仍然遇到问题,请使用 support bundle 命令从 IoT Edge 设备收集日志,以供 Azure 团队进一步分析。If you've run all the preceding checks and are still encountering issues, gather logs from the IoT Edge device with the support bundle command for further analysis by the Azure team. 可以与我们联系以获取支持,并提交收集的日志。You can contact us for support and to submit the collected logs.

常见错误解决方法Common error resolutions

实时视频分析作为 IoT Edge 模块部署到 Edge 设备上,并且与 IoT Edge 代理和中心模块协作。Live Video Analytics is deployed as an IoT Edge module on the IoT Edge device, and it works collaboratively with the IoT Edge agent and hub modules. 在部署实时视频分析时会遇到的一些常见错误是由底层 IoT 基础结构的问题导致的。Some of the common errors that you'll encounter with the Live Video Analytics deployment are caused by issues with the underlying IoT infrastructure. 这些错误包括:The errors include:

边缘设置脚本问题Edge setup script issues

作为文档的一部分,我们提供了设置脚本来部署边缘和云资源,以便开始使用实时视频分析边缘。As part of our documentation, we've provided a setup script to deploy edge and cloud resources and get you started with Live Video Analytics Edge. 本节介绍你可能遇到的一些脚本错误,以及调试这些错误的解决方案。This section presents some script errors that you might encounter, along with solutions for debugging them.

问题:脚本运行部分创建几个资源,但失败,并出现以下消息:Issue: The script runs, partly creating few resources, but it fails with the following message:

registering device...

Unable to load extension 'eventgrid: unrecognized kwargs: ['min_profile']'. Use --debug for more information.
The command failed with an unexpected error. Here is the traceback:

No module named 'azure.mgmt.iothub.iot_hub_client'
Traceback (most recent call last):
File "/opt/az/lib/python3.6/site-packages/knack/cli.py", line 215, in invoke
  cmd_result = self.invocation.execute(args)
File "/opt/az/lib/python3.6/site-packages/azure/cli/core/commands/__init__.py", line 631, in execute
  raise ex
File "/opt/az/lib/python3.6/site-packages/azure/cli/core/commands/__init__.py", line 695, in _run_jobs_serially
  results.append(self._run_job(expanded_arg, cmd_copy))
File "/opt/az/lib/python3.6/site-packages/azure/cli/core/commands/__init__.py", line 688, in _run_job
  six.reraise(*sys.exc_info())
File "/opt/az/lib/python3.6/site-packages/six.py", line 693, in reraise
  raise value
File "/opt/az/lib/python3.6/site-packages/azure/cli/core/commands/__init__.py", line 665, in _run_job
  result = cmd_copy(params)
File "/opt/az/lib/python3.6/site-packages/azure/cli/core/commands/__init__.py", line 324, in __call__
  return self.handler(*args, **kwargs)
File "/opt/az/lib/python3.6/site-packages/azure/cli/core/__init__.py", line 574, in default_command_handler
  return op(**command_args)
File "/home/.azure/cliextensions/azure-cli-iot-ext/azext_iot/operations/hub.py", line 75, in iot_device_list
  result = iot_query(cmd, query, hub_name, top, resource_group_name, login=login)
File "/home/.azure/cliextensions/azure-cli-iot-ext/azext_iot/operations/hub.py", line 45, in iot_query
  target = get_iot_hub_connection_string(cmd, hub_name, resource_group_name, login=login)
File "/home/.azure/cliextensions/azure-cli-iot-ext/azext_iot/common/_azure.py", line 112, in get_iot_hub_connection_string
  client = iot_hub_service_factory(cmd.cli_ctx)
File "/home/.azure/cliextensions/azure-cli-iot-ext/azext_iot/_factory.py", line 28, in iot_hub_service_factory
  from azure.mgmt.iothub.iot_hub_client import IotHubClient
ModuleNotFoundError: No module named 'azure.mgmt.iothub.iot_hub_client'

解决此问题:To fix this issue:

  1. 运行以下命令:Run the following command:

    az --version
    
  2. 确保安装了以下扩展。Ensure that you have the following extensions installed. 在本文发布时,扩展及其版本如下:As of the publication of this article, the extensions and their versions are:

    分机Extension 版本Version
    azure-cliazure-cli 2.5.1*2.5.1*
    command-modules-nspkgcommand-modules-nspkg 2.0.32.0.3
    corecore 2.5.1*2.5.1*
    nspkgnspkg 3.0.43.0.4
    telemetrytelemetry 1.0.41.0.4
    storage-previewstorage-preview 0.2.100.2.10
    azure-cli-iot-extazure-cli-iot-ext 0.8.90.8.9
    eventgrideventgrid 0.4.90.4.9
    azure-iotazure-iot 0.9.20.9.2
  3. 如果已安装的扩展的版本早于此处列出的版本号,请使用以下命令更新扩展:If you have an installed extension whose version is earlier than the release number listed here, update the extension by using the following command:

    az extension update --name <Extension name>
    

    例如,你可以运行 az extension update --name azure-iotFor example, you might run az extension update --name azure-iot.

示例应用问题Sample app issues

在发布过程中,我们提供了一些 .NET 示例代码来帮助引导我们的开发人员社区。As part of our release, we've provided some .NET sample code to help get our developer community bootstrapped. 本节介绍你运行示例代码时可能遇到的一些错误,以及调试这些错误的解决方案。This section presents some errors you might encounter when you run the sample code, along with solutions for debugging them.

问题:在直接方法调用中,Program.cs 失败并出现以下错误:Issue: Program.cs fails with the following error on the direct method invocation:

Unhandled exception. Microsoft.Azure.Devices.Common.Exceptions.UnauthorizedException: {"Message":"{\"errorCode\":401002,\"trackingId\":\"b1da85801b2e4faf951a2291a2c467c3-G:32-TimeStamp:04/06/2020 17:15:11\",\"message\":\"Unauthorized\",\"timestampUtc\":\"2020-04-06T17:15:11.6990676Z\"}","ExceptionMessage":""}
    
        at Microsoft.Azure.Devices.HttpClientHelper.ExecuteAsync(HttpClient httpClient, HttpMethod httpMethod, Uri requestUri, Func`3 modifyRequestMessageAsync, Func`2 isMappedToException, Func`3 processResponseMessageAsync, IDictionary`2 errorMappingOverrides, CancellationToken cancellationToken)
    
        at Microsoft.Azure.Devices.HttpClientHelper.ExecuteAsync(HttpMethod httpMethod, Uri requestUri, Func`3 modifyRequestMessageAsync, Func`3 processResponseMessageAsync, IDictionary`2 errorMappingOverrides, CancellationToken cancellationToken)
        
        at Microsoft.Azure.Devices.HttpClientHelper.PostAsync[T,T2](Uri requestUri, T entity, TimeSpan operationTimeout, IDictionary`2 errorMappingOverrides, IDictionary`2 customHeaders, CancellationToken cancellationToken)…
  1. 确保在 Visual Studio Code 环境中安装了 Azure IoT Tools,并已设置与 IoT 中心的连接。Ensure that you have Azure IoT Tools installed in your Visual Studio Code environment, and that you've set up the connection to your IoT hub. 若要执行此操作,请选择“Ctrl+Shift+P”,然后选择“选择 IoT 中心方法”。To do so, select Ctrl+Shift+P, and then choose Select IoT Hub method.

  2. 检查是否可以通过 Visual Studio Code 在 IoT Edge 模块上调用直接方法。Check to see whether you can invoke a direct method on the IoT Edge module via Visual Studio Code. 例如,使用以下有效负载调用 GraphTopologyList { "@apiVersion":"1.0"}.For example, call GraphTopologyList with the following payload { "@apiVersion": "1.0"}. 应该会收到以下响应:You should receive the following response:

    {
      "status": 200,
      "payload": {
        "values": [
          {…
    …}
          ]
        }
    }
    

    Visual Studio Code 中的响应的屏幕截图。

  3. 如果上述解决方案失败,请尝试以下操作:If the preceding solution fails, try the following:

    a.a. 在 IoT Edge 设备上转到命令提示符,并运行以下命令:Go to the command prompt on your IoT Edge device, and run the following command:

    sudo systemctl restart iotedge
    

    此命令将重启 IoT Edge 设备和所有模块。This command restarts the IoT Edge device and all the modules. 等待几分钟,然后在再次尝试使用直接方法之前,请通过运行以下命令确认模块正在运行:Wait a few minutes and then, before you try to use the direct method again, confirm that the modules are running by running the following command:

    sudo iotedge list
    

    b.b. 如果上述方法也失败,请尝试重启虚拟机或计算机。If the preceding approach also fails, try rebooting your virtual machine or computer.

    c.c. 如果所有方法均失败,请运行以下命令以获取一个压缩文件,其中包含所有相关日志,并将其附加到支持工单If all approaches fail, run the following command to obtain a zipped file with all relevant logs, and attach it to a support ticket.

    sudo iotedge support-bundle --since 2h
    
  4. 如果收到错误响应 400 代码,请确保按照直接方法指南,为方法调用有效负载设置了正确的格式。If you get an error response 400 code, ensure that your method invocation payload is well formed, as per the Direct methods guide.

  5. 如果收到状态 200 代码,则表示中心运行正常,模块部署正确且正常响应。If you get a status 200 code, it indicates that your hub is functioning well and your module deployment is correct and responsive.

  6. 检查应用配置是否准确。Check to see whether the app configuration is accurate. 应用配置包含 appsettings.json 文件中的以下字段。Your app configuration consists of the following fields in the appsettings.json file. 请仔细检查以确保 deviceId 和 moduleId 准确。Double-check to ensure that deviceId and moduleId are accurate. 一种简便的检查方法是转到 Visual Studio Code 中的 Azure IoT 中心扩展部分。An easy way to check is by going to the Azure IoT Hub extension section in Visual Studio Code. Appsettings.json 文件中的值和 IoT 中心部分应匹配。The values in the appsettings.json file and the IoT Hub section should match.

    {
        "IoThubConnectionString" : 
        "deviceId" : 
        "moduleId" : 
    }
    
  7. 请确保在 appsettings.json 文件中提供 IoT 中心连接字符串,而不是 IoT 中心设备连接字符串,因为连接字符串的格式不同In the appsettings.json file, ensure that you've provided the IoT Hub connection string and not the IoT Hub device connection string, because the connection string formats are different.

用于外部模块的实时视频分析Live Video Analytics working with external modules

通过 HTTP 扩展处理器的实时视频分析可以扩展媒体图,以使用 REST 通过 HTTP 发送和接收来自其他 IoT Edge 模块的数据。Live Video Analytics via the HTTP extension processor can extend the media graph to send and receive data from other IoT Edge modules over HTTP by using REST. 作为特定示例,媒体图可以将视频帧作为图像发送到外部推理模块(如 Yolo v3),并接收基于 JSON 的分析结果。As a specific example, the media graph can send video frames as images to an external inference module such as Yolo v3 and receive JSON-based analytics results. 在这种拓扑中,事件的目标主要是 IoT 中心。In such a topology, the destination for the events is mostly the IoT hub. 如果在中心上看不到推理事件,请检查以下各项:In situations where you don't see the inference events on the hub, check for the following:

  • 检查媒体图要发布到的中心是否与要检查的中心相同。Check to see whether the hub that media graph is publishing to and the hub you're examining are the same. 创建多个部署时,最终可能会获得多个中心,并可能会错误地检查错误的事件中心。As you create multiple deployments, you might end up with multiple hubs and mistakenly check the wrong hub for events.

  • 在 Visual Studio Code 中,检查外部模块是否已部署并正在运行。In Visual Studio Code, check to see whether the external module is deployed and running. 在此处的示例图中,rtspsim 和 cv 是在 lvaEdge 模块外部运行的 IoT Edge 模块。In the example image here, rtspsim and cv are IoT Edge modules running external to the lvaEdge module.

    显示 Azure IoT 中心内模块运行状态的屏幕截图。

  • 检查是否要将事件发送到正确的 URL 终结点。Check to see whether you're sending events to the correct URL endpoint. 外部 AI 容器公开一个 URL 和一个端口,通过该端口接收并返回 POST 请求中的数据。The external AI container exposes a URL and a port through which it receives and returns the data from POST requests. 此 URL 被指定为 HTTP 扩展处理器的 endpoint: url 属性。This URL is specified as an endpoint: url property for the HTTP extension processor. 拓扑 URL 中所示,终结点设置为推理 URL 参数。As seen in the topology URL, the endpoint is set to the inferencing URL parameter. 请确保参数的默认值或传入的值是准确的。Ensure that the default value for the parameter or the passed-in value is accurate. 可以使用客户端 URL (cURL) 来测试它是否正常工作。You can test to see whether it's working by using Client URL (cURL).

    例如,下面是一个在本地计算机上运行的 Yolo v3 容器,其 IP 地址为 172.17.0.3。As an example, here is a Yolo v3 container that's running on local machine with an IP address of 172.17.0.3. 使用 Docker inspect 查找 IP 地址。Use Docker inspect to find the IP address.

    curl -X POST http://172.17.0.3/score -H "Content-Type: image/jpeg" --data-binary @<fullpath to jpg>
    

    返回的结果:Result returned:

    {"inferences": [{"type": "entity", "entity": {"tag": {"value": "car", "confidence": 0.8668569922447205}, "box": {"l": 0.3853073438008626, "t": 0.6063712999658677, "w": 0.04174524943033854, "h": 0.02989496027381675}}}]}
    
  • 如果正在使用 HTTP 扩展处理器运行图形的一个或多个实例,则在每个 HTTP 扩展处理器之前都应有一个帧速率筛选器来管理视频源的每秒帧数 (fps)。If you're running one or more instances of a graph that uses the HTTP extension processor, you should have a frame rate filter before each HTTP extension processor to manage the frames per second (fps) rate of the video feed.

    在某些情况下,边缘计算机的 CPU 或内存使用率很高,可能会丢失某些推理事件。In certain situations, where the CPU or memory of the edge machine is highly utilized, you can lose certain inference events. 若要解决此问题,请在帧速率筛选器上将 maximumFps 属性设置为低值。To address this issue, set a low value for the maximumFps property on the frame rate filter. 可以在图的每个实例上将其设置为 0.5 ("maximumFps":0.5),然后重新运行该实例以检查中心上的推理事件。You can set it to 0.5 ("maximumFps": 0.5 ) on each instance of the graph and then rerun the instance to check for inference events on the hub.

    另外,可以使用更高的 CPU 和内存来获取功能更强大的边缘计算机。Alternatively, you can obtain a more powerful edge machine with higher CPU and memory.

并行的多个直接方法 - 超时失败Multiple direct methods in parallel – timeout failure

IoT Edge 上的实时视频分析提供了一种基于直接方法的编程模型,该模型支持设置多个拓扑和多个图形实例。Live Video Analytics on IoT Edge provides a direct method-based programming model that allows you to set up multiple topologies and multiple graph instances. 在拓扑和图形设置的过程中,可在 IoT Edge 模块上调用多个直接方法调用。As part of the topology and graph setup, you invoke multiple direct method calls on the IoT Edge module. 如果并行调用了多个方法调用,特别是启动和停止图形的方法调用,则可能会遇到一些超时故障,如下面所示:If you invoke these multiple method calls in parallel, especially the ones that start and stop the graphs, you might experience a timeout failure such as the following:

程序集初始化方法 Microsoft.Media.LiveVideoAnalytics.Test.Feature.Edge.AssemblyInitializer.InitializeAssemblyAsync 引发了异常。Assembly Initialization method Microsoft.Media.LiveVideoAnalytics.Test.Feature.Edge.AssemblyInitializer.InitializeAssemblyAsync threw exception. Microsoft.Azure.Devices.Common.Exceptions.IotHubException:Microsoft.Azure.Devices.Common.Exceptions.IotHubException:Microsoft.Azure.Devices.Common.Exceptions.IotHubException: Microsoft.Azure.Devices.Common.Exceptions.IotHubException:
{"Message":"{\"errorCode\":504101,\"trackingId\":\"55b1d7845498428593c2738d94442607-G:32-TimeStamp:05/15/2020 20:43:10-G:10-TimeStamp:05/15/2020 20:43:10\",\"message\":\"Timed out waiting for the response from device.\",\"info\":{},\"timestampUtc\":\"2020-05-15T20:43:10.3899553Z\"}","ExceptionMessage":""}. Aborting test execution.

建议不要并行调用直接方法。We recommend that you not call direct methods in parallel. 按顺序调用它们(也就是说,仅在前一个直接方法调用完成之后才进行另一个直接方法调用)。Call them sequentially (that is, make one direct method call only after the previous one is finished).

后续步骤Next steps

教程:将基于事件的视频录制到云中并从云中播放Tutorial: Event-based video recording to cloud and playback from cloud