使用 Azure 媒体分析检测面部和情绪Detect Face and Emotion with Azure Media Analytics

概述Overview

借助 Azure Media Face Detector 媒体处理器 (MP),可通过面部表情来统计、跟踪动作,甚至计量受众的参与和反应。The Azure Media Face Detector media processor (MP) enables you to count, track movements, and even gauge audience participation and reaction via facial expressions. 此服务包含两项功能:This service contains two features:

  • 面部检测Face detection

    面部检测能够找出并跟踪视频中的人脸。Face detection finds and tracks human faces within a video. 可以检测多个面部,随后随着对象移动进行跟踪,并将时间和位置的元数据以 JSON 文件的形式返回。Multiple faces can be detected and subsequently be tracked as they move around, with the time and location metadata returned in a JSON file. 跟踪期间,该服务会在人员于屏幕上四处移动时,尝试为他们的面部赋予相同的 ID,即使他们被挡住或暂时离帧。During tracking, it attempts to give a consistent ID to the same face while the person is moving around on screen, even if they are obstructed or briefly leave the frame.

    备注

    此服务并不执行面部识别。This service does not perform facial recognition. 面部离帧或被挡住太久的人员,会在回来时赋予新的 ID。An individual who leaves the frame or becomes obstructed for too long will be given a new ID when they return.

  • 情绪检测Emotion detection

    情绪检测是面部检测媒体处理器的可选组件,它根据检测到的面部返回多个情绪属性的分析,包括快乐、悲伤、恐惧、愤怒等等。Emotion Detection is an optional component of the Face Detection Media Processor that returns analysis on multiple emotional attributes from the faces detected, including happiness, sadness, fear, anger, and more.

Azure 媒体面部检测器 MP 目前以预览版提供。The Azure Media Face Detector MP is currently in Preview.

本文提供了有关 Azure Media Face Detector 的详细信息,并演示了如何通过适用于 .NET 的媒体服务 SDK 使用它。This article gives details about Azure Media Face Detector and shows how to use it with Media Services SDK for .NET.

面部检测器输入文件Face Detector input files

视频文件。Video files. 目前支持以下格式:MP4、MOV 和 WMV。Currently, the following formats are supported: MP4, MOV, and WMV.

面部检测器输出文件Face Detector output files

人脸检测器和跟踪 API 可提供高精确度的面部位置检测和跟踪功能,并在单个视频中检测到最多 64 个人脸。The face detection and tracking API provides high precision face location detection and tracking that can detect up to 64 human faces in a video. 正面的面部可提供最佳效果,而侧面的面部和较小的面部(小于或等于 24x24 像素)可能就无法获得相同的精确度。Frontal faces provide the best results, while side faces and small faces (less than or equal to 24x24 pixels) might not be as accurate.

已检测到并已跟踪的面部会在坐标(左侧、顶部、宽度和高度)中返回,其中会在以像素为单位的图像中指明面部的位置,以及表示正在跟踪该人员的面部 ID 编号。The detected and tracked faces are returned with coordinates (left, top, width, and height) indicating the location of faces in the image in pixels, as well as a face ID number indicating the tracking of that individual. 在正面面部长时间于帧中消失或重叠的情况下,面部 ID 编号很容易重置,导致某些人员被分配多个 ID。Face ID numbers are prone to reset under circumstances when the frontal face is lost or overlapped in the frame, resulting in some individuals getting assigned multiple IDs.

输出 JSON 文件中的元素Elements of the output JSON file

作业将生成一个 JSON 输出文件,其中包含有关检测到的和跟踪的面部的元数据。The job produces a JSON output file that contains metadata about detected and tracked faces. 元数据包括指示面部位置的坐标,以及指示正在跟踪该人员的面部 ID 编号。The metadata includes coordinates indicating the location of faces, as well as a face ID number indicating the tracking of that individual. 在正面面部长时间于帧中消失或重叠的情况下,面部 ID 编号很容易重置,导致某些人员被分配多个 ID。Face ID numbers are prone to reset under circumstances when the frontal face is lost or overlapped in the frame, resulting in some individuals getting assigned multiple IDs.

输出 JSON 包含以下元素:The output JSON includes the following elements:

根 JSON 元素Root JSON elements

元素Element 说明Description
版本version 这是指视频 API 的版本。This refers to the version of the Video API.
timescaletimescale 视频每秒的“刻度”数。"Ticks" per second of the video.
offsetoffset 这是时间戳的时间偏移量。This is the time offset for timestamps. 在版本 1.0 的视频 API 中,此属性始终为 0。In version 1.0 of Video APIs, this will always be 0. 在我们将来支持的方案中,此值可能会更改。In future scenarios we support, this value may change.
width、hightwidth, hight 输出视频帧的宽度和高度,以像素为单位。The width and hight of the output video frame, in pixels.
framerateframerate 视频的每秒帧数。Frames per second of the video.
fragmentsfragments 元数据划分成称为“片段”的不同段。The metadata is chunked up into different segments called fragments. 每个片段包含开始时间、持续时间、间隔数字和事件。Each fragment contains a start, duration, interval number, and event(s).

片段 JSON 元素Fragments JSON elements

元素Element 说明Description
startstart 第一个事件的开始时间,以时钟周期为单位。The start time of the first event in "ticks."
durationduration 片段的长度,以“时钟周期”为单位。The length of the fragment, in "ticks."
indexindex (仅适用于 Azure 媒体编修器)定义当前事件的帧索引。(Applies to Azure Media Redactor only) defines the frame index of the current event.
intervalinterval 片段中每个事件条目的间隔(以“时钟周期”为单位)。The interval of each event entry within the fragment, in "ticks."
eventsevents 每个事件包含在该持续时间内检测到并跟踪的面部。Each event contains the faces detected and tracked within that time duration. 它是一个事件数组。It is an array of events. 外部数组代表一个时间间隔。The outer array represents one interval of time. 内部数组包含在该时间点发生的 0 个或多个事件。The inner array consists of 0 or more events that happened at that point in time. 空括号 [] 代表没有检测到人脸。An empty bracket [] means no faces were detected.
idid 正在跟踪的面部的 ID。The ID of the face that is being tracked. 如果某个面部后来未被检测到,此编号可能会意外更改。This number may inadvertently change if a face becomes undetected. 给定人员在整个视频中应该拥有相同的 ID,但由于检测算法的限制(例如受到阻挡等情况),我们无法保证这一点。A given individual should have the same ID throughout the overall video, but this cannot be guaranteed due to limitations in the detection algorithm (occlusion, etc.).
x, yx, y 规范化 0.0 到 1.0 比例中面部边框左上角的 X 和 Y 坐标。The upper left X and Y coordinates of the face bounding box in a normalized scale of 0.0 to 1.0.
-X 和 Y 坐标总是相对于横向方向,因此如果视频是纵向(或使用 iOS 时上下颠倒),便需要相应地变换坐标。-X and Y coordinates are relative to landscape always, so if you have a portrait video (or upside-down, in the case of iOS), you'll have to transpose the coordinates accordingly.
width, heightwidth, height 规范化 0.0 到 1.0 比例中面部边框的宽度和高度。The width and height of the face bounding box in a normalized scale of 0.0 to 1.0.
facesDetectedfacesDetected 位于 JSON 结果的末尾,汇总在生成视频期间算法所检测到的面部数。This is found at the end of the JSON results and summarizes the number of faces that the algorithm detected during the video. 由于 ID 可能在面部无法检测时(例如面部离开屏幕、转向别处)意外重置,此数字并不一定与视频中的实际面部数相同。Because the IDs can be reset inadvertently if a face becomes undetected (e.g., the face goes off screen, looks away), this number may not always equal the true number of faces in the video.

面部检测器使用分片(元数据可以分解为基于时间的区块,可以只下载需要的部分)和分段(可以在事件数过于庞大的情况下对事件进行分解)技术。Face Detector uses techniques of fragmentation (where the metadata can be broken up in time-based chunks and you can download only what you need), and segmentation (where the events are broken up in case they get too large). 一些简单的计算可帮助转换数据。Some simple calculations can help you transform the data. 例如,如果事件从 6300(刻度)开始,其时间刻度为 2997(刻度/秒),帧速率为 29.97(帧/秒),那么:For example, if an event started at 6300 (ticks), with a timescale of 2997 (ticks/sec) and framerate of 29.97 (frames/sec), then:

  • 开始时间/时间刻度 = 2.1 秒Start/Timescale = 2.1 seconds
  • 秒数 x 帧速率 = 63 帧Seconds x Framerate = 63 frames

面部检测输入和输出示例Face detection input and output example

输入视频Input video

输入视频Input Video

任务配置(预设)Task configuration (preset)

在使用 Azure 媒体面部检测器创建任务时,必须指定配置预设。When creating a task with Azure Media Face Detector, you must specify a configuration preset. 以下配置预设仅适用于面部检测。The following configuration preset is just for face detection.

    {
      "version":"1.0",
      "options":{
          "TrackingMode": "Fast"
      }
    }

属性说明Attribute descriptions

属性名称Attribute name 说明Description
ModeMode 快速 - 处理速度快,但准确度较低(默认)。Fast - fast processing speed, but less accurate (default).

JSON 输出JSON output

下面是 JSON 输出被截断的示例。The following example of JSON output was truncated.

    {
    "version": 1,
    "timescale": 30000,
    "offset": 0,
    "framerate": 29.97,
    "width": 1280,
    "height": 720,
    "fragments": [
        {
        "start": 0,
        "duration": 60060
        },
        {
        "start": 60060,
        "duration": 60060,
        "interval": 1001,
        "events": [
            [
            {
                "id": 0,
                "x": 0.519531,
                "y": 0.180556,
                "width": 0.0867188,
                "height": 0.154167
            }
            ],
            [
            {
                "id": 0,
                "x": 0.517969,
                "y": 0.181944,
                "width": 0.0867188,
                "height": 0.154167
            }
            ],
            [
            {
                "id": 0,
                "x": 0.517187,
                "y": 0.183333,
                "width": 0.0851562,
                "height": 0.151389
            }
            ],

情绪检测输入和输出示例Emotion detection input and output example

输入视频Input video

输入视频Input Video

任务配置(预设)Task configuration (preset)

在使用 Azure 媒体面部检测器创建任务时,必须指定配置预设。When creating a task with Azure Media Face Detector, you must specify a configuration preset. 以下配置预设指定基于情绪检测创建 JSON。The following configuration preset specifies to create JSON based on the emotion detection.

    {
      "version": "1.0",
      "options": {
        "aggregateEmotionWindowMs": "987",
        "mode": "aggregateEmotion",
        "aggregateEmotionIntervalMs": "342"
      }
    }

属性说明Attribute descriptions

属性名称Attribute name 说明Description
ModeMode Faces:仅人脸检测。Faces: Only face detection.
PerFaceEmotion:独立返回每个人脸检测的情绪。PerFaceEmotion: Return emotion independently for each face detection.
AggregateEmotion:返回帧中所有面部的平均情绪值。AggregateEmotion: Return average emotion values for all faces in frame.
AggregateEmotionWindowMsAggregateEmotionWindowMs 在已选择 AggregateEmotion 模式时使用。Use if AggregateEmotion mode selected. 指定用于生成每个聚合结果的视频的长度,以毫秒为单位。Specifies the length of video used to produce each aggregate result, in milliseconds.
AggregateEmotionIntervalMsAggregateEmotionIntervalMs 在已选择 AggregateEmotion 模式时使用。Use if AggregateEmotion mode selected. 指定生成聚合结果的频率。Specifies with what frequency to produce aggregate results.

聚合默认值Aggregate defaults

下面是聚合窗口和间隔设置的建议值。Below are recommended values for the aggregate window and interval settings. AggregateEmotionWindowMs 应该超过 AggregateEmotionIntervalMs。AggregateEmotionWindowMs should be longer than AggregateEmotionIntervalMs.

默认值Defaults(s) 最大值Max(s) 最小值Min(s)
AggregateEmotionWindowMsAggregateEmotionWindowMs 0.50.5 22 0.250.25
AggregateEmotionIntervalMsAggregateEmotionIntervalMs 0.50.5 11 0.250.25

JSON 输出JSON output

聚合情绪的 JSON 输出(已截断):JSON output for aggregate emotion (truncated):

    {
     "version": 1,
     "timescale": 30000,
     "offset": 0,
     "framerate": 29.97,
     "width": 1280,
     "height": 720,
     "fragments": [
       {
         "start": 0,
         "duration": 60060,
         "interval": 15015,
         "events": [
           [
             {
               "windowFaceDistribution": {
                 "neutral": 0,
                 "happiness": 0,
                 "surprise": 0,
                 "sadness": 0,
                 "anger": 0,
                 "disgust": 0,
                 "fear": 0,
                 "contempt": 0
               },
               "windowMeanScores": {
                 "neutral": 0,
                 "happiness": 0,
                 "surprise": 0,
                 "sadness": 0,
                 "anger": 0,
                 "disgust": 0,
                 "fear": 0,
                 "contempt": 0
               }
             }
           ],
           [
             {
               "windowFaceDistribution": {
                 "neutral": 0,
                 "happiness": 0,
                 "surprise": 0,
                 "sadness": 0,
                 "anger": 0,
                 "disgust": 0,
                 "fear": 0,
                 "contempt": 0
               },
               "windowMeanScores": {
                 "neutral": 0,
                 "happiness": 0,
                 "surprise": 0,
                 "sadness": 0,
                 "anger": 0,
                 "disgust": 0,
                 "fear": 0,
                 "contempt": 0
               }
             }
           ],
           [
             {
               "windowFaceDistribution": {
                 "neutral": 0,
                 "happiness": 0,
                 "surprise": 0,
                 "sadness": 0,
                 "anger": 0,
                 "disgust": 0,
                 "fear": 0,
                 "contempt": 0
               },
               "windowMeanScores": {
                 "neutral": 0,
                 "happiness": 0,
                 "surprise": 0,
                 "sadness": 0,
                 "anger": 0,
                 "disgust": 0,
                 "fear": 0,
                 "contempt": 0
               }
             }
           ],
           [
             {
               "windowFaceDistribution": {
                 "neutral": 0,
                 "happiness": 0,
                 "surprise": 0,
                 "sadness": 0,
                 "anger": 0,
                 "disgust": 0,
                 "fear": 0,
                 "contempt": 0
               },
               "windowMeanScores": {
                 "neutral": 0,
                 "happiness": 0,
                 "surprise": 0,
                 "sadness": 0,
                 "anger": 0,
                 "disgust": 0,
                 "fear": 0,
                 "contempt": 0
               }
             }
           ]
         ]
       },
       {
         "start": 60060,
         "duration": 60060,
         "interval": 15015,
         "events": [
           [
             {
               "windowFaceDistribution": {
                 "neutral": 1,
                 "happiness": 0,
                 "surprise": 0,
                 "sadness": 0,
                 "anger": 0,
                 "disgust": 0,
                 "fear": 0,
                 "contempt": 0
               },
               "windowMeanScores": {
                 "neutral": 0.688541,
                 "happiness": 0.0586323,
                 "surprise": 0.227184,
                 "sadness": 0.00945675,
                 "anger": 0.00592107,
                 "disgust": 0.00154993,
                 "fear": 0.00450447,
                 "contempt": 0.0042109
               }
             }
           ],
           [
             {
               "windowFaceDistribution": {
                 "neutral": 1,
                 "happiness": 0,
                 "surprise": 0,
                 "sadness": 0,
                 "anger": 0,
                 "disgust": 0,
                 "fear": 0,

限制Limitations

  • 支持的输入视频格式包括 MP4、MOV 和 WMV。The supported input video formats include MP4, MOV, and WMV.
  • 可检测的面部大小范围为 24x24 到 2048x2048 像素。The detectable face size range is 24x24 to 2048x2048 pixels. 无法检测此范围以外的面部。The faces out of this range will not be detected.
  • 对于每个视频,返回的面部数上限为 64。For each video, the maximum number of faces returned is 64.
  • 某些面部可能因技术难题而无法检测,例如非常大的面部角度(头部姿势),以及较大的阻挡物。Some faces may not be detected due to technical challenges; for example, very large face angles (head-pose), and large occlusion. 正面和接近正面的面部可提供最佳效果。Frontal and near-frontal faces have the best results.

.NET 示例代码.NET sample code

以下程序演示如何:The following program shows how to:

  1. 创建资产并将媒体文件上传到资产。Create an asset and upload a media file into the asset.

  2. 基于包含以下 json 预设的配置文件创建含有面部检测任务的作业:Create a job with a face detection task based on a configuration file that contains the following json preset:

            {
                "version": "1.0"
            }
    
  3. 下载输出 JSON 文件。Download the output JSON files.

创建和配置 Visual Studio 项目Create and configure a Visual Studio project

设置开发环境,并在 app.config 文件中填充连接信息,如使用 .NET 进行媒体服务开发中所述。Set up your development environment and populate the app.config file with connection information, as described in Media Services development with .NET.

示例Example

using System;
using System.Configuration;
using System.IO;
using System.Linq;
using Microsoft.WindowsAzure.MediaServices.Client;
using System.Threading;
using System.Threading.Tasks;

namespace FaceDetection
{
    class Program
    {
        private static readonly string _AADTenantDomain =
            ConfigurationManager.AppSettings["AMSAADTenantDomain"];
        private static readonly string _RESTAPIEndpoint =
            ConfigurationManager.AppSettings["AMSRESTAPIEndpoint"];
        private static readonly string _AMSClientId =
            ConfigurationManager.AppSettings["AMSClientId"];
        private static readonly string _AMSClientSecret =
            ConfigurationManager.AppSettings["AMSClientSecret"];

        // Field for service context.
        private static CloudMediaContext _context = null;

        static void Main(string[] args)
        {
            AzureAdTokenCredentials tokenCredentials =
                new AzureAdTokenCredentials(_AADTenantDomain,
                    new AzureAdClientSymmetricKey(_AMSClientId, _AMSClientSecret),
                    AzureEnvironments.AzureChinaCloudEnvironment);

            var tokenProvider = new AzureAdTokenProvider(tokenCredentials);

            _context = new CloudMediaContext(new Uri(_RESTAPIEndpoint), tokenProvider);

            // Run the FaceDetection job.
            var asset = RunFaceDetectionJob(@"C:\supportFiles\FaceDetection\BigBuckBunny.mp4",
                                        @"C:\supportFiles\FaceDetection\config.json");

            // Download the job output asset.
            DownloadAsset(asset, @"C:\supportFiles\FaceDetection\Output");
        }

        static IAsset RunFaceDetectionJob(string inputMediaFilePath, string configurationFile)
        {
            // Create an asset and upload the input media file to storage.
            IAsset asset = CreateAssetAndUploadSingleFile(inputMediaFilePath,
                "My Face Detection Input Asset",
                AssetCreationOptions.None);

            // Declare a new job.
            IJob job = _context.Jobs.Create("My Face Detection Job");

            // Get a reference to Azure Media Face Detector.
            string MediaProcessorName = "Azure Media Face Detector";

            var processor = GetLatestMediaProcessorByName(MediaProcessorName);

            // Read configuration from the specified file.
            string configuration = File.ReadAllText(configurationFile);

            // Create a task with the encoding details, using a string preset.
            ITask task = job.Tasks.AddNew("My Face Detection Task",
                processor,
                configuration,
                TaskOptions.None);

            // Specify the input asset.
            task.InputAssets.Add(asset);

            // Add an output asset to contain the results of the job.
            task.OutputAssets.AddNew("My Face Detection Output Asset", AssetCreationOptions.None);

            // Use the following event handler to check job progress.  
            job.StateChanged += new EventHandler<JobStateChangedEventArgs>(StateChanged);

            // Launch the job.
            job.Submit();

            // Check job execution and wait for job to finish.
            Task progressJobTask = job.GetExecutionProgressTask(CancellationToken.None);

            progressJobTask.Wait();

            // If job state is Error, the event handling
            // method for job progress should log errors.  Here we check
            // for error state and exit if needed.
            if (job.State == JobState.Error)
            {
                ErrorDetail error = job.Tasks.First().ErrorDetails.First();
                Console.WriteLine(string.Format("Error: {0}. {1}",
                                                error.Code,
                                                error.Message));
                return null;
            }

            return job.OutputMediaAssets[0];
        }

        static IAsset CreateAssetAndUploadSingleFile(string filePath, string assetName, AssetCreationOptions options)
        {
            IAsset asset = _context.Assets.Create(assetName, options);

            var assetFile = asset.AssetFiles.Create(Path.GetFileName(filePath));
            assetFile.Upload(filePath);

            return asset;
        }

        static void DownloadAsset(IAsset asset, string outputDirectory)
        {
            foreach (IAssetFile file in asset.AssetFiles)
            {
                file.Download(Path.Combine(outputDirectory, file.Name));
            }
        }

        static IMediaProcessor GetLatestMediaProcessorByName(string mediaProcessorName)
        {
            var processor = _context.MediaProcessors
                .Where(p => p.Name == mediaProcessorName)
                .ToList()
                .OrderBy(p => new Version(p.Version))
                .LastOrDefault();

            if (processor == null)
                throw new ArgumentException(string.Format("Unknown media processor",
                                                           mediaProcessorName));

            return processor;
        }

        static private void StateChanged(object sender, JobStateChangedEventArgs e)
        {
            Console.WriteLine("Job state changed event:");
            Console.WriteLine("  Previous state: " + e.PreviousState);
            Console.WriteLine("  Current state: " + e.CurrentState);

            switch (e.CurrentState)
            {
                case JobState.Finished:
                    Console.WriteLine();
                    Console.WriteLine("Job is finished.");
                    Console.WriteLine();
                    break;
                case JobState.Canceling:
                case JobState.Queued:
                case JobState.Scheduled:
                case JobState.Processing:
                    Console.WriteLine("Please wait...\n");
                    break;
                case JobState.Canceled:
                case JobState.Error:
                    // Cast sender as a job.
                    IJob job = (IJob)sender;
                    // Display or log error details as needed.
                    // LogJobStop(job.Id);
                    break;
                default:
                    break;
            }
        }
    }
}

媒体服务学习路径Media Services learning paths

媒体服务 v3(最新版本)Media Services v3 (latest)

查看最新版本的 Azure 媒体服务!Check out the latest version of Azure Media Services!

媒体服务 v2(旧版)Media Services v2 (legacy)

Azure 媒体服务分析概述Azure Media Services Analytics Overview

Azure Media Analytics demos(Azure 媒体分析演示)Azure Media Analytics demos