使用 Azure 媒体分析将视频文件中的文本内容转换为数字文本Use Azure Media Analytics to convert text content in video files into digital text

概述Overview

如果需要提取视频文件的文本内容,并生成可编辑、可搜索的数字文本,则应该使用 Azure 媒体分析 OCR(光学字符识别)。If you need to extract text content from your video files and generate an editable, searchable digital text, you should use Azure Media Analytics OCR (optical character recognition). 此 Azure 媒体处理器可检测视频文件的文本内容并生成文本文件供你使用。This Azure Media Processor detects text content in your video files and generates text files for your use. OCR 可让你从媒体的视频信号中自动提取有意义的元数据。OCR enables you to automate the extraction of meaningful metadata from the video signal of your media.

与搜索引擎配合使用时,可以根据文本轻松编制媒体的索引,并增强发现内容的能力。When used in conjunction with a search engine, you can easily index your media by text, and enhance the discoverability of your content. 这在包含大量文本的视频(例如视频录制或者幻灯片演示屏幕截图)中非常有用。This is extremely useful in highly textual video, like a video recording or screen-capture of a slideshow presentation. Azure OCR 媒体处理器已针对数字文本进行了优化。The Azure OCR Media Processor is optimized for digital text.

Azure 媒体 OCR 媒体处理器目前以预览版提供。The Azure Media OCR media processor is currently in Preview.

本文提供了有关 Azure 媒体 OCR 的详细信息,并演示了如何通过适用于 .NET 的媒体服务 SDK 使用它。This article gives details about Azure Media OCR and shows how to use it with Media Services SDK for .NET. 有关详细信息和示例,请参阅此博客For more information and examples, see this blog.

OCR 输入文件OCR input files

视频文件。Video files. 目前支持以下格式:MP4、MOV 和 WMV。Currently, the following formats are supported: MP4, MOV, and WMV.

任务配置Task configuration

任务配置(预设)Task configuration (preset). 在使用 Azure 媒体 OCR 创建任务时,必须使用 JSON 或 XML 指定配置预设。When creating a task with Azure Media OCR, you must specify a configuration preset using JSON or XML.

备注

在高度/宽度上,OCR 引擎仅将最小 40 像素到最大 32000 像素的图像区域视为有效输入。The OCR engine only takes an image region with minimum 40 pixels to maximum 32000 pixels as a valid input in both height/width.

属性说明Attribute descriptions

属性名称Attribute name 说明Description
AdvancedOutputAdvancedOutput 如果将 AdvancedOutput 设置为 true,则 JSON 输出将包含每个单词的位置数据(除了短语和区域以外)。If you set AdvancedOutput to true, the JSON output will contain positional data for every single word (in addition to phrases and regions). 如果不想查看这些详细信息,请将标志设置为 false。If you do not want to see these details, set the flag to false. 默认值为 false。The default value is false. 有关详细信息,请参阅此博客For more information, see this blog.
LanguageLanguage (可选)描述要查找的文本的语言。(optional) describes the language of text for which to look. 下列类型作之一:AutoDetect(默认值)、Arabic、ChineseSimplified、ChineseTraditional、Czech Danish、Dutch、English、Finnish、French、German、Greek、Hungarian、Italian、Japanese、Korean、Norwegian、Polish、Portuguese、Romanian、Russian、SerbianCyrillic、SerbianLatin、Slovak、Spanish、Swedish、Turkish。One of the following: AutoDetect (default), Arabic, ChineseSimplified, ChineseTraditional, Czech Danish, Dutch, English, Finnish, French, German, Greek, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Romanian, Russian, SerbianCyrillic, SerbianLatin, Slovak, Spanish, Swedish, Turkish.
TextOrientationTextOrientation (可选)描述要查找的文本的方向。(optional) describes the orientation of text for which to look. “Left”表示所有字母顶部朝向左侧。"Left" means that the top of all letters are pointed towards the left. 默认文本(例如书籍中出现的文本)的方向为“Up”。Default text (like that which can be found in a book) can be called "Up" oriented. 下列类型作之一:AutoDetect(默认值)、Up、Right、Down、Left。One of the following: AutoDetect (default), Up, Right, Down, Left.
TimeIntervalTimeInterval (可选)描述采样率。(optional) describes the sampling rate. 默认值为每 1/2 秒。Default is every 1/2 second.
JSON 格式 - HH:mm:ss.SSS(默认值 00:00:00.500)JSON format – HH:mm:ss.SSS (default 00:00:00.500)
XML 格式 - W3C XSD 持续时间基元(默认值 PT0.5)XML format – W3C XSD duration primitive (default PT0.5)
DetectRegionsDetectRegions (可选)指定要在其中检测文本的视频帧中的区域的 DetectRegion 对象数组。(optional) An array of DetectRegion objects specifying regions within the video frame in which to detect text.
DetectRegion 对象由以下四个整数值组成:A DetectRegion object is made of the following four integer values:
Left – 左边距中的像素Left – pixels from the left-margin
Top – 上边距中的像素Top – pixels from the top-margin
Width – 以像素为单位的区域宽度Width – width of the region in pixels
Height – 以像素为单位的区域高度Height – height of the region in pixels

JSON 预设示例JSON preset example

    {
        "Version":1.0, 
        "Options": 
        {
            "AdvancedOutput":"true",
            "Language":"English", 
            "TimeInterval":"00:00:01.5",
            "TextOrientation":"Up",
            "DetectRegions": [
                    {
                       "Left": 10,
                       "Top": 10,
                       "Width": 100,
                       "Height": 50
                    }
             ]
        }
    }

XML 预设示例XML preset example

    <?xml version=""1.0"" encoding=""utf-16""?>
    <VideoOcrPreset xmlns:xsi=""https://www.w3.org/2001/XMLSchema-instance"" xmlns:xsd=""https://www.w3.org/2001/XMLSchema"" Version=""1.0"" xmlns=""https://www.windowsazure.com/media/encoding/Preset/2014/03"">
      <Options>
         <AdvancedOutput>true</AdvancedOutput>
         <Language>English</Language>
         <TimeInterval>PT1.5S</TimeInterval>
         <DetectRegions>
             <DetectRegion>
                   <Left>10</Left>
                   <Top>10</Top>
                   <Width>100</Width>
                   <Height>50</Height>
            </DetectRegion>
       </DetectRegions>
       <TextOrientation>Up</TextOrientation>
      </Options>
    </VideoOcrPreset>

OCR 输出文件OCR output files

OCR 媒体处理器的输出是一个 JSON 文件。The output of the OCR media processor is a JSON file.

输出 JSON 文件中的元素Elements of the output JSON file

视频 OCR 输出针对视频中的字符提供时间分段数据。The Video OCR output provides time-segmented data on the characters found in your video. 可以使用属性(例如语言或方向)来锁定你想要分析的文本。You can use attributes such as language or orientation to hone-in on exactly the words that you are interested in analyzing.

输出包含以下属性:The output contains the following attributes:

元素Element 说明Description
TimescaleTimescale 视频每秒的“刻度”数"ticks" per second of the video
OffsetOffset 时间戳的时间偏移量。time offset for timestamps. 在版本 1.0 的视频 API 中,此属性始终为 0。In version 1.0 of Video APIs, this will always be 0.
FramerateFramerate 视频的每秒帧数Frames per second of the video
widthwidth 以像素为单位的视频宽度width of the video in pixels
heightheight 以像素为单位的视频高度height of the video in pixels
FragmentsFragments 将元数据堆积成的基于时间的视频块数组array of time-based chunks of video into which the metadata is chunked
startstart 片段的开始时间(以“刻度”为单位)start time of a fragment in "ticks"
durationduration 片段的长度(以“刻度”为单位)length of a fragment in "ticks"
intervalinterval 给定片段中每个事件的间隔interval of each event within the given fragment
eventsevents 包含区域的数组array containing regions
regionregion 表示检测到的单词或短语的对象object representing detected words or phrases
languagelanguage 区域中检测到的文本的语言language of the text detected within a region
orientationorientation 区域中检测到的文本的方向orientation of the text detected within a region
lineslines 区域中检测到的文本的行数组array of lines of text detected within a region
texttext 实际文本the actual text

JSON 输出示例JSON output example

以下输出示例包含常规视频信息和多个视频片段。The following output example contains the general video information and several video fragments. 每个视频片段包含 OCR MP 检测到的每个区域及其语言和文本方向。In every video fragment, it contains every region, which is detected by OCR MP with the language and its text orientation. 区域还包含此区域中的每个单词行,以及该行的文本、位置及其中每个单词的信息(单词内容、位置和置信度)。The region also contains every word line in this region with the line’s text, the line’s position, and every word information (word content, position, and confidence) in this line. 下面是一个示例,我在其中嵌入了一些注释。The following is an example, and I put some comments inline.

    {
        "version": 1, 
        "timescale": 90000, 
        "offset": 0, 
        "framerate": 30, 
        "width": 640, 
        "height": 480,  // general video information
        "fragments": [
            {
                "start": 0, 
                "duration": 180000, 
                "interval": 90000,  // the time information about this fragment
                "events": [
                    [
                       { 
                            "region": { // the detected region array in this fragment 
                                "language": "English",  // region language
                                "orientation": "Up",  // text orientation
                                "lines": [  // line information array in this region, including the text and the position
                                    {
                                        "text": "One Two", 
                                        "left": 10, 
                                        "top": 10, 
                                        "right": 210, 
                                        "bottom": 110, 
                                        "word": [  // word information array in this line
                                            {
                                                "text": "One", 
                                                "left": 10, 
                                                "top": 10, 
                                                "right": 110, 
                                                "bottom": 110, 
                                                "confidence": 900
                                            }, 
                                            {
                                                "text": "Two", 
                                                "left": 110, 
                                                "top": 10, 
                                                "right": 210, 
                                                "bottom": 110, 
                                                "confidence": 910
                                            }
                                        ]
                                    }
                                ]
                            }
                        }
                    ]
                ]
            }
        ]
    }

.NET 示例代码.NET sample code

以下程序演示如何:The following program shows how to:

  1. 创建资产并将媒体文件上传到资产。Create an asset and upload a media file into the asset.
  2. 使用 OCR 配置/预设文件创建作业。Create a job with an OCR configuration/preset file.
  3. 下载输出 JSON 文件。Download the output JSON files.

创建和配置 Visual Studio 项目Create and configure a Visual Studio project

设置开发环境,并在 app.config 文件中填充连接信息,如使用 .NET 进行媒体服务开发中所述。Set up your development environment and populate the app.config file with connection information, as described in Media Services development with .NET.

示例Example

using System;
using System.Configuration;
using System.IO;
using System.Linq;
using Microsoft.WindowsAzure.MediaServices.Client;
using System.Threading;
using System.Threading.Tasks;

namespace OCR
{
    class Program
    {
        // Read values from the App.config file.
        private static readonly string _AADTenantDomain =
            ConfigurationManager.AppSettings["AMSAADTenantDomain"];
        private static readonly string _RESTAPIEndpoint =
            ConfigurationManager.AppSettings["AMSRESTAPIEndpoint"];
        private static readonly string _AMSClientId =
            ConfigurationManager.AppSettings["AMSClientId"];
        private static readonly string _AMSClientSecret =
            ConfigurationManager.AppSettings["AMSClientSecret"];

        // Field for service context.
        private static CloudMediaContext _context = null;

        static void Main(string[] args)
        {
            AzureAdTokenCredentials tokenCredentials =
                new AzureAdTokenCredentials(_AADTenantDomain,
                    new AzureAdClientSymmetricKey(_AMSClientId, _AMSClientSecret),
                    AzureEnvironments.AzureChinaCloudEnvironment);

            var tokenProvider = new AzureAdTokenProvider(tokenCredentials);

            _context = new CloudMediaContext(new Uri(_RESTAPIEndpoint), tokenProvider);

            // Run the OCR job.
            var asset = RunOCRJob(@"C:\supportFiles\OCR\presentation.mp4",
                                        @"C:\supportFiles\OCR\config.json");

            // Download the job output asset.
            DownloadAsset(asset, @"C:\supportFiles\OCR\Output");
        }

        static IAsset RunOCRJob(string inputMediaFilePath, string configurationFile)
        {
            // Create an asset and upload the input media file to storage.
            IAsset asset = CreateAssetAndUploadSingleFile(inputMediaFilePath,
                "My OCR Input Asset",
                AssetCreationOptions.None);

            // Declare a new job.
            IJob job = _context.Jobs.Create("My OCR Job");

            // Get a reference to Azure Media OCR.
            string MediaProcessorName = "Azure Media OCR";

            var processor = GetLatestMediaProcessorByName(MediaProcessorName);

            // Read configuration from the specified file.
            string configuration = File.ReadAllText(configurationFile);

            // Create a task with the encoding details, using a string preset.
            ITask task = job.Tasks.AddNew("My OCR Task",
                processor,
                configuration,
                TaskOptions.None);

            // Specify the input asset.
            task.InputAssets.Add(asset);

            // Add an output asset to contain the results of the job.
            task.OutputAssets.AddNew("My OCR Output Asset", AssetCreationOptions.None);

            // Use the following event handler to check job progress.  
            job.StateChanged += new EventHandler<JobStateChangedEventArgs>(StateChanged);

            // Launch the job.
            job.Submit();

            // Check job execution and wait for job to finish.
            Task progressJobTask = job.GetExecutionProgressTask(CancellationToken.None);

            progressJobTask.Wait();

            // If job state is Error, the event handling
            // method for job progress should log errors.  Here we check
            // for error state and exit if needed.
            if (job.State == JobState.Error)
            {
                ErrorDetail error = job.Tasks.First().ErrorDetails.First();
                Console.WriteLine(string.Format("Error: {0}. {1}",
                                                error.Code,
                                                error.Message));
                return null;
            }

            return job.OutputMediaAssets[0];
        }

        static IAsset CreateAssetAndUploadSingleFile(string filePath, string assetName, AssetCreationOptions options)
        {
            IAsset asset = _context.Assets.Create(assetName, options);

            var assetFile = asset.AssetFiles.Create(Path.GetFileName(filePath));
            assetFile.Upload(filePath);

            return asset;
        }

        static void DownloadAsset(IAsset asset, string outputDirectory)
        {
            foreach (IAssetFile file in asset.AssetFiles)
            {
                file.Download(Path.Combine(outputDirectory, file.Name));
            }
        }

        static IMediaProcessor GetLatestMediaProcessorByName(string mediaProcessorName)
        {
            var processor = _context.MediaProcessors
                .Where(p => p.Name == mediaProcessorName)
                .ToList()
                .OrderBy(p => new Version(p.Version))
                .LastOrDefault();

            if (processor == null)
                throw new ArgumentException(string.Format("Unknown media processor",
                                                           mediaProcessorName));

            return processor;
        }

        static private void StateChanged(object sender, JobStateChangedEventArgs e)
        {
            Console.WriteLine("Job state changed event:");
            Console.WriteLine("  Previous state: " + e.PreviousState);
            Console.WriteLine("  Current state: " + e.CurrentState);

            switch (e.CurrentState)
            {
                case JobState.Finished:
                    Console.WriteLine();
                    Console.WriteLine("Job is finished.");
                    Console.WriteLine();
                    break;
                case JobState.Canceling:
                case JobState.Queued:
                case JobState.Scheduled:
                case JobState.Processing:
                    Console.WriteLine("Please wait...\n");
                    break;
                case JobState.Canceled:
                case JobState.Error:
                    // Cast sender as a job.
                    IJob job = (IJob)sender;
                    // Display or log error details as needed.
                    // LogJobStop(job.Id);
                    break;
                default:
                    break;
            }
        }

    }
}

媒体服务学习路径Media Services learning paths

媒体服务 v3(最新版本)Media Services v3 (latest)

查看最新版本的 Azure 媒体服务!Check out the latest version of Azure Media Services!

媒体服务 v2(旧版)Media Services v2 (legacy)

Azure 媒体服务分析概述Azure Media Services Analytics Overview