什么是文本分析 API?What is Text Analytics API?

文本分析 API 是一种基于云的服务,它对原始文本提供高级自然语言处理,并且包含四项主要功能:情绪分析、关键短语提取、语言检测和实体识别。The Text Analytics API is a cloud-based service that provides advanced natural language processing over raw text, and includes four main functions: sentiment analysis, key phrase extraction, language detection, and entity recognition.

该 API 是 Azure 认知服务的一部分,是云中机器学习和 AI 算法的集合,适用于开发项目。The API is a part of Azure Cognitive Services, a collection of machine learning and AI algorithms in the cloud for your development projects.

文本分析可能有不同的含义,但在认知服务中,文本分析 API 提供如下所述的四种分析。Text analysis can mean different things, but in Cognitive Services, the Text Analytics API provides four types of analysis as described below.

情绪分析Sentiment Analysis

使用情绪分析,通过在原始文本中分析有关积极和消极情绪的线索,确定客户如何看待你的品牌或主题。Use sentiment analysis to find out what customers think of your brand or topic by analyzing raw text for clues about positive or negative sentiment. 此 API 针对每个文档返回介于 0 和 1 之间的情绪评分,1 是最积极的评分。This API returns a sentiment score between 0 and 1 for each document, where 1 is the most positive.
分析模型已使用 Microsoft 提供的大量文本正文和自然语言技术进行预先训练。The analysis models are pretrained using an extensive body of text and natural language technologies from Microsoft. 对于选定的语言,该 API 可以分析和评分提供的任何原始文本,并直接将结果返回给调用方应用程序。For selected languages, the API can analyze and score any raw text that you provide, directly returning results to the calling application. 可以使用 REST API 或 .NET SDK。You can use the REST API or the .NET SDK.

关键短语提取Key Phrase Extraction

自动提取关键短语,以快速识别要点。Automatically extract key phrases to quickly identify the main points. 例如,针对输入文本“The food was delicious and there were wonderful staff”,该 API 会返回谈话要点:“food”和“wonderful staff”。For example, for the input text "The food was delicious and there were wonderful staff", the API returns the main talking points: "food" and "wonderful staff". 可以使用此处的 REST API,也可以使用 .NET SDK。You can use the REST API here or the .NET SDK.

语言检测Language Detection

可以针对多达 120 种语言检测输入文本是使用哪种语言编写的,并报告请求中提交的每个文档的单个语言代码。You can detect which language the input text is written in and report a single language code for every document submitted on the request for up to 120 languages. 语言代码与表示评分强度的评分相搭配。The language code is paired with a score indicating the strength of the score. 可以使用 REST API 或 .NET SDK。You can use the REST API or the .NET SDK.

命名实体识别Named Entity Recognition

在文本中识别实体并将其分类为人员、地点、组织、日期/时间、数量、百分比、货币等。Identify and categorize entities in your text as people, places, organizations, date/time, quantities, percentages, currencies, and more. 已知实体也可以在 Web 上识别并链接到更多信息。Well-known entities are also recognized and linked to more information on the web. 可以使用 REST API。You can use the REST API.

使用容器Use containers

将标准化的 Docker 容器安装到靠近数据的位置以后,即可在本地使用文本分析容器提取关键短语、检测语言以及进行情绪分析。Use the Text Analytics containers to extract key phrases, detect language, and analyze sentiment locally, by installing standardized Docker containers closer to your data.

典型工作流Typical workflow

工作流非常简单:在代码中提交分析数据和处理输出。The workflow is simple: you submit data for analysis and handle outputs in your code. 分析器按原样使用,无需额外的配置或自定义。Analyzers are consumed as-is, with no additional configuration or customization.

  1. 注册访问密钥Sign up for an access key. 必须在每个请求中传递密钥。The key must be passed on each request.

  2. 规划请求,其中包含原始非结构化文本形式的 JSON 数据。Formulate a request containing your data as raw unstructured text, in JSON.

  3. 将此请求发布到注册期间建立的终结点,并追加所需的资源:情绪分析、关键短语提取、语言检测或实体识别。Post the request to the endpoint established during sign-up, appending the desired resource: sentiment analysis, key phrase extraction, language detection, or entity identification.

  4. 在本地流式处理或存储响应。Stream or store the response locally. 根据具体的请求,结果将是情绪评分、提取的关键短语集合或语言代码。Depending on the request, results are either a sentiment score, a collection of extracted key phrases, or a language code.

输出将会根据 ID 以单个 JSON 文档的形式返回,其中包含发布的每个文本文档的结果。Output is returned as a single JSON document, with results for each text document you posted, based on ID. 然后,可以分析、可视化结果,或将其分类成可行的见解。You can subsequently analyze, visualize, or categorize the results into actionable insights.

数据不会存储在你的帐户中。Data is not stored in your account. 文本分析 API 执行的操作是无状态的,这意味着,将会处理所提供的文本,并立即返回结果。Operations performed by the Text Analytics API are stateless, which means the text you provide is processed and results are returned immediately.

适合多种编程经验水平的文本分析Text Analytics for multiple programming experience levels

即使编程经验并不丰富,也可以开始在进程中使用文本分析 API。You can start using the Text Analytics API in your processes, even if you don't have much experience in programming. 学习这些教程,了解如何根据自己的经验水平使用该 API 以不同方式分析文本。Use these tutorials to learn how you can use the API to analyze text in different ways to fit your experience level.

支持的语言Supported languages

为方便查找,本部分已转移到单独的文章。This section has been moved to a separate article for better discoverability. 请参阅文本分析 API 支持的语言中的相关内容。Refer to Supported languages in Text Analytics API for this content.

数据限制Data limits

所有的文本分析 API 终结点都接受原始文本数据。All of the Text Analytics API endpoints accept raw text data. 当前限制为每个文档最多包含 5,120 个字符;如果需要分析更大的文档,可将它们分解成较小的区块。The current limit is 5,120 characters for each document; if you need to analyze larger documents, you can break them up into smaller chunks.

限制Limit ValueValue
单个文档的最大大小Maximum size of a single document 5,120 个字符,由 StringInfo.LengthInTextElements 度量。5,120 characters as measured by StringInfo.LengthInTextElements.
整个请求的最大大小Maximum size of entire request 1 MB1 MB
一个请求中的文档数上限Maximum number of documents in a request 1,000 个文档1,000 documents

速率限制为每秒 100 个请求,每分钟 1000 个请求。The rate limit is 100 requests per second and 1000 requests per minute. 可以在单次调用中提交大量的文档(最多 1000 个文档)。You can submit a large quantity of documents in a single call (up to 1000 documents).

Unicode 编码Unicode encoding

文本分析 API 使用 Unicode 编码来呈现文本和计算字符数。The Text Analytics API uses Unicode encoding for text representation and character count calculations. 可以 UTF-8 和 UTF-16 编码提交请求,这在字符计数方面没有可度量的差别。Requests can be submitted in both UTF-8 and UTF-16 with no measurable differences in the character count. Unicode 码位用作字符长度的启发因子,对文本分析数据限制的影响被视为等效。Unicode codepoints are used as the heuristic for character length and are considered equivalent for the purposes of text analytics data limits. 如果你使用 StringInfo.LengthInTextElements 获取字符计数,则使用的方法也是我们用来度量数据大小的方法。If you use StringInfo.LengthInTextElements to get the character count, you are using the same method we use to measure data size.

后续步骤Next steps

  • 注册访问密钥,并查看调用 API 的步骤。Sign up for an access key and review the steps for calling the API.

  • 快速入门演练了以 C# 编写的 REST API 调用。Quickstart is a walkthrough of the REST API calls written in C#. 了解如何以少量的代码提交文本、选择分析,并查看结果。Learn how to submit text, choose an analysis, and view results with minimal code. 如果你愿意,可以改从 Python 快速入门着手。If you prefer, you can start with the Python quickstart instead.