什么是文本分析 API?What is the Text Analytics API?

文本分析 API 是一种基于云的服务,它提供对原始文本的高级自然语言处理,并且包含四项主要功能:情绪分析、关键短语提取、语言检测和命名实体识别。The Text Analytics API is a cloud-based service that provides advanced natural language processing over raw text, and includes four main functions: sentiment analysis, key phrase extraction, language detection, and named entity recognition.

该 API 是 Azure 认知服务的一部分,是云中机器学习和 AI 算法的集合,适用于开发项目。The API is a part of Azure Cognitive Services, a collection of machine learning and AI algorithms in the cloud for your development projects.

文本分析可能有不同的含义,但在认知服务中,文本分析 API 提供如下所述的四种分析。Text analysis can mean different things, but in Cognitive Services, the Text Analytics API provides four types of analysis as described below. 可以将这些功能与 REST API客户端库一起使用。You can use these features with the REST API, or the client library.

情绪分析Sentiment Analysis

使用情绪分析,通过在原始文本中分析有关积极和消极情绪的线索,确定客户如何看待你的品牌或主题。Use sentiment analysis to find out what customers think of your brand or topic by analyzing raw text for clues about positive or negative sentiment. 此 API 针对每个文档返回介于 0 和 1 之间的情绪评分,1 是最积极的评分。This API returns a sentiment score between 0 and 1 for each document, where 1 is the most positive.
分析模型已使用 Microsoft 提供的大量文本正文和自然语言技术进行预先训练。The analysis models are pretrained using an extensive body of text and natural language technologies from Microsoft. 对于选定的语言,该 API 可以分析和评分提供的任何原始文本,并直接将结果返回给调用方应用程序。For selected languages, the API can analyze and score any raw text that you provide, directly returning results to the calling application.

关键短语提取Key Phrase Extraction

自动提取关键短语,以快速识别要点。Automatically extract key phrases to quickly identify the main points. 例如,针对输入文本“The food was delicious and there were wonderful staff”,该 API 会返回谈话要点:“food”和“wonderful staff”。For example, for the input text "The food was delicious and there were wonderful staff", the API returns the main talking points: "food" and "wonderful staff".

语言检测Language Detection

可以检测输入文本是用哪种语言编写的,并以多种语言、变体、方言和一些区域/文化语言报告请求中提交的每个文档的单一语言代码。You can detect which language the input text is written in and report a single language code for every document submitted on the request in a wide range of languages, variants, dialects, and some regional/cultural languages. 语言代码与表示评分强度的评分相搭配。The language code is paired with a score indicating the strength of the score.

命名实体识别Named Entity Recognition

在文本中识别实体并将其分类为人员、地点、组织、日期/时间、数量、百分比、货币等。Identify and categorize entities in your text as people, places, organizations, date/time, quantities, percentages, currencies, and more. 已知实体也可以在 Web 上识别并链接到更多信息。Well-known entities are also recognized and linked to more information on the web.

使用容器Use containers

将标准化的 Docker 容器安装到靠近数据的位置以后,即可在本地使用文本分析容器提取关键短语、检测语言以及进行情绪分析。Use the Text Analytics containers to extract key phrases, detect language, and analyze sentiment locally, by installing standardized Docker containers closer to your data.

典型工作流Typical workflow

工作流非常简单:在代码中提交分析数据和处理输出。The workflow is simple: you submit data for analysis and handle outputs in your code. 分析器按原样使用,无需额外的配置或自定义。Analyzers are consumed as-is, with no additional configuration or customization.

  1. 为文本分析创建 Azure 资源Create an Azure resource for Text Analytics. 然后,获取生成的密钥,以便对请求进行身份验证。Afterwards, get the key generated for you to authenticate your requests.

  2. 规划请求,其中包含原始非结构化文本形式的 JSON 数据。Formulate a request containing your data as raw unstructured text, in JSON.

  3. 将此请求发布到注册期间建立的终结点,并追加所需的资源:情绪分析、关键短语提取、语言检测或命名实体识别。Post the request to the endpoint established during sign-up, appending the desired resource: sentiment analysis, key phrase extraction, language detection, or named entity recognition.

  4. 在本地流式处理或存储响应。Stream or store the response locally. 根据具体的请求,结果将是情绪评分、提取的关键短语集合或语言代码。Depending on the request, results are either a sentiment score, a collection of extracted key phrases, or a language code.

输出将会根据 ID 以单个 JSON 文档的形式返回,其中包含发布的每个文本文档的结果。Output is returned as a single JSON document, with results for each text document you posted, based on ID. 然后,可以分析、可视化结果,或将其分类成可行的见解。You can subsequently analyze, visualize, or categorize the results into actionable insights.

数据不会存储在你的帐户中。Data is not stored in your account. 文本分析 API 执行的操作是无状态的,这意味着,将会处理所提供的文本,并立即返回结果。Operations performed by the Text Analytics API are stateless, which means the text you provide is processed and results are returned immediately.

适合多种编程经验水平的文本分析Text Analytics for multiple programming experience levels

即使编程经验并不丰富,也可以开始在进程中使用文本分析 API。You can start using the Text Analytics API in your processes, even if you don't have much experience in programming. 学习这些教程,了解如何根据自己的经验水平使用该 API 以不同方式分析文本。Use these tutorials to learn how you can use the API to analyze text in different ways to fit your experience level.

支持的语言Supported languages

为方便查找,本部分已转移到单独的文章。This section has been moved to a separate article for better discoverability. 有关此内容,请参阅文本分析 API 支持的语言Refer to Supported languages in the Text Analytics API for this content.

数据限制Data limits

所有的文本分析 API 终结点都接受原始文本数据。All of the Text Analytics API endpoints accept raw text data. 当前限制为每个文档最多包含 5,120 个字符;如果需要分析更大的文档,可将它们分解成较小的区块。The current limit is 5,120 characters for each document; if you need to analyze larger documents, you can break them up into smaller chunks.

限制Limit ValueValue
单个文档的最大大小Maximum size of a single document 5,120 个字符,由 StringInfo.LengthInTextElements 度量。5,120 characters as measured by StringInfo.LengthInTextElements.
整个请求的最大大小Maximum size of entire request 1 MB1 MB
一个请求中的文档数上限Maximum number of documents in a request 最多为 1,000 个文档(具体上限取决于不同的功能Up to 1,000 documents (varies for each feature)

速率限制将因定价层而异。Your rate limit will vary with your pricing tier.

Tier 每秒请求数Requests per second 每分钟请求数Requests per minute
S/多服务S / Multi-service 10001000 10001000
S0/F0S0 / F0 100100 300300
S1S1 200200 300300
S2S2 300300 300300
S3S3 500500 500500
S4S4 10001000 10001000

对每个文本分析功能的请求分别进行测量。Requests are measured for each Text Analytics feature separately. 例如,可以同时向每个功能发送定价层的最大数量的请求。For example, you can send the maximum number of requests for your pricing tier to each feature, at the same time.

Unicode 编码Unicode encoding

文本分析 API 使用 Unicode 编码来呈现文本和计算字符数。The Text Analytics API uses Unicode encoding for text representation and character count calculations. 可以 UTF-8 和 UTF-16 编码提交请求,这在字符计数方面没有可度量的差别。Requests can be submitted in both UTF-8 and UTF-16 with no measurable differences in the character count. Unicode 码位用作字符长度的启发因子,对文本分析数据限制的影响被视为等效。Unicode codepoints are used as the heuristic for character length and are considered equivalent for the purposes of text analytics data limits. 如果你使用 StringInfo.LengthInTextElements 获取字符计数,则使用的方法也是我们用来度量数据大小的方法。If you use StringInfo.LengthInTextElements to get the character count, you are using the same method we use to measure data size.

后续步骤Next steps