示例:通过文本分析检测语言Example: Detect language with Text Analytics

Azure 文本分析 REST API 的语言检测功能评估每个文档的文本输入,并返回带有指示分析强度的分数的语言标识符。The Language Detection feature of the Azure Text Analytics REST API evaluates text input for each document and returns language identifiers with a score that indicates the strength of the analysis.

此功能对于用于收集语言未知的任意文本的内容存储非常有用。This capability is useful for content stores that collect arbitrary text, where language is unknown. 可以解析此分析的结果,确定输入文档中使用的语言。You can parse the results of this analysis to determine which language is used in the input document. 响应还返回一个分数,反映模型的置信度。The response also returns a score that reflects the confidence of the model. 分数值介于 0 到 1 之间。The score value is between 0 and 1.

语言检测功能可以检测多种语言、变体、方言和某些区域或文化语言。The Language Detection feature can detect a wide range of languages, variants, dialects, and some regional or cultural languages. 此功能的确切语言列表未发布。The exact list of languages for this feature isn't published.

如果内容是用较少使用的语言表示的,则可以尝试“语言检测”功能来查看它是否返回代码。If you have content expressed in a less frequently used language, you can try the Language Detection feature to see if it returns a code. 无法检测到的语言的响应为 unknownThe response for languages that can't be detected is unknown.

提示

文本分析还提供一个基于 Linux 的 Docker 容器映像,用于检测语言,因此可以在靠近数据的位置安装并运行文本分析容器Text Analytics also provides a Linux-based Docker container image for language detection, so you can install and run the Text Analytics container close to your data.

准备工作Preparation

必须拥有以下格式的 JSON 文档:ID 和文本。You must have JSON documents in this format: ID and text.

每个文档的大小必须少于 5,120 个字符,The document size must be under 5,120 characters per document. 每个集合最多可包含 1,000 个项目 (ID)。You can have up to 1,000 items (IDs) per collection. 集合在请求正文中提交。The collection is submitted in the body of the request. 以下示例是可能提交用于语言检测的内容示例:The following sample is an example of content you might submit for language detection:

    {
        "documents": [
            {
                "id": "1",
                "text": "This document is in English."
            },
            {
                "id": "2",
                "text": "Este documento está en inglés."
            },
            {
                "id": "3",
                "text": "Ce document est en anglais."
            },
            {
                "id": "4",
                "text": "本文件为英文"
            },                
            {
                "id": "5",
                "text": "Этот документ на английском языке."
            }
        ]
    }

步骤 1:构造请求Step 1: Structure the request

有关请求定义的详细信息,请参阅调用文本分析 APIFor more information on request definition, see Call the Text Analytics API. 为方便起见,特重申以下几点:The following points are restated for convenience:

  • 创建 POST 请求。Create a POST request. 若要查看此请求的 API 文档,请参阅语言检测 APITo review the API documentation for this request, see the Language Detection API.

  • 设置语言检测的 HTTP 终结点。Set the HTTP endpoint for language detection. 使用 Azure 上的文本分析资源或实例化的文本分析容器Use either a Text Analytics resource on Azure or an instantiated Text Analytics container. 它必须包含 /languages 资源:https://chinaeast2.api.cognitive.azure.cn/text/analytics/v2.1/languagesIt must include the /languages resource: https://chinaeast2.api.cognitive.azure.cn/text/analytics/v2.1/languages.

  • 设置请求头以包含文本分析操作的访问密钥。Set a request header to include the access key for Text Analytics operations. 有关详细信息,请参阅查找终结点和访问密钥For more information, see Find endpoints and access keys.

  • 在请求正文中,提供为此分析准备的 JSON 文档集合。In the request body, provide the JSON documents collection you prepared for this analysis.

提示

使用 Postman 或打开文档中的“API 测试控制台”来构造请求并将其 POST 到该服务 。Use Postman or open the API testing console in the documentation to structure a request and POST it to the service.

步骤 2:POST 请求Step 2: POST the request

在收到请求时执行分析。Analysis is performed upon receipt of the request. 有关每分钟和每秒可以发送的请求的大小和数量的信息,请参阅概述中的数据限制部分。For information on the size and number of requests you can send per minute and second, see the data limits section in the overview.

记住,该服务是无状态服务。Recall that the service is stateless. 帐户中未存储任何数据。No data is stored in your account. 结果会立即在响应中返回。Results are returned immediately in the response.

步骤 3:查看结果Step 3: View the results

所有 POST 请求都将返回 JSON 格式的响应,其中包含 ID 和检测到的属性。All POST requests return a JSON-formatted response with the IDs and detected properties.

系统会立即返回输出。Output is returned immediately. 可将结果流式传输到接受 JSON 的应用程序,或者将输出保存到本地系统上的文件中。You can stream the results to an application that accepts JSON or save the output to a file on the local system. 然后,将输出导入到可以用来对数据进行排序、搜索和操作的应用程序。Then, import the output into an application that you can use to sort, search, and manipulate the data.

示例请求的结果应类似于以下 JSON。Results for the example request should look like the following JSON. 请注意,它是一个包含多个项的文档。Notice that it's one document with multiple items. 输出采用英文。Output is in English. 语言标识符包括友好名称和 ISO 639-1 格式的语言代码。Language identifiers include a friendly name and a language code in ISO 639-1 format.

正分 1.0 表示分析可能达到的最高可信度。A positive score of 1.0 expresses the highest possible confidence level of the analysis.

    {
        "documents": [
            {
                "id": "1",
                "detectedLanguages": [
                    {
                        "name": "English",
                        "iso6391Name": "en",
                        "score": 1
                    }
                ]
            },
            {
                "id": "2",
                "detectedLanguages": [
                    {
                        "name": "Spanish",
                        "iso6391Name": "es",
                        "score": 1
                    }
                ]
            },
            {
                "id": "3",
                "detectedLanguages": [
                    {
                        "name": "French",
                        "iso6391Name": "fr",
                        "score": 1
                    }
                ]
            },
            {
                "id": "4",
                "detectedLanguages": [
                    {
                        "name": "Chinese_Simplified",
                        "iso6391Name": "zh_chs",
                        "score": 1
                    }
                ]
            },
            {
                "id": "5",
                "detectedLanguages": [
                    {
                        "name": "Russian",
                        "iso6391Name": "ru",
                        "score": 1
                    }
                ]
            }
        ],
        "errors": []
    }

不明确的内容Ambiguous content

在某些情况下,可能很难根据输入区分语言。In some cases it may be hard to disambiguate languages based on the input. 可以使用 countryHint 参数指定 2 个字母的国家/地区代码。You can use the countryHint parameter to specify a 2-letter country code. 默认情况下,API 使用“US”作为默认的 countryHint,要删除此行为,可以通过将此值设置为空字符串 countryHint = "" 来重置此参数。By default the API is using the "US" as the default countryHint, to remove this behavior you can reset this parameter by setting this value to empty string countryHint = "" .

例如,“Impossible”对于英语和法语都是通用的,如果在有限的背景下给出,则响应将基于“美国”国家/地区提示。For example, "Impossible" is common to both English and French and if given with limited context the response will be based on the "US" country hint. 如果已知文本来源来自法国,可以将其作为提示给出。If the origin of the text is known to be coming from France that can be given as a hint.

输入Input

    {
        "documents": [
            {
                "id": "1",
                "text": "impossible"
            },
            {
                "id": "2",
                "text": "impossible",
                "countryHint": "fr"
            }
        ]
    }

现在,该服务提供了其他上下文来帮助做出更好的判断:The service now has additional context to make a better judgment:

输出Output

    {
        "documents": [
            {
                "id": "1",
                "detectedLanguages": [
                    {
                        "name": "English",
                        "iso6391Name": "en",
                        "score": 1
                    }
                ]
            },
            {
                "id": "2",
                "detectedLanguages": [
                    {
                        "name": "French",
                        "iso6391Name": "fr",
                        "score": 1
                    }
                ]
            }
        ],
        "errors": []
    }

如果分析器无法分析输入,则会返回 (Unknown)If the analyzer can't parse the input, it returns (Unknown). 例如,如果提交的文本块仅包含阿拉伯数字,则会出现这种情况。An example is if you submit a text block that consists solely of Arabic numerals.

    {
        "id": "5",
        "detectedLanguages": [
            {
                "name": "(Unknown)",
                "iso6391Name": "(Unknown)",
                "score": "NaN"
            }
        ]
    }

混合语言内容Mixed-language content

同一文档中的混合语言内容将返回内容中代表性最强但正评级较低的语言。Mixed-language content within the same document returns the language with the largest representation in the content, but with a lower positive rating. 评级反映该评估的边界强度。The rating reflects the marginal strength of the assessment. 在以下示例中,输入是英语、西班牙语和法语的混合。In the following example, input is a blend of English, Spanish, and French. 分析器对每个段中的字符进行计数,确定主要语言。The analyzer counts characters in each segment to determine the predominant language.

输入Input

    {
      "documents": [
        {
          "id": "1",
          "text": "Hello, I would like to take a class at your University. ¿Se ofrecen clases en español? Es mi primera lengua y más fácil para escribir. Que diriez-vous des cours en français?"
        }
      ]
    }

输出Output

生成的输出包含主要语言,分数低于 1.0,表示可信度较低。The resulting output consists of the predominant language, with a score of less than 1.0, which indicates a weaker level of confidence.

    {
      "documents": [
        {
          "id": "1",
          "detectedLanguages": [
            {
              "name": "Spanish",
              "iso6391Name": "es",
              "score": 0.9375
            }
          ]
        }
      ],
      "errors": []
    }

摘要Summary

本文介绍了使用 Azure 认知服务中的文本分析进行语言检测的概念和工作流。In this article, you learned concepts and workflow for language detection by using Text Analytics in Azure Cognitive Services. 其中解释并演示了以下要点:The following points were explained and demonstrated:

  • 语言检测可用于多种语言、变体、方言和某些区域或文化语言。Language detection is available for a wide range of languages, variants, dialects, and some regional or cultural languages.
  • 请求正文中的 JSON 文档包括 ID 和文本。JSON documents in the request body include an ID and text.
  • 通过使用对订阅有效的个性化访问密钥和终结点,将 POST 请求发送到 /languages 终结点。The POST request is to a /languages endpoint by using a personalized access key and an endpoint that's valid for your subscription.
  • 响应输出包含每个文档 ID 的语言标识符。Response output consists of language identifiers for each document ID. 输出可以流式传输到接受 JSON 的任何应用。The output can be streamed to any app that accepts JSON. 示例应用包括 Excel 和 Power BI(仅举几例)。Example apps include Excel and Power BI, to name a few.

另请参阅See also

文本分析概述Text Analytics overview
常见问题解答 (FAQ)Frequently asked questions (FAQ)
文本分析产品页Text Analytics product page