文本分析 API 输出中的文本偏移量Text offsets in the Text Analytics API output

多语言和表情符号支持已导致 Unicode 编码,该编码使用多个码位来表示单个显示的字符(称为字形)。Multilingual and emoji support has led to Unicode encodings that use more than one code point to represent a single displayed character, called a grapheme. 例如,🌷 和 👍 之类的表情符号可以使用几个字符将形状与其他字符组合起来,以提供视觉对象特性(例如肤色)。For example, emojis like 🌷 and 👍 may use several characters to compose the shape with additional characters for visual attributes, such as skin tone. 同样,印地语文字 अनुच्छेद 将编码为五个字母和三个组合标记。Similarly, the Hindi word अनुच्छेद is encoded as five letters and three combining marks.

由于可能的多语言和表情符号编码的长度不同,文本分析 API 可能会在响应中返回偏移量。Because of the different lengths of possible multilingual and emoji encodings, the Text Analytics API may return offsets in the response.

API 响应中的偏移量。Offsets in the API response.

每当在 API 响应(如命名实体识别情绪分析)中返回偏移量时,请记住以下事项:Whenever offsets are returned the API response, such as Named Entity Recognition or Sentiment Analysis, remember the following:

  • 响应中的元素可能会特定于所调用的终结点。Elements in the response may be specific to the endpoint that was called.
  • HTTP POST/GET 有效负载以 UTF-8 格式进行编码,该编码不一定是客户端编译器或操作系统上的默认字符编码。HTTP POST/GET payloads are encoded in UTF-8, which may or may not be the default character encoding on your client-side compiler or operating system.
  • 偏移量是指基于 Unicode 8.0.0 标准的字形计数,而不是字符计数。Offsets refer to grapheme counts based on the Unicode 8.0.0 standard, not character counts.

从具有偏移量的文本中提取子字符串Extracting substrings from text with offsets

使用基于字符的子字符串方法(例如 .NET substring() 方法)时,偏移量可能会导致问题。Offsets can cause problems when using character-based substring methods, for example the .NET substring() method. 一个问题是,偏移量可能导致子字符串方法在多字符字形编码的中间而不是结尾处结束。One problem is that an offset may cause a substring method to end in the middle of a multi-character grapheme encoding instead of the end.

在 .NET 中,考虑使用 StringInfo 类,该类使你可以将字符串作为一系列文本元素(而不是单个字符对象)来处理。In .NET, consider using the StringInfo class, which enables you to work with a string as a series of textual elements, rather than individual character objects. 也可以在首选软件环境中查找字形拆分器库。You can also look for grapheme splitter libraries in your preferred software environment.

为方便起见,文本分析 API 也会返回这些文本元素。The Text Analytics API returns these textual elements as well, for convenience.

API 版本 3.1-preview 中的偏移Offsets in API version 3.1-preview

从 API 版本 3.1-preview.1 开始,返回偏移量的所有文本分析 API 终结点都会支持 stringIndexType 参数。Beginning with API version 3.1-preview.1, all Text Analytics API endpoints that return an offset will support the stringIndexType parameter. 此参数在 API 输出中调整 offsetlength 属性,以匹配请求的字符串迭代方案。This parameter adjusts the offset and length attributes in the API output to match the requested string iteration scheme. 目前,我们支持三种类型:Currently, we support three types:

  1. textElement_v8(默认值):根据 Unicode 8.0.0 标准的定义循环访问字形textElement_v8 (default): iterates over graphemes as defined by the Unicode 8.0.0 standard
  2. unicodeCodePoint:循环访问 Unicode 码位(Python 3 的默认方案)unicodeCodePoint: iterates over Unicode Code Points, the default scheme for Python 3
  3. utf16CodeUnit:循环访问 UTF-16 代码单位(Javascript、Java 和 .NET 的默认方案)utf16CodeUnit: iterates over UTF-16 Code Units, the default scheme for Javascript, Java, and .NET

如果请求的 stringIndexType 与所选的编程环境相匹配,则可以使用标准子字符串或切片方法提取子字符串。If the stringIndexType requested matches the programming environment of choice, substring extraction can be done using standard substring or slice methods.

另请参阅See also