向 Azure 认知搜索索引中的字符串字段添加自定义分析器Add custom analyzers to string fields in an Azure Cognitive Search index

自定义分析器是文本分析器的一种特定类型,包含现有 tokenizer 和可选筛选器的用户定义组合**。A custom analyzer is a specific type of text analyzer that consists of a user-defined combination of existing tokenizer and optional filters. 通过以新方式组合 tokenizer 和筛选器,可以在搜索引擎中自定义文本处理以得到特定结果。By combining tokenizers and filters in new ways, you can customize text processing in the search engine to achieve specific outcomes. 例如,可以使用字符筛选器创建自定义分析器,以在标记文本输入之前删除 HTML 标记。For example, you could create a custom analyzer with a char filter to remove HTML markup before text inputs are tokenized.

可以定义多个自定义分析器来改变筛选器组合,但每个字段只能使用一个分析器进行索引分析,一个分析器进行搜索分析。You can define multiple custom analyzers to vary the combination of filters, but each field can only use one analyzer for indexing analysis and one for search analysis. 有关自定义分析器外观的说明,请参见自定义分析器示例For an illustration of what a customer analyzer looks like, see Custom analyzer example.

概述Overview

简单来说,全文搜索引擎的作用是以能够进行有效查询和检索的方式处理和存储文档。The role of a full-text search engine, in simple terms, is to process and store documents in a way that enables efficient querying and retrieval. 从较高层面来说,就是从文档中提取重要字词,将它们放入索引,然后使用索引查找与给定查询的字词匹配的文档。At a high level, it all comes down to extracting important words from documents, putting them in an index, and then using the index to find documents that match words of a given query. 从文档和搜索查询中提取字词的过程称为“词法分析”**。The process of extracting words from documents and search queries is called lexical analysis. 执行词法分析的组件称为“分析器”**。Components that perform lexical analysis are called analyzers.

在 Azure 认知搜索中,可以从分析器表中的一组预定义语言不可知分析器或语言分析器(Azure 认知搜索服务 REST API)中列出的语言特定分析器中进行选择。In Azure Cognitive Search, you can choose from a set of predefined language-agnostic analyzers in the Analyzers table or language-specific analyzers listed in Language analyzers (Azure Cognitive Search service REST API). 也可以选择定义自己的自定义分析器。You also have an option to define your own custom analyzers.

自定义分析器允许控制将文本转换为可索引且可搜索标记的过程。A custom analyzer allows you to take control over the process of converting text into indexable and searchable tokens. 它是一种用户定义的配置,由一个预定义的 tokenizer、一个或多个标记筛选器以及一个或多个字符筛选器组成。It’s a user-defined configuration consisting of a single predefined tokenizer, one or more token filters, and one or more char filters. Tokenizer 负责将文本分解成多个标记,标记筛选器负责修改 tokenizer 发出的标记。The tokenizer is responsible for breaking text into tokens, and the token filters for modifying tokens emitted by the tokenizer. 在由 tokenizer 处理输入文本之前,可应用字符筛选器来准备输入文本。Char filters are applied for to prepare input text before it is processed by the tokenizer. 例如,字符筛选器可以替换某些字符或符号。For instance, char filter can replace certain characters or symbols.

自定义分析器支持的常见方案包括:Popular scenarios enabled by custom analyzers include:

  • 拼音搜索。Phonetic search. 添加一个拼音筛选器,以便根据字词的发音方式而不是拼写方式进行搜索。Add a phonetic filter to enable searching based on how a word sounds, not how it’s spelled.

  • 禁用词法分析。Disable lexical analysis. 使用关键字分析器创建未分析的可搜索字段。Use the Keyword analyzer to create searchable fields that are not analyzed.

  • 快速前缀/后缀搜索。Fast prefix/suffix search. 添加 Edge N 元语法标记筛选器以对字词前缀编制索引,从而启用快速前缀匹配。Add the Edge N-gram token filter to index prefixes of words to enable fast prefix matching. 将其与反向标记筛选器组合使用,以执行后缀匹配。Combine it with the Reverse token filter to do suffix matching.

  • 自定义词汇切分。Custom tokenization. 例如,借助空格 tokenizer,通过将空格用作分隔符来将句子分解成标记For example, use the Whitespace tokenizer to break sentences into tokens using whitespace as a delimiter

  • ASCII 折叠。ASCII folding. 添加标准 ASCII 折叠筛选器以规范化搜索词中的音调符号,如 ö 或 ê。Add the Standard ASCII folding filter to normalize diacritics like ö or ê in search terms.

    本页面提供了受支持的分析器、tokenizer、标记筛选器和字符筛选器的列表。This page provides a list of supported analyzers, tokenizers, token filters, and char filters. 你还可以找到索引定义更改的说明以及用法示例。You can also find a description of changes to the index definition with a usage example. 有关 Azure 认知搜索实现中使用的基础技术的更多背景信息,请参阅分析包摘要 (Lucene)For more background about the underlying technology leveraged in the Azure Cognitive Search implementation, see Analysis package summary (Lucene). 有关分析器配置的示例,请参阅在 Azure 认知搜索中添加分析器For examples of analyzer configurations, see Add analyzers in Azure Cognitive Search.

验证规则Validation rules

分析器、tokenizer、标记筛选器和字符筛选器的名称必须是唯一的,不能与任何预定义的分析器、tokenizer、标记筛选器或字符筛选器相​​同。Names of analyzers, tokenizers, token filters, and char filters have to be unique and cannot be the same as any of the predefined analyzers, tokenizers, token filters, or char filters. 有关已使用的名称,请参阅属性参考See the Property Reference for names already in use.

创建自定义分析器Create custom analyzers

可以在创建索引时定义自定义分析器。You can define custom analyzers at index creation time. 本部分介绍用于指定自定义分析器的语法。The syntax for specifying a custom analyzer is described in this section. 也可以通过查看在 Azure 认知搜索中添加分析器中的示例定义来熟悉该语法。You can also familiarize yourself with the syntax by reviewing sample definitions in Add analyzers in Azure Cognitive Search.

分析器定义包括名称、类型、一个或多个字符筛选器、最多一个 tokenizer,以及一个或多个用于后期词汇切分处理的标记筛选器。An analyzer definition includes a name, a type, one or more char filters, a maximum of one tokenizer, and one or more token filters for post-tokenization processing. 字符筛选器在词汇切分前应用。Char filers are applied before tokenization. 标记筛选器和字符筛选器按从左到右的顺序应用。Token filters and char filters are applied from left to right.

tokenizer_name 是 tokenizer 的名称,token_filter_name_1token_filter_name_2 是标记筛选器的名称,char_filter_name_1char_filter_name_2 是字符筛选器的名称(请参阅 Tokenizer标记筛选器和字符筛选器表来了解有效值)。The tokenizer_name is the name of a tokenizer, token_filter_name_1 and token_filter_name_2 are the names of token filters, and char_filter_name_1 and char_filter_name_2 are the names of char filters (see the Tokenizers, Token filters and Char filters tables for valid values).

分析器定义是较大索引的一部分。The analyzer definition is a part of the larger index. 有关索引其余部分的信息,请参阅创建索引 APISee Create Index API for information about the rest of the index.

"analyzers":(optional)[
   {
      "name":"name of analyzer",
      "@odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",
      "charFilters":[
         "char_filter_name_1",
         "char_filter_name_2"
      ],
      "tokenizer":"tokenizer_name",
      "tokenFilters":[
         "token_filter_name_1",
         "token_filter_name_2"
      ]
   },
   {
      "name":"name of analyzer",
      "@odata.type":"#analyzer_type",
      "option1":value1,
      "option2":value2,
      ...
   }
],
"charFilters":(optional)[
   {
      "name":"char_filter_name",
      "@odata.type":"#char_filter_type",
      "option1":value1,
      "option2":value2,
      ...
   }
],
"tokenizers":(optional)[
   {
      "name":"tokenizer_name",
      "@odata.type":"#tokenizer_type",
      "option1":value1,
      "option2":value2,
      ...
   }
],
"tokenFilters":(optional)[
   {
      "name":"token_filter_name",
      "@odata.type":"#token_filter_type",
      "option1":value1,
      "option2":value2,
      ...
   }
]

备注

你创建的自定义分析器不会在 Azure 门户中公开。Custom analyzers that you create are not exposed in the Azure portal. 添加自定义分析器的唯一方法是在定义索引时通过使用调用 API 的代码。The only way to add a custom analyzer is through code that makes calls to the API when defining an index.

在索引定义中,可以将此节放在创建索引请求正文中的任意位置,但通常将它放在末尾:Within an index definition, you can place this section anywhere in the body of a create index request but usually it goes at the end:

{
  "name": "name_of_index",
  "fields": [ ],
  "suggesters": [ ],
  "scoringProfiles": [ ],
  "defaultScoringProfile": (optional) "...",
  "corsOptions": (optional) { },
  "analyzers":(optional)[ ],
  "charFilters":(optional)[ ],
  "tokenizers":(optional)[ ],
  "tokenFilters":(optional)[ ]
}

只有在设置自定义选项时,才向索引添加字符筛选器、tokenizer 和标记筛选器的定义。Definitions for char filters, tokenizers, and token filters are added to the index only if you are setting custom options. 若要按原样使用现有筛选器或 tokenizer,请在分析器定义中按名称指定它。To use an existing filter or tokenizer as-is, specify it by name in the analyzer definition.

测试自定义分析器Test custom analyzers

可以使用 REST API 中的测试分析器操作来查看分析器如何将给定文本分解成多个标记。You can use the Test Analyzer operation in the REST API to see how an analyzer breaks given text into tokens.

请求Request

  POST https://[search service name].search.azure.cn/indexes/[index name]/analyze?api-version=[api-version]
  Content-Type: application/json
    api-key: [admin key]

  {
     "analyzer":"my_analyzer",
     "text": "Vis-à-vis means Opposite"
  }

响应Response

  {
    "tokens": [
      {
        "token": "vis_a_vis",
        "startOffset": 0,
        "endOffset": 9,
        "position": 0
      },
      {
        "token": "vis_à_vis",
        "startOffset": 0,
        "endOffset": 9,
        "position": 0
      },
      {
        "token": "means",
        "startOffset": 10,
        "endOffset": 15,
        "position": 1
      },
      {
        "token": "opposite",
        "startOffset": 16,
        "endOffset": 24,
        "position": 2
      }
    ]
  }

更新自定义分析器Update custom analyzers

定义分析器、tokenizer、标记筛选器或字符筛选器后,便无法修改它。Once an analyzer, a tokenizer, a token filter, or a char filter is defined, it cannot be modified. 仅当 allowIndexDowntime 标志在索引更新请求中设置为 true 时,才可向现有索引添加新的上述内容:New ones can be added to an existing index only if the allowIndexDowntime flag is set to true in the index update request:

PUT https://[search service name].search.azure.cn/indexes/[index name]?api-version=[api-version]&allowIndexDowntime=true

此操作将使索引离线至少几秒钟,从而导致索引和查询请求失败。This operation takes your index offline for at least a few seconds, causing your indexing and query requests to fail. 索引的性能和写入可用性可能会在更新索引后的几分钟内处于受损状态,对于非常大的索引,持续时间更长,但这些影响只是暂时的,最终将自行解除。Performance and write availability of the index can be impaired for several minutes after the index is updated, or longer for very large indexes, but these effects are temporary and eventually resolve on their own.

分析器引用Analyzer reference

下表列出了索引定义的分析器、tokenizer、标记筛选器和字符筛选器节的配置属性。The tables below list the configuration properties for the analyzers, tokenizers, token filters, and char filter section of an index definition. 索引中的分析器、tokenizer 或筛选器的结构均由这些属性组成。The structure of an analyzer, tokenizer, or filter in your index is composed of these attributes. 有关赋值信息,请参阅属性参考For value assignment information, see the Property Reference.

分析器Analyzers

对于分析器,索引属性取决于你使用的是预定义分析器还是自定义分析器。For analyzers, index attributes vary depending on the whether you're using predefined or custom analyzers.

预定义分析器Predefined Analyzers

类型Type 描述Description
名称Name 它必须仅包含字母、数字、空格、短划线或下划线,只能以字母数字字符开头和结尾,且最多包含 128 个字符。It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters.
类型Type 分析器类型来自受支持分析器列表。Analyzer type from the list of supported analyzers. 请参阅下面分析器表中的 analyzer_type 列。See the analyzer_type column in the Analyzers table below.
选项Options 必须是下面分析器表中列出的预定义分析器的有效选项。Must be valid options of a predefined analyzer listed in the Analyzers table below.

自定义分析器Custom Analyzers

类型Type 描述Description
名称Name 它必须仅包含字母、数字、空格、短划线或下划线,只能以字母数字字符开头和结尾,且最多包含 128 个字符。It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters.
类型Type 必须是“#Microsoft.Azure.Search.CustomAnalyzer”。Must be "#Microsoft.Azure.Search.CustomAnalyzer".
CharFiltersCharFilters 设置为字符筛选器表中列出的预定义字符筛选器之一或索引定义中指定的自定义字符筛选器。Set to either one of predefined char filters listed in the Char Filters table or a custom char filter specified in the index definition.
分词器Tokenizer 必需。Required. 设置为下面 Tokenizer 表中列出的预定义 tokenizer 之一或索引定义中指定的自定义 tokenizer。Set to either one of predefined tokenizers listed in the Tokenizers table below or a custom tokenizer specified in the index definition.
TokenFiltersTokenFilters 设置为标记筛选器表中列出的预定义标记筛选器之一或索引定义中指定的自定义标记筛选器。Set to either one of predefined token filters listed in the Token filters table or a custom token filter specified in the index definition.

备注

需要将自定义分析器配置为不生成超过 300 个字符的标记。It's required that you configure your custom analyzer to not produce tokens longer than 300 characters. 对包含此类标记的文档编制索引时会失败。Indexing fails for documents with such tokens. 若要剪裁或忽略它们,请分别使用 TruncateTokenFilterLengthTokenFilterTo trim them or ignore them, use the TruncateTokenFilter and the LengthTokenFilter respectively. 请查看令牌筛选器进行参考。Check Token filters for reference.

字符筛选器Char Filters

字符筛选器用于在 tokenizer 处理输入文本之前准备输入文本。A char filter is used to prepare input text before it is processed by the tokenizer. 例如,它们可以替换某些字符或符号。For instance, they can replace certain characters or symbols. 可以在自定义分析器中使用多个字符筛选器。You can have multiple char filters in a custom analyzer. 字符筛选器按列出的顺序运行。Char filters run in the order in which they are listed.

类型Type 描述Description
名称Name 它必须仅包含字母、数字、空格、短划线或下划线,只能以字母数字字符开头和结尾,且最多包含 128 个字符。It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters.
类型Type 字符筛选器类型来自受支持字符筛选器列表。Char filter type from the list of supported char filters. 请参阅下面字符筛选器表中的 char_filter_type 列。See char_filter_type column in the Char Filters table below.
选项Options 必须是给定字符筛选器类型的有效选项。Must be valid options of a given Char Filters type.

TokenizerTokenizers

Tokenizer 将连续文本划分为一系列标记,例如将一个句子分解成多个字词。A tokenizer divides continuous text into a sequence of tokens, such as breaking a sentence into words.

可以为每个自定义分析器指定一个 tokenizer。You can specify exactly one tokenizer per custom analyzer. 如果需要多个 tokenizer,则可以创建多个自定义分析器,并在索引架构中逐个字段地分配它们。If you need more than one tokenizer, you can create multiple custom analyzers and assign them on a field-by-field basis in your index schema.
自定义分析器可以使用带有默认或自定义选项的预定义 tokenizer。A custom analyzer can use a predefined tokenizer with either default or customized options.

类型Type 描述Description
名称Name 它必须仅包含字母、数字、空格、短划线或下划线,只能以字母数字字符开头和结尾,且最多包含 128 个字符。It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters.
类型Type Tokenizer 名称来自受支持 tokenizer 列表。Tokenizer name from the list of supported tokenizers. 请参阅下面 Tokenizer 表中的 tokenizer_type 列。See tokenizer_type column in the Tokenizers table below.
选项Options 必须是下面 Tokenizer 表中列出的给定 tokenizer 类型的有效选项。Must be valid options of a given tokenizer type listed in the Tokenizers table below.

标记筛选器Token filters

标记筛选器用于筛选出或修改由 tokenizer 生成的标记。A token filter is used to filter out or modify the tokens generated by a tokenizer. 例如,可以指定将所有字符转换为小写的小写筛选器。For example, you can specify a lowercase filter that converts all characters to lowercase.
可以在自定义分析器中使用多个标记筛选器。You can have multiple token filters in a custom analyzer. 标记筛选器按列出的顺序运行。Token filters run in the order in which they are listed.

类型Type 描述Description
名称Name 它必须仅包含字母、数字、空格、短划线或下划线,只能以字母数字字符开头和结尾,且最多包含 128 个字符。It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters.
类型Type 标记筛选器名称来自受支持标记筛选器列表。Token filter name from the list of supported token filters. 请参阅下面标记筛选器表中的 token_filter_type 列。See token_filter_type column in the Token filters table below.
选项Options 必须是给定标记筛选器类型的标记筛选器Must be Token filters of a given token filter type.

属性参考Property reference

本部分提供索引中自定义分析器、tokenizer、字符筛选器或标记筛选器的定义中指定的属性的有效值。This section provides the valid values for attributes specified in the definition of a custom analyzer, tokenizer, char filter, or token filter in your index. 使用 Apache Lucene 实现的分析器、tokenizer 和筛选器已链接到相应的 Lucene API 文档。Analyzers, tokenizers, and filters that are implemented using Apache Lucene have links to the Lucene API documentation.

预定义分析器参考Predefined Analyzers Reference

analyzer_nameanalyzer_name analyzer_type 1analyzer_type 1 说明和选项Description and Options
keywordkeyword (仅当有可用的选项时,类型才适用)(type applies only when options are available) 将某个字段的整个内容视为单个标记。Treats the entire content of a field as a single token. 此分析器可用于邮政编码、ID 和某些产品名称等数据。This is useful for data like zip codes, IDs, and some product names.
patternpattern PatternAnalyzerPatternAnalyzer 通过正则表达式模式将文本灵活地分解成多个词条。Flexibly separates text into terms via a regular expression pattern.

选项Options

lowercase (type: bool) - 确定词条是否为小写。lowercase (type: bool) - Determines whether terms are lowercased. 默认值为 true。The default is true.

pattern (type: string) - 用于匹配标记分隔符的正则表达式模式。pattern (type: string) - A regular expression pattern to match token separators. 默认值为 \W+,与非字词字符匹配。The default is \W+, which matches non-word characters.

flags (type: string) - 正则表达式标志。flags (type: string) - Regular expression flags. 默认值为空字符串。The default is an empty string. 允许的值:CANON_EQ、CASE_INSENSITIVE、COMMENTS、DOTALL、LITERAL、MULTILINE、UNICODE_CASE、UNIX_LINESAllowed values: CANON_EQ, CASE_INSENSITIVE, COMMENTS, DOTALL, LITERAL, MULTILINE, UNICODE_CASE, UNIX_LINES

stopwords (type: string array) - 非索引字列表。stopwords (type: string array) - A list of stopwords. 默认为空列表。The default is an empty list.
simplesimple (仅当有可用的选项时,类型才适用)(type applies only when options are available) 在非字母处划分文本并将其转换为小写。Divides text at non-letters and converts them to lower case.
standardstandard
(也称为 standard.lucene)(Also referred to as standard.lucene)
StandardAnalyzerStandardAnalyzer 标准 Lucene 分析器,由标准 tokenizer、小写筛选器和停止筛选器组成。Standard Lucene analyzer, composed of the standard tokenizer, lowercase filter, and stop filter.

选项Options

maxTokenLength (type: int) - 最大标记长度。maxTokenLength (type: int) - The maximum token length. 默认值为 255。The default is 255. 超过最大长度的标记将被拆分。Tokens longer than the maximum length are split. 可以使用的最大标记长度为 300 个字符。Maximum token length that can be used is 300 characters.

stopwords (type: string array) - 非索引字列表。stopwords (type: string array) - A list of stopwords. 默认为空列表。The default is an empty list.
standardasciifolding.lucenestandardasciifolding.lucene (仅当有可用的选项时,类型才适用)(type applies only when options are available) 带 Ascii 折叠筛选器的标准分析器。Standard analyzer with Ascii folding filter.
stopstop StopAnalyzerStopAnalyzer 在非字母处划分文本,应用小写和非索引字标记筛选器。Divides text at non-letters, applies the lowercase and stopword token filters.

选项Options

stopwords (type: string array) - 非索引字列表。stopwords (type: string array) - A list of stopwords. 默认值为英语预定义列表。The default is a predefined list for English.
whitespacewhitespace (仅当有可用的选项时,类型才适用)(type applies only when options are available) 使用空格 tokenizer 的分析器。An analyzer that uses the whitespace tokenizer. 超过 255 个字符的标记将被拆分。Tokens that are longer than 255 characters are split.

1 分析器类型在代码中始终带有前缀“#Microsoft.Azure.Search”,因此,“PatternAnalyzer”实际上会被指定为“#Microsoft.Azure.Search.PatternAnalyzer”。1 Analyzer Types are always prefixed in code with "#Microsoft.Azure.Search" such that "PatternAnalyzer" would actually be specified as "#Microsoft.Azure.Search.PatternAnalyzer". 为简洁起见,我们删除了该前缀,但在代码中需要使用该前缀。We removed the prefix for brevity, but the prefix is required in your code.

analyzer_type 仅适用于可自定义的分析器。The analyzer_type is only provided for analyzers that can be customized. 如果没有选项(比如 keyword 分析器),则没有关联的 #Microsoft.Azure.Search 类型。If there are no options, as is the case with the keyword analyzer, there is no associated #Microsoft.Azure.Search type.

字符筛选器参考Char Filters Reference

在下表中,使用 Apache Lucene 实现的字符筛选器已链接到相应的 Lucene API 文档。In the table below, the character filters that are implemented using Apache Lucene are linked to the Lucene API documentation.

char_filter_namechar_filter_name char_filter_type 1char_filter_type 1 说明和选项Description and Options
html_striphtml_strip (仅当有可用的选项时,类型才适用)(type applies only when options are available) 一个字符筛选器,它尝试去除 HTML 构造。A char filter that attempts to strip out HTML constructs.
mappingmapping MappingCharFilterMappingCharFilter 一个字符筛选器,它应用使用 mappings 选项定义的映射。A char filter that applies mappings defined with the mappings option. 匹配具有贪婪性(给定点的最长模式匹配获胜)。Matching is greedy (longest pattern matching at a given point wins). 允许替换为空字符串。Replacement is allowed to be the empty string.

选项Options

mappings (type: string array) - 以下格式的映射列表:“a=>b”(出现的所有字符“a”均替换为字符“b”)。mappings (type: string array) - A list of mappings of the following format: "a=>b" (all occurrences of the character "a" are replaced with character "b"). 必需。Required.
pattern_replacepattern_replace PatternReplaceCharFilterPatternReplaceCharFilter 一个字符筛选器,用于替换输入字符串中的字符。A char filter that replaces characters in the input string. 它使用正则表达式来标识要保留的字符序列,并使用替换模式来标识要替换的字符。It uses a regular expression to identify character sequences to preserve and a replacement pattern to identify characters to replace. 例如,input text = "aa bb aa bb", pattern="(aa)\\s+(bb)" replacement="$1#$2", result = "aa#bb aa#bb"。For example, input text = "aa bb aa bb", pattern="(aa)\\s+(bb)" replacement="$1#$2", result = "aa#bb aa#bb".

选项Options

pattern (type: string) - 必需。pattern (type: string) - Required.

replacement (type: string) - 必需。replacement (type: string) - Required.

1 字符筛选器类型在代码中始终带有前缀“#Microsoft.Azure.Search”,因此,“MappingCharFilter”实际上会被指定为“#Microsoft.Azure.Search.MappingCharFilter”。1 Char Filter Types are always prefixed in code with "#Microsoft.Azure.Search" such that "MappingCharFilter" would actually be specified as "#Microsoft.Azure.Search.MappingCharFilter. 为缩小表的宽度,我们删除了该前缀,但请记住在代码中包含该前缀。We removed the prefix to reduce the width of the table, but please remember to include it in your code. 请注意,char_filter_type 仅适用于可自定义的筛选器。Notice that char_filter_type is only provided for filters that can be customized. 如果没有选项(比如 html_strip),则没有关联的 #Microsoft.Azure.Search 类型。If there are no options, as is the case with html_strip, there is no associated #Microsoft.Azure.Search type.

Tokenizer 参考Tokenizers Reference

在下表中,使用 Apache Lucene 实现的 tokenizer 已链接到相应的 Lucene API 文档。In the table below, the tokenizers that are implemented using Apache Lucene are linked to the Lucene API documentation.

tokenizer_nametokenizer_name tokenizer_type 1tokenizer_type 1 说明和选项Description and Options
经典classic ClassicTokenizerClassicTokenizer 基于语法的 tokenizer,适合处理大多数欧洲语言文档。Grammar based tokenizer that is suitable for processing most European-language documents.

选项Options

maxTokenLength (type: int) - 最大标记长度。maxTokenLength (type: int) - The maximum token length. 默认值:255,最大值:300。Default: 255, maximum: 300. 超过最大长度的标记将被拆分。Tokens longer than the maximum length are split.
edgeNGramedgeNGram EdgeNGramTokenizerEdgeNGramTokenizer 将来自 Edge 的输入标记为给定大小的 n 元语法。Tokenizes the input from an edge into n-grams of given size(s).

选项Options

minGram (type: int) - 默认值:1,最大值:300。minGram (type: int) - Default: 1, maximum: 300.

maxGram (type: int) - 默认值:2,最大值:300。maxGram (type: int) - Default: 2, maximum: 300. 必须大于 minGram。Must be greater than minGram.

tokenChars (type: string array) - 要保留在标记中的字符类。tokenChars (type: string array) - Character classes to keep in the tokens. 允许的值:Allowed values:
“letter”、“digit”、“whitespace”、“punctuation”、“symbol”。"letter", "digit", "whitespace", "punctuation", "symbol". 默认为空数组 - 保留所有字符。Defaults to an empty array - keeps all characters.
keyword_v2keyword_v2 KeywordTokenizerV2KeywordTokenizerV2 将整个输入作为单个标记发出。Emits the entire input as a single token.

选项Options

maxTokenLength (type: int) - 最大标记长度。maxTokenLength (type: int) - The maximum token length. 默认值:256,最大值:300。Default: 256, maximum: 300. 超过最大长度的标记将被拆分。Tokens longer than the maximum length are split.
letterletter (仅当有可用的选项时,类型才适用)(type applies only when options are available) 在非字母处划分文本。Divides text at non-letters. 超过 255 个字符的标记将被拆分。Tokens that are longer than 255 characters are split.
lowercaselowercase (仅当有可用的选项时,类型才适用)(type applies only when options are available) 在非字母处划分文本并将其转换为小写。Divides text at non-letters and converts them to lower case. 超过 255 个字符的标记将被拆分。Tokens that are longer than 255 characters are split.
microsoft_language_tokenizermicrosoft_language_tokenizer MicrosoftLanguageTokenizerMicrosoftLanguageTokenizer 使用特定于语言的规则划分文本。Divides text using language-specific rules.

选项Options

maxTokenLength (type: int) - 最大标记长度,默认值:255,最大值:300。maxTokenLength (type: int) - The maximum token length, default: 255, maximum: 300. 超过最大长度的标记将被拆分。Tokens longer than the maximum length are split. 首先将超过 300 个字符的标记拆分为长度为 300 的标记,然后根据设置的 maxTokenLength 逐一拆分这些标记。Tokens longer than 300 characters are first split into tokens of length 300 and then each of those tokens is split based on the maxTokenLength set.

isSearchTokenizer (type: bool) - 如果用作搜索 tokenizer,则设置为 true;如果用作索引 tokenizer,则设置为 false。isSearchTokenizer (type: bool) - Set to true if used as the search tokenizer, set to false if used as the indexing tokenizer.

language (type: string) - 要使用的语言,默认值为“english”。language (type: string) - Language to use, default "english". 允许的值包括:Allowed values include:
“bangla”、“bulgarian”、“catalan”、“chineseSimplified”、“chineseTraditional”、“croatian”、“czech”、“danish”、“dutch”、“english”、“french”、“german”、“greek”、“gujarati”、“hindi”、“icelandic”、“indonesian”、“italian”、“japanese”、“kannada”、“korean”、“malay”、“malayalam”、“marathi”、“norwegianBokmaal”、“polish”、“portuguese”、“portugueseBrazilian”、“punjabi”、“romanian”、“russian”、“serbianCyrillic”、“serbianLatin”、“slovenian”、“spanish”、“swedish”、“tamil”、“telugu”、“thai”、“ukrainian”、“urdu”、“vietnamese”"bangla", "bulgarian", "catalan", "chineseSimplified", "chineseTraditional", "croatian", "czech", "danish", "dutch", "english", "french", "german", "greek", "gujarati", "hindi", "icelandic", "indonesian", "italian", "japanese", "kannada", "korean", "malay", "malayalam", "marathi", "norwegianBokmaal", "polish", "portuguese", "portugueseBrazilian", "punjabi", "romanian", "russian", "serbianCyrillic", "serbianLatin", "slovenian", "spanish", "swedish", "tamil", "telugu", "thai", "ukrainian", "urdu", "vietnamese"
microsoft_language_stemming_tokenizermicrosoft_language_stemming_tokenizer MicrosoftLanguageStemmingTokenizerMicrosoftLanguageStemmingTokenizer 使用特定于语言的规则划分文本,并将各字词缩减为其原形Divides text using language-specific rules and reduces words to their base forms

选项Options

maxTokenLength (type: int) - 最大标记长度,默认值:255,最大值:300。maxTokenLength (type: int) - The maximum token length, default: 255, maximum: 300. 超过最大长度的标记将被拆分。Tokens longer than the maximum length are split. 首先将超过 300 个字符的标记拆分为长度为 300 的标记,然后根据设置的 maxTokenLength 逐一拆分这些标记。Tokens longer than 300 characters are first split into tokens of length 300 and then each of those tokens is split based on the maxTokenLength set.

isSearchTokenizer (type: bool) - 如果用作搜索 tokenizer,则设置为 true;如果用作索引 tokenizer,则设置为 false。isSearchTokenizer (type: bool) - Set to true if used as the search tokenizer, set to false if used as the indexing tokenizer.

language (type: string) - 要使用的语言,默认值为“english”。language (type: string) - Language to use, default "english". 允许的值包括:Allowed values include:
“arabic”、“bangla”、“bulgarian”、“catalan”、“croatian”、“czech”、“danish”、“dutch”、“english”、“estonian”、“finnish”、“french”、“german”、“greek”、“gujarati”、“hebrew”、“hindi”、“hungarian”、“icelandic”、“indonesian”、“italian”、“kannada”、“latvian”、“lithuanian”、“malay”、“malayalam”、“marathi”、“norwegianBokmaal”、“polish”、“portuguese”、“portugueseBrazilian”、“punjabi”、“romanian”、“russian”、“serbianCyrillic”、“serbianLatin”、“slovak”、“slovenian”、“spanish”、“swedish”、“tamil”、“telugu”、“turkish”、“ukrainian”、“urdu”"arabic", "bangla", "bulgarian", "catalan", "croatian", "czech", "danish", "dutch", "english", "estonian", "finnish", "french", "german", "greek", "gujarati", "hebrew", "hindi", "hungarian", "icelandic", "indonesian", "italian", "kannada", "latvian", "lithuanian", "malay", "malayalam", "marathi", "norwegianBokmaal", "polish", "portuguese", "portugueseBrazilian", "punjabi", "romanian", "russian", "serbianCyrillic", "serbianLatin", "slovak", "slovenian", "spanish", "swedish", "tamil", "telugu", "turkish", "ukrainian", "urdu"
nGramnGram NGramTokenizerNGramTokenizer 将输入标记为给定大小的 n 元语法。Tokenizes the input into n-grams of the given size(s).

选项Options

minGram (type: int) - 默认值:1,最大值:300。minGram (type: int) - Default: 1, maximum: 300.

maxGram (type: int) - 默认值:2,最大值:300。maxGram (type: int) - Default: 2, maximum: 300. 必须大于 minGram。Must be greater than minGram.

tokenChars (type: string array) - 要保留在标记中的字符类。tokenChars (type: string array) - Character classes to keep in the tokens. 允许的值:“letter”、“digit”、“whitespace”、“punctuation”、“symbol”。Allowed values: "letter", "digit", "whitespace", "punctuation", "symbol". 默认为空数组 - 保留所有字符。Defaults to an empty array - keeps all characters.
path_hierarchy_v2path_hierarchy_v2 PathHierarchyTokenizerV2PathHierarchyTokenizerV2 用于路径式层次结构的 tokenizer。Tokenizer for path-like hierarchies.

选项Options

delimiter (type: string) - 默认值:'/。delimiter (type: string) - Default: '/.

replacement (type: string) - 如果设置该选项,则替换分隔符字符。replacement (type: string) - If set, replaces the delimiter character. 默认值与分隔符的值相同。Default same as the value of delimiter.

maxTokenLength (type: int) - 最大标记长度。maxTokenLength (type: int) - The maximum token length. 默认值:300,最大值:300。Default: 300, maximum: 300. 长度超过 maxTokenLength 的路径将被忽略。Paths longer than maxTokenLength are ignored.

reverse (type: bool) - 如果为 true,则按相反顺序生成标记。reverse (type: bool) - If true, generates token in reverse order. 默认值:false。Default: false.

skip (type: bool) - 要跳过的初始标记。skip (type: bool) - Initial tokens to skip. 默认值为 0。The default is 0.
patternpattern PatternTokenizerPatternTokenizer 此 tokenizer 使用正则表达式模式匹配来构造不同的标记。This tokenizer uses regex pattern matching to construct distinct tokens.

选项Options

pattern (type: string) - 用于匹配标记分隔符的正则表达式模式。pattern (type: string) - Regular expression pattern to match token separators. 默认值为 \W+,与非字词字符匹配。The default is \W+, which matches non-word characters.

flags (type: string) - 正则表达式标志。flags (type: string) - Regular expression flags. 默认值为空字符串。The default is an empty string. 允许的值:CANON_EQ、CASE_INSENSITIVE、COMMENTS、DOTALL、LITERAL、MULTILINE、UNICODE_CASE、UNIX_LINESAllowed values: CANON_EQ, CASE_INSENSITIVE, COMMENTS, DOTALL, LITERAL, MULTILINE, UNICODE_CASE, UNIX_LINES

group (type: int) - 要提取到标记中的组。group (type: int) - Which group to extract into tokens. 默认值为 -1 (split)。The default is -1 (split).
standard_v2standard_v2 StandardTokenizerV2StandardTokenizerV2 按照 Unicode 文本分段规则划分文本。Breaks text following the Unicode Text Segmentation rules.

选项Options

maxTokenLength (type: int) - 最大标记长度。maxTokenLength (type: int) - The maximum token length. 默认值:255,最大值:300。Default: 255, maximum: 300. 超过最大长度的标记将被拆分。Tokens longer than the maximum length are split.
uax_url_emailuax_url_email UaxUrlEmailTokenizerUaxUrlEmailTokenizer 将 URL 和电子邮件标记为一个标记。Tokenizes urls and emails as one token.

选项Options

maxTokenLength (type: int) - 最大标记长度。maxTokenLength (type: int) - The maximum token length. 默认值:255,最大值:300。Default: 255, maximum: 300. 超过最大长度的标记将被拆分。Tokens longer than the maximum length are split.
whitespacewhitespace (仅当有可用的选项时,类型才适用)(type applies only when options are available) 在空格处划分文本。Divides text at whitespace. 超过 255 个字符的标记将被拆分。Tokens that are longer than 255 characters are split.

1 Tokenizer 类型在代码中始终带有前缀“#Microsoft.Azure.Search”,因此,“ClassicTokenizer”实际上会被指定为“#Microsoft.Azure.Search.ClassicTokenizer”。1 Tokenizer Types are always prefixed in code with "#Microsoft.Azure.Search" such that "ClassicTokenizer" would actually be specified as "#Microsoft.Azure.Search.ClassicTokenizer". 为缩小表的宽度,我们删除了该前缀,但请记住在代码中包含该前缀。We removed the prefix to reduce the width of the table, but please remember to include it in your code. 请注意,tokenizer_type 仅适用于可自定义的 tokenizer。Notice that tokenizer_type is only provided for tokenizers that can be customized. 如果没有选项(比如 letter tokenizer),则没有关联的 #Microsoft.Azure.Search 类型。If there are no options, as is the case with the letter tokenizer, there is no associated #Microsoft.Azure.Search type.

标记筛选器参考Token Filters Reference

在下表中,使用 Apache Lucene 实现的标记筛选器已链接到相应的 Lucene API 文档。In the table below, the token filters that are implemented using Apache Lucene are linked to the Lucene API documentation.

token_filter_nametoken_filter_name token_filter_type 1token_filter_type 1 说明和选项Description and Options
arabic_normalizationarabic_normalization (仅当有可用的选项时,类型才适用)(type applies only when options are available) 一个标记筛选器,它应用阿拉伯语规范化程序来规范化正字法。A token filter that applies the Arabic normalizer to normalize the orthography.
apostropheapostrophe (仅当有可用的选项时,类型才适用)(type applies only when options are available) 去除撇号后面的所有字符(包括撇号本身)。Strips all characters after an apostrophe (including the apostrophe itself).
asciifoldingasciifolding AsciiFoldingTokenFilterAsciiFoldingTokenFilter 将不在前 127 个 ASCII 字符(“基本拉丁语”Unicode 块)中的字母、数字和符号 Unicode 字符转换为其 ASCII 等效字符(如果存在)。Converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists.

选项Options

preserveOriginal (type: bool) - 如果为 true,则保留原始标记。preserveOriginal (type: bool) - If true, the original token is kept. 默认值为 false。The default is false.
cjk_bigramcjk_bigram CjkBigramTokenFilterCjkBigramTokenFilter 形成 CJK 词条的二元语法,这些词条从 StandardTokenizer 生成。Forms bigrams of CJK terms that are generated from StandardTokenizer.

选项Options

ignoreScripts (type: string array) - 要忽略的脚本。ignoreScripts (type: string array) - Scripts to ignore. 允许的值包括:“han”、“hiragana”、“katakana”、“hangul”。Allowed values include: "han", "hiragana", "katakana", "hangul". 默认为空列表。The default is an empty list.

outputUnigrams (type: bool) - 如果总是要输出单元语法和二元语法,则设置为 true。outputUnigrams (type: bool) - Set to true if you always want to output both unigrams and bigrams. 默认值为 false。The default is false.
cjk_widthcjk_width (仅当有可用的选项时,类型才适用)(type applies only when options are available) 规范化 CJK 宽度差异。Normalizes CJK width differences. 将全角 ASCII 变体折叠成等效的基本拉丁语,将半角片假名变体折叠成等效的假名。Folds full width ASCII variants into the equivalent basic latin and half-width Katakana variants into the equivalent kana.
经典classic (仅当有可用的选项时,类型才适用)(type applies only when options are available) 删除英语所有格,以及首字母缩写词中的点。Removes the English possessives, and dots from acronyms.
common_gramscommon_grams CommonGramTokenFilterCommonGramTokenFilter 在编制索引时为经常出现的词条构造二元语法。Construct bigrams for frequently occurring terms while indexing. 此外,仍将为单个词条编制索引并叠加二元语法。Single terms are still indexed too, with bigrams overlaid.

选项Options

commonWords (type: string array) - 常见字词集。commonWords (type: string array) - The set of common words. 默认为空列表。The default is an empty list. 必需。Required.

ignoreCase (type: bool) - 如果为 true,则匹配不区分大小写。ignoreCase (type: bool) - If true, matching is case insensitive. 默认值为 false。The default is false.

queryMode (type: bool) - 生成二元语法,然后删除常见字词和常见字词后跟的单个词条。queryMode (type: bool) - Generates bigrams then removes common words and single terms followed by a common word. 默认值为 false。The default is false.
dictionary_decompounderdictionary_decompounder DictionaryDecompounderTokenFilterDictionaryDecompounderTokenFilter 分解在许多日耳曼语系中找到的复合词。Decomposes compound words found in many Germanic languages.

选项Options

wordList (type: string array) - 要匹配的字词列表。wordList (type: string array) - The list of words to match against. 默认为空列表。The default is an empty list. 必需。Required.

minWordSize (type: int) - 只处理超过此长度的字词。minWordSize (type: int) - Only words longer than this get processed. 默认值为 5。The default is 5.

minSubwordSize (type: int) - 仅输出超过此长度的子字。minSubwordSize (type: int) - Only subwords longer than this are outputted. 默认值为 2。The default is 2.

maxSubwordSize (type: int) - 仅输出短于此长度的子字。maxSubwordSize (type: int) - Only subwords shorter than this are outputted. 默认值为 15。The default is 15.

onlyLongestMatch (type: bool) - 仅添加要输出的最长匹配子字。onlyLongestMatch (type: bool) - Add only the longest matching subword to output. 默认值为 false。The default is false.
edgeNGram_v2edgeNGram_v2 EdgeNGramTokenFilterV2EdgeNGramTokenFilterV2 从输入标记的前面或后面开始生成给定大小的 n 元语法。Generates n-grams of the given size(s) from starting from the front or the back of an input token.

选项Options

minGram (type: int) - 默认值:1,最大值:300。minGram (type: int) - Default: 1, maximum: 300.

maxGram (type: int) - 默认值:2,最大值:300。maxGram (type: int) - Default: 2, maximum 300. 必须大于 minGram。Must be greater than minGram.

side (type: string) - 指定应从输入的哪一侧生成 n 元语法。side (type: string) - Specifies which side of the input the n-gram should be generated from. 允许的值:“front”、“back”Allowed values: "front", "back"
elisionelision ElisionTokenFilterElisionTokenFilter 删除省音。Removes elisions. 例如,“l'avion”(the plane) 转换为“avion”(plane)。For example, "l'avion" (the plane) is converted to "avion" (plane).

选项Options

articles (type: string array) - 要删除的一组冠词。articles (type: string array) - A set of articles to remove. 默认为空列表。The default is an empty list. 如果没有设置冠词列表,默认情况下会删除所有法语冠词。If there is no list of articles set, by default all French articles are removed.
german_normalizationgerman_normalization (仅当有可用的选项时,类型才适用)(type applies only when options are available) 根据 German2 snowball 算法的试探方法规范化德语字符。Normalizes German characters according to the heuristics of the German2 snowball algorithm .
hindi_normalizationhindi_normalization (仅当有可用的选项时,类型才适用)(type applies only when options are available) 规范化印地语文本,以消除拼写变体中的一些差异。Normalizes text in Hindi to remove some differences in spelling variations.
indic_normalizationindic_normalization IndicNormalizationTokenFilterIndicNormalizationTokenFilter 规范化印地语文本的 Unicode 表示形式。Normalizes the Unicode representation of text in Indian languages.
keepkeep KeepTokenFilterKeepTokenFilter 一个标记筛选器,仅保留具有包含在指定字词列表中的文本的标记。A token filter that only keeps tokens with text contained in specified list of words.

选项Options

keepWords (type: string array) - 要保留的字词列表。keepWords (type: string array) - A list of words to keep. 默认为空列表。The default is an empty list. 必需。Required.

keepWordsCase (type: bool) - 如果为 true,则首先小写所有字词。keepWordsCase (type: bool) - If true, lower case all words first. 默认值为 false。The default is false.
keyword_markerkeyword_marker KeywordMarkerTokenFilterKeywordMarkerTokenFilter 将词条标记为关键字。Marks terms as keywords.

选项Options

keywords (type: string array) - 要标记为关键字的字词列表。keywords (type: string array) - A list of words to mark as keywords. 默认为空列表。The default is an empty list. 必需。Required.

ignoreCase (type: bool) - 如果为 true,则首先小写所有字词。ignoreCase (type: bool) - If true, lower case all words first. 默认值为 false。The default is false.
keyword_repeatkeyword_repeat (仅当有可用的选项时,类型才适用)(type applies only when options are available) 将每个传入标记发出两次,一次作为关键字,一次作为非关键字。Emits each incoming token twice once as keyword and once as non-keyword.
kstemkstem (仅当有可用的选项时,类型才适用)(type applies only when options are available) 适用于英语的高性能 kstem 筛选器。A high-performance kstem filter for English.
lengthlength LengthTokenFilterLengthTokenFilter 删除太长或太短的字词。Removes words that are too long or too short.

选项Options

min (type: int) - 最小数目。min (type: int) - The minimum number. 默认值:0,最大值:300。Default: 0, maximum: 300.

max (type: int) - 最大数目。max (type: int) - The maximum number. 默认值:300,最大值:300。Default: 300, maximum: 300.
limitlimit Microsoft.Azure.Search.LimitTokenFilterMicrosoft.Azure.Search.LimitTokenFilter 编制索引时限制标记数量。Limits the number of tokens while indexing.

选项Options

maxTokenCount (type: int) - 要生成的最大标记数。maxTokenCount (type: int) - Max number of tokens to produce. 默认值为 1。The default is 1.

consumeAllTokens (type: bool) - 是否在达到 maxTokenCount 的情况下,还必须使用输入中的所有标记。consumeAllTokens (type: bool) - Whether all tokens from the input must be consumed even if maxTokenCount is reached. 默认值为 false。The default is false.
lowercaselowercase (仅当有可用的选项时,类型才适用)(type applies only when options are available) 将标记文本规范化为小写。Normalizes token text to lower case.
nGram_v2nGram_v2 NGramTokenFilterV2NGramTokenFilterV2 生成给定大小的 n 元语法。Generates n-grams of the given size(s).

选项Options

minGram (type: int) - 默认值:1,最大值:300。minGram (type: int) - Default: 1, maximum: 300.

maxGram (type: int) - 默认值:2,最大值:300。maxGram (type: int) - Default: 2, maximum 300. 必须大于 minGram。Must be greater than minGram.
pattern_capturepattern_capture PatternCaptureTokenFilterPatternCaptureTokenFilter 使用 Java 正则表达式发出多个标记,一个或多个模式中的每个捕获组一个标记。Uses Java regexes to emit multiple tokens, one for each capture group in one or more patterns.

选项Options

patterns (type: string array) - 与每个标记匹配的模式的列表。patterns (type: string array) - A list of patterns to match against each token. 必需。Required.

preserveOriginal (type: bool) - 如果设置为 true,那么即使其中一个模式匹配,也返回原始标记,默认值:truepreserveOriginal (type: bool) - Set to true to return the original token even if one of the patterns matches, default: true
pattern_replacepattern_replace PatternReplaceTokenFilterPatternReplaceTokenFilter 一个标记筛选器,它向流中的每个标记应用模式,并将匹配项替换为指定的替换字符串。A token filter which applies a pattern to each token in the stream, replacing match occurrences with the specified replacement string.

选项Options

pattern (type: string) - 必需。pattern (type: string) - Required.

replacement (type: string) - 必需。replacement (type: string) - Required.
persian_normalizationpersian_normalization (仅当有可用的选项时,类型才适用)(type applies only when options are available) 为波斯语应用规范化。Applies normalization for Persian.
phoneticphonetic PhoneticTokenFilterPhoneticTokenFilter 为拼音匹配项创建标记。Create tokens for phonetic matches.

选项Options

encoder (type: string) - 要使用的拼音编码器。encoder (type: string) - Phonetic encoder to use. 允许的值包括:“metaphone”、“doubleMetaphone”、“soundex”、“refinedSoundex”、“caverphone1”、“caverphone2”、“cologne”、“nysiis”、“koelnerPhonetik”、“haasePhonetik”、“beiderMorse”。Allowed values include: "metaphone", "doubleMetaphone", "soundex", "refinedSoundex", "caverphone1", "caverphone2", "cologne", "nysiis", "koelnerPhonetik", "haasePhonetik", "beiderMorse". 默认值:“metaphone”。Default: "metaphone". 默认值为 metaphone。Default is metaphone.

有关详细信息,请参阅编码器See encoder for more information.

replace (type: bool) - 如果编码的标记应替换原始标记,则为 true;如果应将其添加为同义词,则为 false。replace (type: bool) - True if encoded tokens should replace original tokens, false if they should be added as synonyms. 默认值为 true。The default is true.
porter_stemporter_stem (仅当有可用的选项时,类型才适用)(type applies only when options are available) 根据 Porter 词干分解算法转换标记流。Transforms the token stream as per the Porter stemming algorithm.
reversereverse (仅当有可用的选项时,类型才适用)(type applies only when options are available) 反转标记字符串。Reverses the token string.
scandinavian_normalizationscandinavian_normalization (仅当有可用的选项时,类型才适用)(type applies only when options are available) 规范化可互换的斯堪的纳维亚语字符的使用。Normalizes use of the interchangeable Scandinavian characters.
scandinavian_foldingscandinavian_folding (仅当有可用的选项时,类型才适用)(type applies only when options are available) 折叠斯堪的纳维亚语字符 åÅäæÄÆ->a 和 öÖøØ->o。Folds Scandinavian characters åÅäæÄÆ->a and öÖøØ->o. 它还排斥双元音 aa、ae、ao、oe 和 oo 的使用,只留下第一个元音。It also discriminates against use of double vowels aa, ae, ao, oe and oo, leaving just the first one.
shingleshingle ShingleTokenFilterShingleTokenFilter 创建标记组合作为单个标记。Creates combinations of tokens as a single token.

选项Options

maxShingleSize (type: int) - 默认值为 2。maxShingleSize (type: int) - Defaults to 2.

minShingleSize (type: int) - 默认值为 2。minShingleSize (type: int) - Defaults to 2.

outputUnigrams (type: bool) - 如果为 true,则输出流包含输入标记(单元语法)和瓦形。outputUnigrams (type: bool) - if true, the output stream contains the input tokens (unigrams) as well as shingles. 默认值为 true。The default is true.

outputUnigramsIfNoShingles (type: bool) - 如果为 true,则在没有可用瓦形时覆盖 outputUnigrams==false 的行为。outputUnigramsIfNoShingles (type: bool) - If true, override the behavior of outputUnigrams==false for those times when no shingles are available. 默认值为 false。The default is false.

tokenSeparator (type: string) - 联接相邻标记以形成瓦形时使用的字符串。tokenSeparator (type: string) - The string to use when joining adjacent tokens to form a shingle. 默认值为“ ”。The default is " ".

filterToken (type: string) - 要为每个没有标记的位置插入的字符串。filterToken (type: string) - The string to insert for each position at which there is no token. 默认值为“”。The default is "".
snowballsnowball SnowballTokenFilterSnowballTokenFilter Snowball 标记筛选器。Snowball Token Filter.

选项Options

language (type: string) - 允许的值包括:“armenian”、“basque”、“catalan”、“danish”、“dutch”、“english”、“finnish”、“french”、“german”、“german2”、“hungarian”、“italian”、“kp”、“lovins”、“norwegian”、“porter”、“portuguese”、“romanian”、“russian”、“spanish”、“swedish”、“turkish”language (type: string) - Allowed values include: "armenian", "basque", "catalan", "danish", "dutch", "english", "finnish", "french", "german", "german2", "hungarian", "italian", "kp", "lovins", "norwegian", "porter", "portuguese", "romanian", "russian", "spanish", "swedish", "turkish"
sorani_normalizationsorani_normalization SoraniNormalizationTokenFilterSoraniNormalizationTokenFilter 规范化索拉尼语文本的 Unicode 表示形式。Normalizes the Unicode representation of Sorani text.

选项Options

无。None.
stemmerstemmer StemmerTokenFilterStemmerTokenFilter 特定于语言的词干分解筛选器。Language-specific stemming filter.

选项Options

language (type: string) - 允许的值包括:language (type: string) - Allowed values include:
- “arabic”- "arabic"
- “armenian”- "armenian"
- “basque”- "basque"
- “brazilian”- "brazilian"
- "bulgarian"- "bulgarian"
- “catalan”- "catalan"
- “czech”- "czech"
- “danish”- "danish"
- “dutch”- "dutch"
- “dutchKp”- "dutchKp"
- “english”- "english"
- “lightEnglish”- "lightEnglish"
- “minimalEnglish”- "minimalEnglish"
- “possessiveEnglish”- "possessiveEnglish"
- “porter2”- "porter2"
- “lovins”- "lovins"
- “finnish”- "finnish"
- "lightFinnish"- "lightFinnish"
- “french”- "french"
- “lightFrench”- "lightFrench"
- “minimalFrench”- "minimalFrench"
- "galician"- "galician"
- "minimalGalician"- "minimalGalician"
- “german”- "german"
- “german2”- "german2"
- “lightGerman”- "lightGerman"
- "minimalGerman"- "minimalGerman"
- “greek”- "greek"
- "hindi"- "hindi"
- “hungarian”- "hungarian"
- “lightHungarian”- "lightHungarian"
- “indonesian”- "indonesian"
- “irish”- "irish"
- “italian”- "italian"
- “lightItalian”- "lightItalian"
- “sorani”- "sorani"
- “latvian”- "latvian"
- “norwegian”- "norwegian"
- “lightNorwegian”- "lightNorwegian"
- “minimalNorwegian”- "minimalNorwegian"
- “lightNynorsk”- "lightNynorsk"
- “minimalNynorsk”- "minimalNynorsk"
- “portuguese”- "portuguese"
- “lightPortuguese”- "lightPortuguese"
- “minimalPortuguese”- "minimalPortuguese"
- “portugueseRslp”- "portugueseRslp"
- “romanian”- "romanian"
- “russian”- "russian"
- “lightRussian”- "lightRussian"
- “spanish”- "spanish"
- “lightSpanish”- "lightSpanish"
- “swedish”- "swedish"
- "lightSwedish"- "lightSwedish"
- “turkish”- "turkish"
stemmer_overridestemmer_override StemmerOverrideTokenFilterStemmerOverrideTokenFilter 所有字典词干派生形式的词条均标记为关键字,这将阻止链中的向下词干分解。Any dictionary-Stemmed terms are marked as keywords, which prevents stemming down the chain. 必须放在任何词干分解筛选器之前。Must be placed before any stemming filters.

选项Options

rules (type: string array) - 采用“word => stem”格式的词干分解规则,例如“ran => run”。rules (type: string array) - Stemming rules in the following format "word => stem" for example "ran => run". 默认为空列表。The default is an empty list. 必需。Required.
stopwordsstopwords StopwordsTokenFilterStopwordsTokenFilter 从标记流中删除非索引字。Removes stop words from a token stream. 默认情况下,该筛选器使用预定义的英语非索引字列表。By default, the filter uses a predefined stop word list for English.

选项Options

stopwords (type: string array) - 非索引字列表。stopwords (type: string array) - A list of stopwords. 如果指定了 stopwordsList,则无法指定该选项。Cannot be specified if a stopwordsList is specified.

stopwordsList (type: string) - 一个预定义的非索引字列表。stopwordsList (type: string) - A predefined list of stopwords. 如果指定了 stopwords,则无法指定该选项。Cannot be specified if stopwords is specified. 允许值包括:“arabic”、“armenian”、“basque”、“brazilian”、“bulgarian”、“catalan”、“czech”、“danish”、“dutch”、“english”、“finnish”、“french”、“galician”、“german”、“greek”、“hindi”、“hungarian”、“indonesian”、“irish”、“italian”、“latvian”、“norwegian”、“persian”、“portuguese”、“romanian”、“russian”、“sorani”、“spanish”、“swedish”、“thai”、“turkish”,默认值:“english”。Allowed values include:"arabic", "armenian", "basque", "brazilian", "bulgarian", "catalan", "czech", "danish", "dutch", "english", "finnish", "french", "galician", "german", "greek", "hindi", "hungarian", "indonesian", "irish", "italian", "latvian", "norwegian", "persian", "portuguese", "romanian", "russian", "sorani", "spanish", "swedish", "thai", "turkish", default: "english". 如果指定了 stopwords,则无法指定该选项。Cannot be specified if stopwords is specified.

ignoreCase (type: bool) - 如果为 true,则首先小写所有字词。ignoreCase (type: bool) - If true, all words are lower cased first. 默认值为 false。The default is false.

removeTrailing (type: bool) - 如果为 true,则忽略最后一个搜索词(如果它是一个非索引字)。removeTrailing (type: bool) - If true, ignore the last search term if it's a stop word. 默认值为 true。The default is true.
synonymsynonym SynonymTokenFilterSynonymTokenFilter 匹配标记流中的单个或多个字词同义词。Matches single or multi word synonyms in a token stream.

选项Options

synonyms (type: string array) - 必需。synonyms (type: string array) - Required. 以下两种格式之一的同义词列表:List of synonyms in one of the following two formats:

-incredible, unbelievable, fabulous => amazing - => 符号左侧的所有词条均替换为其右侧的所有词条。-incredible, unbelievable, fabulous => amazing - all terms on the left side of => symbol are replaced with all terms on its right side.

-incredible, unbelievable, fabulous, amazing - 以逗号分隔的等效字词列表。-incredible, unbelievable, fabulous, amazing - A comma-separated list of equivalent words. 设置展开选项可更改此列表的解释方式。Set the expand option to change how this list is interpreted.

ignoreCase (type: bool) - 用于匹配的大小写折叠输入。ignoreCase (type: bool) - Case-folds input for matching. 默认值为 false。The default is false.

expand (type: bool) - 如果为 true,则同义词列表中的所有字词(如果未使用 => 表示法)将相互映射。expand (type: bool) - If true, all words in the list of synonyms (if => notation is not used) map to one another.
以下列表:incredible, unbelievable, fabulous, amazing 等效于:incredible, unbelievable, fabulous, amazing => incredible, unbelievable, fabulous, amazingThe following list: incredible, unbelievable, fabulous, amazing is equivalent to: incredible, unbelievable, fabulous, amazing => incredible, unbelievable, fabulous, amazing

- 如果为 false,则以下列表:incredible, unbelievable, fabulous, amazing 等效于:incredible, unbelievable, fabulous, amazing => incredible。- If false, the following list: incredible, unbelievable, fabulous, amazing are equivalent to: incredible, unbelievable, fabulous, amazing => incredible.
trimtrim (仅当有可用的选项时,类型才适用)(type applies only when options are available) 剪裁标记中的前导和尾随空格。Trims leading and trailing whitespace from tokens.
truncatetruncate TruncateTokenFilterTruncateTokenFilter 将词条截断为特定长度。Truncates the terms into a specific length.

选项Options

length (type: int) - 默认值:300,最大值:300。length (type: int) - Default: 300, maximum: 300. 必需。Required.
uniqueunique UniqueTokenFilterUniqueTokenFilter 筛选出与前一个标记具有相同文本的标记。Filters out tokens with same text as the previous token.

选项Options

onlyOnSamePosition (type: bool) - 如果设置该选项,则仅在同一位置删除重复项。onlyOnSamePosition (type: bool) - If set, remove duplicates only at the same position. 默认值为 true。The default is true.
uppercaseuppercase (仅当有可用的选项时,类型才适用)(type applies only when options are available) 将标记文本规范化为大写。Normalizes token text to upper case.
word_delimiterword_delimiter WordDelimiterTokenFilterWordDelimiterTokenFilter 将字词拆分为子字,并对子字组执行可选转换。Splits words into subwords and performs optional transformations on subword groups.

选项Options

generateWordParts (type: bool) - 导致生成字词的各个部分,例如“AzureSearch”变成“Azure”“Search”。generateWordParts (type: bool) - Causes parts of words to be generated, for example "AzureSearch" becomes "Azure" "Search". 默认值为 true。The default is true.

generateNumberParts (type: bool) - 导致生成数字子字。generateNumberParts (type: bool) - Causes number subwords to be generated. 默认值为 true。The default is true.

catenateWords (type: bool) - 导致运行次数最多的字词部分连接在一起,例如“Azure-Search”变成“AzureSearch”。catenateWords (type: bool) - Causes maximum runs of word parts to be catenated, for example "Azure-Search" becomes "AzureSearch". 默认值为 false。The default is false.

catenateNumbers (type: bool) - 导致运行次数最多的数字部分连接在一起,例如“1-2”变成“12”。catenateNumbers (type: bool) - Causes maximum runs of number parts to be catenated, for example "1-2" becomes "12". 默认值为 false。The default is false.

catenateAll (type: bool) - 导致所有子字部分连接在一起,例如“Azure-Search-1”变成“AzureSearch1”。catenateAll (type: bool) - Causes all subword parts to be catenated, e.g "Azure-Search-1" becomes "AzureSearch1". 默认值为 false。The default is false.

splitOnCaseChange (type: bool) - 如果为 true,则在大小写发生变化时拆分字词,例如“AzureSearch”变成“Azure”“Search”。splitOnCaseChange (type: bool) - If true, splits words on caseChange, for example "AzureSearch" becomes "Azure" "Search". 默认值为 true。The default is true.

preserveOriginal - 使原始字词保留并添加到子字列表中。preserveOriginal - Causes original words to be preserved and added to the subword list. 默认值为 false。The default is false.

splitOnNumerics (type: bool) - 如果为 true,则在数字处拆分,例如“Azure1Search”变成“Azure”“1”“Search”。splitOnNumerics (type: bool) - If true, splits on numbers, for example "Azure1Search" becomes "Azure" "1" "Search". 默认值为 true。The default is true.

stemEnglishPossessive (type: bool) - 导致为每个子字删除尾随的“'s”。stemEnglishPossessive (type: bool) - Causes trailing "'s" to be removed for each subword. 默认值为 true。The default is true.

protectedWords (type: string array) - 要防止分隔的标记。protectedWords (type: string array) - Tokens to protect from being delimited. 默认为空列表。The default is an empty list.

1 标记筛选器类型在代码中始终带有前缀“#Microsoft.Azure.Search”,因此,“ArabicNormalizationTokenFilter”实际上会被指定为“#Microsoft.Azure.Search.ArabicNormalizationTokenFilter”。1 Token Filter Types are always prefixed in code with "#Microsoft.Azure.Search" such that "ArabicNormalizationTokenFilter" would actually be specified as "#Microsoft.Azure.Search.ArabicNormalizationTokenFilter". 为缩小表的宽度,我们删除了该前缀,但请记住在代码中包含该前缀。We removed the prefix to reduce the width of the table, but please remember to include it in your code.

另请参阅See also

Azure 认知搜索 REST API Azure Cognitive Search REST APIs
Azure 认知搜索中的分析器 > 示例 Analyzers in Azure Cognitive Search > Examples
创建索引(Azure 认知搜索 REST API)Create Index (Azure Cognitive Search REST API)