用于 Azure 认知搜索中文本处理的分析器Analyzers for text processing in Azure Cognitive Search

分析器是全文搜索引擎的组成部分,负责在查询字符串和带索引文档中进行文本处理。An analyzer is a component of the full text search engine responsible for processing text in query strings and indexed documents. 文本处理(也称为词法分析)具有转换性,可通过以下等操作修改字符串:Text processing (also known as lexical analysis) is transformative, modifying a string through actions such as these:

  • 删除非必需字(非索引字)和标点Remove non-essential words (stopwords) and punctuation
  • 将短语和用连字符连接的词语拆分为组成部分Split up phrases and hyphenated words into component parts
  • 将大写单词转换为小写单词Lower-case any upper-case words
  • 将单词分解为原根形式以提高存储效率,以便无论是哪种时态都可以找到匹配项。Reduce words into primitive root forms for storage efficiency and so that matches can be found regardless of tense

“分析”适用于标记为“可搜索”的 Edm.String 字段,这表示全文搜索。Analysis applies to Edm.String fields that are marked as "searchable", which indicates full text search. 对于具有此配置的字段,在创建令牌后会在索引编制期间进行分析,然后在查询执行期间,在分析查询和引擎扫描匹配的令牌时会再次进行分析。For fields with this configuration, analysis occurs during indexing when tokens are created, and then again during query execution when queries are parsed and the engine scans for matching tokens. 当为索引编制和查询使用同一分析器时,匹配概率会更高,但也可以根据需要为每个工作负载分别设置分析器。A match is more likely to occur when the same analyzer is used for both indexing and queries, but you can set the analyzer for each workload independently, depending on your requirements.

非全文搜索的查询类型(如正则表达式或模糊搜索)不会在查询端经历分析阶段。Query types that are not full text search, such as regular expression or fuzzy search, do not go through the analysis phase on the query side. 相反,分析程序会使用所提供的作为匹配基础的模式,将这些字符串直接发送到搜索引擎。Instead, the parser sends those strings directly to the search engine, using the pattern that you provide as the basis for the match. 通常,这些查询窗体需要使用完整的字符串形式的标记来使模式匹配生效。Typically, these query forms require whole-string tokens to make pattern matching work. 若要在编制索引期间获取内容完整的标记,可能需要使用自定义分析器To get whole terms tokens during indexing, you might need custom analyzers. 有关何时及为何会分析查询词语的详细信息,请参阅 Azure 认知搜索中的全文搜索For more information about when and why query terms are analyzed, see Full text search in Azure Cognitive Search.

默认分析器Default analyzer

在 Azure 认知搜索查询中,分析器会自动调用标记为“可搜索”的所有字符串字段。In Azure Cognitive Search queries, an analyzer is automatically invoked on all string fields marked as searchable.

Azure 认知搜索默认使用 Apache Lucene 标准分析器 (standard lucene),该分析器按照“Unicode 文本分段”规则将文本分解成多个元素。By default, Azure Cognitive Search uses the Apache Lucene Standard analyzer (standard lucene), which breaks text into elements following the "Unicode Text Segmentation" rules. 此外,标准分析器将所有字符转换为其小写形式。Additionally, the standard analyzer converts all characters to their lower case form. 已编入索引的文档和搜索词在索引和查询处理期间完成分析。Both indexed documents and search terms go through the analysis during indexing and query processing.

你可以逐字段替代默认值。You can override the default on a field-by-field basis. 替代的分析器可以是用于语言处理的语言分析器自定义分析器,也可以是可用分析器列表中的预定义分析器。Alternative analyzers can be a language analyzer for linguistic processing, a custom analyzer, or a predefined analyzer from the list of available analyzers.

分析器类型Types of analyzers

下表描述了 Azure 认知搜索中可用的分析器。The following list describes which analyzers are available in Azure Cognitive Search.

CategoryCategory 说明Description
标准 Lucene 分析器Standard Lucene analyzer 默认。Default. 无需任何规范或配置。No specification or configuration is required. 这种通用分析器适用于大多数语言和场景。This general-purpose analyzer performs well for many languages and scenarios.
预定义分析器Predefined analyzers 以成品的形式提供,旨在按原样使用。Offered as a finished product intended to be used as-is.
有两种类型:专用和语言特定。There are two types: specialized and language. 之所以称作“预定义”分析器,是因为它们按名称引用,不需要进行额外的配置或自定义。What makes them "predefined" is that you reference them by name, with no configuration or customization.

需要对文本输入进行专业处理或最小处理时,请使用专业(不区分语言)分析器Specialized (language-agnostic) analyzers are used when text inputs require specialized processing or minimal processing. 非语言预定义分析器包括 Asciifolding、Keyword、Pattern、Simple、Stop 和 Whitespace 。Non-language predefined analyzers include Asciifolding, Keyword, Pattern, Simple, Stop, Whitespace.

当需要为各种语言提供丰富的语言支持时,请使用语言分析器Language analyzers are used when you need rich linguistic support for individual languages. Azure 认知搜索支持 35 种 Lucene 语言分析器和 50 种 Microsoft 自然语言处理分析器。Azure Cognitive Search supports 35 Lucene language analyzers and 50 Microsoft natural language processing analyzers.
自定义分析器Custom analyzers 称为结合了现有元素的用户定义配置,由一个 tokenizer(必需)和可选的筛选器(字符或词元)组成。Refers to a user-defined configuration of a combination of existing elements, consisting of one tokenizer (required) and optional filters (char or token).

某些预定义分析器(例如 PatternStop)支持有限的一组配置选项。A few predefined analyzers, such as Pattern or Stop, support a limited set of configuration options. 若要设置这些选项,请有效地创建一个自定义分析器,其中包括某个预定义分析器,以及预定义分析器参考中所述的一个替代选项。To set these options, you effectively create a custom analyzer, consisting of the predefined analyzer and one of the alternative options documented in Predefined Analyzer Reference. 对于任何自定义配置,请为新配置提供一个名称,例如 myPatternAnalyzer,以便将它与 Lucene Pattern 分析器区分开来。As with any custom configuration, provide your new configuration with a name, such as myPatternAnalyzer to distinguish it from the Lucene Pattern analyzer.

如何指定分析器How to specify analyzers

“设置分析器”是可选的。Setting an analyzer is optional. 作为一般规则,请首先尝试使用默认的标准 Lucene 分析器来查看其执行情况。As a general rule, try using the default standard Lucene analyzer first to see how it performs. 如果查询未能返回预期的结果,通常正确的解决方案是使用其他分析器。If queries fail to return the expected results, switching to a different analyzer is often the right solution.

  1. 索引中创建字段定义时,将分析器属性设置为以下之一:预定义分析器(例如 keyword)、语言分析器(例如 en.microsoft)或自定义分析器(在同一索引架构中定义)。When creating a field definition in the index, set the analyzer property to one of the following: a predefined analyzer such as keyword, a language analyzer such as en.microsoft, or a custom analyzer (defined in the same index schema).

      "fields": [
     {
       "name": "Description",
       "type": "Edm.String",
       "retrievable": true,
       "searchable": true,
       "analyzer": "en.microsoft",
       "indexAnalyzer": null,
       "searchAnalyzer": null
     },
    

    如果使用的是语言分析器,则必须使用分析器属性进行指定。If you are using a language analyzer, you must use the analyzer property to specify it. searchAnalyzer 和 indexAnalyzer 属性不支持语言分析器 。The searchAnalyzer and indexAnalyzer properties do not support language analyzers.

  2. 或者,设置 indexAnalyzer 和 searchAnalyzer,根据每个工作负载的需要使用不同且合适的分析器 。Alternatively, set indexAnalyzer and searchAnalyzer to vary the analyzer for each workload. 一并设置这些属性并替换 analyzer 属性,该属性必须为 null。These properties are set together and replace the analyzer property, which must be null. 如果其中的某个活动需要特定的转换,而其他活动不需要该转换,则你可以使用不同的分析器来准备和检索数据。You might use different analyzers for data preparation and retrieval if one of those activities required a specific transformation not needed by the other.

      "fields": [
     {
       "name": "Description",
       "type": "Edm.String",
       "retrievable": true,
       "searchable": true,
       "analyzer": null,
       "indexAnalyzer": "keyword",
       "searchAnalyzer": "whitespace"
     },
    
  3. 为且仅为自定义分析器执行操作:在索引的“[分析器]”部分创建一个条目,然后按照前两个步骤之一将自定义分析器分配给字段定义。For custom analyzers only, create an entry in the [analyzers] section of the index, and then assign your custom analyzer to the field definition per either of the previous two steps. 有关详细信息,请参阅创建索引添加自定义分析器For more information, see Create Index and also Add custom analyzers.

何时添加分析器When to add analyzers

添加和分配分析器的最佳时机是在开发高潮期,此时,删除和重新创建索引是经常性的操作。The best time to add and assign analyzers is during active development, when dropping and recreating indexes is routine.

由于分析器用于标记术语,因此在创建字段时应分配一个分析器。Because analyzers are used to tokenize terms, you should assign an analyzer when the field is created. 实际上,不允许将分析器或 indexAnalyzer 分配到已实际创建的字段(但可以随时更改 searchAnalyzer 属性,而不会对索引造成影响) 。In fact, assigning analyzer or indexAnalyzer to a field that has already been physically created is not allowed (although you can change the searchAnalyzer property at any time with no impact to the index).

若要更改现有字段的分析器,必须重新生成整个索引(不能重新生成单个字段)。To change the analyzer of an existing field, you'll have to rebuild the index entirely (you cannot rebuild individual fields). 对于生产环境中的索引,可以通过分配新分析器创建一个新字段并开始使用该字段取代旧字段,以推迟重新生成。For indexes in production, you can defer a rebuild by creating a new field with the new analyzer assignment, and start using it in place of the old one. 使用 Update Index 合并新字段,使用 mergeOrUpload 填充该字段。Use Update Index to incorporate the new field and mergeOrUpload to populate it. 之后在计划索引服务中,可清除索引以删除过时字段。Later, as part of planned index servicing, you can clean up the index to remove obsolete fields.

若要将新字段添加到现有索引,请调用 Update Index 添加该字段,并使用 mergeOrUpload 填充字段。To add a new field to an existing index, call Update Index to add the field, and mergeOrUpload to populate it.

若要将自定义分析器添加到现有索引,且想要避免以下错误,请在 Update Index 中传递 allowIndexDowntime 标志:To add a custom analyzer to an existing index, pass the allowIndexDowntime flag in Update Index if you want to avoid this error:

“不允许索引更新,因为这会导致停机。若要将新的分析器、tokenizer、标记筛选器或字符筛选器添加到现有索引,请将索引更新请求中的 'allowIndexDowntime' 查询参数设置为 'true'。请注意,此操作将使索引离线至少几秒钟,从而导致索引和查询请求失败。索引的性能和写入可用性可在更新索引后的几分钟内处于受损状态,对于非常大的索引,持续时间更长。”"Index update not allowed because it would cause downtime. In order to add new analyzers, tokenizers, token filters, or character filters to an existing index, set the 'allowIndexDowntime' query parameter to 'true' in the index update request. Note that this operation will put your index offline for at least a few seconds, causing your indexing and query requests to fail. Performance and write availability of the index can be impaired for several minutes after the index is updated, or longer for very large indexes."

有关使用分析器的建议Recommendations for working with analyzers

本部分提供分析器使用方法的相关建议。This section offers advice on how to work with analyzers.

除非特别要求,否则读写操作使用一个分析器One analyzer for read-write unless you have specific requirements

Azure 认知搜索允许通过附加的 indexAnalyzer 和 searchAnalyzer 字段属性来指定使用不同的分析器执行索引和搜索 。Azure Cognitive Search lets you specify different analyzers for indexing and search via additional indexAnalyzer and searchAnalyzer field properties. 如果未指定,则使用 analyzer 属性设置的分析器将用于索引编制和搜索。If unspecified, the analyzer set with the analyzer property is used for both indexing and searching. 如果未指定分析器,将使用默认标准 Lucene 分析器。If analyzer is unspecified, the default Standard Lucene analyzer is used.

除非特别要求,否则索引和查询一般使用同一个分析器。A general rule is to use the same analyzer for both indexing and querying, unless specific requirements dictate otherwise. 请务必全面测试。Be sure to test thoroughly. 如果搜索和索引时的文本处理不同,则当搜索和索引分析器配置不一致时,查询词和索引词可能会不匹配。When text processing differs at search and indexing time, you run the risk of mismatch between query terms and indexed terms when the search and indexing analyzer configurations are not aligned.

在活动开发中进行测试Test during active development

替换标准分析器需要重新生成索引。Overriding the standard analyzer requires an index rebuild. 如果可能,请先决定在活动开发中使用的分析器,然后再将索引应用到生产中。If possible, decide on which analyzers to use during active development, before rolling an index into production.

检查已标记的词语Inspect tokenized terms

如果搜索未能返回所需的结果,最有可能的情况是查询上的词语输入和索引中已标记的词语之间存在标记差异。If a search fails to return expected results, the most likely scenario is token discrepancies between term inputs on the query, and tokenized terms in the index. 如果标记不同,匹配则无法具体化。If the tokens aren't the same, matches fail to materialize. 若要检查 tokenizer 输出,建议将 Analyze API 用作调查工具。To inspect tokenizer output, we recommend using the Analyze API as an investigation tool. 响应包含令牌,由特定分析器生成。The response consists of tokens, as generated by a specific analyzer.

REST 示例REST examples

下方示例演示了几个主要方案的分析器定义。The examples below show analyzer definitions for a few key scenarios.

自定义分析器示例Custom analyzer example

本示例说明了具有自定义选项的分析器定义。This example illustrates an analyzer definition with custom options. 字符筛选器、分词器和词元筛选器的自定义选项分别指定为命名构造,然后在分析器定义中引用。Custom options for char filters, tokenizers, and token filters are specified separately as named constructs, and then referenced in the analyzer definition. 预定义元素按原样使用,并可通过名称轻松引用。Predefined elements are used as-is and simply referenced by name.

分析此示例:Walking through this example:

  • 分析器是可搜索字段的字段类的属性。Analyzers are a property of the field class for a searchable field.
  • 自定义分析器是索引定义的一部分。A custom analyzer is part of an index definition. 它支持轻度自定义(例如,在某个筛选器中自定义一个单独选项)或在多个位置自定义。It might be lightly customized (for example, customizing a single option in one filter) or customized in multiple places.
  • 在这种情况下,自定义分析器为“my_analyzer”,它将反过来使用自定义的标准分词器“my_standard_tokenizer”和两个词元筛选器:小写的自定义 asciifolding 筛选器“my_asciifolding”。In this case, the custom analyzer is "my_analyzer", which in turn uses a customized standard tokenizer "my_standard_tokenizer" and two token filters: lowercase and customized asciifolding filter "my_asciifolding".
  • 它还定义了 2 个自定义字符型筛选器“map_dash”和“remove_whitespace”。It also defines 2 custom char filters "map_dash" and "remove_whitespace". 第一个使用下划线替换所有破折号,而第二个用于删除所有空格。The first one replaces all dashes with underscores while the second one removes all spaces. 空间需要在映射规则中进行 UTF-8 编码。Spaces need to be UTF-8 encoded in the mapping rules. 字符型筛选器在标记化之前进行应用并将影响生成的标记(标准 tokenizer 在破折号和空格上中断,但不会在下划线上中断)。The char filters are applied before tokenization and will affect the resulting tokens (the standard tokenizer breaks on dash and spaces but not on underscore).
  {
     "name":"myindex",
     "fields":[
        {
           "name":"id",
           "type":"Edm.String",
           "key":true,
           "searchable":false
        },
        {
           "name":"text",
           "type":"Edm.String",
           "searchable":true,
           "analyzer":"my_analyzer"
        }
     ],
     "analyzers":[
        {
           "name":"my_analyzer",
           "@odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",
           "charFilters":[
              "map_dash",
              "remove_whitespace"
           ],
           "tokenizer":"my_standard_tokenizer",
           "tokenFilters":[
              "my_asciifolding",
              "lowercase"
           ]
        }
     ],
     "charFilters":[
        {
           "name":"map_dash",
           "@odata.type":"#Microsoft.Azure.Search.MappingCharFilter",
           "mappings":["-=>_"]
        },
        {
           "name":"remove_whitespace",
           "@odata.type":"#Microsoft.Azure.Search.MappingCharFilter",
           "mappings":["\\u0020=>"]
        }
     ],
     "tokenizers":[
        {
           "name":"my_standard_tokenizer",
           "@odata.type":"#Microsoft.Azure.Search.StandardTokenizerV2",
           "maxTokenLength":20
        }
     ],
     "tokenFilters":[
        {
           "name":"my_asciifolding",
           "@odata.type":"#Microsoft.Azure.Search.AsciiFoldingTokenFilter",
           "preserveOriginal":true
        }
     ]
  }

每个字段的分析器分配示例Per-field analyzer assignment example

默认为标准分析器。The Standard analyzer is the default. 假设你希望将默认分析器替换为其他预定义分析器(如 Pattern 分析器)。Suppose you want to replace the default with a different predefined analyzer, such as the pattern analyzer. 如果尚未设置自定义选项,则只需在字段定义中通过名称来指定它。If you are not setting custom options, you only need to specify it by name in the field definition.

“分析器”元素逐字段替代标准分析器。The "analyzer" element overrides the Standard analyzer on a field-by-field basis. 不能全局替代。There is no global override. 在本例中,text1 使用 Pattern 分析器和 text2,它不指定分析器,仅使用默认值。In this example, text1 uses the pattern analyzer and text2, which doesn't specify an analyzer, uses the default.

  {
     "name":"myindex",
     "fields":[
        {
           "name":"id",
           "type":"Edm.String",
           "key":true,
           "searchable":false
        },
        {
           "name":"text1",
           "type":"Edm.String",
           "searchable":true,
           "analyzer":"pattern"
        },
        {
           "name":"text2",
           "type":"Edm.String",
           "searchable":true
        }
     ]
  }

混合用于索引和搜索操作的分析器Mixing analyzers for indexing and search operations

API 包括为索引和搜索指定不同分析器的其他索引属性。The APIs include additional index attributes for specifying different analyzers for indexing and search. 必须以对的形式指定 searchAnalyzerindexAnalyzer 属性,并替换单个 analyzer 属性。The searchAnalyzer and indexAnalyzer attributes must be specified as a pair, replacing the single analyzer attribute.

  {
     "name":"myindex",
     "fields":[
        {
           "name":"id",
           "type":"Edm.String",
           "key":true,
           "searchable":false
        },
        {
           "name":"text",
           "type":"Edm.String",
           "searchable":true,
           "indexAnalyzer":"whitespace",
           "searchAnalyzer":"simple"
        },
     ],
  }

语言分析器示例Language analyzer example

包含其他语言字符串的字段可使用语言分析器,而其他字段将保留默认值(或使用某些其他预定义或自定义分析器)。Fields containing strings in different languages can use a language analyzer, while other fields retain the default (or use some other predefined or custom analyzer). 如果使用语言分析器,则必须将其同时用于索引和搜索操作。If you use a language analyzer, it must be used for both indexing and search operations. 使用语言分析器的字段不得对索引和搜索使用不同的分析器。Fields that use a language analyzer cannot have different analyzers for indexing and search.

  {
     "name":"myindex",
     "fields":[
        {
           "name":"id",
           "type":"Edm.String",
           "key":true,
           "searchable":false
        },
        {
           "name":"text",
           "type":"Edm.String",
           "searchable":true,
           "indexAnalyzer":"whitespace",
           "searchAnalyzer":"simple"
        },
        {
           "name":"text_fr",
           "type":"Edm.String",
           "searchable":true,
           "analyzer":"fr.lucene"
        }
     ],
  }

C# 示例C# examples

如果使用 .NET SDK 代码示例,则可追加这些示例,以便使用或配置分析器。If you are using the .NET SDK code samples, you can append these examples to use or configure analyzers.

分配语言分析器Assign a language analyzer

任何按原样使用且没有任何配置的分析器都是在字段定义中指定的。Any analyzer that is used as-is, with no configuration, is specified on a field definition. 不需要在索引的“[分析器]”部分中创建条目。There is no requirement for creating an entry in the [analyzers] section of the index.

此示例将 Microsoft 英语和法语分析器分配给说明字段。This example assigns Microsoft English and French analyzers to description fields. 它是从更大的酒店索引定义中提取的代码片段,使用 DotNetHowTo 示例的 hotels.cs 文件中的酒店类进行创建。It's a snippet taken from a larger definition of the hotels index, creating using the Hotel class in the hotels.cs file of the DotNetHowTo sample.

调用分析器,指定 AnalyzerName 类型,提供在 Azure 认知搜索中受支持的文本分析器。Call Analyzer, specifying the AnalyzerName type providing a text analyzer supported in Azure Cognitive Search.

    public partial class Hotel
    {
       . . . 

        [IsSearchable]
        [Analyzer(AnalyzerName.AsString.EnMicrosoft)]
        [JsonProperty("description")]
        public string Description { get; set; }

        [IsSearchable]
        [Analyzer(AnalyzerName.AsString.FrLucene)]
        [JsonProperty("description_fr")]
        public string DescriptionFr { get; set; }

      . . .
    }

定义自定义分析器Define a custom analyzer

如果需要自定义或配置,则需向索引添加分析器构造。When customization or configuration is required, you will need to add an analyzer construct to an index. 定义以后,即可将其添加到字段定义,如上一示例所示。Once you define it, you can add it the field definition as demonstrated in the previous example.

创建 CustomAnalyzer 对象。Create a CustomAnalyzer object. 如需更多示例,请参阅 CustomAnalyzerTests.csFor more examples, see CustomAnalyzerTests.cs.

{
   var definition = new Index()
   {
         Name = "hotels",
         Fields = FieldBuilder.BuildForType<Hotel>(),
         Analyzers = new[]
            {
               new CustomAnalyzer()
               {
                     Name = "url-analyze",
                     Tokenizer = TokenizerName.UaxUrlEmail,
                     TokenFilters = new[] { TokenFilterName.Lowercase }
               }
            },
   };

   serviceClient.Indexes.Create(definition);

后续步骤Next steps

另请参阅See also

搜索文档 REST APISearch Documents REST API

简单的查询语法Simple query syntax

完整 Lucene 查询语法Full Lucene query syntax

处理搜索结果Handle search results