向 Azure 认知搜索索引中的字符串字段添加语言分析器Add language analyzers to string fields in an Azure Cognitive Search index

语言分析器是特定类型的文本分析器,可以使用目标语言的语言规则执行词法分析。A language analyzer is a specific type of text analyzer that performs lexical analysis using the linguistic rules of the target language. 每个可搜索字段都有一个“分析器”属性。Every searchable field has an analyzer property. 如果内容包含翻译后的字符串,例如针对英文文本和中文文本的单独字段,则可在每个字段上指定语言分析器,以便访问这些分析器的丰富语言功能。If your content consists of translated strings, such as separate fields for English and Chinese text, you could specify language analyzers on each field to access the rich linguistic capabilities of those analyzers.

何时使用语言分析器When to use a language analyzer

如果感知词汇或句子结构可以为文本分析带来好处,则应该考虑使用语言分析器。You should consider a language analyzer when awareness of word or sentence structure adds value to text parsing. 常见例子为不规则动词形式(“bring”和“brought”)或复数名词(“mice”和“mouse”)的关联。A common example is the association of irregular verb forms ("bring" and "brought) or plural nouns ("mice" and "mouse"). 如果没有语言感知功能,仅根据物理特征分析这些字符串,则无法抓住这种联系。Without linguistic awareness, these strings are parsed on physical characteristics alone, which fails to catch the connection. 由于大段文本更有可能包含此内容,因此,由描述、评论或摘要组成的字段很适合使用语言分析器。Since large chunks of text are more likely to have this content, fields consisting of descriptions, reviews, or summaries are good candidates for a language analyzer.

当内容包含非西方语言字符串时,也应该考虑使用语言分析器。You should also consider language analyzers when content consists of non-Western language strings. 尽管默认分析器与语言无关,但使用空格和特殊字符(连字符和斜杠)分隔字符串的概念通常更适用于西方语言,而不是非西方语言。While the default analyzer is language-agnostic, the concept of using spaces and special characters (hyphens and slashes) to separate strings tends is more applicable to Western languages than non-Western ones.

例如,在中文、日语、韩语 (CJK) 和其他亚洲语言中,空格不一定是词汇分隔符。For example, in Chinese, Japanese, Korean (CJK), and other Asian languages, a space is not necessarily a word delimiter. 请看下面的日语字符串。Consider the following Japanese string. 因为它没有空格,所以与语言无关的分析器可能会将整个字符串作为一个标记进行分析,但该字符串实际上是一个短语。Because it has no spaces, a language-agnostic analyzer would likely analyze the entire string as one token, when in fact the string is actually a phrase.

これは私たちの銀河系の中ではもっとも重く明るいクラスの球状星団です。
(This is the heaviest and brightest group of spherical stars in our galaxy.)

对于上面的示例,成功的查询必须包含完整的标记或使用后缀通配符的部分标记,这样会带来不自然且有限的搜索体验。For the example above, a successful query would have to include the full token, or a partial token using a suffix wildcard, resulting in an unnatural and limiting search experience.

更好的体验是搜索单个词汇:明るい(明亮)、私たちの(我们的)、銀河系(银河系)。A better experience is to search for individual words: 明るい (Bright), 私たちの (Our), 銀河系 (Galaxy). 通过使用认知搜索中可用的日语分析器之一,更有可能解锁此行为,因为这些分析器更擅长将文本段拆分为目标语言中有意义的词汇。Using one of the Japanese analyzers available in Cognitive Search is more likely to unlock this behavior because those analyzers are better equipped at splitting the chunk of text into meaningful words in the target language.

比较 Lucene 和 Microsoft 分析器Comparing Lucene and Microsoft Analyzers

Azure 认知搜索支持 35 个受 Lucene 支持的语言分析器,以及 Office 和必应中使用的专有 Microsoft 自然语言处理技术支持的 50 个语言分析器。Azure Cognitive Search supports 35 language analyzers backed by Lucene, and 50 language analyzers backed by proprietary Microsoft natural language processing technology used in Office and Bing.

某些开发人员可能首选更熟悉、简单的开源 Lucene 解决方案。Some developers might prefer the more familiar, simple, open-source solution of Lucene. Lucene 语言分析器更快,但 Microsoft 分析器具有高级功能,如词形还原、字词分解(在德语、丹麦语、荷兰语、瑞典语、挪威语、爱沙尼亚语、芬兰语、匈牙利语、斯洛伐克语中)和实体识别(URL、电子邮件、日期、数字)。Lucene language analyzers are faster, but the Microsoft analyzers have advanced capabilities, such as lemmatization, word decompounding (in languages like German, Danish, Dutch, Swedish, Norwegian, Estonian, Finish, Hungarian, Slovak) and entity recognition (URLs, emails, dates, numbers). 如果可能,应对 Microsoft 和 Lucene 分析器进行比较以确定哪一个更合适。If possible, you should run comparisons of both the Microsoft and Lucene analyzers to decide which one is a better fit.

Microsoft 分析器的索引平均比 Lucene 的索引慢两到三倍,具体取决于语言。Indexing with Microsoft analyzers is on average two to three times slower than their Lucene equivalents, depending on the language. 对于平均大小的查询,搜索性能应该不会受到显著影响。Search performance should not be significantly affected for average size queries.

英语分析器English analyzers

默认分析器为 Standard Lucene,它适用于英语,但可能不如 Lucene 的英语分析器或 Microsoft 的英语分析器那样适用。The default analyzer is Standard Lucene, which works well for English, but perhaps not as well as Lucene's English analyzer or Microsoft's English analyzer.

  • Lucene 的英语分析器扩展了标准分析器。Lucene's English analyzer extends the standard analyzer. 它从字词中删除所有格(尾部的 's)、根据 Porter 词干分解算法应用词干分解,并删除英语非索引字。It removes possessives (trailing 's) from words, applies stemming as per Porter Stemming algorithm, and removes English stop words.

  • Microsoft 的英语分析器执行词形还原,而不是词干分解。Microsoft's English analyzer performs lemmatization instead of stemming. 这意味着它可以更好地处理发生了词尾变化的字词形式以及不规则的字词形式,从而产生相关度更高的搜索结果This means it can handle inflected and irregular word forms much better which results in more relevant search results

配置分析器Configuring analyzers

语言分析器按原样使用。Language analyzers are used as-is. 对于索引定义中的每个字段,可将分析器属性设置为用于指定语言和语言学堆栈(Microsoft 或 Lucene)的分析器名称。For each field in the index definition, you can set the analyzer property to an analyzer name that specifies the language and linguistics stack (Microsoft or Lucene). 将在为该字段编入索引和搜索时应用相同的分析器。The same analyzer will be applied when indexing and searching for that field. 例如,可以为在同一个索引中并行存在的英语、法语和西班牙语酒店说明使用单独的字段。For example, you can have separate fields for English, French, and Spanish hotel descriptions that exist side by side in the same index.

备注

不能在为字段编制索引时和查询时使用不同的语言分析器。It is not possible to use a different language analyzer at indexing time than at query time for a field. 该功能是为自定义分析器保留的。That capability is reserved for custom analyzers. 因此,如果尝试将 searchAnalyzerindexAnalyzer 属性设为语言分析器的名称,REST API 将返回错误响应。For this reason, if you try to set the searchAnalyzer or indexAnalyzer properties to the name of a language analyzer, the REST API will return an error response. 必须改用 analyzer 属性。You must use the analyzer property instead.

使用 searchFields 查询参数指定在查询中针对哪个特定于语言的字段进行搜索。Use the searchFields query parameter to specify which language-specific field to search against in your queries. 可在搜索文档中查看包含分析器属性的查询示例。You can review query examples that include the analyzer property in Search Documents.

有关索引属性的详细信息,请参阅创建索引(Azure 认知搜索 REST API)For more information about index properties, see Create Index (Azure Cognitive Search REST API). 若要详细了解 Azure 认知搜索中的分析,请参阅 Azure 认知搜索中的分析器For more information about analysis in Azure Cognitive Search, see Analyzers in Azure Cognitive Search.

语言分析器列表Language analyzer list

下面是受支持语言的列表以及 Lucene 和 Microsoft 分析器名称。Below is the list of supported languages together with Lucene and Microsoft analyzer names.

语言Language Microsoft 分析器名称Microsoft Analyzer Name Lucene 分析器名称Lucene Analyzer Name
阿拉伯语Arabic ar.microsoftar.microsoft ar.lucenear.lucene
亚美尼亚语Armenian hy.lucenehy.lucene
BanglaBangla bn.microsoftbn.microsoft
巴斯克语Basque eu.luceneeu.lucene
保加利亚语Bulgarian bg.microsoftbg.microsoft bg.lucenebg.lucene
加泰罗尼亚语Catalan ca.microsoftca.microsoft ca.luceneca.lucene
简体中文Chinese Simplified zh-Hans.microsoftzh-Hans.microsoft zh-Hans.lucenezh-Hans.lucene
中文(繁体)Chinese Traditional zh-Hant.microsoftzh-Hant.microsoft zh-Hant.lucenezh-Hant.lucene
克罗地亚语Croatian hr.microsofthr.microsoft
捷克语Czech cs.microsoftcs.microsoft cs.lucenecs.lucene
丹麦语Danish da.microsoftda.microsoft da.luceneda.lucene
荷兰语Dutch nl.microsoftnl.microsoft nl.lucenenl.lucene
英语English en.microsoften.microsoft en.luceneen.lucene
爱沙尼亚语Estonian et.microsoftet.microsoft
芬兰语Finnish fi.microsoftfi.microsoft fi.lucenefi.lucene
法语French fr.microsoftfr.microsoft fr.lucenefr.lucene
加利西亚语Galician gl.lucenegl.lucene
德语German de.microsoftde.microsoft de.lucenede.lucene
希腊语Greek el.microsoftel.microsoft el.luceneel.lucene
古吉拉特语Gujarati gu.microsoftgu.microsoft
希伯来语Hebrew he.microsofthe.microsoft
HindiHindi hi.microsofthi.microsoft hi.lucenehi.lucene
匈牙利语Hungarian hu.microsofthu.microsoft hu.lucenehu.lucene
冰岛语Icelandic is.microsoftis.microsoft
印度尼西亚语Indonesian (Bahasa) id.microsoftid.microsoft id.luceneid.lucene
爱尔兰语Irish ga.lucenega.lucene
意大利语Italian it.microsoftit.microsoft it.luceneit.lucene
日语Japanese ja.microsoftja.microsoft ja.luceneja.lucene
卡纳达语Kannada kn.microsoftkn.microsoft
朝鲜语Korean ko.microsoftko.microsoft ko.luceneko.lucene
拉脱维亚语Latvian lv.microsoftlv.microsoft lv.lucenelv.lucene
立陶宛语Lithuanian lt.microsoftlt.microsoft
马拉雅拉姆语Malayalam ml.microsoftml.microsoft
马来语(拉丁语系)Malay (Latin) ms.microsoftms.microsoft
马拉地语Marathi mr.microsoftmr.microsoft
挪威语Norwegian nb.microsoftnb.microsoft no.luceneno.lucene
波斯语Persian fa.lucenefa.lucene
波兰语Polish pl.microsoftpl.microsoft pl.lucenepl.lucene
葡萄牙语(巴西)Portuguese (Brazil) pt-Br.microsoftpt-Br.microsoft pt-Br.lucenept-Br.lucene
葡萄牙语(葡萄牙)Portuguese (Portugal) pt-Pt.microsoftpt-Pt.microsoft pt-Pt.lucenept-Pt.lucene
旁遮普语Punjabi pa.microsoftpa.microsoft
罗马尼亚语Romanian ro.microsoftro.microsoft ro.lucenero.lucene
俄语Russian ru.microsoftru.microsoft ru.luceneru.lucene
塞尔维亚语(西里尔)Serbian (Cyrillic) sr-cyrillic.microsoftsr-cyrillic.microsoft
塞尔维亚语(拉丁)Serbian (Latin) sr-latin.microsoftsr-latin.microsoft
斯洛伐克语Slovak sk.microsoftsk.microsoft
斯洛文尼亚语Slovenian sl.microsoftsl.microsoft
西班牙语Spanish es.microsoftes.microsoft es.lucenees.lucene
瑞典语Swedish sv.microsoftsv.microsoft sv.lucenesv.lucene
泰米尔语Tamil ta.microsoftta.microsoft
泰卢固语Telugu te.microsoftte.microsoft
泰语Thai th.microsoftth.microsoft th.luceneth.lucene
土耳其语Turkish tr.microsofttr.microsoft tr.lucenetr.lucene
乌克兰语Ukrainian uk.microsoftuk.microsoft
乌尔都语Urdu ur.microsoftur.microsoft
越南语Vietnamese vi.microsoftvi.microsoft

名称带有 Lucene 批注的所有分析器都由 Apache Lucene 的语言分析器提供支持。All analyzers with names annotated with Lucene are powered by Apache Lucene's language analyzers.

另请参阅See also