Azure 认知搜索中的同义词Synonyms in Azure Cognitive Search

使用同义词映射,你可以关联等效字词,从而在无需用户实际提供字词的情况下扩展查询范围。With synonym maps, you can associate equivalent terms, expanding the scope of a query without the user having to actually provide the term. 例如,假设“dog”、“canine”和“puppy”是同义词,则对“canine”的查询会匹配包含“dog”的文档。For example, assuming "dog", "canine", and "puppy" are synonyms, a query on "canine" will match on a document containing "dog".

创建同义词Create synonyms

同义词映射是一种可以在创建一次后由许多索引使用的资产。A synonym map is an asset that can be created once and used by many indexes. 服务层级决定了你可以创建的同义词映射个数,范围从免费层和基本层的 3 个同义词映射到标准层最多 20 个同义词映射不等。The service tier determines how many synonym maps you can create, ranging from three synonym maps for Free and Basic tiers, up to 20 for the Standard tiers.

如果你的内容包含技术或模糊术语,则你可以为不同的语言(例如英语和法语版本)或词典创建多个同义词映射。You might create multiple synonym maps for different languages, such as English and French versions, or lexicons if your content includes technical or obscure terminology. 虽然可以在搜索服务中创建多个同义词映射,但一个字段只能使用其中一个同义词映射。Although you can create multiple synonym maps in your search service, a field can only use one of them.

同义词映射由名称、格式和充当同义词映射条目的规则组成。A synonym map consists of name, format, and rules that function as synonym map entries. 唯一受支持的格式是 solrsolr 格式决定了规则构造。The only format that is supported is solr, and the solr format determines rule construction.

POST /synonymmaps?api-version=2020-06-30
{
    "name": "geo-synonyms",
    "format": "solr",
    "synonyms": "
        USA, United States, United States of America\n
        Washington, Wash., WA => WA\n"
}

若要创建同义词映射,请使用创建同义词映射 (REST API) 或 Azure SDK。To create a synonym map, use the Create Synonym Map (REST API) or an Azure SDK. 对于 C# 开发人员,建议从使用 C# 在 Azure 认知搜索中添加同义词开始。For C# developers, we recommend starting with Add Synonyms in Azure Cognitive Searching using C#.

定义规则Define rules

映射规则遵循 Apache Solr 的开源同义词筛选器规范,详情请参阅此文档:SynonymFiltersolr 格式支持两种类型的规则:Mapping rules adhere to the open-source synonym filter specification of Apache Solr, described in this document: SynonymFilter.The solr format supports two kinds of rules:

  • 等效(例如,字词在查询中是等效替代项)equivalency (where terms are equal substitutes in the query)

  • 显式映射(多个术语在查询之前映射到一个显式术语)explicit mappings (where terms are mapped to one explicit term prior to querying)

每个规则都必须由换行符 (\n) 分隔。Each rule must be delimited by the new line character (\n). 在免费服务中可为每个同义词映射定义最多 5,000 条规则,在其他层级中可为每个映射定义最多 20,000 条规则。You can define up to 5,000 rules per synonym map in a free service and 20,000 rules per map in other tiers. 每条规则可包含最多 20 个扩展(或规则中的项)。Each rule can have up to 20 expansions (or items in a rule). 有关详细信息,请参阅同义词限制For more information, see Synonym limits.

查询分析程序会将任何大写或混合大小写的字词转换为小写,但如果你想在字符串中保留特殊字符(例如逗号或短划线),请在创建同义词映射时添加合适的转义字符。Query parsers will lower-case any upper or mixed case terms, but if you want to preserve special characters in the string, such as a comma or dash, add the appropriate escape characters when creating the synonym map.

等效规则Equivalency rules

等效字词的规则在同一规则中用逗号分隔。Rules for equivalent terms are comma-delimited within the same rule. 在第一个示例中,针对 USA 的查询将扩展为 USA OR "United States" OR "United States of America"In the first example, a query on USA will expand to USA OR "United States" OR "United States of America". 请注意,如果想要匹配某个短语,则查询本身必须是带引号的短语查询。Notice that if you want to match on a phrase, the query itself must be a quote-enclosed phrase query.

对于等效情况,对 dog 的查询会对查询进行扩展,以同时包括 puppycanineIn the equivalence case, a query for dog will expand the query to also include puppy and canine.

{
"format": "solr",
"synonyms": "
    USA, United States, United States of America\n
    dog, puppy, canine\n
    coffee, latte, cup of joe, java\n"
}

显式映射Explicit mapping

显式映射的规则由箭头 => 表示。Rules for an explicit mapping are denoted by an arrow =>. 在指定它的情况下,与 => 左侧内容匹配的一系列搜索查询字词在查询时会被替换为右侧的替代项。When specified, a term sequence of a search query that matches the left-hand side of => will be replaced with the alternatives on the right-hand side at query time.

对于显式情况,对 WashingtonWash.WA 的查询会被重写为 WA,查询引擎将仅查找 WA 一词的匹配项。In the explicit case, a query for Washington, Wash. or WA will be rewritten as WA, and the query engine will only look for matches on the term WA. 显式映射只会按指定方向应用,在这种情况下,不会将查询 WA 重写为 WashingtonExplicit mapping only applies in the direction specified, and does not rewrite the query WA to Washington in this case.

{
"format": "solr",
"synonyms": "
    Washington, Wash., WA => WA\n
    California, Calif., CA => CA\n"
}

转义特殊字符Escaping special characters

在查询处理过程中分析同义词。Synonyms are analyzed during query processing. 如果需要定义包含逗号或其他特殊字符的同义词,可以使用反斜杠对其进行转义,如以下示例所示:If you need to define synonyms that contain commas or other special characters, you can escape them with a backslash, like in this example:

{
"format": "solr",
"synonyms": "WA\, USA, WA, Washington\n"
}

由于反斜杠本身是其他语言(例如 JSON 和 C#)中的特殊字符,因此你可能需要对其进行双重转义。Since the backslash is itself a special character in other languages like JSON and C#, you will probably need to double-escape it. 例如,发送到上述同义词映射的 REST API 的 JSON 如下所示:For example, the JSON sent to the REST API for the above synonym map would look like this:

{
"format":"solr",
"synonyms": "WA\\, USA, WA, Washington"
}

上传和管理同义词映射Upload and manage synonym maps

如前文所述,若要创建或更新同义词映射,你无需中断查询,也无需为工作负荷编制索引。As mentioned previously, you can create or update a synonym map without disrupting query and indexing workloads. 同义词映射是一个独立的对象(像索引或数据源一样),只要没有字段使用它,更新便不会导致索引或查询失败。A synonym map is a standalone object (like indexes or data sources), and as long as no field is using it, updates won't cause indexing or queries to fail. 但是,一旦你将同义词映射添加到字段定义,则在删除同义词映射后,包含相关字段的任何查询都会失败并出现 404 错误。However, once you add a synonym map to a field definition, if you then delete a synonym map, any query that includes the fields in question will fail with a 404 error.

创建、更新和删除同义词映射始终是一项全文档操作。也就是说,无法逐个更新或删除同义词映射的各个部分。Creating, updating, and deleting a synonym map is always a whole-document operation, meaning that you cannot update or delete parts of the synonym map incrementally. 甚至更新单个规则也需要重新加载。Updating even a single rule requires a reload.

向字段分配同义词Assign synonyms to fields

上传同义词映射后,你可以在类型为 Edm.StringCollection(Edm.String) 且设置了 "searchable":true 的字段上启用同义词。After uploading a synonym map, you can enable the synonyms on fields of the type Edm.String or Collection(Edm.String), on fields having "searchable":true. 如上所述,一个字段定义只能使用一个同义词映射。As noted, a field definition can use only one synonym map.

POST /indexes?api-version=2020-06-30
{
    "name":"hotels-sample-index",
    "fields":[
        {
            "name":"description",
            "type":"Edm.String",
            "searchable":true,
            "synonymMaps":[
            "en-synonyms"
            ]
        },
        {
            "name":"description_fr",
            "type":"Edm.String",
            "searchable":true,
            "analyzer":"fr.microsoft",
            "synonymMaps":[
            "fr-synonyms"
            ]
        }
    ]
}

对等效或映射的字段进行查询Query on equivalent or mapped fields

添加同义词不会对查询构造施加新的要求。Adding synonyms does not impose new requirements on query construction. 你可以像添加同义词之前一样发出字词和短语查询。You can issue term and phrase queries just as you did before the addition of synonyms. 唯一的区别是,如果同义词映射中存在查询字词,则查询引擎会根据规则扩展或重写该字词或短语。The only difference is that if a query term exists in the synonym map, the query engine will either expand or rewrite the term or phrase, depending on the rule.

在查询执行过程中如何使用同义词How synonyms are used during query execution

同义词是一种查询扩展技术,它对具有等效词条的索引内容进行补充,但仅适用于具有同义词赋值的字段。Synonyms are a query expansion technique that supplements the contents of an index with equivalent terms, but only for fields that have a synonym assignment. 如果字段范围内的查询不包括已启用同义词的字段,则看不到同义词映射中的匹配项。If a field-scoped query excludes a synonym-enabled field, you won't see matches from the synonym map.

对于支持同义词的字段,同义词与关联的字段具有相同的文本分析。For synonym-enabled fields, synonyms are subject to the same text analysis as the associated field. 例如,如果使用标准 Lucene 分析器分析某个字段,则在查询时,同义词词条也将受限于标准 Lucene 分析器。For example, if a field is analyzed using the standard Lucene analyzer, synonym terms will also be subject to the standard Lucene analyzer at query time. 如果希望保留标点符号(如句点或短划线),请在“同义词词条”中对该字段应用内容保留分析器。If you want to preserve punctuation, such as periods or dashes, in the synonym term, apply a content-preserving analyzer on the field.

在内部,同义词功能将使用 OR 操作符重写具有同义词的原始查询。Internally, the synonyms feature rewrites the original query with synonyms with the OR operator. 出于这个原因,突出显示和计分配置文件会将原始术语和同义词视为等效项。For this reason, hit highlighting and scoring profiles treat the original term and synonyms as equivalent.

同义词仅适用于自由格式文本查询,不支持筛选器、Facet、自动完成或建议。Synonyms apply to free form text queries only and are not supported for filters, facets, autocomplete, or suggestions. 自动完成和建议仅基于原始字词;同义词匹配项不会在响应中显示。Autocomplete and suggestions are based only on the original term; synonym matches do not appear in the response.

同义词扩展不适用于通配符搜索术语;也不会扩展前缀、模糊和正则表达式术语。Synonym expansions do not apply to wildcard search terms; prefix, fuzzy, and regex terms aren't expanded.

如果需要执行应用同义词扩展和通配符、正则表达式或模糊搜索的单个查询,则可以使用 OR 语法组合查询。If you need to do a single query that applies synonym expansion and wildcard, regex, or fuzzy searches, you can combine the queries using the OR syntax. 例如,若要将同义词与通配符组合用于简单查询语法,则术语将为 <query> | <query>*For example, to combine synonyms with wildcards for simple query syntax, the term would be <query> | <query>*.

如果开发(非生产)环境中具有现有索引,请使用一个小字典进行试验,了解添加同义词如何更改搜索体验,包括对计分配置文件、突出显示和建议造成的影响。If you have an existing index in a development (non-production) environment, experiment with a small dictionary to see how the addition of synonyms changes the search experience, including impact on scoring profiles, hit highlighting, and suggestions.

后续步骤Next steps