Azure 认知搜索中的语义搜索Semantic search in Azure Cognitive Search

重要

语义搜索目前处于公共预览阶段,在预览版 REST API 和门户可用。Semantic search is in public preview, available through the preview REST API and portal. 根据使用条款补充,预览版功能按原样提供,不保证与正式版实现的功能相同。Preview features are offered as-is, under Supplemental Terms of Use, and are not guaranteed to have the same implementation at general availability. 这些功能将计费。These features are billable. 有关详细信息,请参阅可用性和定价For more information, see Availability and pricing.

语义搜索是与查询相关的功能的集合,将语义相关性和语言理解添加到搜索结果。Semantic search is a collection of query-related capabilities that add semantic relevance and language understanding to search results. 本文简要综合地介绍了语义搜索,其中描述了每个功能以及它们是如何协同工作的。This article is a high-level introduction to semantic search all-up, with descriptions of each feature and how they work collectively. 嵌入的视频介绍了此项技术,结尾部分介绍了可用性和定价。The embedded video describes the technology, and the section at the end covers availability and pricing.

建议查看本文了解背景信息,但如果希望立即开始操作,请执行以下步骤:We recommend reviewing this article for background, but if you'd rather get started right away, follow these steps:

  1. 注册预览版(假设服务满足区域和层级要求)。Sign up for the preview, assuming a service that meets regional and tier requirements.
  2. 新建查询或修改现有查询以返回语义标题和重点Create new or modify existing queries to return semantic captions and highlights.
  3. 再添加几个属性以返回语义答案Add a few more properties to also return semantic answers.
  4. 有选择地包括拼写检查查询属性,以最大程度地提高精准率和召回率。Optionally, include a spell check query property to maximize precision and recall.

语义搜索是与搜索相关的 AI 的一个可选层,它使用语义排名模型扩展了传统的查询执行管道,并返回可改进用户体验的其他属性。Semantic search is an optional layer of search-related AI that extends the traditional query execution pipeline with a semantic ranking model, and returns additional properties that improve the user experience.

语义排名查找词语之间的上下文和相关性,从而提高使查询更有意义的匹配度。Semantic ranking looks for context and relatedness among terms, elevating matches that make more sense given the query. 语言理解在汇总匹配文档或者回答问题的内容中查找标题和答案。这些标题和答案在搜索结果页上呈现,以获得更高效的搜索体验 。Language understanding finds captions and answers within your content that summarize the matching document or answer a question, which can then be rendered on a search results page for a more productive search experience.

先进的预训练模型将用于汇总和排名。State-of-the-art pretrained models are used for summarization and ranking. 为了保持用户预期的搜索速度,语义汇总和排名仅适用于根据默认相似性评分算法评分的前 50 个结果。To maintain the fast performance that users expect from search, semantic summarization and ranking are applied to just the top 50 results, as scored by the default similarity scoring algorithm. 将这些结果用作文档语料库,语义排名会根据匹配项的语义强度对这些结果重新评分。Using those results as the document corpus, semantic ranking re-scores those results based on the semantic strength of the match.

基础技术来自必应和 Microsoft Research,并作为附加功能集成到认知搜索基础结构。The underlying technology is from Bing and Microsoft Research, and integrated into the Cognitive Search infrastructure as an add-on feature. 有关用于支持语义搜索的研究和 AI 投入的详细信息,请参阅必应的 AI 功能如何为 Azure 认知搜索提供支持(Microsoft Research 博客)For more information about the research and AI investments backing semantic search, see How AI from Bing is powering Azure Cognitive Search (Microsoft Research Blog).

组件和工作流Components and workflow

语义搜索通过添加以下功能提高精准率和召回率:Semantic search improves precision and recall with the addition of the following capabilities:

功能Feature 说明Description
拼写检查Spell check 在查询词语到达搜索引擎之前纠正拼写错误。Corrects typos before the query terms reach the search engine.
语义排名Semantic ranking 使用上下文或语义含义计算相关性分数。Uses the context or semantic meaning to compute a new relevance score.
语义标题和突出显示Semantic captions and highlights 在文档中,突出显示对内容进行最佳概括的句子和短语中的关键片段,以便于扫描。Sentences and phrases from a document that best summarize the content, with highlights over key passages for easy scanning. 当单个内容字段在结果页中过于密集时,汇总结果的标题就显得非常有用。Captions that summarize a result are useful when individual content fields are too dense for the results page. 突出显示的文本表明最相关的词语和短语,以便用户可以快速确定为何匹配项被视为相关。Highlighted text elevates the most relevant terms and phrases so that users can quickly determine why a match was considered relevant.
语义答案Semantic answers 从语义查询返回的可选的附加子结构。An optional and additional substructure returned from a semantic query. 它为与问题类似的查询提供直接答案。It provides a direct answer to a query that looks like a question.

运算顺序Order of operations

语义搜索的组件在两个方向扩展现有查询执行管道。Components of semantic search extend the existing query execution pipeline in both directions. 如果启用拼写更正,拼写检查器会在词语到达搜索引擎之前,以及查询开始时纠正拼写错误。If you enable spelling correction, the speller corrects typos at query onset, before terms reach the search engine.

查询执行中的语义组件

查询执行照常进行,包括词语解析、分析和扫描倒排索引。Query execution proceeds as usual, with term parsing, analysis, and scans over the inverted indexes. 引擎使用标记匹配检索文档,并使用默认相似性评分算法对结果进行评分。The engine retrieves documents using token matching, and scores the results using the default similarity scoring algorithm. 分数根据查询词语和索引中的匹配词语之间的语言相似性来计算。Scores are calculated based on the degree of linguistic similarity between query terms and matching terms in the index. 如果已对其进行定义,此阶段还会应用计分概要文件。If you defined them, scoring profiles are also applied at this stage. 然后将结果传递到语义搜索子系统。Results are then passed to the semantic search subsystem.

在准备步骤中,将在句子和段落级别对从初始结果集中返回的文档语料库进行分析,以查找汇总每个文档的段落。In the preparation step, the document corpus returned from the initial result set is analyzed at the sentence and paragraph level to find passages that summarize each document. 与关键字搜索不同,此步骤使用机器阅读和理解来评估内容。In contrast with keyword search, this step uses machine reading and comprehension to evaluate the content. 通过此内容处理阶段,语义查询将返回标题答案Through this stage of content processing, a semantic query returns captions and answers. 为了构建标题,语义搜索使用语言表示形式提取和突出显示对结果进行最佳概括的关键片段。To formulate them, semantic search uses language representation to extract and highlight key passages that best summarize a result. 如果搜索查询是一个问题,并且要求回答,则响应将包括一个可以最好地回答该问题的文本段落,如搜索查询所示。If the search query is a question - and answers are requested - the response will also include a text passage that best answers the question, as expressed by the search query.

将现有文本用于构建标题和答案。For both captions and answers, existing text is used in the formulation. 语义模型无法根据现有内容撰写新句子或短语,也无法应用得出新结论的逻辑。The semantic models do not compose new sentences or phrases from the available content, nor does it apply logic to arrive at new conclusions. 简而言之,该系统永远不会返回不存在的内容。In short, the system will never return content that doesn't already exist.

然后,根据查询词语的概念相似性重新对结果评分。Results are then re-scored based on the conceptual similarity of query terms.

要在查询中使用语义功能,需要对搜索请求进行少量修改,但是无需额外的配置或重新索引。To use semantic capabilities in queries, you'll need to make small modifications to the search request, but no extra configuration or reindexing is required.

可用性和定价Availability and pricing

语义功能可以通过在标准层(S1、S2 和 S3)创建的搜索服务上的注册获得,位于以下区域之一:中国北部、中国北部、中国北部 2、中国北部 2、中国北部、中国北部。Semantic capabilities are available through sign-up registration, on search services created at a Standard tier (S1, S2, S3), located in one of these regions: China North, China North, China North 2, China North 2, China North, China North.

拼写纠错可在同一区域使用,但没有层级限制。Spell correction is available in the same regions, but has no tier restrictions. 如果现有服务符合层级和区域标准,则只需注册即可使用。If you have an existing service that meets tier and region criteria, only sign up is required.

在 3 月 2 日至 4 月下旬预览版发布期间,将免费提供拼写纠错和语义排名功能。Between preview launch on March 2 through late April, spell correction and semantic ranking are offered free of charge. 在 4 月下旬,将对运行此功能的计算进行计费。Later in April the computational costs of running this functionality will become a billable event. 250,000 个查询的所需成本约为 500 美元/月。The expected cost is about USD $500/month for 250,000 queries. 可以在“认知搜索定价”页面估算和管理成本中找到详细的成本信息。You can find detailed cost information documented in the Cognitive Search pricing page and in Estimate and manage costs.

后续步骤Next steps

在符合上一部分所述层级和区域要求的搜索服务上注册预览版。Sign-up for the preview on a search service that meets the tier and regional requirements noted in the previous section.

服务准备就绪后,创建语义查询以查看语义排名的运行情况。When your service is ready, create a semantic query to see semantic ranking in action.