教程:相关性最大化(Azure AI 搜索中的 RAG)

本教程介绍如何提高 RAG 解决方案中使用的搜索结果的相关性。 相关性优化是交付符合用户期望的 RAG 解决方案的一个重要因素。 在 Azure AI 搜索中,相关性优化包括 L2 语义排名和计分概要文件。

要实现这些功能,需要重新访问索引架构,为语义排序和计分概要文件添加配置。 然后使用新结构来重新运行查询。

在本教程中,你将修改现有的搜索索引和查询,以便使用:

  • L2 语义排名
  • 文档提升的计分概要文件

本教程将更新由索引管道创建的搜索索引。 更新不会影响现有内容,因此无需重新生成,也无需重新运行索引器。

注意

预览版中还有更多相关性功能,包括矢量查询加权和设置最小阈值,但由于这些功能还处于预览阶段,因此本教程暂不介绍。

先决条件

下载示例

示例笔记本包含更新的索引和查询请求。

运行基线查询进行比较

让我们从一个新的查询开始,“是否存在特定于海洋和大型水体的云层?”

要比较添加相关性功能后的结果,请在添加语义排名或计分概要文件之前,根据现有索引架构运行查询。

from azure.search.documents import SearchClient
from openai import AzureOpenAI

token_provider = get_bearer_token_provider(credential, "https://cognitiveservices.azure.com/.default")
openai_client = AzureOpenAI(
     api_version="2024-06-01",
     azure_endpoint=AZURE_OPENAI_ACCOUNT,
     azure_ad_token_provider=token_provider
 )

deployment_name = "gpt-4o"

search_client = SearchClient(
     endpoint=AZURE_SEARCH_SERVICE,
     index_name=index_name,
     credential=credential
 )

GROUNDED_PROMPT="""
You are an AI assistant that helps users learn from the information found in the source material.
Answer the query using only the sources provided below.
Use bullets if the answer has multiple points.
If the answer is longer than 3 sentences, provide a summary.
Answer ONLY with the facts listed in the list of sources below. Cite your source when you answer the question
If there isn't enough information below, say you don't know.
Do not generate answers that don't use the sources below.
Query: {query}
Sources:\n{sources}
"""

# Focused query on cloud formations and bodies of water
query="Are there any cloud formations specific to oceans and large bodies of water?"
vector_query = VectorizableTextQuery(text=query, k_nearest_neighbors=50, fields="text_vector")

search_results = search_client.search(
    search_text=query,
    vector_queries= [vector_query],
    select=["title", "chunk", "locations"],
    top=5,
)

sources_formatted = "=================\n".join([f'TITLE: {document["title"]}, CONTENT: {document["chunk"]}, LOCATIONS: {document["locations"]}' for document in search_results])

response = openai_client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": GROUNDED_PROMPT.format(query=query, sources=sources_formatted)
        }
    ],
    model=deployment_name
)

print(response.choices[0].message.content)

此请求的输出结果可能与下面的示例相似。

Yes, there are cloud formations specific to oceans and large bodies of water. 
A notable example is "cloud streets," which are parallel rows of clouds that form over 
the Bering Strait in the Arctic Ocean. These cloud streets occur when wind blows from 
a cold surface like sea ice over warmer, moister air near the open ocean, leading to 
the formation of spinning air cylinders. Clouds form along the upward cycle of these cylinders, 
while skies remain clear along the downward cycle (Source: page-21.pdf).

更新语义排名和计分概要文件的索引

在之前的教程中,你为 RAG 工作负荷设计了一个索引架构。 我们特意省略了该架构中的相关性增强功能,以便你可以专注于基础知识。 将相关性推迟到另一项练习中进行,这样就能在更新后对搜索结果的质量进行前后比较。

  1. 更新导入语句,纳入语义排名和计分概要文件的类。

     from azure.identity import DefaultAzureCredential
     from azure.identity import get_bearer_token_provider
     from azure.search.documents.indexes import SearchIndexClient
     from azure.search.documents.indexes.models import (
         SearchField,
         SearchFieldDataType,
         VectorSearch,
         HnswAlgorithmConfiguration,
         VectorSearchProfile,
         AzureOpenAIVectorizer,
         AzureOpenAIVectorizerParameters,
         SearchIndex,
         SemanticConfiguration,
         SemanticPrioritizedFields,
         SemanticField,
         SemanticSearch,
         ScoringProfile,
         TagScoringFunction,
         TagScoringParameters
     )
    
  2. 在搜索索引中添加以下语义配置。 此示例可在笔记本中的更新架构步骤中找到。

    # New semantic configuration
    semantic_config = SemanticConfiguration(
        name="my-semantic-config",
        prioritized_fields=SemanticPrioritizedFields(
            title_field=SemanticField(field_name="title"),
            keywords_fields=[SemanticField(field_name="locations")],
            content_fields=[SemanticField(field_name="chunk")]
        )
    )
    
    # Create the semantic settings with the configuration
    semantic_search = SemanticSearch(configurations=[semantic_config])
    

    语义配置有一个名称和一个优先字段列表,有助于优化语义排序器的输入。 有关详细信息,请参阅配置语义排名

  3. 接下来,添加计分概要文件定义。 与语义配置一样,计分概要文件可随时添加到索引架构中。 此示例也在笔记本的更新架构步骤中,紧随语义配置之后。

    # New scoring profile
    scoring_profiles = [  
        ScoringProfile(  
            name="my-scoring-profile",
            functions=[
                TagScoringFunction(  
                    field_name="locations",  
                    boost=5.0,  
                    parameters=TagScoringParameters(  
                        tags_parameter="tags",  
                    ),  
                ) 
            ]
        )
    ]
    

    此配置文件使用标记功能,可提高在位置字段中找到匹配项的文档的分数。 回想一下,搜索索引有一个矢量字段和多个非矢量字段,分别代表标题、块和位置。 位置字段是一个字符串集合,而字符串集合可以使用计分概要文件中的标记函数进行增强。 有关详细信息,请参阅添加计分概要文件通过文档提升增强搜索相关性(博客文章)

  4. 更新搜索服务上的索引定义。

    # Update the search index with the semantic configuration
     index = SearchIndex(name=index_name, fields=fields, vector_search=vector_search, semantic_search=semantic_search, scoring_profiles=scoring_profiles)  
     result = index_client.create_or_update_index(index)  
     print(f"{result.name} updated")  
    

更新语义排名和计分概要文件的查询

在之前的教程中,你在搜索引擎上执行了运行查询,并将响应和其他信息传递给 LLM 以完成聊天。

此示例修改了查询请求,以包含语义配置和计分概要文件。

# Import libraries
from azure.search.documents import SearchClient
from openai import AzureOpenAI

token_provider = get_bearer_token_provider(credential, "https://cognitiveservices.azure.com/.default")
openai_client = AzureOpenAI(
     api_version="2024-06-01",
     azure_endpoint=AZURE_OPENAI_ACCOUNT,
     azure_ad_token_provider=token_provider
 )

deployment_name = "gpt-4o"

search_client = SearchClient(
     endpoint=AZURE_SEARCH_SERVICE,
     index_name=index_name,
     credential=credential
 )

# Prompt is unchanged in this update
GROUNDED_PROMPT="""
You are an AI assistant that helps users learn from the information found in the source material.
Answer the query using only the sources provided below.
Use bullets if the answer has multiple points.
If the answer is longer than 3 sentences, provide a summary.
Answer ONLY with the facts listed in the list of sources below.
If there isn't enough information below, say you don't know.
Do not generate answers that don't use the sources below.
Query: {query}
Sources:\n{sources}
"""

# Queries are unchanged in this update
query="Are there any cloud formations specific to oceans and large bodies of water?"
vector_query = VectorizableTextQuery(text=query, k_nearest_neighbors=50, fields="text_vector")

# Add query_type semantic and semantic_configuration_name
# Add scoring_profile and scoring_parameters
search_results = search_client.search(
    query_type="semantic",
    semantic_configuration_name="my-semantic-config",
    scoring_profile="my-scoring-profile",
    scoring_parameters=["tags-ocean, 'sea surface', seas, surface"],
    search_text=query,
    vector_queries= [vector_query],
    select="title, chunk, locations",
    top=5,
)
sources_formatted = "=================\n".join([f'TITLE: {document["title"]}, CONTENT: {document["chunk"]}, LOCATIONS: {document["locations"]}' for document in search_results])

response = openai_client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": GROUNDED_PROMPT.format(query=query, sources=sources_formatted)
        }
    ],
    model=deployment_name
)

print(response.choices[0].message.content)

语义排名和增强查询的输出结果可能与下面的示例相似。

Yes, there are specific cloud formations influenced by oceans and large bodies of water:

- **Stratus Clouds Over Icebergs**: Low stratus clouds can frame holes over icebergs, 
such as Iceberg A-56 in the South Atlantic Ocean, likely due to thermal instability caused 
by the iceberg (source: page-39.pdf).

- **Undular Bores**: These are wave structures in the atmosphere created by the collision 
of cool, dry air from a continent with warm, moist air over the ocean, as seen off the 
coast of Mauritania (source: page-23.pdf).

- **Ship Tracks**: These are narrow clouds formed by water vapor condensing around tiny 
particles from ship exhaust. They are observed over the oceans, such as in the Pacific Ocean 
off the coast of California (source: page-31.pdf).

These specific formations are influenced by unique interactions between atmospheric conditions 
and the presence of large water bodies or objects within them.

添加语义排名和计分概要文件可促进符合评分标准和语义相关的结果,从而对 LLM 的响应产生积极影响。

现在,你已经对索引和查询设计有了更好的了解,让我们继续优化速度和简洁性。 我们会重新审视架构定义,以实现量化和减少存储,但管道和模型的其他部分保持不变。