教程:在 Azure AI 搜索上为 RAG 生成索引管道

了解如何为 Azure AI 搜索上的 RAG 解决方案生成自动化索引管道。 索引自动化是通过一种索引器来驱动索引编制和技能组执行,为增量更新提供一次性或定期的集成数据分块和矢量化

在本教程中,你将了解:

  • 提供上一教程中的索引架构
  • 创建数据源连接
  • 创建索引器
  • 创建用于分块、矢量化和识别实体的技能组
  • 运行索引器并检查结果

如果没有 Azure 订阅,请在开始前创建一个试用版订阅

先决条件

下载示例

从 GitHub 下载 Jupyter 笔记本,以将请求发送到 Azure AI 搜索。 有关详细信息,请参阅从 GitHub 下载文件

提供索引架构

在 Visual Studio Code 中打开或创建 Jupyter 笔记本(.ipynb),以包含构成管道的脚本。 初始步骤安装包并收集连接的变量。 完成设置步骤后,即可开始使用索引管道的组件。

我们从上一教程索引架构开始。 它围绕矢量化和非矢量化区块进行组织。 它包括一个 locations 字段,用于存储技能组创建的 AI 生成的内容。

from azure.identity import DefaultAzureCredential
from azure.identity import get_bearer_token_provider
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
    SearchField,
    SearchFieldDataType,
    VectorSearch,
    HnswAlgorithmConfiguration,
    VectorSearchProfile,
    AzureOpenAIVectorizer,
    AzureOpenAIVectorizerParameters,
    SearchIndex
)

credential = DefaultAzureCredential()

# Create a search index  
index_name = "py-rag-tutorial-idx"
index_client = SearchIndexClient(endpoint=AZURE_SEARCH_SERVICE, credential=credential)  
fields = [
    SearchField(name="parent_id", type=SearchFieldDataType.String),  
    SearchField(name="title", type=SearchFieldDataType.String),
    SearchField(name="locations", type=SearchFieldDataType.Collection(SearchFieldDataType.String), filterable=True),
    SearchField(name="chunk_id", type=SearchFieldDataType.String, key=True, sortable=True, filterable=True, facetable=True, analyzer_name="keyword"),  
    SearchField(name="chunk", type=SearchFieldDataType.String, sortable=False, filterable=False, facetable=False),  
    SearchField(name="text_vector", type=SearchFieldDataType.Collection(SearchFieldDataType.Single), vector_search_dimensions=1024, vector_search_profile_name="myHnswProfile")
    ]  

# Configure the vector search configuration  
vector_search = VectorSearch(  
    algorithms=[  
        HnswAlgorithmConfiguration(name="myHnsw"),
    ],  
    profiles=[  
        VectorSearchProfile(  
            name="myHnswProfile",  
            algorithm_configuration_name="myHnsw",  
            vectorizer_name="myOpenAI",  
        )
    ],  
    vectorizers=[  
        AzureOpenAIVectorizer(  
            vectorizer_name="myOpenAI",  
            kind="azureOpenAI",  
            parameters=AzureOpenAIVectorizerParameters(  
                resource_url=AZURE_OPENAI_ACCOUNT,  
                deployment_name="text-embedding-3-large",
                model_name="text-embedding-3-large"
            ),
        ),  
    ], 
)  

# Create the search index
index = SearchIndex(name=index_name, fields=fields, vector_search=vector_search)  
result = index_client.create_or_update_index(index)  
print(f"{result.name} created")  

创建数据源连接

在此步骤中,设置示例数据和与 Azure Blob 存储的连接。 索引器从容器中检索 PDF。 在此步骤中创建容器并上传文件。

原始电子书很大,超过 100 页,大小为 35 MB。 我们将其分成若干个较小的 PDF,每页文本为一个 PDF,以保持在 REST API 有效负载限制之下(每次 API 调用不超过 16 MB)。 为简单起见,我们在本练习中省略了图像矢量化。

  1. 登录到 Azure 门户,并找到你的 Azure 存储帐户。

  2. 创建容器并从 earth_book_2019_text_pages 上传 PDF。

  3. 确保 Azure AI 搜索具有存储 Blob 数据读取器资源权限。

  4. 接下来,在 Visual Studio Code 中,定义索引器数据源,该数据源在索引编制过程中提供连接信息。

    from azure.search.documents.indexes import SearchIndexerClient
    from azure.search.documents.indexes.models import (
        SearchIndexerDataContainer,
        SearchIndexerDataSourceConnection
    )
    
    # Create a data source 
    indexer_client = SearchIndexerClient(endpoint=AZURE_SEARCH_SERVICE, credential=credential)
    container = SearchIndexerDataContainer(name="nasa-ebooks-pdfs-all")
    data_source_connection = SearchIndexerDataSourceConnection(
        name="py-rag-tutorial-ds",
        type="azureblob",
        connection_string=AZURE_STORAGE_CONNECTION,
        container=container
    )
    data_source = indexer_client.create_or_update_data_source_connection(data_source_connection)
    
    print(f"Data source '{data_source.name}' created or updated")
    

如果为连接设置了 Azure AI 搜索托管标识,连接字符串会包含 ResourceId= 后缀。 它应如下例所示:"ResourceId=/subscriptions/FAKE-SUBCRIPTION=ID/resourceGroups/FAKE-RESOURCE-GROUP/providers/Microsoft.Storage/storageAccounts/FAKE-ACCOUNT;"

创建技能组

技能是集成数据分块和矢量化的基础。 至少需要文本拆分技能来将内容分块,以及创建分块内容的矢量表示形式的嵌入技能。

在此技能组中,额外技能用于在索引中创建结构化数据。 实体识别技能用于识别位置,这些位置的范围可以从适当的名称到通用引用,例如“海洋”或“山峰”。 通过结构化数据,你可以选择更多选项来创建有趣的查询并提升相关性。

from azure.search.documents.indexes.models import (
    SplitSkill,
    InputFieldMappingEntry,
    OutputFieldMappingEntry,
    AzureOpenAIEmbeddingSkill,
    EntityRecognitionSkill,
    SearchIndexerIndexProjection,
    SearchIndexerIndexProjectionSelector,
    SearchIndexerIndexProjectionsParameters,
    IndexProjectionMode,
    SearchIndexerSkillset,
    CognitiveServicesAccountKey
)

# Create a skillset  
skillset_name = "py-rag-tutorial-ss"

split_skill = SplitSkill(  
    description="Split skill to chunk documents",  
    text_split_mode="pages",  
    context="/document",  
    maximum_page_length=2000,  
    page_overlap_length=500,  
    inputs=[  
        InputFieldMappingEntry(name="text", source="/document/content"),  
    ],  
    outputs=[  
        OutputFieldMappingEntry(name="textItems", target_name="pages")  
    ],  
)  

embedding_skill = AzureOpenAIEmbeddingSkill(  
    description="Skill to generate embeddings via Azure OpenAI",  
    context="/document/pages/*",  
    resource_url=AZURE_OPENAI_ACCOUNT,  
    deployment_name="text-embedding-3-large",  
    model_name="text-embedding-3-large",
    dimensions=1536,
    inputs=[  
        InputFieldMappingEntry(name="text", source="/document/pages/*"),  
    ],  
    outputs=[  
        OutputFieldMappingEntry(name="embedding", target_name="text_vector")  
    ],  
)

entity_skill = EntityRecognitionSkill(
    description="Skill to recognize entities in text",
    context="/document/pages/*",
    categories=["Location"],
    default_language_code="en",
    inputs=[
        InputFieldMappingEntry(name="text", source="/document/pages/*")
    ],
    outputs=[
        OutputFieldMappingEntry(name="locations", target_name="locations")
    ]
)

index_projections = SearchIndexerIndexProjection(  
    selectors=[  
        SearchIndexerIndexProjectionSelector(  
            target_index_name=index_name,  
            parent_key_field_name="parent_id",  
            source_context="/document/pages/*",  
            mappings=[  
                InputFieldMappingEntry(name="chunk", source="/document/pages/*"),  
                InputFieldMappingEntry(name="text_vector", source="/document/pages/*/text_vector"),
                InputFieldMappingEntry(name="locations", source="/document/pages/*/locations"),
                InputFieldMappingEntry(name="title", source="/document/metadata_storage_name"),  
            ],  
        ),  
    ],  
    parameters=SearchIndexerIndexProjectionsParameters(  
        projection_mode=IndexProjectionMode.SKIP_INDEXING_PARENT_DOCUMENTS  
    ),  
) 

cognitive_services_account = CognitiveServicesAccountKey(key=AZURE_AI_MULTISERVICE_KEY)

skills = [split_skill, embedding_skill, entity_skill]

skillset = SearchIndexerSkillset(  
    name=skillset_name,  
    description="Skillset to chunk documents and generating embeddings",  
    skills=skills,  
    index_projection=index_projections,
    cognitive_services_account=cognitive_services_account
)

client = SearchIndexerClient(endpoint=AZURE_SEARCH_SERVICE, credential=credential)  
client.create_or_update_skillset(skillset)  
print(f"{skillset.name} created")

创建并运行索引器

索引器是设置运动中的所有进程的组件。 可以创建处于禁用状态的索引器,但默认立即运行它。 在本教程中,创建并运行索引器以从 Blob 存储检索数据、执行技能(包括分块和矢量化)并加载索引。

索引器需要几分钟的时间来运行。 完成后,可以继续执行最后一步:查询索引。

from azure.search.documents.indexes.models import (
    SearchIndexer,
    FieldMapping
)

# Create an indexer  
indexer_name = "py-rag-tutorial-idxr" 

indexer_parameters = None

indexer = SearchIndexer(  
    name=indexer_name,  
    description="Indexer to index documents and generate embeddings",  
    skillset_name=skillset_name,  
    target_index_name=index_name,  
    data_source_name=data_source.name,
    # Map the metadata_storage_name field to the title field in the index to display the PDF title in the search results  
    field_mappings=[FieldMapping(source_field_name="metadata_storage_name", target_field_name="title")],
    parameters=indexer_parameters
)  

# Create and run the indexer  
indexer_client = SearchIndexerClient(endpoint=AZURE_SEARCH_SERVICE, credential=credential)  
indexer_result = indexer_client.create_or_update_indexer(indexer)  

print(f' {indexer_name} is created and running. Give the indexer a few minutes before running a query.')    

运行查询以检查结果

发送查询以确认索引是否正常运行。 此请求将文本字符串“what's NASA's website?”转换为矢量搜索的矢量。 结果由 Select 语句中的字段组成,其中一些字段作为输出打印。

目前还没有聊天或生成式 AI。 结果是搜索索引中的逐字内容。

from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizableTextQuery

# Vector Search using text-to-vector conversion of the querystring
query = "what's NASA's website?"  

search_client = SearchClient(endpoint=AZURE_SEARCH_SERVICE, credential=credential, index_name=index_name)
vector_query = VectorizableTextQuery(text=query, k_nearest_neighbors=50, fields="text_vector")

results = search_client.search(  
    search_text=query,  
    vector_queries= [vector_query],
    select=["chunk"],
    top=1
)  

for result in results:  
    print(f"Score: {result['@search.score']}")
    print(f"Chunk: {result['chunk']}")

此查询返回单个匹配(top=1),由搜索引擎确定的一个区块组成,以达到最相关。 查询的结果应类似于以下示例:

Score: 0.03306011110544205
Title: page-178.pdf
Locations: ['Headquarters', 'Washington']
Content: national Aeronautics and Space Administration

earth Science

NASA Headquarters 

300 E Street SW 

Washington, DC 20546

www.nasa.gov

np-2018-05-2546-hQ

再尝试一些查询来了解搜索引擎直接返回的内容,以便你可以将其与启用 LLM 的响应进行比较。 使用此查询重新运行上一个脚本:"how much of the earth is covered in water"

第二个查询的结果应类似于以下结果,这些结果为了简洁经过轻微编辑。

Score: 0.03333333507180214
Content:

Land of Lakes
Canada

During the last Ice Age, nearly all of Canada was covered by a massive ice sheet. Thousands of years later, the landscape still shows 

the scars of that icy earth-mover. Surfaces that were scoured by retreating ice and flooded by Arctic seas are now dotted with 

millions of lakes, ponds, and streams. In this false-color view from the Terra satellite, water is various shades of blue, green, tan, and 

black, depending on the amount of suspended sediment and phytoplankton; vegetation is red.

The region of Nunavut Territory is sometimes referred to as the "Barren Grounds," as it is nearly treeless and largely unsuitable for 

agriculture. The ground is snow-covered for much of the year, and the soil typically remains frozen (permafrost) even during the 

summer thaw. Nonetheless, this July 2001 image shows plenty of surface vegetation in midsummer, including lichens, mosses, 

shrubs, and grasses. The abundant fresh water also means the area is teeming with flies and mosquitoes.

通过此示例,更容易发现如何逐字返回区块,以及关键字和相似性搜索如何标识最匹配项。 这个特定的区块肯定有关于地球的水和覆盖的信息,但它与查询不完全相关。 语义排名会找到更好的答案,但下一步,我们来了解如何将 Azure AI 搜索连接到 LLM 以进行对话搜索。

下一步