Agentic retrieval in Azure AI Search

Note

This feature is currently in public preview. This preview is provided without a service-level agreement and isn't recommended for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Azure Previews.

What is agentic retrieval? In Azure AI Search, agentic retrieval is a new multi-query pipeline designed for complex questions posed by users or agents in chat and copilot apps. It's intended for Retrieval Augmented Generation (RAG) patterns and agent-to-agent workflows.

Here's what it does:

Uses a large language model (LLM) to break down a complex query into smaller, focused subqueries for better coverage over your indexed content. Subqueries can include chat history for extra context.
Runs subqueries in parallel. Each subquery is semantically reranked to promote the most relevant matches.
Combines the best results into a unified response that an LLM can use to generate answers with your proprietary content.
The response is modular yet comprehensive in how it also includes a query plan and source documents. You can choose to use just the search results as grounding data, or invoke the LLM to formulate an answer.

This high-performance pipeline helps you generate high quality grounding data (or an answer) for your chat application, with the ability to answer complex questions quickly.

Programmatically, agentic retrieval is supported through a new Knowledge Base object in the 2025-11-01-preview and in Azure SDK preview packages that provide the feature. A knowledge base's retrieval response is designed for downstream consumption by other agents and chat apps.

Why use agentic retrieval

You should use agentic retrieval when you want to provide agents and apps with the most relevant content for answering harder questions, leveraging chat context and your proprietary content.

The agentic aspect is a reasoning step in query planning processing that's performed by a supported large language model (LLM) that you provide. The LLM analyzes the entire chat thread to identify the underlying information need. Instead of a single, catch-all query, the LLM breaks down compound questions into focused subqueries based on: user questions, chat history, and parameters on the request. The subqueries target your indexed documents (plain text and vectors) in Azure AI Search. This hybrid approach ensures you surface both keyword matches and semantic similarities at once, dramatically improving recall.

The retrieval component is the ability to run subqueries simultaneously, merge results, semantically rank results, and return a three-part response that includes grounding data for the next conversation turn, reference data so that you can inspect the source content, and an activity plan that shows query execution steps.

Query expansion and parallel execution, plus the retrieval response, are the key capabilities of agentic retrieval that make it the best choice for generative AI (RAG) applications.

Agentic retrieval adds latency to query processing, but it makes up for it by adding these capabilities:

Reads in chat history as an input to the retrieval pipeline.
Deconstructs a complex query that contains multiple "asks" into component parts. For example: "find me a hotel near the beach, with airport transportation, and that's within walking distance of vegetarian restaurants."
Rewrites an original query into multiple subqueries using synonym maps (optional) and LLM-generated paraphrasing.
Corrects spelling mistakes.
Executes all subqueries simultaneously.
Outputs a unified result as a single string. Alternatively, you can extract parts of the response for your solution. Metadata about query execution and reference data is included in the response.

Agentic retrieval invokes the entire query processing pipeline multiple times for each subquery, but it does so in parallel, preserving the efficiency and performance necessary for a reasonable user experience.

Note

Including an LLM in query planning adds latency to a query pipeline. You can mitigate the effects by using faster models, such as gpt-4o-mini, and summarizing the message threads. You can minimize latency and costs by setting properties that limit LLM processing. You can also exclude LLM processing altogether for just text and hybrid search and your own query planning logic.

Architecture and workflow

Agentic retrieval is designed for conversational search experiences that use an LLM to intelligently break down complex queries. The system coordinates multiple Azure services to deliver comprehensive search results.

How it works

The agentic retrieval process works as follows:

Workflow initiation: Your application calls a knowledge base with retrieve action that provides a query and conversation history.
Query planning: A knowledge base sends your query and conversation history to an LLM, which analyzes the context and breaks down complex questions into focused subqueries. This step is automated and not customizable.
Query execution: The knowledge base sends the subqueries to your knowledge sources. All subqueries run simultaneously and can be keyword, vector, and hybrid search. Each subquery undergoes semantic reranking to find the most relevant matches. References are extracted and retained for citation purposes.
Result synthesis: The system combines all results into a unified response with three parts: merged content, source references, and execution details.

Your search index determines query execution and any optimizations that occur during query execution. Specifically, if your index includes searchable text and vector fields, a hybrid query executes. If the only searchable field is a vector field, then only pure vector search is used. The index semantic configuration, plus optional scoring profiles, synonym maps, analyzers, and normalizers (if you add filters) are all used during query execution. You must have named defaults for a semantic configuration and a scoring profile.

Required components

Component	Service	Role
LLM	Azure OpenAI	Creates subqueries from conversation context and later uses grounding data for answer generation
Knowledge base	Azure AI Search	Orchestrates the pipeline, connecting to your LLM and managing query parameters
Knowledge source	Azure AI Search	Wraps the search index with properties pertaining to knowledge base usage
Search index	Azure AI Search	Stores your searchable content (text and vectors) with semantic configuration
Semantic ranker	Azure AI Search	Required component that reranks results for relevance (L2 reranking)

Integration requirements

Your application drives the pipeline by calling the knowledge base and handling the response. The pipeline returns grounding data that you pass to an LLM for answer generation in your conversation interface. For implementation details, see Tutorial: Build an end-to-end agentic retrieval solution.

Note

Only gpt-4o, gpt-4.1, and gpt-5 series models are supported for query planning. You can use any model for final answer generation.

How to get started

To create an agentic retrieval solution, you can use the Azure portal, the latest preview REST APIs, or a preview Azure SDK package that provides the functionality.

Currently, Azure portal support for agentic retrieval is limited to the 2025-08-01-preview. We recommend using a programmatic approach to access the latest features.

Availability and pricing

Agentic retrieval is available in selected regions. Knowledge sources and knowledge bases also have maximum limits that vary by service tier.

It has a dependency on premium features. If you disable semantic ranker for your search service, you effectively disable agentic retrieval.

Plan	Description
Free	A free tier search service provides 50 million free agentic reasoning tokens per month. On higher tiers, you can choose between the free plan (default) and the standard plan.
Standard	The standard plan is pay-as-you-go pricing once the monthly free quota is consumed. After the free quota is used up, you are charged an additional fee for each additional one million agentic reasoning tokens. You aren't notified when the transition occurs. For more information about charges by currency, see the Azure AI Search pricing page.

Token-based billing for LLM-based query planning and answer synthesis (optional) is pay-as-you-go in Azure OpenAI. It's token based for both input and output tokens. The model you assign to the knowledge base is the one charged for token usage. For example, if you use gpt-4o, the token charge appears in the bill for gpt-4o.

Token-based billing for agentic retrieval is the number of tokens returned by each subquery.

Aspect	Classic single-query pipeline	Agentic retrieval multi-query pipeline
Unit	Query based (1,000 queries) per unit of currency	Token based (1 million tokens per unit of currency)
Cost per unit	Uniform cost per query	Uniform cost per token
Cost estimation	Estimate query count	Estimate token usage
Free tier	1,000 free queries	50 million free tokens

Last updated on 2025-12-29

Agentic retrieval in Azure AI Search

Why use agentic retrieval

Architecture and workflow

How it works

Required components

Integration requirements

How to get started

Availability and pricing

Additional resources