Set the retrieval reasoning effort

Note

This feature is currently in public preview. This preview is provided without a service-level agreement and isn't recommended for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Azure Previews.

In agentic retrieval, you can specify the level of large language model (LLM) processing for query planning and answer formulation. Use the retrievalReasoningEffort property to set LLM processing levels that affect costs and latency. Extra LLM processing improves relevancy, but it also takes longer and uses billable LLM resources. You can set this property in a knowledge base or on a retrieve request.

Levels of reasoning effort include:

Level	Effort
`minimal`	No LLM processing. You provide the query.
`low`	Runs a single pass of LLM-based query planning and knowledge source selection. This is the default. The LLM analyzes the query and breaks it into component parts as needed.
`medium`	Adds deeper search and an enhanced retrieval stack to agentic retrieval to maximize completeness.

Prerequisites

An Azure AI Search service with a knowledge base.
Permissions to update the knowledge base. Configure keyless authentication with the Search Service Contributor role assigned to your user account (recommended) or use an API key.
If the knowledge base specifies an LLM, the search service must have a managed identity with Cognitive Services User permissions on the Microsoft Foundry resource.
The 2025-11-01-preview REST API or an equivalent Azure SDK preview package: .NET | Java | JavaScript | Python.

Choose a reasoning effort

This section describes:

Reasoning effort levels
Iterative search for medium retrieval
Region support for medium retrieval

Reasoning effort levels

Level	Description	Recommendation	Limits
`minimal`	Disables LLM-based query planning to deliver the lowest cost and latency for agentic retrieval. It issues direct text and vector searches across the knowledge sources listed in the knowledge base, and returns the best-matching passages. Because all knowledge sources in the knowledge base are always searched and no query expansion is performed, behavior is predictable and easy to control. It also means the `alwaysQueryKnowledgeSource` property on a retrieve request is ignored.	Use "minimal" for migrations from the Search API or when you want to manage query planning yourself.	`outputMode` must be set to `extractiveData`. Answer synthesis and web knowledge aren't supported. Maximum of 10 knowledge sources per knowledge base.
`low`	The default mode of agentic retrieval, running a single pass of LLM-based query planning and knowledge source selection. The agentic retrieval engine generates subqueries and fans them out to the selected knowledge sources, then merges the results. You can enable answer synthesis to produce a grounded natural-language response with inline citations.	Use "low" when you want a balance between minimal latency and deeper processing.	5,000 answer tokens. Maximum of three subqueries from three knowledge sources per knowledge base. Maximum of 50 documents for semantic ranking, and 10 documents if the semantic ranker uses L3 classification.

Set the reasoning effort in a knowledge base

To establish the default behavior, set the property in the knowledge base.

Use Create or Update Knowledge Base to set the retrievalReasoningEffort.
Add the retrievalReasoningEffort property. The following JSON shows the syntax. For more information about knowledge bases, see Create a knowledge base.
```
"retrievalReasoningEffort": { /* no other parameters when effort is minimal */
    "kind": "low"
}
```

Set the reasoning effort in a retrieve request

To override the default on a query-by-query basis, set the property in the retrieve request.

Modify a retrieve action to override the knowledge base retrievalReasoningEffort default.

Add the retrievalReasoningEffort property. A retrieve request might look similar to the following example.

{
    "messages": [ /* trimmed for brevity */  ],
    "retrievalReasoningEffort": { "kind": "low" },
    "outputMode": "answerSynthesis",
    "maxRuntimeInSeconds": 30,
    "maxOutputSize": 6000
}

Last updated on 2026-02-28