Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
The RetrievalSufficiency judge evaluates whether the retrieved context (from RAG applications, agents, or any system that retrieves documents) contains enough information to adequately answer the user's request based on the ground truth label provided as expected_facts or an expected_response.
This built-in LLM judge is designed for evaluating RAG systems where you need to ensure that your retrieval process is providing all necessary information.
By default, this judge uses a Databricks-hosted LLM designed to perform GenAI quality assessments. You can change the judge model by using the model argument in the judge definition. The model must be specified in the format <provider>:/<model-name>, where <provider> is a LiteLLM-compatible model provider. If you use databricks as the model provider, the model name is the same as the serving endpoint name.
Prerequisites for running the examples
Install MLflow and required packages
pip install --upgrade "mlflow[databricks]>=3.4.0"Create an MLflow experiment by following the setup your environment quickstart.
Usage with mlflow.evaluate()
The RetrievalSufficiency judge can be used directly with MLflow's evaluation framework.
Requirements:
- Trace requirements:
- The MLflow Trace must contain at least one span with
span_typeset toRETRIEVER inputsandoutputsmust be on the Trace's root span
- The MLflow Trace must contain at least one span with
- Ground-truth labels: Required - must provide either
expected_factsorexpected_responsein theexpectationsdictionary
Initialize an OpenAI client to connect to LLMs hosted by OpenAI.
OpenAI-hosted LLMs
Use the native OpenAI SDK to connect to OpenAI-hosted models. Select a model from the available OpenAI models.
import mlflow import os import openai # Ensure your OPENAI_API_KEY is set in your environment # os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>" # Uncomment and set if not globally configured # Enable auto-tracing for OpenAI mlflow.openai.autolog() # Set up MLflow tracking to Databricks mlflow.set_tracking_uri("databricks") mlflow.set_experiment("/Shared/docs-demo") # Create an OpenAI client connected to OpenAI SDKs client = openai.OpenAI() # Select an LLM model_name = "gpt-4o-mini"Use the judge:
from mlflow.genai.scorers import RetrievalSufficiency from mlflow.entities import Document from typing import List # Define a retriever function with proper span type @mlflow.trace(span_type="RETRIEVER") def retrieve_docs(query: str) -> List[Document]: # Simulated retrieval - some queries return insufficient context if "capital of france" in query.lower(): return [ Document( id="doc_1", page_content="Paris is the capital of France.", metadata={"source": "geography.txt"} ), Document( id="doc_2", page_content="France is a country in Western Europe.", metadata={"source": "countries.txt"} ) ] elif "mlflow components" in query.lower(): # Incomplete retrieval - missing some components return [ Document( id="doc_3", page_content="MLflow has multiple components including Tracking and Projects.", metadata={"source": "mlflow_intro.txt"} ) ] else: return [ Document( id="doc_4", page_content="General information about data science.", metadata={"source": "ds_basics.txt"} ) ] # Define your RAG app @mlflow.trace def rag_app(query: str): # Retrieve documents docs = retrieve_docs(query) context = "\n".join([doc.page_content for doc in docs]) # Generate response messages = [ {"role": "system", "content": f"Answer based on this context: {context}"}, {"role": "user", "content": query} ] response = client.chat.completions.create( # This example uses Databricks hosted Claude. If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o, etc. model=model_name, messages=messages ) return {"response": response.choices[0].message.content} # Create evaluation dataset with ground truth eval_dataset = [ { "inputs": {"query": "What is the capital of France?"}, "expectations": { "expected_facts": ["Paris is the capital of France."] } }, { "inputs": {"query": "What are all the MLflow components?"}, "expectations": { "expected_facts": [ "MLflow has four main components", "Components include Tracking", "Components include Projects", "Components include Models", "Components include Registry" ] } } ] # Run evaluation with RetrievalSufficiency scorer eval_results = mlflow.genai.evaluate( data=eval_dataset, predict_fn=rag_app, scorers=[ RetrievalSufficiency( model="databricks:/databricks-gpt-oss-120b", # Optional. Defaults to custom Databricks model. ) ] )
Understanding the results
The RetrievalSufficiency scorer evaluates each retriever span separately. It will:
- Return "yes" if the retrieved documents contain all the information needed to generate the expected facts
- Return "no" if the retrieved documents are missing critical information, along with a rationale explaining what's missing
This helps you identify when your retrieval system is failing to fetch all necessary information, which is a common cause of incomplete or incorrect responses in RAG applications.
Customization
You can customize the judge by providing a different judge model:
from mlflow.genai.scorers import RetrievalSufficiency
# Use a different judge model
sufficiency_judge = RetrievalSufficiency(
model="databricks:/databricks-gpt-5-mini" # Or any LiteLLM-compatible model
)
# Use in evaluation
eval_results = mlflow.genai.evaluate(
data=eval_dataset,
predict_fn=rag_app,
scorers=[sufficiency_judge]
)
Interpreting Results
The judge returns a Feedback object with:
value: "yes" if context is sufficient, "no" if insufficientrationale: Explanation of which expected facts are covered or missing in the context
Next Steps
- Evaluate context relevance - Ensure retrieved documents are relevant before checking sufficiency
- Evaluate groundedness - Verify that responses use only the provided context
- Build evaluation datasets - Create ground truth datasets with expected facts for testing