Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
The RetrievalGroundedness judge assesses whether your application's response is factually supported by the provided context (either from a RAG system or generated by a tool call), helping detect hallucinations or statements not backed by that context.
This built-in LLM judge is designed for evaluating RAG applications that need to ensure responses are grounded in retrieved information.
By default, this judge uses a Databricks-hosted LLM designed to perform GenAI quality assessments. You can change the judge model by using the model argument in the judge definition. The model must be specified in the format <provider>:/<model-name>, where <provider> is a LiteLLM-compatible model provider. If you use databricks as the model provider, the model name is the same as the serving endpoint name.
Prerequisites for running the examples
Install MLflow and required packages
pip install --upgrade "mlflow[databricks]>=3.4.0"Create an MLflow experiment by following the setup your environment quickstart.
Usage with mlflow.evaluate()
The RetrievalGroundedness judge can be used directly with MLflow's evaluation framework.
Requirements:
- Trace requirements:
- The MLflow Trace must contain at least one span with
span_typeset toRETRIEVER inputsandoutputsmust be on the Trace's root span
- The MLflow Trace must contain at least one span with
Initialize an OpenAI client to connect to either Databricks-hosted LLMs or LLMs hosted by OpenAI.
OpenAI-hosted LLMs
Use the native OpenAI SDK to connect to OpenAI-hosted models. Select a model from the available OpenAI models.
import mlflow import os import openai # Ensure your OPENAI_API_KEY is set in your environment # os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>" # Uncomment and set if not globally configured # Enable auto-tracing for OpenAI mlflow.openai.autolog() # Set up MLflow tracking to Databricks mlflow.set_tracking_uri("databricks") mlflow.set_experiment("/Shared/docs-demo") # Create an OpenAI client connected to OpenAI SDKs client = openai.OpenAI() # Select an LLM model_name = "gpt-4o-mini"Use the judge:
from mlflow.genai.scorers import RetrievalGroundedness from mlflow.entities import Document from typing import List # Define a retriever function with proper span type @mlflow.trace(span_type="RETRIEVER") def retrieve_docs(query: str) -> List[Document]: # Simulated retrieval based on query if "mlflow" in query.lower(): return [ Document( id="doc_1", page_content="MLflow is an open-source platform for managing the ML lifecycle.", metadata={"source": "mlflow_docs.txt"} ), Document( id="doc_2", page_content="MLflow provides tools for experiment tracking, model packaging, and deployment.", metadata={"source": "mlflow_features.txt"} ) ] else: return [ Document( id="doc_3", page_content="Machine learning involves training models on data.", metadata={"source": "ml_basics.txt"} ) ] # Define your RAG app @mlflow.trace def rag_app(query: str): # Retrieve relevant documents docs = retrieve_docs(query) context = "\n".join([doc.page_content for doc in docs]) # Generate response using LLM messages = [ {"role": "system", "content": f"Answer based on this context: {context}"}, {"role": "user", "content": query} ] response = client.chat.completions.create( # This example uses Databricks hosted Claude. If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o, etc. model=model_name, messages=messages ) return {"response": response.choices[0].message.content} # Create evaluation dataset eval_dataset = [ { "inputs": {"query": "What is MLflow used for?"} }, { "inputs": {"query": "What are the main features of MLflow?"} } ] # Run evaluation with RetrievalGroundedness scorer eval_results = mlflow.genai.evaluate( data=eval_dataset, predict_fn=rag_app, scorers=[ RetrievalGroundedness( model="databricks:/databricks-gpt-oss-120b", # Optional. Defaults to custom Databricks model. ) ] )
Customization
You can customize the judge by providing a different judge model:
from mlflow.genai.scorers import RetrievalGroundedness
# Use a different judge model
groundedness_judge = RetrievalGroundedness(
model="databricks:/databricks-gpt-5-mini" # Or any LiteLLM-compatible model
)
# Use in evaluation
eval_results = mlflow.genai.evaluate(
data=eval_dataset,
predict_fn=rag_app,
scorers=[groundedness_judge]
)
Interpreting Results
The judge returns a Feedback object with:
value: "yes" if response is grounded, "no" if it contains hallucinationsrationale: Detailed explanation identifying:- Which statements are supported by context
- Which statements lack support (hallucinations)
- Specific quotes from context that support or contradict claims
Next Steps
- Evaluate context sufficiency - Check if your retriever provides adequate information
- Evaluate context relevance - Ensure retrieved documents are relevant to queries
- Run comprehensive RAG evaluation - Combine multiple judges for complete RAG assessment