法官RetrievalSufficiency评估检索到的上下文(无论是来自 RAG 应用程序、代理,还是任何检索文档的系统)是否包含足够的信息,以便根据提供的expected_facts基础真相标签或expected_response来充分回答用户的请求。
此内置 LLM 评估器旨在评估 RAG 系统,确保检索过程能够提供所有必要的信息。
运行示例的先决条件
安装 MLflow 和所需包
pip install --upgrade "mlflow[databricks]>=3.4.0"请按照设置环境快速入门创建 MLflow 试验。
用法示例
可以直接调用RetrievalSufficiency评估组件进行单一跟踪评估,也可以与 MLflow 的评估框架一起使用进行批量评估。
要求:
-
跟踪要求:
- MLflow 跟踪必须至少包含一个将
span_type设置为RETRIEVER的跨度 -
inputs和outputs必须位于跟踪的根跨度上
- MLflow 跟踪必须至少包含一个将
-
基本事实标签:必需 - 必须在
expected_facts字典中提供expected_response或expectations之一
直接调用
from mlflow.genai.scorers import retrieval_sufficiency
import mlflow
# Get a trace from a previous run
trace = mlflow.get_trace("<your-trace-id>")
# Assess if the retrieved context is sufficient for the expected facts
feedback = retrieval_sufficiency(
trace=trace,
expectations={
"expected_facts": [
"MLflow has four main components",
"Components include Tracking",
"Components include Projects",
"Components include Models",
"Components include Registry"
]
}
)
print(feedback)
使用 evaluate() 进行调用
import mlflow
from mlflow.genai.scorers import RetrievalSufficiency
# Evaluate traces from previous runs with ground truth expectations
results = mlflow.genai.evaluate(
data=eval_dataset, # Dataset with trace data and expected_facts
scorers=[RetrievalSufficiency()]
)
RAG 示例
下面是一个完整的示例,演示如何创建 RAG 应用程序并评估检索到的上下文是否足够:
初始化 OpenAI 客户端以连接到 OpenAI 托管的 LLM。
OpenAI 托管的 LLM
使用本地 OpenAI SDK 连接到由 OpenAI 托管的模型。 从 可用的 OpenAI 模型中选择一个模型。
import mlflow import os import openai # Ensure your OPENAI_API_KEY is set in your environment # os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>" # Uncomment and set if not globally configured # Enable auto-tracing for OpenAI mlflow.openai.autolog() # Set up MLflow tracking to Databricks mlflow.set_tracking_uri("databricks") mlflow.set_experiment("/Shared/docs-demo") # Create an OpenAI client connected to OpenAI SDKs client = openai.OpenAI() # Select an LLM model_name = "gpt-4o-mini"定义和评估 RAG 应用程序:
from mlflow.genai.scorers import RetrievalSufficiency from mlflow.entities import Document from typing import List # Define a retriever function with proper span type @mlflow.trace(span_type="RETRIEVER") def retrieve_docs(query: str) -> List[Document]: # Simulated retrieval - some queries return insufficient context if "capital of france" in query.lower(): return [ Document( id="doc_1", page_content="Paris is the capital of France.", metadata={"source": "geography.txt"} ), Document( id="doc_2", page_content="France is a country in Western Europe.", metadata={"source": "countries.txt"} ) ] elif "mlflow components" in query.lower(): # Incomplete retrieval - missing some components return [ Document( id="doc_3", page_content="MLflow has multiple components including Tracking and Projects.", metadata={"source": "mlflow_intro.txt"} ) ] else: return [ Document( id="doc_4", page_content="General information about data science.", metadata={"source": "ds_basics.txt"} ) ] # Define your RAG app @mlflow.trace def rag_app(query: str): # Retrieve documents docs = retrieve_docs(query) context = "\n".join([doc.page_content for doc in docs]) # Generate response messages = [ {"role": "system", "content": f"Answer based on this context: {context}"}, {"role": "user", "content": query} ] response = client.chat.completions.create( # This example uses Databricks hosted Claude. If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o, etc. model=model_name, messages=messages ) return {"response": response.choices[0].message.content} # Create evaluation dataset with ground truth eval_dataset = [ { "inputs": {"query": "What is the capital of France?"}, "expectations": { "expected_facts": ["Paris is the capital of France."] } }, { "inputs": {"query": "What are all the MLflow components?"}, "expectations": { "expected_facts": [ "MLflow has four main components", "Components include Tracking", "Components include Projects", "Components include Models", "Components include Registry" ] } } ] # Run evaluation with RetrievalSufficiency scorer eval_results = mlflow.genai.evaluate( data=eval_dataset, predict_fn=rag_app, scorers=[ RetrievalSufficiency( model="databricks:/databricks-gpt-oss-120b", # Optional. Defaults to custom Databricks model. ) ] )
了解结果
RetrievalSufficiency 评分器单独评估每个检索模块的跨度。 该流将:
- 如果检索的文档包含生成预期事实所需的所有信息,则返回“yes”
- 如果检索的文档缺少关键信息,则返回“否”,并说明缺少的内容的理由
这有助于确定检索系统何时无法提取所有必要的信息,这是 RAG 应用程序中不完整或不正确的响应的常见原因。
选择赋予法官权力的 LLM
默认情况下,这些评委使用 Databricks 托管的 LLM 来执行 GenAI 质量评估。 可以通过在法官定义中使用model参数来更改评判模型。 必须以格式 <provider>:/<model-name>指定模型,其中 <provider> 与 LiteLLM 兼容的模型提供程序。 如果使用 databricks 模型提供程序,则模型名称与服务终结点名称相同。
可以通过提供不同的评价模型来定制评审:
from mlflow.genai.scorers import RetrievalSufficiency
# Use a different judge model
sufficiency_judge = RetrievalSufficiency(
model="databricks:/databricks-gpt-5-mini" # Or any LiteLLM-compatible model
)
# Use in evaluation
eval_results = mlflow.genai.evaluate(
data=eval_dataset,
predict_fn=rag_app,
scorers=[sufficiency_judge]
)
有关支持的模型列表,请参阅 MLflow 文档。
解释结果
法官返回一个 Feedback 对象,其中包含:
-
value:如果上下文足够,则为“是”,如果不足,则为“否” -
rationale:上下文涵盖或缺少哪些预期事实的说明