上下文充分性判断和评分器

2025-10-21

根据 judges.is_context_sufficient() 或 expected_facts 提供的基本事实标签，expected_response 预定义判断评估 RAG 系统检索到的上下文或工具调用生成的上下文是否包含足够的信息来充分回答用户的请求。

评判可通过预定义的 RetrievalSufficiency 评分器进行，用于评估 RAG 系统，你需要确保检索过程提供了所有必要的信息。

API 签名

有关详细信息，请参阅 mlflow.genai.judges.is_context_sufficient()。

from mlflow.genai.judges import is_context_sufficient

def is_context_sufficient(
    *,
    request: str,                    # User's question or query
    context: Any,                    # Context to evaluate for relevance, can be any Python primitive or a JSON-seralizable dict
    expected_facts: Optional[list[str]],       # List of expected facts (provide either expected_response or expected_facts)
    expected_response: Optional[str] = None,  # Ground truth response (provide either expected_response or expected_facts)
    name: Optional[str] = None,       # Optional custom name for display in the MLflow UIs
    model: Optional[str] = None,  # Optional LiteLLM compatible custom judge model
) -> mlflow.entities.Feedback:
    """Returns Feedback with 'yes' or 'no' value and a rationale"""

默认情况下，此法官使用专门优化的 Databricks 托管的 LLM 模型，该模型旨在执行 GenAI 质量评估。可以使用记分器定义中的模型参数来更改判断模型。必须以格式 <provider>:/<model-name>指定模型，其中提供程序是 LiteLLM 兼容的模型提供程序。如果使用 databricks 模型提供程序，则模型名称与服务终结点名称相同。

运行示例的先决条件

安装 MLflow 和所需包

pip install --upgrade "mlflow[databricks]>=3.4.0"

请按照设置环境快速入门创建 MLflow 试验。

直接使用 SDK

from mlflow.genai.judges import is_context_sufficient

# Example 1: Context contains sufficient information
feedback = is_context_sufficient(
    request="What is the capital of France?",
    context=[
        {"content": "Paris is the capital of France."},
        {"content": "Paris is known for its Eiffel Tower."}
    ],
    expected_facts=["Paris is the capital of France."]
)
print(feedback.value)  # "yes"
print(feedback.rationale)  # Explanation of sufficiency

# Example 2: Context lacks necessary information
feedback = is_context_sufficient(
    request="What are MLflow's components?",
    context=[
        {"content": "MLflow is an open-source platform."}
    ],
    expected_facts=[
        "MLflow has four main components",
        "Components include Tracking",
        "Components include Projects"
    ]
)
print(feedback.value)  # "no"
print(feedback.rationale)  # Explanation of what's missing

# Example 3: Custom judge model
feedback = is_context_sufficient(
    request="What are MLflow's components?",
    context=[
        {"content": "MLflow is an open-source platform."}
    ],
    expected_facts=[
        "MLflow has four main components",
        "Components include Tracking",
        "Components include Projects"
    ],
    model="databricks:/databricks-gpt-oss-120b",
)

使用预构建的评分器

is_context_sufficient 判断可通过 RetrievalSufficiency 预构建的评分器获得。

要求：

跟踪要求：
- MLflow 跟踪必须至少包含一个将 span_type 设置为 RETRIEVER 的跨度
- inputs 和 outputs 必须位于跟踪的根跨度上
基本事实标签：必需 - 必须在 expected_facts 字典中提供 expected_response 或 expectations 之一

初始化 OpenAI 客户端以连接到由 Databricks 托管的 LLM 或者由 OpenAI 托管的 LLM。

Databricks 托管的 LLM

使用 MLflow 获取一个 OpenAI 客户端，以连接到由 Databricks 托管的 LLMs。从可用的基础模型中选择一个模型。

import mlflow
from databricks.sdk import WorkspaceClient

# Enable MLflow's autologging to instrument your application with Tracing
mlflow.openai.autolog()

# Set up MLflow tracking to Databricks
mlflow.set_tracking_uri("databricks")
mlflow.set_experiment("/Shared/docs-demo")

# Create an OpenAI client that is connected to Databricks-hosted LLMs
w = WorkspaceClient()
client = w.serving_endpoints.get_open_ai_client()

# Select an LLM
model_name = "databricks-claude-sonnet-4"

OpenAI 托管的 LLM

使用本地 OpenAI SDK 连接到由 OpenAI 托管的模型。从可用的 OpenAI 模型中选择一个模型。

import mlflow
import os
import openai

# Ensure your OPENAI_API_KEY is set in your environment
# os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>" # Uncomment and set if not globally configured

# Enable auto-tracing for OpenAI
mlflow.openai.autolog()

# Set up MLflow tracking to Databricks
mlflow.set_tracking_uri("databricks")
mlflow.set_experiment("/Shared/docs-demo")

# Create an OpenAI client connected to OpenAI SDKs
client = openai.OpenAI()

# Select an LLM
model_name = "gpt-4o-mini"

使用判断：

from mlflow.genai.scorers import RetrievalSufficiency
from mlflow.entities import Document
from typing import List

# Define a retriever function with proper span type
@mlflow.trace(span_type="RETRIEVER")
def retrieve_docs(query: str) -> List[Document]:
    # Simulated retrieval - some queries return insufficient context
    if "capital of france" in query.lower():
        return [
            Document(
                id="doc_1",
                page_content="Paris is the capital of France.",
                metadata={"source": "geography.txt"}
            ),
            Document(
                id="doc_2",
                page_content="France is a country in Western Europe.",
                metadata={"source": "countries.txt"}
            )
        ]
    elif "mlflow components" in query.lower():
        # Incomplete retrieval - missing some components
        return [
            Document(
                id="doc_3",
                page_content="MLflow has multiple components including Tracking and Projects.",
                metadata={"source": "mlflow_intro.txt"}
            )
        ]
    else:
        return [
            Document(
                id="doc_4",
                page_content="General information about data science.",
                metadata={"source": "ds_basics.txt"}
            )
        ]

# Define your RAG app
@mlflow.trace
def rag_app(query: str):
    # Retrieve documents
    docs = retrieve_docs(query)
    context = "\n".join([doc.page_content for doc in docs])

    # Generate response
    messages = [
        {"role": "system", "content": f"Answer based on this context: {context}"},
        {"role": "user", "content": query}
    ]

    response = client.chat.completions.create(
        # This example uses Databricks hosted Claude.  If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o, etc.
        model=model_name,
        messages=messages
    )

    return {"response": response.choices[0].message.content}

# Create evaluation dataset with ground truth
eval_dataset = [
    {
        "inputs": {"query": "What is the capital of France?"},
        "expectations": {
            "expected_facts": ["Paris is the capital of France."]
        }
    },
    {
        "inputs": {"query": "What are all the MLflow components?"},
        "expectations": {
            "expected_facts": [
                "MLflow has four main components",
                "Components include Tracking",
                "Components include Projects",
                "Components include Models",
                "Components include Registry"
            ]
        }
    }
]

# Run evaluation with RetrievalSufficiency scorer
eval_results = mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=rag_app,
    scorers=[
        RetrievalSufficiency(
            model="databricks:/databricks-gpt-oss-120b",  # Optional. Defaults to custom Databricks model.
        )
    ]
)

了解结果

RetrievalSufficiency 评分器单独评估每个检索模块的跨度。该流将：

如果检索的文档包含生成预期事实所需的所有信息，则返回“yes”
如果检索的文档缺少关键信息，则返回“否”，并说明缺少的内容的理由

这有助于确定检索系统何时无法提取所有必要的信息，这是 RAG 应用程序中不完整或不正确的响应的常见原因。

在自定义评分器中使用

在评估具有与预定义评分器要求不同的数据结构的应用程序时，请将判断包装在自定义评分器中：

from mlflow.genai.judges import is_context_sufficient
from mlflow.genai.scorers import scorer
from typing import Dict, Any

eval_dataset = [
    {
        "inputs": {"query": "What are the benefits of MLflow?"},
        "outputs": {
            "retrieved_context": [
                {"content": "MLflow simplifies ML lifecycle management."},
                {"content": "MLflow provides experiment tracking and model versioning."},
                {"content": "MLflow enables easy model deployment."}
            ]
        },
        "expectations": {
            "expected_facts": [
                "MLflow simplifies ML lifecycle management",
                "MLflow provides experiment tracking",
                "MLflow enables model deployment"
            ]
        }
    },
    {
        "inputs": {"query": "How does MLflow handle model versioning?"},
        "outputs": {
            "retrieved_context": [
                {"content": "MLflow is an open-source platform."}
            ]
        },
        "expectations": {
            "expected_facts": [
                "MLflow Model Registry handles versioning",
                "Models can have multiple versions",
                "Versions can be promoted through stages"
            ]
        }
    }
]

@scorer
def context_sufficiency_scorer(inputs: Dict[Any, Any], outputs: Dict[Any, Any], expectations: Dict[Any, Any]):
    return is_context_sufficient(
        request=inputs["query"],
        context=outputs["retrieved_context"],
        expected_facts=expectations["expected_facts"]
    )

# Run evaluation
eval_results = mlflow.genai.evaluate(
    data=eval_dataset,
    scorers=[context_sufficiency_scorer]
)

解释结果

法官返回一个 Feedback 对象，其中包含：

value：如果上下文足够，则为“是”，如果不足，则为“否”
rationale：上下文中涵盖或缺少哪些预期事实的说明

后续步骤

评估上下文相关性 - 在检查足够之前确保检索的文档相关
评估基础性 - 验证响应是否仅使用提供的上下文
生成评估数据集 - 创建具有预期事实的地面真实数据集以进行测试

通过