다음을 통해 공유

上下文充分性判断和评分器

根据 judges.is_context_sufficient()expected_facts 提供的基本事实标签,expected_response 预定义判断评估 RAG 系统检索到的上下文或工具调用生成的上下文是否包含足够的信息来充分回答用户的请求。

评判可通过预定义的 RetrievalSufficiency 评分器进行,用于评估 RAG 系统,你需要确保检索过程提供了所有必要的信息。

API 签名

有关详细信息,请参阅 mlflow.genai.judges.is_context_sufficient()

from mlflow.genai.judges import is_context_sufficient

def is_context_sufficient(
    *,
    request: str,                    # User's question or query
    context: Any,                    # Context to evaluate for relevance, can be any Python primitive or a JSON-seralizable dict
    expected_facts: Optional[list[str]],       # List of expected facts (provide either expected_response or expected_facts)
    expected_response: Optional[str] = None,  # Ground truth response (provide either expected_response or expected_facts)
    name: Optional[str] = None       # Optional custom name for display in the MLflow UIs
) -> mlflow.entities.Feedback:
    """Returns Feedback with 'yes' or 'no' value and a rationale"""

运行示例的先决条件

  1. 安装 MLflow 和所需包

    pip install --upgrade "mlflow[databricks]>=3.1.0"
    
  2. 请按照设置环境快速入门创建 MLflow 试验。

直接使用 SDK

from mlflow.genai.judges import is_context_sufficient

# Example 1: Context contains sufficient information
feedback = is_context_sufficient(
    request="What is the capital of France?",
    context=[
        {"content": "Paris is the capital of France."},
        {"content": "Paris is known for its Eiffel Tower."}
    ],
    expected_facts=["Paris is the capital of France."]
)
print(feedback.value)  # "yes"
print(feedback.rationale)  # Explanation of sufficiency

# Example 2: Context lacks necessary information
feedback = is_context_sufficient(
    request="What are MLflow's components?",
    context=[
        {"content": "MLflow is an open-source platform."}
    ],
    expected_facts=[
        "MLflow has four main components",
        "Components include Tracking",
        "Components include Projects"
    ]
)
print(feedback.value)  # "no"
print(feedback.rationale)  # Explanation of what's missing

使用预构建的评分器

is_context_sufficient 判断可通过 RetrievalSufficiency 预构建的评分器获得。

要求

  • 跟踪要求
    • MLflow 跟踪必须至少包含一个将 span_type 设置为 RETRIEVER 的跨度
    • inputsoutputs 必须位于跟踪的根跨度上
  • 基本事实标签:必需 - 必须在 expected_facts 字典中提供 expected_responseexpectations 之一
  1. 初始化 OpenAI 客户端以连接到由 Databricks 托管的 LLM 或者由 OpenAI 托管的 LLM。

    Databricks 托管的 LLM

    使用 MLflow 获取一个 OpenAI 客户端,以连接到由 Databricks 托管的 LLMs。 从可用的基础模型中选择一个模型。

    import mlflow
    from databricks.sdk import WorkspaceClient
    
    # Enable MLflow's autologging to instrument your application with Tracing
    mlflow.openai.autolog()
    
    # Set up MLflow tracking to Databricks
    mlflow.set_tracking_uri("databricks")
    mlflow.set_experiment("/Shared/docs-demo")
    
    # Create an OpenAI client that is connected to Databricks-hosted LLMs
    w = WorkspaceClient()
    client = w.serving_endpoints.get_open_ai_client()
    
    # Select an LLM
    model_name = "databricks-claude-sonnet-4"
    

    OpenAI 托管的 LLM

    使用本地 OpenAI SDK 连接到由 OpenAI 托管的模型。 从 可用的 OpenAI 模型中选择一个模型。

    import mlflow
    import os
    import openai
    
    # Ensure your OPENAI_API_KEY is set in your environment
    # os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>" # Uncomment and set if not globally configured
    
    # Enable auto-tracing for OpenAI
    mlflow.openai.autolog()
    
    # Set up MLflow tracking to Databricks
    mlflow.set_tracking_uri("databricks")
    mlflow.set_experiment("/Shared/docs-demo")
    
    # Create an OpenAI client connected to OpenAI SDKs
    client = openai.OpenAI()
    
    # Select an LLM
    model_name = "gpt-4o-mini"
    
  2. 使用判断:

    from mlflow.genai.scorers import RetrievalSufficiency
    from mlflow.entities import Document
    from typing import List
    
    # Define a retriever function with proper span type
    @mlflow.trace(span_type="RETRIEVER")
    def retrieve_docs(query: str) -> List[Document]:
        # Simulated retrieval - some queries return insufficient context
        if "capital of france" in query.lower():
            return [
                Document(
                    id="doc_1",
                    page_content="Paris is the capital of France.",
                    metadata={"source": "geography.txt"}
                ),
                Document(
                    id="doc_2",
                    page_content="France is a country in Western Europe.",
                    metadata={"source": "countries.txt"}
                )
            ]
        elif "mlflow components" in query.lower():
            # Incomplete retrieval - missing some components
            return [
                Document(
                    id="doc_3",
                    page_content="MLflow has multiple components including Tracking and Projects.",
                    metadata={"source": "mlflow_intro.txt"}
                )
            ]
        else:
            return [
                Document(
                    id="doc_4",
                    page_content="General information about data science.",
                    metadata={"source": "ds_basics.txt"}
                )
            ]
    
    # Define your RAG app
    @mlflow.trace
    def rag_app(query: str):
        # Retrieve documents
        docs = retrieve_docs(query)
        context = "\n".join([doc.page_content for doc in docs])
    
        # Generate response
        messages = [
            {"role": "system", "content": f"Answer based on this context: {context}"},
            {"role": "user", "content": query}
        ]
    
        response = client.chat.completions.create(
            # This example uses Databricks hosted Claude.  If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o, etc.
            model=model_name,
            messages=messages
        )
    
        return {"response": response.choices[0].message.content}
    
    # Create evaluation dataset with ground truth
    eval_dataset = [
        {
            "inputs": {"query": "What is the capital of France?"},
            "expectations": {
                "expected_facts": ["Paris is the capital of France."]
            }
        },
        {
            "inputs": {"query": "What are all the MLflow components?"},
            "expectations": {
                "expected_facts": [
                    "MLflow has four main components",
                    "Components include Tracking",
                    "Components include Projects",
                    "Components include Models",
                    "Components include Registry"
                ]
            }
        }
    ]
    
    # Run evaluation with RetrievalSufficiency scorer
    eval_results = mlflow.genai.evaluate(
        data=eval_dataset,
        predict_fn=rag_app,
        scorers=[RetrievalSufficiency()]
    )
    

了解结果

RetrievalSufficiency 评分器单独评估每个检索模块的跨度。 该流将:

  • 如果检索的文档包含生成预期事实所需的所有信息,则返回“yes”
  • 如果检索的文档缺少关键信息,则返回“否”,并说明缺少的内容的理由

这有助于确定检索系统何时无法提取所有必要的信息,这是 RAG 应用程序中不完整或不正确的响应的常见原因。

在自定义评分器中使用

在评估具有与预定义评分器要求不同的数据结构的应用程序时,请将判断包装在自定义评分器中:

from mlflow.genai.judges import is_context_sufficient
from mlflow.genai.scorers import scorer
from typing import Dict, Any

eval_dataset = [
    {
        "inputs": {"query": "What are the benefits of MLflow?"},
        "outputs": {
            "retrieved_context": [
                {"content": "MLflow simplifies ML lifecycle management."},
                {"content": "MLflow provides experiment tracking and model versioning."},
                {"content": "MLflow enables easy model deployment."}
            ]
        },
        "expectations": {
            "expected_facts": [
                "MLflow simplifies ML lifecycle management",
                "MLflow provides experiment tracking",
                "MLflow enables model deployment"
            ]
        }
    },
    {
        "inputs": {"query": "How does MLflow handle model versioning?"},
        "outputs": {
            "retrieved_context": [
                {"content": "MLflow is an open-source platform."}
            ]
        },
        "expectations": {
            "expected_facts": [
                "MLflow Model Registry handles versioning",
                "Models can have multiple versions",
                "Versions can be promoted through stages"
            ]
        }
    }
]

@scorer
def context_sufficiency_scorer(inputs: Dict[Any, Any], outputs: Dict[Any, Any], expectations: Dict[Any, Any]):
    return is_context_sufficient(
        request=inputs["query"],
        context=outputs["retrieved_context"],
        expected_facts=expectations["expected_facts"]
    )

# Run evaluation
eval_results = mlflow.genai.evaluate(
    data=eval_dataset,
    scorers=[context_sufficiency_scorer]
)

解释结果

法官返回一个 Feedback 对象,其中包含:

  • value:如果上下文足够,则为“是”,如果不足,则为“否”
  • rationale:上下文中涵盖或缺少哪些预期事实的说明

后续步骤