다음을 통해 공유

检索充分性评估

法官RetrievalSufficiency评估检索到的上下文(无论是来自 RAG 应用程序、代理,还是任何检索文档的系统)是否包含足够的信息,以便根据提供的expected_facts基础真相标签或expected_response来充分回答用户的请求。

此内置 LLM 评估器旨在评估 RAG 系统,确保检索过程能够提供所有必要的信息。

默认情况下,此法官使用 Databricks 托管的 LLM 来执行 GenAI 质量评估。 可以通过在法官定义中使用model参数来更改评判模型。 必须以格式 <provider>:/<model-name>指定模型,其中 <provider> 与 LiteLLM 兼容的模型提供程序。 如果使用 databricks 模型提供程序,则模型名称与服务终结点名称相同。

运行示例的先决条件

  1. 安装 MLflow 和所需包

    pip install --upgrade "mlflow[databricks]>=3.4.0"
    
  2. 请按照设置环境快速入门创建 MLflow 试验。

与 mlflow.evaluate() 一起使用

RetrievalSufficiency评估器可以直接与 MLflow 的评估框架一起使用。

要求

  • 跟踪要求
    • MLflow 跟踪必须至少包含一个将 span_type 设置为 RETRIEVER 的跨度
    • inputsoutputs 必须位于跟踪的根跨度上
  • 基本事实标签:必需 - 必须在 expected_facts 字典中提供 expected_responseexpectations 之一
  1. 初始化 OpenAI 客户端以连接到 OpenAI 托管的 LLM。

    OpenAI 托管的 LLM

    使用本地 OpenAI SDK 连接到由 OpenAI 托管的模型。 从 可用的 OpenAI 模型中选择一个模型。

    import mlflow
    import os
    import openai
    
    # Ensure your OPENAI_API_KEY is set in your environment
    # os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>" # Uncomment and set if not globally configured
    
    # Enable auto-tracing for OpenAI
    mlflow.openai.autolog()
    
    # Set up MLflow tracking to Databricks
    mlflow.set_tracking_uri("databricks")
    mlflow.set_experiment("/Shared/docs-demo")
    
    # Create an OpenAI client connected to OpenAI SDKs
    client = openai.OpenAI()
    
    # Select an LLM
    model_name = "gpt-4o-mini"
    
  2. 使用判断:

    from mlflow.genai.scorers import RetrievalSufficiency
    from mlflow.entities import Document
    from typing import List
    
    # Define a retriever function with proper span type
    @mlflow.trace(span_type="RETRIEVER")
    def retrieve_docs(query: str) -> List[Document]:
        # Simulated retrieval - some queries return insufficient context
        if "capital of france" in query.lower():
            return [
                Document(
                    id="doc_1",
                    page_content="Paris is the capital of France.",
                    metadata={"source": "geography.txt"}
                ),
                Document(
                    id="doc_2",
                    page_content="France is a country in Western Europe.",
                    metadata={"source": "countries.txt"}
                )
            ]
        elif "mlflow components" in query.lower():
            # Incomplete retrieval - missing some components
            return [
                Document(
                    id="doc_3",
                    page_content="MLflow has multiple components including Tracking and Projects.",
                    metadata={"source": "mlflow_intro.txt"}
                )
            ]
        else:
            return [
                Document(
                    id="doc_4",
                    page_content="General information about data science.",
                    metadata={"source": "ds_basics.txt"}
                )
            ]
    
    # Define your RAG app
    @mlflow.trace
    def rag_app(query: str):
        # Retrieve documents
        docs = retrieve_docs(query)
        context = "\n".join([doc.page_content for doc in docs])
    
        # Generate response
        messages = [
            {"role": "system", "content": f"Answer based on this context: {context}"},
            {"role": "user", "content": query}
        ]
    
        response = client.chat.completions.create(
            # This example uses Databricks hosted Claude.  If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o, etc.
            model=model_name,
            messages=messages
        )
    
        return {"response": response.choices[0].message.content}
    
    # Create evaluation dataset with ground truth
    eval_dataset = [
        {
            "inputs": {"query": "What is the capital of France?"},
            "expectations": {
                "expected_facts": ["Paris is the capital of France."]
            }
        },
        {
            "inputs": {"query": "What are all the MLflow components?"},
            "expectations": {
                "expected_facts": [
                    "MLflow has four main components",
                    "Components include Tracking",
                    "Components include Projects",
                    "Components include Models",
                    "Components include Registry"
                ]
            }
        }
    ]
    
    # Run evaluation with RetrievalSufficiency scorer
    eval_results = mlflow.genai.evaluate(
        data=eval_dataset,
        predict_fn=rag_app,
        scorers=[
            RetrievalSufficiency(
                model="databricks:/databricks-gpt-oss-120b",  # Optional. Defaults to custom Databricks model.
            )
        ]
    )
    

了解结果

RetrievalSufficiency 评分器单独评估每个检索模块的跨度。 该流将:

  • 如果检索的文档包含生成预期事实所需的所有信息,则返回“yes”
  • 如果检索的文档缺少关键信息,则返回“否”,并说明缺少的内容的理由

这有助于确定检索系统何时无法提取所有必要的信息,这是 RAG 应用程序中不完整或不正确的响应的常见原因。

Customization

可以通过提供不同的评价模型来定制评审:

from mlflow.genai.scorers import RetrievalSufficiency

# Use a different judge model
sufficiency_judge = RetrievalSufficiency(
    model="databricks:/databricks-gpt-5-mini"  # Or any LiteLLM-compatible model
)

# Use in evaluation
eval_results = mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=rag_app,
    scorers=[sufficiency_judge]
)

解释结果

法官返回一个 Feedback 对象,其中包含:

  • value:如果上下文足够,则为“是”,如果不足,则为“否”
  • rationale:上下文中涵盖或缺少哪些预期事实的说明

后续步骤