答案与上下文相关性判断标准

MLflow 提供两个内置 LLM 法官来评估 GenAI 应用程序中的相关性。这些法官有助于诊断质量问题 - 如果上下文不相关，生成步骤无法产生有用的响应。

RelevanceToQuery：评估应用的响应是否直接解决用户的输入问题
RetrievalRelevance：评估应用检索器返回的每个文档是否相关

默认情况下，这些评委使用 Databricks 托管的 LLM 来执行 GenAI 质量评估。可以通过在法官定义中使用model参数来更改评判模型。必须以格式 <provider>:/<model-name>指定模型，其中 <provider> 与 LiteLLM 兼容的模型提供程序。如果使用 databricks 模型提供程序，则模型名称与服务终结点名称相同。

运行示例的先决条件

安装 MLflow 和所需包

pip install --upgrade "mlflow[databricks]>=3.4.0" openai "databricks-connect>=16.1"

请按照设置环境快速入门创建 MLflow 试验。

与 mlflow.evaluate() 一起使用

1. 相关性评审

此评分器评估应用的响应是否直接解决用户的输入问题，而不会偏离不相关的主题。

要求：

跟踪要求：inputs和outputs必须位于跟踪的根跨度上

from mlflow.genai.scorers import RelevanceToQuery

eval_dataset = [
    {
        "inputs": {"query": "What is the capital of France?"},
        "outputs": {
            "response": "Paris is the capital of France. It's known for the Eiffel Tower and is a major European city."
        },
    },
    {
        "inputs": {"query": "What is the capital of France?"},
        "outputs": {
            "response": "France is a beautiful country with great wine and cuisine."
        },
    }
]

# Run evaluation with RelevanceToQuery scorer
eval_results = mlflow.genai.evaluate(
    data=eval_dataset,
    scorers=[
        RelevanceToQuery(
            model="databricks:/databricks-gpt-oss-120b",  # Optional. Defaults to custom Databricks model.
        )
    ],
)

2. RetrievalRelevance 判定

此评分器评估应用检索器（s）返回的每个文档是否与输入请求相关。

要求：

跟踪要求：MLflow 跟踪必须至少包含一个范围，且设置为span_typeRETRIEVER

import mlflow
from mlflow.genai.scorers import RetrievalRelevance
from mlflow.entities import Document
from typing import List

# Define a retriever function with proper span type
@mlflow.trace(span_type="RETRIEVER")
def retrieve_docs(query: str) -> List[Document]:
    # Simulated retrieval - in practice, this would query a vector database
    if "capital" in query.lower() and "france" in query.lower():
        return [
            Document(
                id="doc_1",
                page_content="Paris is the capital of France.",
                metadata={"source": "geography.txt"}
            ),
            Document(
                id="doc_2",
                page_content="The Eiffel Tower is located in Paris.",
                metadata={"source": "landmarks.txt"}
            )
        ]
    else:
        return [
            Document(
                id="doc_3",
                page_content="Python is a programming language.",
                metadata={"source": "tech.txt"}
            )
        ]

# Define your app that uses the retriever
@mlflow.trace
def rag_app(query: str):
    docs = retrieve_docs(query)
    # In practice, you would pass these docs to an LLM
    return {"response": f"Found {len(docs)} relevant documents."}

# Create evaluation dataset
eval_dataset = [
    {
        "inputs": {"query": "What is the capital of France?"}
    },
    {
        "inputs": {"query": "How do I use Python?"}
    }
]

# Run evaluation with RetrievalRelevance scorer
eval_results = mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=rag_app,
    scorers=[
        RetrievalRelevance(
            model="databricks:/databricks-gpt-oss-120b",  # Optional. Defaults to custom Databricks model.
        )
    ]
)

Customization

可以通过提供不同的判断模型来自定义这些法官：

from mlflow.genai.scorers import RelevanceToQuery, RetrievalRelevance

# Use different judge models
relevance_judge = RelevanceToQuery(
    model="databricks:/databricks-gpt-5-mini"  # Or any LiteLLM-compatible model
)

retrieval_judge = RetrievalRelevance(
    model="databricks:/databricks-claude-opus-4-1"
)

# Use in evaluation
eval_results = mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=rag_app,
    scorers=[relevance_judge, retrieval_judge]
)

解释结果

法官返回一个 Feedback 对象，其中包含：

value：如果上下文相关，则为“是”，否则为“否”
rationale：解释上下文为何被视为相关或无关

后续步骤

探索其他内置评估器 - 了解合理性、安全性和正确性评估器
创建自定义法官 - 为用例构建专用法官
评估 RAG 应用程序 - 在全面的 RAG 评估中应用相关性法官

Last updated on 2025-10-30