正确性判断

2025-10-30

法官 Correctness 通过将 GenAI 申请与提供的基础真相信息（expected_facts 或 expected_response）进行比较来评估你的 GenAI 申请的反应是否真实正确。

此内置 LLM 评估器旨在评估应用程序响应与已知正确答案的匹配程度。

默认情况下，此法官使用 Databricks 托管的 LLM 来执行 GenAI 质量评估。可以通过在法官定义中使用model参数来更改评判模型。必须以格式 <provider>:/<model-name>指定模型，其中 <provider> 与 LiteLLM 兼容的模型提供程序。如果使用 databricks 模型提供程序，则模型名称与服务终结点名称相同。

运行示例的先决条件

安装 MLflow 和所需包

pip install --upgrade "mlflow[databricks]>=3.4.0"

请按照设置环境快速入门创建 MLflow 试验。

用法示例

from mlflow.genai.scorers import Correctness

correctness_judge = Correctness()

# Example 1: Response contains expected facts
feedback = correctness_judge(
    inputs={"request": "What is MLflow?"},
    outputs={"response": "MLflow is an open-source platform for managing the ML lifecycle."},
    expectations={
        "expected_facts": [
            "MLflow is open-source",
            "MLflow is a platform for ML lifecycle"
        ]
    }
)

print(feedback.value)  # "yes"
print(feedback.rationale)  # Explanation of which facts are supported

# Example 2: Response missing or contradicting facts
feedback = correctness_judge(
    inputs={"request": "When was MLflow released?"},
    outputs={"response": "MLflow was released in 2017."},
    expectations={"expected_facts": ["MLflow was released in June 2018"]}
)

# Example 3: Using expected_response instead of expected_facts
feedback = correctness_judge(
    inputs={"request": "What is the capital/major city of France?"},
    outputs={"response": "The capital/major city of France is Paris."},
    expectations={"expected_response": "Paris is the capital/major city of France."}
)

与 mlflow.evaluate() 一起使用

Correctness评估器可以直接与 MLflow 的评估框架一起使用。

要求：

跟踪要求：inputs和outputs必须位于跟踪的根跨度上
基本事实标签：必需 - 必须在 expected_facts 字典中提供 expected_response 或 expectations 之一

from mlflow.genai.scorers import Correctness

# Create evaluation dataset with ground truth
eval_dataset = [
    {
        "inputs": {"query": "What is the capital of France?"},
        "outputs": {
            "response": "Paris is the magnificent capital city of France, known for the Eiffel Tower and rich culture."
        },
        "expectations": {
            "expected_facts": ["Paris is the capital of France."]
        },
    },
    {
        "inputs": {"query": "What are the main components of MLflow?"},
        "outputs": {
            "response": "MLflow has four main components: Tracking, Projects, Models, and Registry."
        },
        "expectations": {
            "expected_facts": [
                "MLflow has four main components",
                "Components include Tracking",
                "Components include Projects",
                "Components include Models",
                "Components include Registry"
            ]
        },
    },
    {
        "inputs": {"query": "When was MLflow released?"},
        "outputs": {
            "response": "MLflow was released in 2017 by Databricks."
        },
        "expectations": {
            "expected_facts": ["MLflow was released in June 2018"]
        },
    }
]

# Run evaluation with Correctness scorer
eval_results = mlflow.genai.evaluate(
    data=eval_dataset,
    scorers=[
        Correctness(
            model="databricks:/databricks-gpt-oss-120b",  # Optional. Defaults to custom Databricks model.
        )
    ]
)

替代方法：使用expected_response

还可以使用 expected_response ，而不是 expected_facts：

eval_dataset_with_response = [
    {
        "inputs": {"query": "What is MLflow?"},
        "outputs": {
            "response": "MLflow is an open-source platform for managing the ML lifecycle."
        },
        "expectations": {
            "expected_response": "MLflow is an open-source platform for managing the machine learning lifecycle, including experimentation, reproducibility, and deployment."
        },
    }
]

# Run evaluation with expected_response
eval_results = mlflow.genai.evaluate(
    data=eval_dataset_with_response,
    scorers=[Correctness()]
)

小窍门

建议使用expected_facts而不是expected_response，因为它允许更灵活的评估——响应不需要逐字逐句一致，只需包含关键事实。

Customization

可以通过提供不同的评价模型来定制评审：

from mlflow.genai.scorers import Correctness

# Use a different judge model
correctness_judge = Correctness(
    model="databricks:/databricks-gpt-5-mini"  # Or any LiteLLM-compatible model
)

# Use in evaluation
eval_results = mlflow.genai.evaluate(
    data=eval_dataset,
    scorers=[correctness_judge]
)

解释结果

法官返回一个 Feedback 对象，其中包含：

value：如果响应正确，则为“是”，如果不正确，则为“否”
rationale：详细说明哪些事实受支持或缺失

后续步骤

探索其他内置评委 - 了解其他内置质量评估评委
创建自定义评委 - 构建领域特定的评估评委
运行评估 - 在综合应用程序评估中使用判断