法官 Correctness 通过将 GenAI 申请与提供的基础真相信息(expected_facts 或 expected_response)进行比较来评估你的 GenAI 申请的反应是否真实正确。
此内置 LLM 评估器旨在评估应用程序响应与已知正确答案的匹配程度。
默认情况下,此法官使用 Databricks 托管的 LLM 来执行 GenAI 质量评估。 可以通过在法官定义中使用model参数来更改评判模型。 必须以格式 <provider>:/<model-name>指定模型,其中 <provider> 与 LiteLLM 兼容的模型提供程序。 如果使用 databricks 模型提供程序,则模型名称与服务终结点名称相同。
运行示例的先决条件
安装 MLflow 和所需包
pip install --upgrade "mlflow[databricks]>=3.4.0"请按照设置环境快速入门创建 MLflow 试验。
用法示例
from mlflow.genai.scorers import Correctness
correctness_judge = Correctness()
# Example 1: Response contains expected facts
feedback = correctness_judge(
inputs={"request": "What is MLflow?"},
outputs={"response": "MLflow is an open-source platform for managing the ML lifecycle."},
expectations={
"expected_facts": [
"MLflow is open-source",
"MLflow is a platform for ML lifecycle"
]
}
)
print(feedback.value) # "yes"
print(feedback.rationale) # Explanation of which facts are supported
# Example 2: Response missing or contradicting facts
feedback = correctness_judge(
inputs={"request": "When was MLflow released?"},
outputs={"response": "MLflow was released in 2017."},
expectations={"expected_facts": ["MLflow was released in June 2018"]}
)
# Example 3: Using expected_response instead of expected_facts
feedback = correctness_judge(
inputs={"request": "What is the capital/major city of France?"},
outputs={"response": "The capital/major city of France is Paris."},
expectations={"expected_response": "Paris is the capital/major city of France."}
)
与 mlflow.evaluate() 一起使用
Correctness评估器可以直接与 MLflow 的评估框架一起使用。
要求:
-
跟踪要求:
inputs和outputs必须位于跟踪的根跨度上 -
基本事实标签:必需 - 必须在
expected_facts字典中提供expected_response或expectations之一
from mlflow.genai.scorers import Correctness
# Create evaluation dataset with ground truth
eval_dataset = [
{
"inputs": {"query": "What is the capital of France?"},
"outputs": {
"response": "Paris is the magnificent capital city of France, known for the Eiffel Tower and rich culture."
},
"expectations": {
"expected_facts": ["Paris is the capital of France."]
},
},
{
"inputs": {"query": "What are the main components of MLflow?"},
"outputs": {
"response": "MLflow has four main components: Tracking, Projects, Models, and Registry."
},
"expectations": {
"expected_facts": [
"MLflow has four main components",
"Components include Tracking",
"Components include Projects",
"Components include Models",
"Components include Registry"
]
},
},
{
"inputs": {"query": "When was MLflow released?"},
"outputs": {
"response": "MLflow was released in 2017 by Databricks."
},
"expectations": {
"expected_facts": ["MLflow was released in June 2018"]
},
}
]
# Run evaluation with Correctness scorer
eval_results = mlflow.genai.evaluate(
data=eval_dataset,
scorers=[
Correctness(
model="databricks:/databricks-gpt-oss-120b", # Optional. Defaults to custom Databricks model.
)
]
)
替代方法:使用expected_response
还可以使用 expected_response ,而不是 expected_facts:
eval_dataset_with_response = [
{
"inputs": {"query": "What is MLflow?"},
"outputs": {
"response": "MLflow is an open-source platform for managing the ML lifecycle."
},
"expectations": {
"expected_response": "MLflow is an open-source platform for managing the machine learning lifecycle, including experimentation, reproducibility, and deployment."
},
}
]
# Run evaluation with expected_response
eval_results = mlflow.genai.evaluate(
data=eval_dataset_with_response,
scorers=[Correctness()]
)
小窍门
建议使用expected_facts而不是expected_response,因为它允许更灵活的评估——响应不需要逐字逐句一致,只需包含关键事实。
Customization
可以通过提供不同的评价模型来定制评审:
from mlflow.genai.scorers import Correctness
# Use a different judge model
correctness_judge = Correctness(
model="databricks:/databricks-gpt-5-mini" # Or any LiteLLM-compatible model
)
# Use in evaluation
eval_results = mlflow.genai.evaluate(
data=eval_dataset,
scorers=[correctness_judge]
)
解释结果
法官返回一个 Feedback 对象,其中包含:
-
value:如果响应正确,则为“是”,如果不正确,则为“否” -
rationale:详细说明哪些事实受支持或缺失