Nota
El acceso a esta página requiere autorización. Puede intentar iniciar sesión o cambiar directorios.
El acceso a esta página requiere autorización. Puede intentar cambiar los directorios.
judges.is_correct()
预定义的法官通过将其与提供的基本事实信息expected_facts
(或expected_response
)进行比较,来评估 GenAI 应用程序的响应是否准确无误。
此评估者可通过预定义的 Correctness
评分器来评估应用程序响应与已知正确答案的比较。
API 签名
有关详细信息,请参阅 mlflow.genai.judges.is_correct()
。
from mlflow.genai.judges import is_correct
def is_correct(
*,
request: str, # User's question or query
response: str, # Application's response to evaluate
expected_facts: Optional[list[str]], # List of expected facts (provide either expected_response or expected_facts)
expected_response: Optional[str] = None, # Ground truth response (provide either expected_response or expected_facts)
name: Optional[str] = None # Optional custom name for display in the MLflow UIs
) -> mlflow.entities.Feedback:
"""Returns Feedback with 'yes' or 'no' value and a rationale"""
运行示例的先决条件
安装 MLflow 和所需包
pip install --upgrade "mlflow[databricks]>=3.1.0"
请按照设置环境快速入门创建 MLflow 试验。
直接使用 SDK
from mlflow.genai.judges import is_correct
# Example 1: Response contains expected facts
feedback = is_correct(
request="What is MLflow?",
response="MLflow is an open-source platform for managing the ML lifecycle.",
expected_facts=[
"MLflow is open-source",
"MLflow is a platform for ML lifecycle"
]
)
print(feedback.value) # "yes"
print(feedback.rationale) # Explanation of correctness
# Example 2: Response missing or contradicting facts
feedback = is_correct(
request="When was MLflow released?",
response="MLflow was released in 2017.",
expected_facts=["MLflow was released in June 2018"]
)
print(feedback.value) # "no"
print(feedback.rationale) # Explanation of what's incorrect
使用预构建的评分器
is_correct
判断可通过 Correctness
预构建的评分器获得。
要求:
-
跟踪要求:
inputs
和outputs
必须位于跟踪的根跨度上 -
基本事实标签:必需 - 必须在
expected_facts
字典中提供expected_response
或expectations
之一
from mlflow.genai.scorers import Correctness
# Create evaluation dataset with ground truth
eval_dataset = [
{
"inputs": {"query": "What is the capital of France?"},
"outputs": {
"response": "Paris is the magnificent capital city of France, known for the Eiffel Tower and rich culture."
},
"expectations": {
"expected_facts": ["Paris is the capital of France."]
},
},
{
"inputs": {"query": "What are the main components of MLflow?"},
"outputs": {
"response": "MLflow has four main components: Tracking, Projects, Models, and Registry."
},
"expectations": {
"expected_facts": [
"MLflow has four main components",
"Components include Tracking",
"Components include Projects",
"Components include Models",
"Components include Registry"
]
},
},
{
"inputs": {"query": "When was MLflow released?"},
"outputs": {
"response": "MLflow was released in 2017 by Databricks."
},
"expectations": {
"expected_facts": ["MLflow was released in June 2018"]
},
}
]
# Run evaluation with Correctness scorer
eval_results = mlflow.genai.evaluate(
data=eval_dataset,
scorers=[Correctness()]
)
替代方法:使用expected_response
还可以使用 expected_response
,而不是 expected_facts
:
eval_dataset_with_response = [
{
"inputs": {"query": "What is MLflow?"},
"outputs": {
"response": "MLflow is an open-source platform for managing the ML lifecycle."
},
"expectations": {
"expected_response": "MLflow is an open-source platform for managing the machine learning lifecycle, including experimentation, reproducibility, and deployment."
},
}
]
# Run evaluation with expected_response
eval_results = mlflow.genai.evaluate(
data=eval_dataset_with_response,
scorers=[Correctness()]
)
小窍门
建议使用expected_facts
而不是expected_response
,因为它允许更灵活的评估——响应不需要逐字逐句一致,只需包含关键事实。
在自定义评分器中使用
在评估具有与预定义评分器要求不同的数据结构的应用程序时,请将判断包装在自定义评分器中:
from mlflow.genai.judges import is_correct
from mlflow.genai.scorers import scorer
from typing import Dict, Any
eval_dataset = [
{
"inputs": {"question": "What are the main components of MLflow?"},
"outputs": {
"answer": "MLflow has four main components: Tracking, Projects, Models, and Registry."
},
"expectations": {
"facts": [
"MLflow has four main components",
"Components include Tracking",
"Components include Projects",
"Components include Models",
"Components include Registry"
]
}
},
{
"inputs": {"question": "What is MLflow used for?"},
"outputs": {
"answer": "MLflow is used for building websites."
},
"expectations": {
"facts": [
"MLflow is used for managing ML lifecycle",
"MLflow helps with experiment tracking"
]
}
}
]
@scorer
def correctness_scorer(inputs: Dict[Any, Any], outputs: Dict[Any, Any], expectations: Dict[Any, Any]):
return is_correct(
request=inputs["question"],
response=outputs["answer"],
expected_facts=expectations["facts"]
)
# Run evaluation
eval_results = mlflow.genai.evaluate(
data=eval_dataset,
scorers=[correctness_scorer]
)
解释结果
法官返回一个 Feedback
对象,其中包含:
-
value
:如果响应正确,则为“是”,如果不正确,则为“否” -
rationale
:详细说明哪些事实受支持或缺失
后续步骤
- 探索其他预定义的评委 - 了解其他内置质量评估评委
- 创建自定义评委 - 构建领域特定的评估评委
- 运行评估 - 在综合应用程序评估中使用判断