自定义评分器提供最终的灵活性,用于准确定义 GenAI 应用程序的质量测量方式。 自定义评分器提供灵活性来定义针对特定业务用例定制的评估指标,无论是基于简单的启发式、高级逻辑还是编程评估。
对以下方案使用自定义评分器:
定义自定义启发式或基于代码的评估指标。
自定义如何将应用的跟踪数据映射到预定义的 LLM 评分器中的 Databricks 研究支持的 LLM 判断。
使用基于提示的 LLM 评分器文章创建具有自定义提示文本的 LLM 判断。
使用自己的 LLM 模型(而不是 Databricks 托管的 LLM 法官模型)进行评估。
任何其他需要比预定义抽象提供的更具灵活性和控制权的场景。
注释
有关自定义记分器接口的详细参考,请参阅 记分器概念页 或 API 文档。
使用情况概述
自定义评分器是使用 Python 编写的,可让你完全控制评估应用跟踪中的任何数据。 单个自定义评分器既evaluate(...)
适用于脱机评估,又适用于生产监视。 定义自定义记分器后,可以像使用预定义的评分器一样使用它。
支持以下输出类型:
- 通过/未通过字符串:
"yes" or "no"
字符串值在 UI 中呈现为“通过”或“未通过”。 - 数值:序号值:整数或浮点数。
- 布尔值:
True
或False
。 - 反馈对象:返回
Feedback
具有分数、理由和其他元数据的对象
作为输入,自定义记分器可以访问:
- 完整的 MLflow 跟踪,包括范围、属性和输出。 跟踪作为实例化
mlflow.entities.trace
类传递到自定义评分器。 - 字典
inputs
,派生自跟踪的 输入数据集 或 MLflow 后进程。 -
outputs
值,从输入数据集或跟踪中派生。 如果提供predict_fn
,那么outputs
的值将是predict_fn
的返回结果。 -
expectations
字典,派生自输入数据集中的expectations
字段,或与跟踪关联的评估。
修饰@scorer
器允许用户定义可以使用参数或用于mlflow.genai.evaluate()
的scorers
自定义评估指标。
根据下面的签名,使用命名参数调用评分器函数。 所有命名参数都是可选的,因此可以使用任意组合。 例如,可以定义仅具有 inputs
和 trace
作为参数的记分器,并省略 outputs
和 expectations
:
from mlflow.genai.scorers import scorer
from typing import Optional, Any
from mlflow.entities import Feedback
@scorer
def my_custom_scorer(
*, # evaluate(...) harness will always call your scorer with named arguments
inputs: Optional[dict[str, Any]], # The agent's raw input, parsed from the Trace or dataset, as a Python dict
outputs: Optional[Any], # The agent's raw output, parsed from the Trace or
expectations: Optional[dict[str, Any]], # The expectations passed to evaluate(data=...), as a Python dict
trace: Optional[mlflow.entities.Trace] # The app's resulting Trace containing spans and other metadata
) -> int | float | bool | str | Feedback | list[Feedback]
自定义记分器开发方法
在开发指标时,需要快速迭代指标,而不用在每次更改评分器时执行应用程序。 为此,建议执行以下步骤:
步骤 1:定义初始指标、应用和评估数据
import mlflow
from mlflow.entities import Trace
from mlflow.genai.scorers import scorer
from typing import Any
@mlflow.trace
def my_app(input_field_name: str):
return {'output': input_field_name+'_output'}
@scorer
def my_metric() -> int:
# placeholder return value
return 1
eval_set = [{'inputs': {'input_field_name': 'test'}}]
步骤 2:使用 evaluate()
从您的应用生成跟踪
eval_results = mlflow.genai.evaluate(
data=eval_set,
predict_fn=my_app,
scorers=[dummy_metric]
)
步骤 3:查询和存储生成的跟踪
generated_traces = mlflow.search_traces(run_id=eval_results.run_id)
步骤 4:在迭代指标时,将结果跟踪作为输入传递给 evaluate()
search_traces
函数返回跟踪的 andas DataFrame,可以将其作为输入数据集直接传递给 evaluate()
。 这样你可以快速调整你的指标,而无需重新启动应用程序。
@scorer
def my_metric(outputs: Any):
# Implement the actual metric logic here.
return outputs == "test_output"
# Note the lack of a predict_fn parameter
mlflow.genai.evaluate(
data=generated_traces,
scorers=[my_metric],
)
自定义记分器示例
本部分介绍了生成自定义评分器的不同方法。
先决条件:创建示例应用程序并获取跟踪的本地副本
在所有方法中,我们使用以下示例应用程序和跟踪副本(使用 上述方法提取)。
初始化 OpenAI 客户端以连接到由 Databricks 托管的 LLM 或者由 OpenAI 托管的 LLM。
Databricks 托管的 LLM
使用 MLflow 获取一个 OpenAI 客户端,以连接到由 Databricks 托管的 LLMs。 从可用的基础模型中选择一个模型。
import mlflow from databricks.sdk import WorkspaceClient # Enable MLflow's autologging to instrument your application with Tracing mlflow.openai.autolog() # Set up MLflow tracking to Databricks mlflow.set_tracking_uri("databricks") mlflow.set_experiment("/Shared/docs-demo") # Create an OpenAI client that is connected to Databricks-hosted LLMs w = WorkspaceClient() client = w.serving_endpoints.get_open_ai_client() # Select an LLM model_name = "databricks-claude-sonnet-4"
OpenAI 托管的 LLM
使用本地 OpenAI SDK 连接到由 OpenAI 托管的模型。 从 可用的 OpenAI 模型中选择一个模型。
import mlflow import os import openai # Ensure your OPENAI_API_KEY is set in your environment # os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>" # Uncomment and set if not globally configured # Enable auto-tracing for OpenAI mlflow.openai.autolog() # Set up MLflow tracking to Databricks mlflow.set_tracking_uri("databricks") mlflow.set_experiment("/Shared/docs-demo") # Create an OpenAI client connected to OpenAI SDKs client = openai.OpenAI() # Select an LLM model_name = "gpt-4o-mini"
使用简单的评分器生成追踪。
from typing import Any
from mlflow.entities import Trace
from mlflow.genai.scorers import scorer
@mlflow.trace
def sample_app(messages: list[dict[str, str]]):
# 1. Prepare messages for the LLM
messages_for_llm = [
{"role": "system", "content": "You are a helpful assistant."},
*messages,
]
# 2. Call LLM to generate a response
response = client.chat.completions.create(
model= model_name, # Select a model
messages=messages_for_llm,
)
return response.choices[0].message.content
# Create a list of messages for the LLM to generate a response
eval_dataset = [
{
"inputs": {
"messages": [
{"role": "user", "content": "How much does a microwave cost?"},
]
},
},
{
"inputs": {
"messages": [
{
"role": "user",
"content": "Can I return the microwave I bought 2 months ago?",
},
]
},
},
{
"inputs": {
"messages": [
{
"role": "user",
"content": "I'm having trouble with my account. I can't log in.",
},
{
"role": "assistant",
"content": "I'm sorry to hear that you're having trouble with your account. Are you using our website or mobile app?",
},
{"role": "user", "content": "Website"},
]
},
},
]
@scorer
def dummy_metric():
# This scorer is just to help generate initial traces.
return 1
# Generate initial traces by running the sample_app.
# The results, including traces, are logged to the MLflow experiment defined above.
initial_eval_results = mlflow.genai.evaluate(
data=eval_dataset, predict_fn=sample_app, scorers=[dummy_metric]
)
generated_traces = mlflow.search_traces(run_id=initial_eval_results.run_id)
运行上述代码后,试验中应有三个跟踪。
示例 1:从跟踪记录中获取数据
访问完整的 MLflow 跟踪对象 ,以使用各种详细信息(范围、输入、输出、属性、计时)进行精细的指标计算。
注释
先决条件 generated_traces
部分将用作这些示例的输入数据。
此评分器检查跟踪的总执行时间是否在可接受的范围内。
import mlflow
from mlflow.genai.scorers import scorer
from mlflow.entities import Trace, Feedback, SpanType
@scorer
def llm_response_time_good(trace: Trace) -> Feedback:
# Search particular span type from the trace
llm_span = trace.search_spans(span_type=SpanType.CHAT_MODEL)[0]
response_time = (llm_span.end_time_ns - llm_span.start_time_ns) / 1e9 # second
max_duration = 5.0
if response_time <= max_duration:
return Feedback(
value="yes",
rationale=f"LLM response time {response_time:.2f}s is within the {max_duration}s limit."
)
else:
return Feedback(
value="no",
rationale=f"LLM response time {response_time:.2f}s exceeds the {max_duration}s limit."
)
# Evaluate the scorer using the pre-generated traces from the prerequisite code block.
span_check_eval_results = mlflow.genai.evaluate(
data=generated_traces,
scorers=[llm_response_time_good]
)
示例 2:封装预定义 LLM 判断
创建一个自定义评分器,用于封装 MLflow 的 预定义 LLM 评估器。 使用此功能为判断预处理跟踪数据或后处理其反馈。
此示例演示如何包装 is_context_relevant
判断(评估给定上下文是否与查询相关),以评估助手的响应是否与用户的查询相关。
import mlflow
from mlflow.entities import Trace, Feedback
from mlflow.genai.judges import is_context_relevant
from mlflow.genai.scorers import scorer
from typing import Any
# Assume `generated_traces` is available from the prerequisite code block.
@scorer
def is_message_relevant(inputs: dict[str, Any], outputs: str) -> Feedback:
# The `inputs` field for `sample_app` is a dictionary like: {"messages": [{"role": ..., "content": ...}, ...]}
# We need to extract the content of the last user message to pass to the relevance judge.
last_user_message_content = None
if "messages" in inputs and isinstance(inputs["messages"], list):
for message in reversed(inputs["messages"]):
if message.get("role") == "user" and "content" in message:
last_user_message_content = message["content"]
break
if not last_user_message_content:
raise Exception("Could not extract the last user message from inputs to evaluate relevance.")
# Call the `relevance_to_query judge. It will return a Feedback object.
return is_context_relevant(
request=last_user_message_content,
context={"response": outputs},
)
# Evaluate the custom relevance scorer
custom_relevance_eval_results = mlflow.genai.evaluate(
data=generated_traces,
scorers=[is_message_relevant]
)
示例 3:使用 expectations
当使用 mlflow.genai.evaluate()
参数调用 data
时,该参数是字典列表或 Pandas DataFrame,每行可以包含一个 expectations
键。 与此密钥关联的值直接传递给自定义评分器。
此示例还演示了使用自定义记分器以及预定义的评分器(在本示例中,安全评分器)。
import mlflow
from mlflow.entities import Feedback
from mlflow.genai.scorers import scorer, Safety
from typing import Any, List, Optional, Union
expectations_eval_dataset_list = [
{
"inputs": {"messages": [{"role": "user", "content": "What is 2+2?"}]},
"expectations": {
"expected_response": "2+2 equals 4.",
"expected_keywords": ["4", "four", "equals"],
}
},
{
"inputs": {"messages": [{"role": "user", "content": "Describe MLflow in one sentence."}]},
"expectations": {
"expected_response": "MLflow is an open-source platform to streamline machine learning development, including tracking experiments, packaging code into reproducible runs, and sharing and deploying models.",
"expected_keywords": ["mlflow", "open-source", "platform", "machine learning"],
}
},
{
"inputs": {"messages": [{"role": "user", "content": "Say hello."}]},
"expectations": {
"expected_response": "Hello there!",
# No keywords needed for this one, but the field can be omitted or empty
}
}
]
示例 3.1:与预期响应完全匹配
此评分器检查助手的响应是否与 expected_response
中提供的 expectations
完全匹配。
@scorer
def exact_match(outputs: str, expectations: dict[str, Any]) -> bool:
# Scorer can return primitive value like bool, int, float, str, etc.
return outputs == expectations["expected_response"]
exact_match_eval_results = mlflow.genai.evaluate(
data=expectations_eval_dataset_list,
predict_fn=sample_app, # sample_app is from the prerequisite section
scorers=[exact_match, Safety()] # You can include any number of scorers
)
示例 3.2:来自期望的关键字检查
此评分器检查来自 expected_keywords
的所有 expectations
是否都出现在助手的响应中。
@scorer
def keyword_presence_scorer(outputs: str, expectations: dict[str, Any]) -> Feedback:
expected_keywords = expectations.get("expected_keywords")
print(expected_keywords)
if expected_keywords is None:
return Feedback(
score=None, # Undetermined, as no keywords were expected
rationale="No 'expected_keywords' provided in expectations."
)
missing_keywords = []
for keyword in expected_keywords:
if keyword.lower() not in outputs.lower():
missing_keywords.append(keyword)
if not missing_keywords:
return Feedback(value="yes", rationale="All expected keywords are present in the response.")
else:
return Feedback(value="no", rationale=f"Missing keywords: {', '.join(missing_keywords)}.")
keyword_presence_eval_results = mlflow.genai.evaluate(
data=eval_dataset,
predict_fn=sample_app, # sample_app is from the prerequisite section
scorers=[keyword_presence_scorer]
)
示例 4:返回多个反馈对象
单个评分器可以返回对象列表 Feedback
,允许一个评分者同时评估多个质量方面(例如 PII、情绪、简洁性)。 理想情况下,每个 Feedback
对象都应拥有一个唯一的 name
(这将在结果中成为指标名称);否则,如果名称是自动生成的且发生冲突,它们可能会相互覆盖。 如果未提供名称,MLflow 将尝试基于记分器函数名称和索引生成一个名称。
此示例演示了一个计分器,它为每个跟踪返回两个不同的反馈片段:
is_not_empty_check
:一个布尔值,指示响应内容是否为非空。response_char_length
:响应字符长度的数值。
import mlflow
from mlflow.genai.scorers import scorer
from mlflow.entities import Feedback, Trace # Ensure Feedback and Trace are imported
from typing import Any, Optional
# Assume `generated_traces` is available from the prerequisite code block.
@scorer
def comprehensive_response_checker(outputs: str) -> list[Feedback]:
feedbacks = []
# 1. Check if the response is not empty
feedbacks.append(
Feedback(name="is_not_empty_check", value="yes" if outputs != "" else "no")
)
# 2. Calculate response character length
char_length = len(outputs)
feedbacks.append(Feedback(name="response_char_length", value=char_length))
return feedbacks
multi_feedback_eval_results = mlflow.genai.evaluate(
data=generated_traces,
scorers=[comprehensive_response_checker]
)
结果将有两列:is_not_empty_check
和 response_char_length
作为评估。
示例 5:使用你自己的 LLM 来担任法官
将自定义或外部托管的 LLM 整合到评分器中。 评分器负责处理 API 调用、输入/输出格式,并从您的 LLM 响应中生成 Feedback
,从而对评分过程进行完全控制。
还可以设置 source
对象中的 Feedback
字段,以指示评估的来源是 LLM 评判。
import mlflow
import json
from mlflow.genai.scorers import scorer
from mlflow.entities import AssessmentSource, AssessmentSourceType, Feedback
from typing import Any, Optional
# Assume `generated_traces` is available from the prerequisite code block.
# Assume `client` (OpenAI SDK client configured for Databricks) is available from the prerequisite block.
# client = OpenAI(...)
# Define the prompts for the Judge LLM.
judge_system_prompt = """
You are an impartial AI assistant responsible for evaluating the quality of a response generated by another AI model.
Your evaluation should be based on the original user query and the AI's response.
Provide a quality score as an integer from 1 to 5 (1=Poor, 2=Fair, 3=Good, 4=Very Good, 5=Excellent).
Also, provide a brief rationale for your score.
Your output MUST be a single valid JSON object with two keys: "score" (an integer) and "rationale" (a string).
Example:
{"score": 4, "rationale": "The response was mostly accurate and helpful, addressing the user's query directly."}
"""
judge_user_prompt = """
Please evaluate the AI's Response below based on the Original User Query.
Original User Query:
```{user_query}```
AI's Response:
```{llm_response_from_app}```
Provide your evaluation strictly as a JSON object with "score" and "rationale" keys.
"""
@scorer
def answer_quality(inputs: dict[str, Any], outputs: str) -> Feedback:
user_query = inputs["messages"][-1]["content"]
# Call the Judge LLM using the OpenAI SDK client.
judge_llm_response_obj = client.chat.completions.create(
model="databricks-claude-3-7-sonnet", # This example uses Databricks hosted Claude. If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o-mini, etc.
messages=[
{"role": "system", "content": judge_system_prompt},
{"role": "user", "content": judge_user_prompt.format(user_query=user_query, llm_response_from_app=outputs)},
],
max_tokens=200, # Max tokens for the judge's rationale
temperature=0.0, # For more deterministic judging
)
judge_llm_output_text = judge_llm_response_obj.choices[0].message.content
# Parse the Judge LLM's JSON output.
judge_eval_json = json.loads(judge_llm_output_text)
parsed_score = int(judge_eval_json["score"])
parsed_rationale = judge_eval_json["rationale"]
return Feedback(
value=parsed_score,
rationale=parsed_rationale,
# Set the source of the assessment to indicate the LLM judge used to generate the feedback
source=AssessmentSource(
source_type=AssessmentSourceType.LLM_JUDGE,
source_id="claude-3-7-sonnet",
)
)
# Evaluate the scorer using the pre-generated traces.
custom_llm_judge_eval_results = mlflow.genai.evaluate(
data=generated_traces,
scorers=[answer_quality]
)
通过在 UI 中打开跟踪并单击“answer_quality”评估,你可以看到判断的元数据,例如理由、时间戳、判断模型名称等。 如果判断评估不正确,可以通过单击 Edit
按钮替代分数。
新的评估取代了最初的法官评估。 编辑历史记录已被保留以供将来参考。
后续步骤
按照以下操作和教程继续你的旅程。
- 使用自定义 LLM 评分器进行评估 - 使用 LLM 创建语义评估。
- 在生产环境中运行记分器 - 部署记分器进行持续监视。
- 生成评估数据集 - 为评分者创建测试数据。
参考指南
浏览本指南中提到的概念和功能的详细文档。
- 评分者 - 深入了解评分者的工作原理及其体系结构。
-
评估工具 - 了解如何
mlflow.genai.evaluate()
使用评分员。 - LLM 评委 - 了解 AI 驱动的评估的基础。