重要
此功能在 Beta 版中。
概述
使用 MLflow 可以在生产 GenAI 应用程序跟踪上自动运行评分器,以持续监视质量。 可以计划任何 记分器 (包括自定义指标和内置/自定义 LLM 法官),以自动评估生产流量的示例。
主要优势:
- 无需手动干预即可自动进行质量评估。
- 灵活的采样 ,用于平衡覆盖范围与计算成本。
- 使用开发中的相同评分器进行一致的评估。
- 使用定期后台执行进行持续监视。
先决条件
在设置质量监视之前,请确保具备:
- MLflow 试验:记录跟踪的 MLflow 试验。 如果未指定,则使用活动试验。
- 检测的生产应用程序:GenAI 应用必须使用 MLflow 跟踪记录跟踪。 请参阅 生产跟踪指南。
- 定义的记分器:测试了使用应用程序的跟踪格式的 记分器 。
小窍门
如果你在开发过程中将生产应用用作 predict_fn
中的 mlflow.genai.evaluate()
,那么你的评分器可能已经兼容。
生产监视入门
本部分包含演示如何创建不同类型的记分器的示例代码。
注释
在任何给定时间,最多 20 名评分者都可以与持续质量监视的实验相关联。
使用预定义的评分器
MLflow 提供了多个 预定义的评分器 ,可以使用现装的计分器进行监视。
from mlflow.genai.scorers import Safety, ScorerSamplingConfig
# Register the scorer with a name and start monitoring
safety_scorer = Safety().register(name="my_safety_scorer") # name must be unique to experiment
safety_scorer = safety_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=0.7))
使用基于准则的 LLM 评分器
基于指南的 LLM 评分器 可以使用传递/失败的自然语言条件来评估输入和输出。
from mlflow.genai.scorers import Guidelines
# Create and register the guidelines scorer
english_scorer = Guidelines(
name="english",
guidelines=["The response must be in English"]
).register(name="is_english") # name must be unique to experiment
# Start monitoring with the specified sample rate
english_scorer = english_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=0.7))
使用基于提示的评分器
对于比基于准则的 LLM 评分器更具灵活性,可以使用 基于提示的评分器 ,以便通过可自定义的选择类别(例如优秀/良好/差)和可选的数值评分进行多级质量评估。
from mlflow.genai.scorers import scorer, ScorerSamplingConfig
@scorer
def formality(inputs, outputs, trace):
# Must be imported inline within the scorer function body
from mlflow.genai.judges.databricks import custom_prompt_judge
from mlflow.entities.assessment import DEFAULT_FEEDBACK_NAME
formality_prompt = """
You will look at the response and determine the formality of the response.
<request>{{request}}</request>
<response>{{response}}</response>
You must choose one of the following categories.
[[formal]]: The response is very formal.
[[semi_formal]]: The response is somewhat formal. The response is somewhat formal if the response mentions friendship, etc.
[[not_formal]]: The response is not formal.
"""
my_prompt_judge = custom_prompt_judge(
name="formality",
prompt_template=formality_prompt,
numeric_values={
"formal": 1,
"semi_formal": 0.5,
"not_formal": 0,
},
)
result = my_prompt_judge(request=inputs, response=inputs)
if hasattr(result, "name"):
result.name = DEFAULT_FEEDBACK_NAME
return result
# Register the custom scorer and start monitoring
formality_scorer = formality.register(name="my_formality_scorer") # name must be unique to experiment
formality_scorer = formality_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=0.1))
使用自定义记分器函数
为了获得最大的灵活性,包括放弃基于 LLM 的评分的选项,可以定义和使用 自定义记分器函数 进行监视。
定义自定义记分器时,请勿使用需要在函数签名中导入的类型提示。 如果记分器函数正文使用需要导入的包,请将这些包内联导入函数中,以便进行适当的序列化。
某些包默认可用,无需内联导入。 这些包包括 databricks-agents
、 mlflow-skinny
和 openai
无服务器环境版本 2 中包含的所有包。
from mlflow.genai.scorers import scorer, ScorerSamplingConfig
# Custom metric: Check if response mentions Databricks
@scorer
def mentions_databricks(outputs):
"""Check if the response mentions Databricks"""
return "databricks" in str(outputs.get("response", "")).lower()
# Custom metric: Response length check
@scorer(aggregations=["mean", "min", "max"])
def response_length(outputs):
"""Measure response length in characters"""
return len(str(outputs.get("response", "")))
# Custom metric with multiple inputs
@scorer
def response_relevance_score(inputs, outputs):
"""Score relevance based on keyword matching"""
query = str(inputs.get("query", "")).lower()
response = str(outputs.get("response", "")).lower()
# Simple keyword matching (replace with your logic)
query_words = set(query.split())
response_words = set(response.split())
if not query_words:
return 0.0
overlap = len(query_words & response_words)
return overlap / len(query_words)
# Register and start monitoring custom scorers
databricks_scorer = mentions_databricks.register(name="databricks_mentions")
databricks_scorer = databricks_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=0.5))
length_scorer = response_length.register(name="response_length")
length_scorer = length_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=1.0))
relevance_scorer = response_relevance_score.register(name="response_relevance_score") # name must be unique to experiment
relevance_scorer = relevance_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=1.0))
多个记分器配置
若要进行全面的监视设置,可以单独注册和启动多个评分器。
from mlflow.genai.scorers import Safety, RelevanceToQuery, ScorerSamplingConfig
# Configure multiple scorers for comprehensive monitoring
safety_scorer = Safety().register(name="safety_check") # name must be unique within an MLflow experiment
safety_scorer = safety_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=1.0)) # Check all traces
relevance_scorer = RelevanceToQuery().register(name="relevance_check")
relevance_scorer = relevance_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=0.5)) # Sample 50%
length_scorer = response_length.register(name="length_analysis")
length_scorer = length_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=0.3))
管理计划的记分器
列出当前评分者
若要查看试验的所有已注册记分器,请执行以下作:
from mlflow.genai.scorers import list_scorers
# List all registered scorers
scorers = list_scorers()
for scorer in scorers:
print(f"Name: {scorer._server_name}")
print(f"Sample rate: {scorer.sample_rate}")
print(f"Filter: {scorer.filter_string}")
print("---")
更新记分器
修改现有记分器配置:
from mlflow.genai.scorers import get_scorer
# Get existing scorer and update its configuration (immutable operation)
safety_scorer = get_scorer(name="safety_monitor")
updated_scorer = safety_scorer.update(sampling_config=ScorerSamplingConfig(sample_rate=0.8)) # Increased from 0.5
# Note: The original scorer remains unchanged; update() returns a new scorer instance
print(f"Original sample rate: {safety_scorer.sample_rate}") # Original rate
print(f"Updated sample rate: {updated_scorer.sample_rate}") # New rate
停止和删除记分器
完全停止监视或删除记分器:
from mlflow.genai.scorers import get_scorer, delete_scorer
# Get existing scorer
databricks_scorer = get_scorer(name="databricks_mentions")
# Stop monitoring (sets sample_rate to 0, keeps scorer registered)
stopped_scorer = databricks_scorer.stop()
print(f"Sample rate after stop: {stopped_scorer.sample_rate}") # 0
# Remove scorer entirely from the server
delete_scorer(name=databricks_scorer.name)
# Or restart monitoring from a stopped scorer
restarted_scorer = stopped_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=0.5))
评估历史跟踪(指标回填)
可以追溯性地将新的或更新的指标应用于历史跟踪。
使用当前采样率的基本指标回填
from databricks.agents.scorers import backfill_scorers
safety_scorer = Safety()
safety_scorer.register(name="safety_check")
safety_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=0.5))
#custom scorer
@scorer(aggregations=["mean", "min", "max"])
def response_length(outputs):
"""Measure response length in characters"""
return len(outputs)
response_length.register(name="response_length")
response_length.start(sampling_config=ScorerSamplingConfig(sample_rate=0.5))
# Use existing sample rates for specified scorers
job_id = backfill_scorers(
scorers=["safety_check", "response_length"]
)
使用自定义采样率和时间范围的指标回填
from databricks.agents.scorers import backfill_scorers, BackfillScorerConfig
from datetime import datetime
from mlflow.genai.scorers import Safety, Correctness
safety_scorer = Safety()
safety_scorer.register(name="safety_check")
safety_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=0.5))
#custom scorer
@scorer(aggregations=["mean", "min", "max"])
def response_length(outputs):
"""Measure response length in characters"""
return len(outputs)
response_length.register(name="response_length")
response_length.start(sampling_config=ScorerSamplingConfig(sample_rate=0.5))
# Define custom sample rates for backfill
custom_scorers = [
BackfillScorerConfig(scorer=safety_scorer, sample_rate=0.8),
BackfillScorerConfig(scorer=response_length, sample_rate=0.9)
]
job_id = backfill_scorers(
experiment_id=YOUR_EXPERIMENT_ID,
scorers=custom_scorers,
start_time=datetime(2024, 6, 1),
end_time=datetime(2024, 6, 30)
)
最近的数据回填
from datetime import datetime, timedelta
# Backfill last week's data with higher sample rates
one_week_ago = datetime.now() - timedelta(days=7)
job_id = backfill_scorers(
scorers=[
BackfillScorerConfig(scorer=safety_scorer, sample_rate=0.8),
BackfillScorerConfig(scorer=response_length, sample_rate=0.9)
],
start_time=one_week_ago
)
查看结果
计划记分器后,允许 15-20 分钟进行初始处理。 然后:
- 导航到 MLflow 试验。
- 打开 “跟踪 ”选项卡,查看附加到跟踪的评估。
- 使用监视仪表板跟踪质量趋势。
最佳做法
采样策略
平衡覆盖范围与成本,如以下示例所示:
# High-priority scorers: higher sampling
safety_scorer = Safety().register(name="safety")
safety_scorer = safety_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=1.0)) # 100% coverage for critical safety
# Expensive scorers: lower sampling
complex_scorer = ComplexCustomScorer().register(name="complex_analysis")
complex_scorer = complex_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=0.05)) # 5% for expensive operations
自定义记分器设计
使自定义评分器保持独立,如以下示例所示:
@scorer
def well_designed_scorer(inputs, outputs):
# ✅ All imports inside the function
import re
import json
# ✅ Handle missing data gracefully
response = outputs.get("response", "")
if not response:
return 0.0
# ✅ Return consistent types
return float(len(response) > 100)
Troubleshooting
记分器未运行
如果未执行评分程序,请检查以下内容:
- 检查试验:确保将跟踪记录到试验,而不是记录到单个运行。
- 采样率:采样率较低,查看结果可能需要一些时间。
序列化问题
创建自定义评分器时,在函数定义中包含导入。
# ❌ Avoid external dependencies
import external_library # Outside function
@scorer
def bad_scorer(outputs):
return external_library.process(outputs)
# ✅ Include imports in the function definition
@scorer
def good_scorer(outputs):
import json # Inside function
return len(json.dumps(outputs))
# ❌ Avoid using type hints in scorer function signature that requires imports
from typing import List
@scorer
def scorer_with_bad_types(outputs: List[str]):
return False
指标回填问题
“试验中未找到计划的记分器 'X' ”
- 确保记分器名称与实验中已注册的记分器匹配
- 使用
list_scorers
方法检查可用的记分器
后续步骤
继续学习以下教程。
- 创建自定义记分器 - 生成根据需求定制的评分器。
参考指南
浏览本指南中提到的概念和功能的详细文档。
- 生产监视 - 深入了解监视概念。
- 记分器 - 了解电源监视的指标。
- Evaluation Harness - 脱机评估与生产的关系。