다음을 통해 공유

生产监控

重要

此功能在 Beta 版中。

概述

使用 MLflow 可以在生产 GenAI 应用程序跟踪上自动运行评分器,以持续监视质量。 可以计划任何 记分器 (包括自定义指标和内置/自定义 LLM 法官),以自动评估生产流量的示例。

主要优势:

  • 无需手动干预即可自动进行质量评估
  • 灵活的采样 ,用于平衡覆盖范围与计算成本。
  • 使用开发中的相同评分器进行一致的评估
  • 使用定期后台执行进行持续监视

先决条件

在设置质量监视之前,请确保具备:

  1. MLflow 试验:记录跟踪的 MLflow 试验。 如果未指定,则使用活动试验。
  2. 检测的生产应用程序:GenAI 应用必须使用 MLflow 跟踪记录跟踪。 请参阅 生产跟踪指南
  3. 定义的记分器:测试了使用应用程序的跟踪格式的 记分器

小窍门

如果你在开发过程中将生产应用用作 predict_fn 中的 mlflow.genai.evaluate(),那么你的评分器可能已经兼容。

生产监视入门

本部分包含演示如何创建不同类型的记分器的示例代码。

注释

在任何给定时间,最多 20 名评分者都可以与持续质量监视的实验相关联。

使用预定义的评分器

MLflow 提供了多个 预定义的评分器 ,可以使用现装的计分器进行监视。

from mlflow.genai.scorers import Safety, ScorerSamplingConfig

# Register the scorer with a name and start monitoring
safety_scorer = Safety().register(name="my_safety_scorer")  # name must be unique to experiment
safety_scorer = safety_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=0.7))

使用基于准则的 LLM 评分器

基于指南的 LLM 评分器 可以使用传递/失败的自然语言条件来评估输入和输出。

from mlflow.genai.scorers import Guidelines

# Create and register the guidelines scorer
english_scorer = Guidelines(
  name="english",
  guidelines=["The response must be in English"]
).register(name="is_english")  # name must be unique to experiment

# Start monitoring with the specified sample rate
english_scorer = english_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=0.7))

使用基于提示的评分器

对于比基于准则的 LLM 评分器更具灵活性,可以使用 基于提示的评分器 ,以便通过可自定义的选择类别(例如优秀/良好/差)和可选的数值评分进行多级质量评估。

from mlflow.genai.scorers import scorer, ScorerSamplingConfig

@scorer
def formality(inputs, outputs, trace):
    # Must be imported inline within the scorer function body
    from mlflow.genai.judges.databricks import custom_prompt_judge
    from mlflow.entities.assessment import DEFAULT_FEEDBACK_NAME

    formality_prompt = """
    You will look at the response and determine the formality of the response.

    <request>{{request}}</request>
    <response>{{response}}</response>

    You must choose one of the following categories.

    [[formal]]: The response is very formal.
    [[semi_formal]]: The response is somewhat formal. The response is somewhat formal if the response mentions friendship, etc.
    [[not_formal]]: The response is not formal.
    """

    my_prompt_judge = custom_prompt_judge(
        name="formality",
        prompt_template=formality_prompt,
        numeric_values={
            "formal": 1,
            "semi_formal": 0.5,
            "not_formal": 0,
        },
    )

    result = my_prompt_judge(request=inputs, response=inputs)
    if hasattr(result, "name"):
        result.name = DEFAULT_FEEDBACK_NAME
    return result

# Register the custom scorer and start monitoring
formality_scorer = formality.register(name="my_formality_scorer")  # name must be unique to experiment
formality_scorer = formality_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=0.1))

使用自定义记分器函数

为了获得最大的灵活性,包括放弃基于 LLM 的评分的选项,可以定义和使用 自定义记分器函数 进行监视。

定义自定义记分器时,请勿使用需要在函数签名中导入的类型提示。 如果记分器函数正文使用需要导入的包,请将这些包内联导入函数中,以便进行适当的序列化。

某些包默认可用,无需内联导入。 这些包包括 databricks-agentsmlflow-skinnyopenai无服务器环境版本 2 中包含的所有包。

from mlflow.genai.scorers import scorer, ScorerSamplingConfig

# Custom metric: Check if response mentions Databricks
@scorer
def mentions_databricks(outputs):
    """Check if the response mentions Databricks"""
    return "databricks" in str(outputs.get("response", "")).lower()

# Custom metric: Response length check
@scorer(aggregations=["mean", "min", "max"])
def response_length(outputs):
    """Measure response length in characters"""
    return len(str(outputs.get("response", "")))

# Custom metric with multiple inputs
@scorer
def response_relevance_score(inputs, outputs):
    """Score relevance based on keyword matching"""
    query = str(inputs.get("query", "")).lower()
    response = str(outputs.get("response", "")).lower()

    # Simple keyword matching (replace with your logic)
    query_words = set(query.split())
    response_words = set(response.split())

    if not query_words:
        return 0.0

    overlap = len(query_words & response_words)
    return overlap / len(query_words)

# Register and start monitoring custom scorers
databricks_scorer = mentions_databricks.register(name="databricks_mentions")
databricks_scorer = databricks_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=0.5))

length_scorer = response_length.register(name="response_length")
length_scorer = length_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=1.0))

relevance_scorer = response_relevance_score.register(name="response_relevance_score")  # name must be unique to experiment
relevance_scorer = relevance_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=1.0))

多个记分器配置

若要进行全面的监视设置,可以单独注册和启动多个评分器。

from mlflow.genai.scorers import Safety, RelevanceToQuery, ScorerSamplingConfig

# Configure multiple scorers for comprehensive monitoring
safety_scorer = Safety().register(name="safety_check")  # name must be unique within an MLflow experiment
safety_scorer = safety_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=1.0))  # Check all traces

relevance_scorer = RelevanceToQuery().register(name="relevance_check")
relevance_scorer = relevance_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=0.5))  # Sample 50%

length_scorer = response_length.register(name="length_analysis")
length_scorer = length_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=0.3))

管理计划的记分器

列出当前评分者

若要查看试验的所有已注册记分器,请执行以下作:

from mlflow.genai.scorers import list_scorers

# List all registered scorers
scorers = list_scorers()
for scorer in scorers:
    print(f"Name: {scorer._server_name}")
    print(f"Sample rate: {scorer.sample_rate}")
    print(f"Filter: {scorer.filter_string}")
    print("---")

更新记分器

修改现有记分器配置:

from mlflow.genai.scorers import get_scorer

# Get existing scorer and update its configuration (immutable operation)
safety_scorer = get_scorer(name="safety_monitor")
updated_scorer = safety_scorer.update(sampling_config=ScorerSamplingConfig(sample_rate=0.8))  # Increased from 0.5

# Note: The original scorer remains unchanged; update() returns a new scorer instance
print(f"Original sample rate: {safety_scorer.sample_rate}")  # Original rate
print(f"Updated sample rate: {updated_scorer.sample_rate}")   # New rate

停止和删除记分器

完全停止监视或删除记分器:

from mlflow.genai.scorers import get_scorer, delete_scorer

# Get existing scorer
databricks_scorer = get_scorer(name="databricks_mentions")

# Stop monitoring (sets sample_rate to 0, keeps scorer registered)
stopped_scorer = databricks_scorer.stop()
print(f"Sample rate after stop: {stopped_scorer.sample_rate}")  # 0

# Remove scorer entirely from the server
delete_scorer(name=databricks_scorer.name)

# Or restart monitoring from a stopped scorer
restarted_scorer = stopped_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=0.5))

评估历史跟踪(指标回填)

可以追溯性地将新的或更新的指标应用于历史跟踪。

使用当前采样率的基本指标回填

from databricks.agents.scorers import backfill_scorers

safety_scorer = Safety()
safety_scorer.register(name="safety_check")
safety_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=0.5))

#custom scorer
@scorer(aggregations=["mean", "min", "max"])
def response_length(outputs):
    """Measure response length in characters"""
    return len(outputs)

response_length.register(name="response_length")
response_length.start(sampling_config=ScorerSamplingConfig(sample_rate=0.5))

# Use existing sample rates for specified scorers
job_id = backfill_scorers(
    scorers=["safety_check", "response_length"]
)

使用自定义采样率和时间范围的指标回填

from databricks.agents.scorers import backfill_scorers, BackfillScorerConfig
from datetime import datetime
from mlflow.genai.scorers import Safety, Correctness

safety_scorer = Safety()
safety_scorer.register(name="safety_check")
safety_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=0.5))

#custom scorer
@scorer(aggregations=["mean", "min", "max"])
def response_length(outputs):
    """Measure response length in characters"""
    return len(outputs)

response_length.register(name="response_length")
response_length.start(sampling_config=ScorerSamplingConfig(sample_rate=0.5))

# Define custom sample rates for backfill
custom_scorers = [
    BackfillScorerConfig(scorer=safety_scorer, sample_rate=0.8),
    BackfillScorerConfig(scorer=response_length, sample_rate=0.9)
]

job_id = backfill_scorers(
    experiment_id=YOUR_EXPERIMENT_ID,
    scorers=custom_scorers,
    start_time=datetime(2024, 6, 1),
    end_time=datetime(2024, 6, 30)
)

最近的数据回填

from datetime import datetime, timedelta

# Backfill last week's data with higher sample rates
one_week_ago = datetime.now() - timedelta(days=7)

job_id = backfill_scorers(
    scorers=[
        BackfillScorerConfig(scorer=safety_scorer, sample_rate=0.8),
        BackfillScorerConfig(scorer=response_length, sample_rate=0.9)
    ],
    start_time=one_week_ago
)

查看结果

计划记分器后,允许 15-20 分钟进行初始处理。 然后:

  1. 导航到 MLflow 试验。
  2. 打开 “跟踪 ”选项卡,查看附加到跟踪的评估。
  3. 使用监视仪表板跟踪质量趋势。

最佳做法

采样策略

平衡覆盖范围与成本,如以下示例所示:

# High-priority scorers: higher sampling
safety_scorer = Safety().register(name="safety")
safety_scorer = safety_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=1.0))  # 100% coverage for critical safety

# Expensive scorers: lower sampling
complex_scorer = ComplexCustomScorer().register(name="complex_analysis")
complex_scorer = complex_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=0.05))  # 5% for expensive operations

自定义记分器设计

使自定义评分器保持独立,如以下示例所示:

@scorer
def well_designed_scorer(inputs, outputs):
    # ✅ All imports inside the function
    import re
    import json

    # ✅ Handle missing data gracefully
    response = outputs.get("response", "")
    if not response:
        return 0.0

    # ✅ Return consistent types
    return float(len(response) > 100)

Troubleshooting

记分器未运行

如果未执行评分程序,请检查以下内容:

  1. 检查试验:确保将跟踪记录到试验,而不是记录到单个运行。
  2. 采样率:采样率较低,查看结果可能需要一些时间。

序列化问题

创建自定义评分器时,在函数定义中包含导入。

# ❌ Avoid external dependencies
import external_library  # Outside function

@scorer
def bad_scorer(outputs):
    return external_library.process(outputs)

# ✅ Include imports in the function definition
@scorer
def good_scorer(outputs):
    import json  # Inside function
    return len(json.dumps(outputs))

# ❌ Avoid using type hints in scorer function signature that requires imports
from typing import List

@scorer
def scorer_with_bad_types(outputs: List[str]):
    return False

指标回填问题

“试验中未找到计划的记分器 'X' ”

  • 确保记分器名称与实验中已注册的记分器匹配
  • 使用 list_scorers 方法检查可用的记分器

后续步骤

继续学习以下教程。

参考指南

浏览本指南中提到的概念和功能的详细文档。