得分者

评分器通过分析输出并生成结构化反馈来评估 GenAI 应用质量。 编写一次,在开发和生产中随处使用。

快速参考

返回类型 UI 显示 用例
"yes"/"no" 通过/失败 二进制求值
True/False 真/假 布尔值检查
int/float 数值 分数、计数
Feedback 值 + 理由 详细评估
List[Feedback] 多个指标 多方面评估

写入一次,随处使用

MLflow 评分器的关键设计原则是 编写一次,随处使用。 同一记分器函数可在以下环境中无缝工作:

这种统一的方法意味着可以在本地开发和测试质量指标,然后在不修改的情况下将完全相同的逻辑部署到生产环境。

from mlflow.genai.scorers import scorer
from mlflow.entities import Feedback

# Define your scorer once
@scorer
def response_completeness(outputs: str) -> Feedback:
    # Outputs is return value of your app. Here we assume it's a string.
    if len(outputs.strip()) < 10:
        return Feedback(
            value=False,
            rationale="Response too short to be meaningful"
        )

    if outputs.lower().endswith(("...", "etc", "and so on")):
        return Feedback(
            value=False,
            rationale="Response appears incomplete"
        )

    return Feedback(
        value=True,
        rationale="Response appears complete"
    )

# Directly call the scorer function for spot testing
response_completeness(outputs="This is a test response...")

# Use in development evaluation
mlflow.genai.evaluate(
    data=test_dataset,
    predict_fn=my_app,
    scorers=[response_completeness]
)

记分器的工作原理

评分器分析 GenAI 应用程序中的跟踪并生成质量评估。 流程如下:

  1. 您的应用运行并生成一个跟踪记录来捕获其执行过程

  2. MLflow 将跟踪信息传递给 评分函数

  3. 评分器 使用自定义逻辑分析跟踪的输入、输出和中间执行步骤

  4. 通过分数和说明生成反馈

  5. 反馈被附加到 跟踪以进行分析

输入

评分器接收到包含所有跨度、属性和输出的完整MLflow 跟踪。 为方便起见,MLflow 还会提取常用的数据,并将其作为命名参数传递:

@scorer
def my_custom_scorer(
    *,  # All arguments are keyword-only
    inputs: Optional[dict[str, Any]],       # App's raw input, a dictionary of input argument names and values
    outputs: Optional[Any],                 # App's raw output
    expectations: Optional[dict[str, Any]], # Ground truth, a dictionary of label names and values
    trace: Optional[mlflow.entities.Trace]  # Complete trace with all metadata
) -> Union[int, float, bool, str, Feedback, List[Feedback]]:
    # Your evaluation logic here

所有参数都是可选的 , 仅声明记分器需要的内容:

  • 输入:发送到应用的请求(例如用户查询、上下文)。
  • 输出:来自应用的响应(例如生成的文本、工具调用)
  • 期望:基本真相或标签(例如预期响应、准则等)
  • 跟踪:包含所有跨度的完整执行跟踪,允许分析中间步骤、延迟、工具使用情况等。

在运行mlflow.genai.evaluate()时,可以在inputs参数中指定outputsexpectationsdata参数,或者从跟踪中解析这些参数。

已注册的记分器 始终分析跟踪中的 inputs 参数 outputsexpectations 不可用。

输出

评分器可以根据评估需求返回不同类型。

简单值

返回用于直接通过/未通过或数值评估的基本值。

  • 通过/失败字符串"yes""no" 呈现为 UI 中的“通过”或“失败”
  • 布尔值TrueFalse 用于二进制计算
  • 数值:分数、计数或度量的整数或浮点数
# These example assumes your app returns a string as a response.
@scorer
def response_length(outputs: str) -> int:
    # Return a numeric metric
    return len(outputs.split())

@scorer
def contains_citation(outputs: str) -> str:
    # Return pass/fail string
    return "yes" if "[source]" in outputs else "no"

丰富的反馈

返回用于详细评估并附带解释的Feedback对象。

from mlflow.entities import Feedback, AssessmentSource

@scorer
def content_quality(outputs):
    return Feedback(
        value=0.85,  # Can be numeric, boolean, or string
        rationale="Clear and accurate, minor grammar issues",
        # Optional: source of the assessment. Several source types are supported,
        # such as "HUMAN", "CODE", "LLM_JUDGE".
        source=AssessmentSource(
            source_type="HUMAN",
            source_id="grammar_checker_v1"
        ),
        # Optional: additional metadata about the assessment.
        metadata={
            "annotator": "me@example.com",
        }
    )

可以将多个反馈对象作为列表返回。 每个反馈都将在评估结果中显示为单独的指标。

@scorer
def comprehensive_check(inputs, outputs):
    return [
        Feedback(name="relevance", value=True, rationale="Directly addresses query"),
        Feedback(name="tone", value="professional", rationale="Appropriate for audience"),
        Feedback(name="length", value=150, rationale="Word count within limits")
    ]

指标命名行为

使用 @scorer 修饰器时,评估结果中的指标名称遵循以下规则:

  1. 没有名称的基元值或单个反馈:记分器函数名称将成为反馈名称

    @scorer
    def word_count(outputs: str) -> int:
        # "word_count" will be used as a metric name
        return len(outputs).split()
    
    @scorer
    def response_quality(outputs: Any) -> Feedback:
        # "response_quality" will be used as a metric name
        return Feedback(value=True, rationale="Good quality")
    
  2. 具有显式名称的单一反馈:反馈对象中指定的名称用作指标名称

    @scorer
    def assess_factualness(outputs: Any) -> Feedback:
        # Name "factual_accuracy" is explicitly specfied, it will be used as a metric name
        return Feedback(name="factual_accuracy", value=True, rationale="Factual accuracy is high")
    
  3. 多个反馈:将保留每个反馈对象中指定的名称。 必须为每个反馈指定唯一的名称。

    @scorer
    def multi_aspect_check(outputs) -> list[Feedback]:
        # These names ARE used since multiple feedbacks are returned
        return [
            Feedback(name="grammar", value=True, rationale="No errors"),
            Feedback(name="clarity", value=0.9, rationale="Very clear"),
            Feedback(name="completeness", value="yes", rationale="All points addressed")
        ]
    

此命名行为可确保评估结果和仪表板中的指标名称一致。

错误处理

当记分器遇到错误时,MLflow 提供了两种方法:

最简单的方法是让异常自然引发。 MLflow 会自动捕获异常,并创建一个反馈对象并显示错误详细信息:

import mlflow
from mlflow.entities import Feedback
from mlflow.genai.scorers import scorer

@scorer
def is_valid_response(outputs: str) -> Feedback:
    import json

    # Let json.JSONDecodeError propagate if response isn't valid JSON
    data = json.loads(outputs)

    # Let KeyError propagate if required fields are missing
    summary = data["summary"]
    confidence = data["confidence"]

    return Feedback(
        value=True,
        rationale=f"Valid JSON with confidence: {confidence}"
    )

# Run the scorer on invalid data that triggers exceptions
invalid_data = [
    {
        # Valid JSON
        "outputs": '{"summary": "this is a summary", "confidence": 0.95}'
    },
    {
        # Invalid JSON
        "outputs": "invalid json",
    },
    {
        # Missing required fields
        "outputs": '{"summary": "this is a summary"}'
    },
]

mlflow.genai.evaluate(
    data=invalid_data,
    scorers=[is_valid_response],
)

发生异常时,MLflow 会创建一个反馈,其中包含:

  • value: None
  • error:异常详细信息,例如异常对象、错误消息和堆栈跟踪

错误信息将显示在评估结果中。 打开相应的行以查看错误详细信息。

评估结果中的错误详细信息

显式处理异常

对于自定义错误处理或提供特定错误消息,请捕获异常并返回 FeedbackNone 和错误详细信息:

from mlflow.entities import AssessmentError, Feedback

@scorer
def is_valid_response(outputs):
    import json

    try:
        data = json.loads(outputs)
        required_fields = ["summary", "confidence", "sources"]
        missing = [f for f in required_fields if f not in data]

        if missing:
            return Feedback(
                error=AssessmentError(
                    error_code="MISSING_REQUIRED_FIELDS",
                    error_message=f"Missing required fields: {missing}",
                ),
            )

        return Feedback(
            value=True,
            rationale="Valid JSON with all required fields"
        )

    except json.JSONDecodeError as e:
        return Feedback(error=e)  # Can pass exception object directly to the error parameter

参数 error 接受:

  • Python 异常:直接传递异常对象
  • AssessmentError:用于包含错误代码的结构化错误报告

当预期可用时

期望 (基本事实或标签)对于脱机评估通常很重要。 可以在运行 mlflow.genai.evaluate()时通过两种方式指定它们:

  • 在输入expectations参数中包含data列(或字段)。
  • Expectation 关联到 Trace 对象,然后将其传递给 data 参数。
@scorer
def exact_match(outputs: str, expectations: dict[str, Any]) -> Feedback:
    expected = expectations.get("expected_response")
    is_correct = outputs == expected

    return Feedback(
        value=is_correct,
        rationale=f"Response {'matches' if is_correct else 'differs from'} expected"
    )

data = [
    {
        "inputs": {"question": "What is the capital of France?"},
        "outputs": "Paris",
        # Specify expected response in the expectations field
        "expectations": {
            "expected_response": "Paris"
        }
    },
]

mlflow.genai.evaluate(
    data=data,
    scorers=[exact_match],
)

注释

生产监控通常没有预期,因为你在评估实时数据流量时没有参考标准。 如果打算对脱机和联机评估使用相同的评分器,请设计它以正常处理期望

使用跟踪数据

评分人员可以访问完整的跟踪记录来评估复杂的应用程序特性。

from mlflow.entities import Feedback, Trace
from mlflow.genai.scorers import scorer

@scorer
def tool_call_efficiency(trace: Trace) -> Feedback:
    """Evaluate how effectively the app uses tools"""
    # Retrieve all tool call spans from the trace
    tool_calls = trace.search_spans(span_type="TOOL")

    if not tool_calls:
        return Feedback(
            value=None,
            rationale="No tool usage to evaluate"
        )

    # Check for redundant calls
    tool_names = [span.name for span in tool_calls]
    if len(tool_names) != len(set(tool_names)):
        return Feedback(
            value=False,
            rationale=f"Redundant tool calls detected: {tool_names}"
        )

    # Check for errors
    failed_calls = [s for s in tool_calls if s.status.status_code != "OK"]
    if failed_calls:
        return Feedback(
            value=False,
            rationale=f"{len(failed_calls)} tool calls failed"
        )

    return Feedback(
        value=True,
        rationale=f"Efficient tool usage: {len(tool_calls)} successful calls"
    )

运行脱机评估mlflow.genai.evaluate()时,跟踪记录如下:

  • 如果它们已可用,则在参数 data 中指定。
  • 通过在predict_fn参数中运行inputs来生成data

运行 生产监视时,监视器收集的跟踪将直接传递到评分器函数,并指定采样和筛选条件。

记分器实现方法

MLflow 提供了两种方法来实现评分器:

@scorer 修饰器用于简单的基于函数的记分器:

from mlflow.genai.scorers import scorer
from mlflow.entities import Feedback

@scorer
def response_tone(outputs: str) -> Feedback:
    """Check if response maintains professional tone"""
    informal_phrases = ["hey", "gonna", "wanna", "lol", "btw"]
    found = [p for p in informal_phrases if p in outputs.lower()]

    if found:
        return Feedback(
            value=False,
            rationale=f"Informal language detected: {', '.join(found)}"
        )

    return Feedback(
        value=True,
        rationale="Professional tone maintained"
    )

基于类的方法

Scorer 基类用于需要状态的更复杂的评分器。 该 Scorer 类是 Pydantic 对象,因此可以定义其他字段并在方法中使用 __call__ 它们。

from mlflow.genai.scorers import Scorer
from mlflow.entities import Feedback
from typing import Optional

# Scorer class is a Pydantic object
class ResponseQualityScorer(Scorer):
    # The `name` field is mandatory
    name: str = "response_quality"
    # Define additiona lfields
    min_length: int = 50
    required_sections: Optional[list[str]] = None

    # Override the __call__ method to implement the scorer logic
    def __call__(self, outputs: str) -> Feedback:
        issues = []

        # Check length
        if len(outputs.split()) < self.min_length:
            issues.append(f"Too short (minimum {self.min_length} words)")

        # Check required sections
        missing = [s for s in self.required_sections if s not in outputs]
        if missing:
            issues.append(f"Missing sections: {', '.join(missing)}")

        if issues:
            return Feedback(
                value=False,
                rationale="; ".join(issues)
            )

        return Feedback(
            value=True,
            rationale="Response meets all quality criteria"
        )

自定义记分器开发工作流

开发自定义评分器时,通常需要快速迭代,而无需每次重新运行应用程序。 MLflow 支持高效的工作流:

  1. 通过运行应用来生成跟踪一次mlflow.genai.evaluate()

  2. 保存痕迹 使用mlflow.search_traces()

  3. 不重新运行应用程序的情况下,将存储的跟踪传递到evaluate()

此方法可在评分器开发期间节省时间和资源:

# Step 1: Generate traces with a placeholder scorer
initial_results = mlflow.genai.evaluate(
    data=test_dataset,
    predict_fn=my_app,
    scorers=[lambda **kwargs: 1]  # Placeholder scorer
)

# Step 2: Store traces for reuse
traces = mlflow.search_traces(run_id=initial_results.run_id)

# Step 3: Iterate on your scorer without re-running the app
@scorer
def my_custom_scorer(outputs):
    # Your evaluation logic here
    pass

# Test scorer on stored traces (no predict_fn needed)
results = mlflow.genai.evaluate(
    data=traces,
    scorers=[my_custom_scorer]
)

常见的注意事项

使用装饰器为记分器命名

from mlflow.genai.scorers import scorer
from mlflow.entities import Feedback

# GOTCHA: Function name becomes feedback name for single returns
@scorer
def quality_check(outputs):
    # This 'name' parameter is IGNORED
    return Feedback(name="ignored", value=True)
    # Feedback will be named "quality_check"

# CORRECT: Use function name meaningfully
@scorer
def response_quality(outputs):
    return Feedback(value=True, rationale="Good quality")
    # Feedback will be named "response_quality"

# EXCEPTION: Multiple feedbacks preserve their names
@scorer
def multi_check(outputs):
    return [
        Feedback(name="grammar", value=True),      # Name preserved
        Feedback(name="spelling", value=True),     # Name preserved
        Feedback(name="clarity", value=0.9)        # Name preserved
    ]

记分器中的状态管理

from mlflow.genai.scorers import Scorer
from mlflow.entities import Feedback

# WRONG: Don't use mutable class attributes
class BadScorer(Scorer):
    results = []  # Shared across all instances!

    def __call__(self, outputs, **kwargs):
        self.results.append(outputs)  # Causes issues
        return Feedback(value=True)

# CORRECT: Use instance attributes
class GoodScorer(Scorer):
    results: list[str] = None

    def __init__(self):
        super().__init__(name="good_scorer")
        self.results = []  # Per-instance state

    def __call__(self, outputs, **kwargs):
        self.results.append(outputs)  # Safe
        return Feedback(value=True)

后续步骤