Compartir a través de

评估概念概述

MLflow 的 GenAI 评估概念: 评分器评委评估数据集以及使用这些概念的系统。

快速参考

概念 目的 Usage
记分器 评估跟踪质量 @scorer 修饰器或 Scorer
法官 基于 LLM 的评估 包裹在计分工具中以便使用
评估工具 运行脱机评估 mlflow.genai.evaluate()
评估数据集 测试数据管理 mlflow.genai.datasets
评估测试 存储评估结果 由线束创建
生产监控 实时质量跟踪 Scorer.registerScorer.start

常见模式

结合使用多个评分器

import mlflow
from mlflow.genai.scorers import scorer, Safety, RelevanceToQuery, ScorerSamplingConfig
from mlflow.entities import Feedback

# Combine predefined and custom scorers
@scorer
def custom_business_scorer(outputs):
    response = outputs.get("response", "")
    # Your business logic
    if "company_name" not in response:
        return Feedback(value=False, rationale="Missing company branding")
    return Feedback(value=True, rationale="Meets business criteria")

# Use same scorers everywhere
scorers = [Safety(), RelevanceToQuery(), custom_business_scorer]

# Offline evaluation
results = mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=my_app,
    scorers=scorers
)

# Production monitoring - same scorers!
registered_scorers = [s.register() for s in scorers]
registered_scorers = [
    reg_scorer.start(
        sampling_config=ScorerSamplingConfig(sample_rate=0.1)
    )
    for reg_scorer in registered_scorers
]

串联评估结果

import mlflow
import pandas as pd
from mlflow.genai.scorers import Safety, Correctness

# Run initial evaluation
results1 = mlflow.genai.evaluate(
    data=test_dataset,
    predict_fn=my_app,
    scorers=[Safety(), Correctness()]
)

# Use results to create refined dataset
traces = mlflow.search_traces(run_id=results1.run_id)

# Filter to problematic traces
safety_failures = traces[traces['assessments'].apply(
    lambda x: any(a.name == 'Safety' and a.value == 'no' for a in x)
)]

# Re-evaluate with different scorers or updated app
from mlflow.genai.scorers import Guidelines

results2 = mlflow.genai.evaluate(
    data=safety_failures,
    predict_fn=updated_app,
    scorers=[
        Safety(),
        Guidelines(
            name="content_policy",
            guidelines="Response must follow our content policy"
        )
    ]
)

评估中的错误处理

import mlflow
from mlflow.genai.scorers import scorer
from mlflow.entities import Feedback, AssessmentError

@scorer
def resilient_scorer(outputs, trace=None):
    try:
        response = outputs.get("response")
        if not response:
            return Feedback(
                value=None,
                error=AssessmentError(
                    error_code="MISSING_RESPONSE",
                    error_message="No response field in outputs"
                )
            )
        # Your evaluation logic
        return Feedback(value=True, rationale="Valid response")
    except Exception as e:
        # Let MLflow handle the error gracefully
        raise

# Use in evaluation - continues even if some scorers fail
results = mlflow.genai.evaluate(
    data=dataset,
    predict_fn=my_app,
    scorers=[resilient_scorer, Safety()]
)

概念

记分器: mlflow.genai.scorers

评估跟踪并返回反馈的函数。

from mlflow.genai.scorers import scorer
from mlflow.entities import Feedback
from typing import Optional, Dict, Any, List

@scorer
def my_custom_scorer(
    *,  # MLflow calls your scorer with named arguments
    inputs: Optional[Dict[Any, Any]],  # App's input from trace
    outputs: Optional[Dict[Any, Any]],  # App's output from trace
    expectations: Optional[Dict[str, Any]],  # Ground truth (offline only)
    trace: Optional[mlflow.entities.Trace]  # Complete trace
) -> int | float | bool | str | Feedback | List[Feedback]:
    # Your evaluation logic
    return Feedback(value=True, rationale="Explanation")

了解更多关于计分器的信息

法官: mlflow.genai.judges

必须被封装在评分器中的基于 LLM 的质量评估器。

from mlflow.genai.judges import is_safe, is_relevant
from mlflow.genai.scorers import scorer

# Direct usage
feedback = is_safe(content="Hello world")

# Wrapped in scorer
@scorer
def safety_scorer(outputs):
    return is_safe(content=outputs["response"])

了解有关法官的详细信息

评估工具: mlflow.genai.evaluate(...)

在开发过程中协调脱机评估。

import mlflow
from mlflow.genai.scorers import Safety, RelevanceToQuery

results = mlflow.genai.evaluate(
    data=eval_dataset,  # Test data
    predict_fn=my_app,  # Your app
    scorers=[Safety(), RelevanceToQuery()],  # Quality metrics
    model_id="models:/my-app/1"  # Optional version tracking
)

详细了解评估工具

评估数据集: mlflow.genai.datasets.EvaluationDataset

版本化测试数据(可选的真实数据)。

import mlflow.genai.datasets

# Create from production traces
dataset = mlflow.genai.datasets.create_dataset(
    uc_table_name="catalog.schema.eval_data"
)

# Add traces
traces = mlflow.search_traces(filter_string="trace.status = 'OK'")
dataset.insert(traces)

# Use in evaluation
results = mlflow.genai.evaluate(data=dataset, ...)

详细了解评估数据集

评估运行: mlflow.entities.Run

评估结果中包含反馈痕迹。

# Access evaluation results
traces = mlflow.search_traces(run_id=results.run_id)

# Filter by feedback
good_traces = traces[traces['assessments'].apply(
    lambda x: all(a.value for a in x if a.name == 'Safety')
)]

详细了解评估运行

生产监控

持续评估已部署的应用程序。

import mlflow
from mlflow.genai.scorers import Safety, ScorerSamplingConfig

# Register the scorer with a name and start monitoring
safety_scorer = Safety().register(name="my_safety_scorer")  # name must be unique to experiment
safety_scorer = safety_scorer.start(sampling_config=ScorerSamplingConfig(sample_rate=0.7))

详细了解生产监视

Workflows

在线监控(生产)

# Production app with tracing → Monitor applies scorers → Feedback on traces → Dashboards

用于生产的联机监视

离线评估(开发)

# Test data → Evaluation harness runs app → Scorers evaluate traces → Results stored

脱机评估

后续步骤

继续您的旅程,并参考这些推荐的行动和教程。

参考指南

浏览有关相关概念的详细文档。