本页提供 MLflow 评估和监视概念的参考文档。 有关指南和教程,请参阅 评估和监视 AI 代理。
小窍门
有关 MLflow 3 评估和监视 API 文档,请参阅 API 参考。
快速参考
| 概念 | 目的 | Usage |
|---|---|---|
| 记分器 | 评估跟踪质量 |
@scorer 修饰器或 Scorer 类 |
| 法官 | 基于 LLM 的评估 | 包裹在计分工具中以便使用 |
| 评估工具 | 运行脱机评估 | mlflow.genai.evaluate() |
| 评估数据集 | 测试数据管理 | mlflow.genai.datasets |
| 评估测试 | 存储评估结果 | 由线束创建 |
| 生产监控 | 实时质量跟踪 |
Scorer.register、Scorer.start |
记分器: mlflow.genai.scorers
from mlflow.genai.scorers import scorer
from mlflow.entities import Feedback
from typing import Optional, Dict, Any, List
@scorer
def my_custom_scorer(
*, # MLflow calls your scorer with named arguments
inputs: Optional[Dict[Any, Any]], # App's input from trace
outputs: Optional[Dict[Any, Any]], # App's output from trace
expectations: Optional[Dict[str, Any]], # Ground truth (offline only)
trace: Optional[mlflow.entities.Trace] # Complete trace
) -> int | float | bool | str | Feedback | List[Feedback]:
# Your evaluation logic
return Feedback(value=True, rationale="Explanation")
法官
LLM 评估器是一种 MLflow 评分器 ,它使用大型语言模型进行质量评估。 虽然基于代码的记分器使用编程逻辑,但法官利用 LLM 的推理功能来评估有用性、相关性、安全性等条件。
from mlflow.genai.scorers import Safety, RelevanceToQuery
# Initialize judges that will assess different quality aspects
safety_judge = Safety() # Checks for harmful, toxic, or inappropriate content
relevance_judge = RelevanceToQuery() # Checks if responses are relevant to user queries
# Run evaluation on your test dataset with multiple judges
mlflow.genai.evaluate(
data=eval_data, # Your test cases (inputs, outputs, optional ground truth)
predict_fn=my_app, # The application function you want to evaluate
scorers=[safety_judge, relevance_judge] # Both judges run on every test case
)
评估工具: mlflow.genai.evaluate(...)
在开发过程中协调脱机评估。
import mlflow
from mlflow.genai.scorers import Safety, RelevanceToQuery
results = mlflow.genai.evaluate(
data=eval_dataset, # Test data
predict_fn=my_app, # Your app
scorers=[Safety(), RelevanceToQuery()], # Quality metrics
model_id="models:/my-app/1" # Optional version tracking
)
评估数据集: mlflow.genai.datasets.EvaluationDataset
版本化测试数据(可选的真实数据)。
import mlflow.genai.datasets
# Create from production traces
dataset = mlflow.genai.datasets.create_dataset(
uc_table_name="catalog.schema.eval_data"
)
# Add traces
traces = mlflow.search_traces(filter_string="trace.status = 'OK'")
dataset.insert(traces)
# Use in evaluation
results = mlflow.genai.evaluate(data=dataset, ...)
评估运行: mlflow.entities.Run
评估结果中包含反馈痕迹。
# Access evaluation results
traces = mlflow.search_traces(run_id=results.run_id)
# Filter by feedback
good_traces = traces[traces['assessments'].apply(
lambda x: all(a.value for a in x if a.name == 'Safety')
)]
生产监控
重要
此功能在 Beta 版中。
持续评估已部署的应用程序。
import mlflow
from mlflow.genai.scorers import Safety, ScorerSamplingConfig
# Register the scorer with a name and start monitoring
safety_judge = Safety().register(name="my_safety_judge") # name must be unique to experiment
safety_judge = safety_judge.start(sampling_config=ScorerSamplingConfig(sample_rate=0.7))