从代理评估迁移到 MLflow 3：快速参考

本快速参考总结了从代理评估版和 MLflow 2 迁移到 MLflow 3 中改进的 API 的关键更改。请参阅从代理评估迁移到 MLflow 3 的完整指南。

导入更新

### Old imports ###
from mlflow import evaluate
from databricks.agents.evals import metric
from databricks.agents.evals import judges

from databricks.agents import review_app

### New imports ###
from mlflow.genai import evaluate
from mlflow.genai.scorers import scorer
from mlflow.genai import judges
# For predefined scorers:
from mlflow.genai.scorers import (
    Correctness, Guidelines, ExpectationsGuidelines,
    RelevanceToQuery, Safety, RetrievalGroundedness,
    RetrievalRelevance, RetrievalSufficiency
)

import mlflow.genai.labeling as labeling
import mlflow.genai.label_schemas as schemas

评估函数

MLflow 2.x	MLflow 3.x
`mlflow.evaluate()`	`mlflow.genai.evaluate()`
`model=my_agent`	`predict_fn=my_agent`
`model_type="databricks-agent"`	（不需要）
`extra_metrics=[...]`	`scorers=[...]`
`evaluator_config={...}`	（评分器中的配置）

法官选择

MLflow 2.x	MLflow 3.x
根据数据自动运行所有适用的判断	必须显式指定要使用的评分器
使用 `evaluator_config` 来限制法官	在 `scorers` 参数中传递所需的评分器
配置中的 `global_guidelines`	使用 `Guidelines()` 评分器
根据可用数据字段选择的判断	你可以精确掌控要运行的评分器

数据字段

MLflow 2.x 字段	MLflow 3.x 字段	Description
`request`	`inputs`	代理输入
`response`	`outputs`	代理输出
`expected_response`	`expectations`	基本事实
`retrieved_context`	通过跟踪访问	来自跟踪的上下文
`guidelines`	评分器配置的一部分	移动到评分器级别

自定义指标和评分器

MLflow 2.x	MLflow 3.x	注释
`@metric` 修饰器	`@scorer` 修饰器	新名称
`def my_metric(request, response, ...)`	`def my_scorer(inputs, outputs, expectations, traces)`	简化
多个 expected_* 参数	单个字典类型的 `expectations` 参数	合并
`custom_expected`	字典 `expectations` 的一部分	简化
`request` 参数	`inputs` 参数	一致的命名
`response` 参数	`outputs` 参数	一致的命名

结果访问

MLflow 2.x	MLflow 3.x
`results.tables['eval_results']`	`mlflow.search_traces(run_id=results.run_id)`
直接数据帧访问	循环访问跟踪和评估

LLM 判断

用例	MLflow 2.x	推荐使用 MLflow 3.x 版本
基本正确性检查	`judges.correctness()` 中的 `@metric`	`Correctness()` 评分器或 `judges.is_correct()` 判断
安全评估	`judges.safety()` 中的 `@metric`	`Safety()` 评分器或 `judges.is_safe()` 判断
全局准则	`judges.guideline_adherence()`	`Guidelines()` 评分器或 `judges.meets_guidelines()` 判断
Per-eval-set-row 指南	带 expected_* 的 `judges.guideline_adherence()`	`ExpectationsGuidelines()` 评分器或 `judges.meets_guidelines()` 判断
确认事实根据	`judges.groundedness()`	`judges.is_grounded()` 或 `RetrievalGroundedness()` 评分器
检查上下文的相关性	`judges.relevance_to_query()`	`judges.is_context_relevant()` 或 `RelevanceToQuery()` 评分器
检查上下文区块的相关性	`judges.chunk_relevance()`	`judges.is_context_relevant()` 或 `RetrievalRelevance()` 评分器
检查上下文的完整性	`judges.context_sufficiency()`	`judges.is_context_sufficient()` 或 `RetrievalSufficiency()` 评分器
复杂的自定义逻辑	`@metric` 中的直接判断调用	预定义评分器或带判断调用的 `@scorer`

人工反馈

MLflow 2.x	MLflow 3.x
`databricks.agents.review_app`	`mlflow.genai.labeling`
`databricks.agents.datasets`	`mlflow.genai.datasets`
`review_app.label_schemas.*`	`mlflow.genai.label_schemas.*`
`app.create_labeling_session()`	`labeling.create_labeling_session()`

常见迁移命令

# Find old evaluate calls
grep -r "mlflow.evaluate" . --include="*.py"

# Find old metric decorators
grep -r "@metric" . --include="*.py"

# Find old data fields
grep -r '"request":\|"response":\|"expected_response":' . --include="*.py"

# Find old imports
grep -r "databricks.agents" . --include="*.py"

其他资源

有关迁移期间的其他支持，请参阅 MLflow 文档或联系 Databricks 支持团队。

Last updated on 2025-09-25

共用方式為

从代理评估迁移到 MLflow 3：快速参考

导入更新

评估函数

法官选择

数据字段

自定义指标和评分器

结果访问

LLM 判断

人工反馈

常见迁移命令

其他资源

其他資源