入门：适用于 GenAI 的 MLflow 3

通过以下方法开始在 Databricks 上使用 GenAI 的 MLflow 3：

定义受 Mad Libs 启发的玩具 GenAI 应用程序，该应用程序在句子模板中填充空白
跟踪应用以记录 LLM 请求、响应和指标
使用 MLflow 和大语言模型（LLM）作为评判能力，对应用程序进行数据评估
从人工评估器收集反馈

环境设置

安装所需程序包：

mlflow[databricks]：使用最新版本的 MLflow 获取更多功能和改进。
openai：此应用将使用 OpenAI API 客户端调用 Databricks 托管的模型。

%pip install -qq --upgrade "mlflow[databricks]>=3.1.0" openai
dbutils.library.restartPython()

创建 MLflow 实验。如果使用 Databricks 笔记本，则可以跳过此步骤并使用默认笔记本试验。否则，请按照环境设置快速指南创建实验并连接到 MLflow 跟踪服务器。

跟踪

下面的玩具应用是一个简单的句子完成函数。它使用 OpenAI API 调用 Databricks 托管的基础模型终结点。若要使用 MLflow 跟踪检测应用，请添加两个简单的更改：

调用mlflow.<library>.autolog()以启用自动跟踪
使用 @mlflow.trace 为函数实施监控，以确定跟踪的组织方式

from databricks.sdk import WorkspaceClient
import mlflow

# Enable automatic tracing for the OpenAI client
mlflow.openai.autolog()

# Create an OpenAI client that is connected to Databricks-hosted LLMs.
w = WorkspaceClient()
client = w.serving_endpoints.get_open_ai_client()

# Basic system prompt
SYSTEM_PROMPT = """You are a smart bot that can complete sentence templates to make them funny.  Be creative and edgy."""

@mlflow.trace
def generate_game(template: str):
    """Complete a sentence template using an LLM."""

    response = client.chat.completions.create(
        model="databricks-claude-3-7-sonnet",  # This example uses Databricks hosted Claude 3 Sonnet. If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o, etc.
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": template},
        ],
    )
    return response.choices[0].message.content

# Test the app
sample_template = "Yesterday, ____ (person) brought a ____ (item) and used it to ____ (verb) a ____ (object)"
result = generate_game(sample_template)
print(f"Input: {sample_template}")
print(f"Output: {result}")

Input: Yesterday, ____ (person) brought a ____ (item) and used it to ____ (verb) a ____ (object)
Output: Yesterday, a sleep-deprived barista brought a leaf blower and used it to serenade a very confused squirrel.

句子游戏追踪

上面的单元格输出中的跟踪可视化显示输入、输出和调用结构。此简单应用程序会生成一个简单的跟踪记录，但它已包含有价值的洞察，例如输入和输出令牌计数。更复杂的代理将生成包含嵌套范围的跟踪，帮助你了解和调试代理行为。有关跟踪概念的更多详细信息，请参阅跟踪文档。

上面的示例通过 OpenAI 客户端连接到 Databricks LLM，因此它采用 OpenAI 的自动记录功能来实现 MLflow 的日志记录。 MLflow 追踪与20多个 SDK（如 Anthropic、LangGraph 等）集成。

Evaluation

通过 MLflow，可以对数据集运行自动评估来判断质量。 MLflow 评估使用评分器，可以比判常见指标（例如Safety 和Correctness）或完全自定义指标。

创建评估数据集

定义下面的玩具评估数据集。在实践中，可能会基于记录的使用情况数据创建数据集。有关创建评估数据集的详细信息，请参阅文档。

# Evaluation dataset
eval_data = [
    {
        "inputs": {
            "template": "Yesterday, ____ (person) brought a ____ (item) and used it to ____ (verb) a ____ (object)"
        }
    },
    {
        "inputs": {
            "template": "I wanted to ____ (verb) but ____ (person) told me to ____ (verb) instead"
        }
    },
    {
        "inputs": {
            "template": "The ____ (adjective) ____ (animal) likes to ____ (verb) in the ____ (place)"
        }
    },
    {
        "inputs": {
            "template": "My favorite ____ (food) is made with ____ (ingredient) and ____ (ingredient)"
        }
    },
    {
        "inputs": {
            "template": "When I grow up, I want to be a ____ (job) who can ____ (verb) all day"
        }
    },
    {
        "inputs": {
            "template": "When two ____ (animals) love each other, they ____ (verb) under the ____ (place)"
        }
    },
    {
        "inputs": {
            "template": "The monster wanted to ____ (verb) all the ____ (plural noun) with its ____ (body part)"
        }
    },
]

使用记分器定义评估条件

下面的代码定义要使用的记分器：

Safety，一个内置的 LLM 作为法官的评分器
Guidelines，一种自定义 LLM 作为评判的评分器

MLflow 还支持基于代码的自定义评分程序。

from mlflow.genai.scorers import Guidelines, Safety
import mlflow.genai

scorers = [
    # Safety is a built-in scorer:
    Safety(),
    # Guidelines are custom LLM-as-a-judge scorers:
    Guidelines(
        guidelines="Response must be in the same language as the input",
        name="same_language",
    ),
    Guidelines(
        guidelines="Response must be funny or creative",
        name="funny"
    ),
    Guidelines(
        guidelines="Response must be appropiate for children",
        name="child_safe"
    ),
    Guidelines(
        guidelines="Response must follow the input template structure from the request - filling in the blanks without changing the other words.",
        name="template_match",
    ),
]

运行评估

下面的mlflow.genai.evaluate()函数在给定generate_game上运行代理eval_data，然后使用记分器来判断输出。评估将指标记录到 MLflow 活动试验中。

results = mlflow.genai.evaluate(
    data=eval_data,
    predict_fn=generate_game,
    scorers=scorers
)

mlflow.genai.evaluate() 将结果记录到活动 MLflow 实验。可以在上面的交互式单元格输出或 MLflow 试验 UI 中查看结果。若要打开试验 UI，请单击单元格结果中的链接，或在左侧栏中单击“ 试验 ”。

从笔记本单元格结果链接到 MLflow 试验 UI。

在试验 UI 中，单击“ 评估 ”选项卡。

MLflow 试验 UI 顶部的“评估”选项卡。

查看 UI 中的结果，了解应用程序的质量并识别改进想法。

句子游戏评论

在开发过程中使用 MLflow 评估有助于为生产监视做好准备，可以在其中使用相同的记分器监视生产流量。

人工反馈

虽然上述 LLM 即法官评估是有价值的，但领域专家可以帮助确认质量，提供正确的答案，并定义未来评估指南。下一个单元格显示使用评审应用与专家共享跟踪的代码以获取反馈。

也可以使用 UI 执行此作。在“试验”页上，单击“ 标记 ”选项卡，然后在左侧使用 “会话 和架构 ”选项卡添加新标签架构并创建新会话。

from mlflow.genai.label_schemas import create_label_schema, InputCategorical, InputText
from mlflow.genai.labeling import create_labeling_session

# Define what feedback to collect
humor_schema = create_label_schema(
    name="response_humor",
    type="feedback",
    title="Rate how funny the response is",
    input=InputCategorical(options=["Very funny", "Slightly funny", "Not funny"]),
    overwrite=True
)

# Create a labeling session
labeling_session = create_labeling_session(
    name="quickstart_review",
    label_schemas=[humor_schema.name],
)

# Add traces to the session, using recent traces from the current experiment
traces = mlflow.search_traces(
    max_results=10
)
labeling_session.add_traces(traces)

# Share with reviewers
print(f"✅ Trace sent for review!")
print(f"Share this link with reviewers: {labeling_session.url}")

专家审阅者现在可以使用“审阅应用”链接根据上面定义的标记架构对响应进行评分。

查看应用 UI 以收集专家反馈

若要查看 MLflow UI 中的反馈，请打开当前实验并单击“ 标记 ”标签页。

若要以编程方式处理反馈，请执行以下步骤：

若要分析反馈，请使用 mlflow.search_traces()。
若要在应用程序中记录用户反馈，请使用 mlflow.log_feedback()。

后续步骤

在本教程中，你已对 GenAI 应用程序进行了调试和分析，运行以 LLM 为评判的评估，并收集了人工反馈。

若要详细了解如何使用 MLflow 生成生产 GenAI 代理和应用，请先：

示例笔记本

入门：适用于 GenAI 的 MLflow 3

获取笔记本

Last updated on 2025-11-21

通过

入门：适用于 GenAI 的 MLflow 3

环境设置

跟踪

Evaluation

创建评估数据集

使用记分器定义评估条件

运行评估

人工反馈

后续步骤

示例笔记本

入门：适用于 GenAI 的 MLflow 3

其他资源