10 分钟演示：评估 GenAI 应用

本快速入门指导您使用 MLflow 评估 GenAI 应用程序。 GenAI 应用程序是一个简单的示例：在句子模板中填写空白，以有趣和适合儿童，类似于游戏 Mad Libs。

你将实现的目标

在本教程结束时，你将：

创建用于自动质量评估的评估数据集
使用 MLflow 评分器定义评估条件
使用 MLflow UI 运行评估并查看结果
通过修改提示并重新评估来迭代和改进

本页上的所有代码（包括先决条件）都包含在示例笔记本中。

先决条件

安装 MLflow 和所需的包。

%pip install --upgrade "mlflow[databricks]>=3.1.0" openai
dbutils.library.restartPython()

创建 MLflow 实验。如果使用 Databricks 笔记本，则可以跳过此步骤并使用默认笔记本试验。否则，请按照环境设置快速指南创建实验并连接到 MLflow 跟踪服务器。

步骤 1：创建句子完成函数

首先，创建一个简单的函数，该函数使用 LLM 完成句子模板。

初始化 OpenAI 客户端以连接到 OpenAI 托管的 LLM。

OpenAI 托管的 LLM

使用本地 OpenAI SDK 连接到由 OpenAI 托管的模型。从可用的 OpenAI 模型中选择一个模型。

import mlflow
import os
import openai

# Ensure your OPENAI_API_KEY is set in your environment
# os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>" # Uncomment and set if not globally configured

# Enable auto-tracing for OpenAI
mlflow.openai.autolog()

# Set up MLflow tracking to Databricks
mlflow.set_tracking_uri("databricks")
mlflow.set_experiment("/Shared/docs-demo")

# Create an OpenAI client connected to OpenAI SDKs
client = openai.OpenAI()

# Select an LLM
model_name = "gpt-4o-mini"

定义句子填空函数：

import json

# Basic system prompt
SYSTEM_PROMPT = """You are a smart bot that can complete sentence templates to make them funny. Be creative and edgy."""

@mlflow.trace
def generate_game(template: str):
    """Complete a sentence template using an LLM."""

    response = client.chat.completions.create(
        model=model_name,  # This example uses Databricks hosted Claude 3 Sonnet. If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o, etc.
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": template},
        ],
    )
    return response.choices[0].message.content

# Test the app
sample_template = "Yesterday, ____ (person) brought a ____ (item) and used it to ____ (verb) a ____ (object)"
result = generate_game(sample_template)
print(f"Input: {sample_template}")
print(f"Output: {result}")

句子游戏追踪

步骤 2：创建评估数据

在此步骤中，你将使用句子模板创建一个简单的评估数据集。

# Evaluation dataset
eval_data = [
    {
        "inputs": {
            "template": "Yesterday, ____ (person) brought a ____ (item) and used it to ____ (verb) a ____ (object)"
        }
    },
    {
        "inputs": {
            "template": "I wanted to ____ (verb) but ____ (person) told me to ____ (verb) instead"
        }
    },
    {
        "inputs": {
            "template": "The ____ (adjective) ____ (animal) likes to ____ (verb) in the ____ (place)"
        }
    },
    {
        "inputs": {
            "template": "My favorite ____ (food) is made with ____ (ingredient) and ____ (ingredient)"
        }
    },
    {
        "inputs": {
            "template": "When I grow up, I want to be a ____ (job) who can ____ (verb) all day"
        }
    },
    {
        "inputs": {
            "template": "When two ____ (animals) love each other, they ____ (verb) under the ____ (place)"
        }
    },
    {
        "inputs": {
            "template": "The monster wanted to ____ (verb) all the ____ (plural noun) with its ____ (body part)"
        }
    },
]

步骤 3：定义评估条件

在此步骤中，将设置记分器以基于以下内容评估完成质量：

语言一致性：与输入相同的语言。
创造力：有趣的或创造性的回应。
儿童安全：适合年龄的内容。
模板结构：在不更改格式的情况下填充空白。
内容安全：无有害内容。

将此代码添加到文件：

from mlflow.genai.scorers import Guidelines, Safety
import mlflow.genai

# Define evaluation scorers
scorers = [
    Guidelines(
        guidelines="Response must be in the same language as the input",
        name="same_language",
    ),
    Guidelines(
        guidelines="Response must be funny or creative",
        name="funny"
    ),
    Guidelines(
        guidelines="Response must be appropiate for children",
        name="child_safe"
    ),
    Guidelines(
        guidelines="Response must follow the input template structure from the request - filling in the blanks without changing the other words.",
        name="template_match",
    ),
    Safety(),  # Built-in safety scorer
]

步骤 4：运行评估

现在，你已准备好评估句子生成器。

# Run evaluation
print("Evaluating with basic prompt...")
results = mlflow.genai.evaluate(
    data=eval_data,
    predict_fn=generate_game,
    scorers=scorers
)

步骤 5：查看结果

可以在交互式单元格输出或 MLflow 试验 UI 中查看结果。若要打开试验 UI，请单击单元格结果中的链接。

从笔记本单元格结果链接到 MLflow 试验 UI。

在试验 UI 中，单击“ 评估 ”选项卡。

MLflow 试验 UI 顶部的“评估”选项卡。

查看 UI 中的结果，了解应用程序的质量并识别改进想法。

句子游戏评论

步骤 6：改进提示

某些结果不适合儿童。以下代码显示了经过修订的更具体的提示。

# Update the system prompt to be more specific
SYSTEM_PROMPT = """You are a creative sentence game bot for children's entertainment.

RULES:
1. Make choices that are SILLY, UNEXPECTED, and ABSURD (but appropriate for kids)

2. Use creative word combinations and mix unrelated concepts (e.g., "flying pizza" instead of just "pizza")

3. Avoid realistic or ordinary answers - be as imaginative as possible!

4. Ensure all content is family-friendly and child appropriate for 1 to 6 year olds.

Examples of good completions:
- For "favorite ____ (food)": use "rainbow spaghetti" or "giggling ice cream" NOT "pizza"
- For "____ (job)": use "bubble wrap popper" or "underwater basket weaver" NOT "doctor"
- For "____ (verb)": use "moonwalk backwards" or "juggle jello" NOT "walk" or "eat"

Remember: The funnier and more unexpected, the better!"""

步骤 7：使用改进的提示重新运行评估

更新提示后，重新运行评估以查看分数是否得到改善。

# Re-run evaluation with the updated prompt
# This works because SYSTEM_PROMPT is defined as a global variable, so `generate_game` will use the updated prompt.
results = mlflow.genai.evaluate(
    data=eval_data,
    predict_fn=generate_game,
    scorers=scorers
)

步骤 8：比较 MLflow UI 中的结果

要比较评估运行，请返回评估 UI 并比较两次运行。比较视图可以帮助你确认根据评估条件，提示改进确实导致了更好的输出。

句子游戏评估

示例笔记本

以下笔记本包括此页上的所有代码。

评估 GenAI 应用快速入门笔记本

获取笔记本

指南和参考

有关本指南中的概念和功能的详细信息，请参阅：

记分器 - 了解 MLflow 评分器如何评估 GenAI 应用程序。

Last updated on 2025-12-16

通过