教程:Gmail 电子邮件发件人操作员

Important

此功能目前以公共预览版提供。

本教程逐步讲解如何为 Lakeflow Designer 创建 python-run-function 操作员,以便通过 Gmail 将 DataFrame 的内容作为 CSV 附件发送。 使用此示例了解如何生成基于 YAML 的运算符来执行副作用,例如发送通知或写入外部系统。 若要了解详细信息,请参阅 Lakeflow Designer 中的用户定义的运算符

要求

  • 具有创建机密范围权限的 Azure Databricks 工作区。
  • 具有 Google 应用密码 的 Gmail 帐户(启用多重身份验证(MFA)时是必需的)。
  • 在本地开发计算机上安装的 Databricks CLI

步骤 1:设置密钥

将 Gmail 凭据存储在Azure Databricks机密范围中,以便操作员可以在运行时检索凭据。

  1. 使用 Azure Databricks CLI 创建机密范围:

    databricks secrets create-scope my_email_scope
    
  2. 将 Gmail 应用专用密码保存在该作用域中:

    databricks secrets put-secret my_email_scope gmail_app_password
    

    系统会提示输入机密值。 粘贴 Gmail 应用密码并保存。

步骤 2:编写 run() 函数

python-run-function 运算符类型需要一个具有以下签名的 run() 函数:

def run(config: Dict[str, Any], inputs: Dict[str, Any], spark) -> Dict[str, Any]:
  • config:Lakeflow 设计器 UI 中用户提供的配置值。
  • inputs:以端口名称为键的输入数据帧。
  • spark:当前活动的 Spark 会话。

该函数必须返回一个以输出端口名称为键的输出 DataFrame 字典。

定义并测试笔记本单元格中的函数:

from typing import Dict, Any

def run(config: Dict[str, Any], inputs: Dict[str, Any], spark) -> Dict[str, Any]:
    input_df = inputs["data"]

    # Skip side effects during Designer preview
    if config.get("is_preview", False):
        return {"data": input_df}

    import smtplib
    import os
    from email.mime.multipart import MIMEMultipart
    from email.mime.text import MIMEText
    from email.mime.base import MIMEBase
    from email import encoders

    sender_email = config.get("sender_email", "")
    secret_scope = config.get("secret_scope", "")
    secret_key = config.get("secret_key", "")
    recipients_raw = config.get("recipients", "")
    subject = config.get("subject", "")
    body = config.get("body", "")

    if not sender_email:
        raise ValueError("Sender Email is required.")
    if not secret_scope or not secret_key:
        raise ValueError("Secret Scope and Secret Key are required.")
    if not recipients_raw:
        raise ValueError("At least one recipient is required.")

    recipients = [r.strip() for r in recipients_raw.split(",") if r.strip()]
    if not recipients:
        raise ValueError("At least one valid recipient email is required.")

    # Retrieve password from Databricks secrets
    from pyspark.dbutils import DBUtils
    dbutils = DBUtils(spark)
    sender_password = dbutils.secrets.get(scope=secret_scope, key=secret_key)

    # Convert DataFrame to CSV
    pdf = input_df.toPandas()
    file_path = "/tmp/designer_email_attachment.csv"
    pdf.to_csv(file_path, index=False)

    # Send email to each recipient
    for recipient in recipients:
        msg = MIMEMultipart()
        msg["From"] = sender_email
        msg["To"] = recipient
        msg["Subject"] = subject
        msg.attach(MIMEText(body, "plain"))

        with open(file_path, "rb") as attachment:
            part = MIMEBase("application", "octet-stream")
            part.set_payload(attachment.read())
            encoders.encode_base64(part)
            part.add_header(
                "Content-Disposition",
                f"attachment; filename={os.path.basename(file_path)}",
            )
            msg.attach(part)

        with smtplib.SMTP_SSL("smtp.gmail.com", 465) as server:
            server.login(sender_email, sender_password)
            server.send_message(msg)

    # Clean up temp file
    if os.path.exists(file_path):
        os.remove(file_path)

    return {"data": input_df}

步骤 3:测试函数

使用示例数据帧测试函数:

test_df = spark.createDataFrame(
    [("Alice", 100), ("Bob", 200)],
    ["name", "amount"]
)

# Test in preview mode (no email sent)
result = run(
    config={
        "is_preview": True,
        "sender_email": "you@gmail.com",
        "secret_scope": "my_email_scope",
        "secret_key": "gmail_app_password",
        "recipients": "alice@example.com",
        "subject": "Test",
        "body": "Test body"
    },
    inputs={"data": test_df},
    spark=spark
)

result["data"].show()
# Expected: the original DataFrame, unchanged

注释

配置中的 secret_scopesecret_key 值是你在步骤 1 中创建的机密范围和密钥的名称,而不是实际的密码。 操作员使用这些名称在运行时从 Azure Databricks 机密中检索密码。

Important

先将 is_preview 设置为 True 进行测试,以便在不发送任何电子邮件的情况下验证直通行为。 准备好测试实际电子邮件时,请将 is_preview 设置为 False

步骤 4:生成 YAML 定义

创建一个名为 gmail_email_sender.yaml 的文件,并包含以下内容:

schema: user-defined-operator-v0.1.0
id: gmail_email_sender
type: python-run-function
version: '1.0.0'
name: Gmail Email Sender
description: Sends the input DataFrame as a CSV attachment via Gmail SMTP to one or more recipients.

config:
  type: object
  properties:
    is_preview:
      type: boolean
      format: is_preview
      default: false
    sender_email:
      type: string
      title: Sender Email
      default: ''
      examples:
        - 'you@gmail.com'
      x-ui:
        widget: input
    secret_scope:
      type: string
      title: Secret Scope
      default: ''
      examples:
        - 'my_email_scope'
      x-ui:
        widget: input
    secret_key:
      type: string
      title: Secret Key
      default: ''
      examples:
        - 'gmail_app_password'
      x-ui:
        widget: input
    recipients:
      type: string
      title: Recipients
      default: ''
      examples:
        - 'alice@example.com, bob@example.com'
      x-ui:
        widget: textarea
        rows: 2
    subject:
      type: string
      title: Subject
      default: ''
      examples:
        - 'Designer Output Data'
      x-ui:
        widget: input
    body:
      type: string
      title: Email Body
      default: "Hello,\n\nAttached is the latest data.\n\nBest,\nDatabricks Workflow"
      x-ui:
        widget: textarea
        rows: 6
  required:
    - sender_email
    - secret_scope
    - secret_key
    - recipients
    - subject
  additionalProperties: false

ports:
  input:
    - name: data
      title: Input Data
      mime: application/vnd.databricks.dataframe
  output:
    - name: data
      title: Output Data
      mime: application/vnd.databricks.dataframe

run_function:
  type: inline
  code: |
    from typing import Dict, Any

    def run(config: Dict[str, Any], inputs: Dict[str, Any], spark) -> Dict[str, Any]:
        input_df = inputs["data"]

        if config.get("is_preview", False):
            return {"data": input_df}

        import smtplib
        import os
        from email.mime.multipart import MIMEMultipart
        from email.mime.text import MIMEText
        from email.mime.base import MIMEBase
        from email import encoders

        sender_email = config.get("sender_email", "")
        secret_scope = config.get("secret_scope", "")
        secret_key = config.get("secret_key", "")
        recipients_raw = config.get("recipients", "")
        subject = config.get("subject", "")
        body = config.get("body", "")

        if not sender_email:
            raise ValueError("Sender Email is required.")
        if not secret_scope or not secret_key:
            raise ValueError("Secret Scope and Secret Key are required.")
        if not recipients_raw:
            raise ValueError("At least one recipient is required.")

        recipients = [r.strip() for r in recipients_raw.split(",") if r.strip()]
        if not recipients:
            raise ValueError("At least one valid recipient email is required.")

        from pyspark.dbutils import DBUtils
        dbutils = DBUtils(spark)
        sender_password = dbutils.secrets.get(scope=secret_scope, key=secret_key)

        pdf = input_df.toPandas()
        file_path = "/tmp/designer_email_attachment.csv"
        pdf.to_csv(file_path, index=False)

        for recipient in recipients:
            msg = MIMEMultipart()
            msg["From"] = sender_email
            msg["To"] = recipient
            msg["Subject"] = subject
            msg.attach(MIMEText(body, "plain"))

            with open(file_path, "rb") as attachment:
                part = MIMEBase("application", "octet-stream")
                part.set_payload(attachment.read())
                encoders.encode_base64(part)
                part.add_header(
                    "Content-Disposition",
                    f"attachment; filename={os.path.basename(file_path)}",
                )
                msg.attach(part)

            with smtplib.SMTP_SSL("smtp.gmail.com", 465) as server:
                server.login(sender_email, sender_password)
                server.send_message(msg)

        if os.path.exists(file_path):
            os.remove(file_path)

        return {"data": input_df}

步骤 5:保存并注册操作员

  1. 将 YAML 文件保存到Azure Databricks工作区。 例如:

    /Workspace/Users/<user-name>/gmail_email_sender.yaml
    
  2. 将运算符添加到 .user_defined_operators.yaml 文件:

    operators:
      - /Workspace/Users/<user-name>/gmail_email_sender.yaml
    

有关注册选项的详细信息,请参阅“让您的 Operator 可被发现”

Permissions

运行包含此操作员的工作流的用户需要 READ 访问机密范围,或者他们可以在操作员配置中提供自己的机密范围和密钥值。 用户还需要对工作区中的 YAML 文件具有读取访问权限。

要授予对机密范围的访问权限:

databricks secrets put-acl my_email_scope <user-or-group> READ

在 Lakeflow Designer 中使用运算符

注册后,操作员会显示在 Lakeflow Designer 中,其中包含数据源的输入端口,以及发件人电子邮件、机密范围、机密密钥、收件人、主题和正文的配置字段。

工作流运行时,操作员将输入数据帧转换为 CSV,将其附加到电子邮件,并将其发送给每个收件人。 DataFrame 会原样传递到输出端口,因此你可以在下游串联其他算子。 在工作流预览期间,不会发送电子邮件。