Tutorial:Power BI 集成 - 使用 Jupyter Notebook 创建预测模型(第 1 部分,共 2 部分)Tutorial: Power BI integration - Create the predictive model with a Jupyter Notebook (part 1 of 2)

在本教程的第 1 部分中,你将使用 Jupyter Notebook 中的代码训练和部署预测机器学习模型。In part 1 of this tutorial, you train and deploy a predictive machine learning model by using code in a Jupyter Notebook. 你还将创建一个评分脚本,以定义模型的输入和输出架构以集成到 Power BI。You will also create a scoring script to define the input and output schema of the model for integration into Power BI. 在第 2 部分中,你将使用该模型来预测 Microsoft Power BI 中的结果。In part 2, you'll use the model to predict outcomes in Microsoft Power BI.

在本教程中,你将了解:In this tutorial, you:

  • 创建 Jupyter Notebook。Create a Jupyter Notebook.
  • 创建 Azure 机器学习计算实例。Create an Azure Machine Learning compute instance.
  • 使用 scikit-learn 训练回归模型。Train a regression model by using scikit-learn.
  • 编写一个评分脚本,该脚本定义输入和输出,以便轻松集成到 Microsoft Power BI 中。Write a scoring script that defines the input and output for easy integration into Microsoft Power BI.
  • 将模型部署到实时评分终结点。Deploy the model to a real-time scoring endpoint.

可以通过三种方法创建和部署要在 Power BI 中使用的模型。There are three ways to create and deploy the model that you'll use in Power BI. 本文介绍“选项 A:使用笔记本训练和部署模型。”This article covers "Option A: Train and deploy models by using notebooks." 此选项是代码优先的创作体验。This option is a code-first authoring experience. 这会使用 Azure 机器学习工作室中托管的 Jupyter Notebook。It uses Jupyter notebooks that are hosted in Azure Machine Learning Studio.

但你可改用其他选项之一:But you could instead use one of the other options:

先决条件Prerequisites

创建笔记本和计算Create a notebook and compute

Azure 机器学习工作室主页中,选择“新建” > “笔记本” :On the Azure Machine Learning Studio home page, select Create new > Notebook:

显示如何创建笔记本的屏幕截图。

在“创建新文件”页上,执行以下操作:On the Create a new file page:

  1. 为笔记本命名(例如 my_model_notebook)。Name your notebook (for example, my_model_notebook).
  2. 将“文件类型”改为“笔记本” 。Change the File Type to Notebook.
  3. 选择“创建” 。Select Create.

接下来,若要运行代码单元格,请创建一个计算实例并将其附加到笔记本。Next, to run code cells, create a compute instance and attach it to your notebook. 首先选择笔记本顶部的加号图标:Start by selecting the plus icon at the top of the notebook:

显示如何创建计算实例的屏幕截图。

在“创建计算实例”页上,执行以下操作:On the Create compute instance page:

  1. 选择 CPU 虚拟机大小。Choose a CPU virtual machine size. 对于本教程,可以选择 Standard_D11_v2,其中包含 2 个核心和 14 GB 的 RAM。For this tutorial, you can choose a Standard_D11_v2, with 2 cores and 14 GB of RAM.
  2. 选择“下一页”。Select Next.
  3. 在“配置设置”页上,提供有效的“计算名称” 。On the Configure Settings page, provide a valid Compute name. 有效字符包括大写字母、小写字母、数字和连字符 (-)。Valid characters are uppercase and lowercase letters, digits, and hyphens (-).
  4. 选择“创建” 。Select Create.

在笔记本中,你可能会注意到“计算”旁边的圆圈变为了蓝绿色。In the notebook, you might notice the circle next to Compute turned cyan. 此颜色更改表明正在创建计算实例:This color change indicates that the compute instance is being created:

显示正在创建计算的屏幕截图。

备注

计算实例可能需要 2 到 4 分钟时间才能完成预配。The compute instance can take 2 to 4 minutes to be provisioned.

预配计算后,可以使用笔记本来运行代码单元格。After the compute is provisioned, you can use the notebook to run code cells. 例如,在单元格中,可以键入以下代码:For example, in the cell you can type the following code:

import numpy as np

np.sin(3)

然后按 Shift + Enter(或按 Control + Enter 或选择单元格旁边的“播放”按钮)。Then select Shift + Enter (or select Control + Enter or select the Play button next to the cell). 应会看到以下输出:You should see the following output:

显示单元格输出的屏幕截图。

现在,你可以构建机器学习模型。Now you're ready to build a machine learning model.

使用 scikit-learn 构建模型Build a model by using scikit-learn

在本教程中,将使用糖尿病数据集In this tutorial, you use the Diabetes dataset. Azure 开放数据集中提供了该数据集。This dataset is available in Azure Open Datasets.

导入数据Import data

若要导入数据,请复制以下代码并将其粘贴到笔记本中的新“代码单元格”。To import your data, copy the following code and paste it into a new code cell in your notebook.

from azureml.opendatasets import Diabetes

diabetes = Diabetes.get_tabular_dataset()
X = diabetes.drop_columns("Y")
y = diabetes.keep_columns("Y")
X_df = X.to_pandas_dataframe()
y_df = y.to_pandas_dataframe()
X_df.info()

X_df pandas 数据帧包含 10 个基线输入变量。The X_df pandas data frame contains 10 baseline input variables. 这些变量包括年龄、性别、体重指数、平均血压和六个血清测定值。These variables include age, sex, body mass index, average blood pressure, and six blood serum measurements. y_df pandas 数据帧是目标变量。The y_df pandas data frame is the target variable. 其中包含基线后一年疾病进展的量化度量值。It contains a quantitative measure of disease progression one year after the baseline. 该数据帧包含 442 条记录。The data frame contains 442 records.

定型模型Train the model

在笔记本中创建新的代码单元格。Create a new code cell in your notebook. 然后复制下面的代码并将其粘贴到单元格中。Then copy the following code and paste it into the cell. 此代码片段可构造岭回归模型,并使用 Python pickle 格式对其进行序列化。This code snippet constructs a ridge regression model and serializes the model by using the Python pickle format.

import joblib
from sklearn.linear_model import Ridge

model = Ridge().fit(X_df,y_df)
joblib.dump(model, 'sklearn_regression_model.pkl')

注册模型Register the model

除了模型文件本身的内容之外,注册的模型还将存储元数据。In addition to the content of the model file itself, your registered model will store metadata. 元数据包括模型说明、标记和框架信息。The metadata includes the model description, tags, and framework information.

在工作区中管理和部署模型时,元数据很有用。Metadata is useful when you're managing and deploying models in your workspace. 例如,通过使用标记,可以对模型进行分类,并在工作区中列出模型时应用筛选器。By using tags, for instance, you can categorize your models and apply filters when you list models in your workspace. 此外,如果使用 scikit-learn 框架标记此模型,则可简化将其部署为 Web 服务的过程。Also, if you mark this model with the scikit-learn framework, you'll simplify deploying it as a web service.

复制以下代码,然后将其粘贴到笔记本中的新“代码单元格”。Copy the following code and then paste it into a new code cell in your notebook.

import sklearn

from azureml.core import Workspace
from azureml.core import Model
from azureml.core.resource_configuration import ResourceConfiguration

ws = Workspace.from_config()

model = Model.register(workspace=ws,
                       model_name='my-sklearn-model',                # Name of the registered model in your workspace.
                       model_path='./sklearn_regression_model.pkl',  # Local file to upload and register as a model.
                       model_framework=Model.Framework.SCIKITLEARN,  # Framework used to create the model.
                       model_framework_version=sklearn.__version__,  # Version of scikit-learn used to create the model.
                       sample_input_dataset=X,
                       sample_output_dataset=y,
                       resource_configuration=ResourceConfiguration(cpu=2, memory_in_gb=4),
                       description='Ridge regression model to predict diabetes progression.',
                       tags={'area': 'diabetes', 'type': 'regression'})

print('Name:', model.name)
print('Version:', model.version)

你还可以在 Azure 机器学习工作室中查看模型。You can also view the model in Azure Machine Learning Studio. 在左侧菜单中选择“模型”:In the menu on the left, select Models:

显示如何查看模型的屏幕截图。

定义评分脚本Define the scoring script

部署要集成到 Power BI 的模型时,需要定义 Python 评分脚本和自定义环境。When you deploy a model that will be integrated into Power BI, you need to define a Python scoring script and custom environment. 评分脚本包含两个函数:The scoring script contains two functions:

  • 当服务启动时,init() 函数将运行。The init() function runs when the service starts. 它会加载模型(从模型注册表自动下载),并对其进行反序列化。It loads the model (which is automatically downloaded from the model registry) and deserializes it.
  • 当对服务的调用包含需要评分的输入数据时,run(data) 函数将运行。The run(data) function runs when a call to the service includes input data that needs to be scored.

备注

下方代码中的 Python 修饰器可定义输入和输出数据的架构,这对于到 Power BI 的集成非常重要。The Python decorators in the code below define the schema of the input and output data, which is important for integration into Power BI.

复制以下代码并将其粘贴到笔记本中的新“代码单元格”。Copy the following code and paste it into a new code cell in your notebook. 以下代码片段具有单元格 magic,可将代码写入名为 score.py 的文件中。The following code snippet has cell magic that writes the code to a file named score.py.

%%writefile score.py

import json
import pickle
import numpy as np
import pandas as pd
import os
import joblib
from azureml.core.model import Model

from inference_schema.schema_decorators import input_schema, output_schema
from inference_schema.parameter_types.numpy_parameter_type import NumpyParameterType
from inference_schema.parameter_types.pandas_parameter_type import PandasParameterType


def init():
    global model
    # Replace filename if needed.
    path = os.getenv('AZUREML_MODEL_DIR') 
    model_path = os.path.join(path, 'sklearn_regression_model.pkl')
    # Deserialize the model file back into a sklearn model.
    model = joblib.load(model_path)


input_sample = pd.DataFrame(data=[{
    "AGE": 5,
    "SEX": 2,
    "BMI": 3.1,
    "BP": 3.1,
    "S1": 3.1,
    "S2": 3.1,
    "S3": 3.1,
    "S4": 3.1,
    "S5": 3.1,
    "S6": 3.1
}])

# This is an integer type sample. Use the data type that reflects the expected result.
output_sample = np.array([0])

# To indicate that we support a variable length of data input,
# set enforce_shape=False
@input_schema('data', PandasParameterType(input_sample))
@output_schema(NumpyParameterType(output_sample))
def run(data):
    try:
        print("input_data....")
        print(data.columns)
        print(type(data))
        result = model.predict(data)
        print("result.....")
        print(result)
    # You can return any data type, as long as it can be serialized by JSON.
        return result.tolist()
    except Exception as e:
        error = str(e)
        return error

定义自定义环境Define the custom environment

接下来,定义环境以对模型进行评分。Next, define the environment to score the model. 在环境中,定义评分脚本 (score.py) 所需的 Python 包,例如 pandas 和 scikit-learn。In the environment, define the Python packages, such as pandas and scikit-learn, that the scoring script (score.py) requires.

若要定义环境,请复制以下代码并将其粘贴到笔记本中的新“代码单元格”。To define the environment, copy the following code and paste it into a new code cell in your notebook.

from azureml.core.model import InferenceConfig
from azureml.core import Environment
from azureml.core.conda_dependencies import CondaDependencies

environment = Environment('my-sklearn-environment')
environment.python.conda_dependencies = CondaDependencies.create(pip_packages=[
    'azureml-defaults',
    'inference-schema[numpy-support]',
    'joblib',
    'numpy',
    'pandas',
    'scikit-learn=={}'.format(sklearn.__version__)
])

inference_config = InferenceConfig(entry_script='./score.py',environment=environment)

部署模型Deploy the model

若要部署模型,请复制以下代码并将其粘贴到笔记本中的新“代码单元格”:To deploy the model, copy the following code and paste it into a new code cell in your notebook:

service_name = 'my-diabetes-model'

service = Model.deploy(ws, service_name, [model], inference_config, overwrite=True)
service.wait_for_deployment(show_output=True)

备注

该服务可能需要 2 到 4 分钟才能完成部署。The service can take 2 to 4 minutes to deploy.

如果服务成功部署,则应会看到以下输出:If the service deploys successfully, you should see the following output:

Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running......................................................................................
Succeeded
ACI service creation operation finished, operation "Succeeded"

你还可以在 Azure 机器学习工作室中查看服务。You can also view the service in Azure Machine Learning Studio. 在左侧菜单中选择“终结点”:In the menu on the left, select Endpoints:

显示如何查看服务的屏幕截图。

建议你测试 Web 服务,以确保其按预期运行。We recommend that you test the web service to ensure it works as expected. 若要返回笔记本,请在 Azure 机器学习工作室中的左侧菜单中,选择“Notebooks”。To return your notebook, in Azure Machine Learning Studio, in the menu on the left, select Notebooks. 然后复制以下代码并将其粘贴到笔记本中的新“代码单元格”以测试服务。Then copy the following code and paste it into a new code cell in your notebook to test the service.

import json

input_payload = json.dumps({
    'data': X_df[0:2].values.tolist()
})

output = service.run(input_payload)

print(output)

输出应类似于此 JSON 结构:{'predict': [[205.59], [68.84]]}The output should look like this JSON structure: {'predict': [[205.59], [68.84]]}.

后续步骤Next steps

在本教程中,你已了解如何生成和部署模型,使其可供 Power BI 使用。In this tutorial, you saw how to build and deploy a model so that it can be consumed by Power BI. 在下一部分中,你将了解如何在 Power BI 报表中使用此模型。In the next part, you'll learn how to consume this model in a Power BI report.