教程 2：使用特征试验和训练模型

2024/11/11

本系列教程介绍了特征如何无缝集成机器学习生命周期的所有阶段：原型制作、训练和操作化。

第一个教程介绍了如何使用自定义转换创建特征集规范。然后，介绍了如何使用该特征集生成训练数据、启用具体化和执行回填。本教程介绍了如何启用具体化和执行回填。还演示了如何试验特征以提高模型性能。

本教程介绍如何执行下列操作：

通过使用现有的预计算值作为特征，对新的 accounts 特征集规范进行原型设计。然后，将本地特征集规范注册为特征存储中的特征集。此过程与第一个教程不同，在第一个教程中，你创建了具有自定义转换的特征集。
从 transactions 和 accounts 特征集中为模型选择特征，并将其保存为特征检索规范。
运行一个训练管道，该管道使用特征检索规范来训练新模型。此管道使用内置的特征检索组件来生成训练数据。

先决条件

在继续学习此教程之前，请务必完成系列中的第一个教程。

设置

配置 Azure 机器学习 Spark 笔记本。

可以创建一个新笔记本，并分步执行本教程中的说明。还可以从 featurestore_sample/notebooks 目录打开并运行名为“2.Experiment-train-models-using-features.ipynb”的现有笔记本。可以选择 sdk_only 或 sdk_and_cli。可以将此教程保持在打开状态，并参考它来获取文档链接和更多说明。
1. 在顶部菜单的“计算”下拉列表中，选择“Azure 机器学习无服务器 Spark”下的“无服务器 Spark 计算”。
2. 配置会话：
  1. 当工具栏显示“配置会话”时，将其选中。
  2. 在“Python 包”选项卡上，选择“上传 Conda 文件”。
  3. 上传在第一个教程中上传的 conda.yml 文件。
  4. 作为一个选项，可以增加会话超时（空闲时间）以避免频繁重新运行先决条件。

启动 Spark 会话。

   # run this cell to start the spark session (any code block will start the session ). This can take around 10 mins.
   print("start spark session")

为示例设置根目录。

import os

# please update the dir to ./Users/{your-alias} (or any custom directory you uploaded the samples to).
# You can find the name from the directory structure inm the left nav
root_dir = "./Users/<your user alias>/featurestore_sample"

if os.path.isdir(root_dir):
    print("The folder exists.")
else:
    print("The folder does not exist. Please create or fix the path")

设置 CLI。

Python SDK
Azure CLI

不适用。

安装 Azure 机器学习扩展。
```
!az extension add --name ml
```
身份验证。
```
!az login
```

设置默认订阅。

import os

subscription_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]

!az account set -s $subscription_id

初始化项目工作区变量。

这是当前工作区，教程笔记本在此资源中运行。

### Initialize the MLClient of this project workspace
import os
from azure.ai.ml import MLClient
from azure.ai.ml.identity import AzureMLOnBehalfOfCredential

project_ws_sub_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]
project_ws_rg = os.environ["AZUREML_ARM_RESOURCEGROUP"]
project_ws_name = os.environ["AZUREML_ARM_WORKSPACE_NAME"]

# connect to the project workspace
ws_client = MLClient(
    AzureMLOnBehalfOfCredential(), project_ws_sub_id, project_ws_rg, project_ws_name
)

初始化特征存储变量。

请务必更新 featurestore_name 和 featurestore_location 值，以反映在教程一中创建的内容。

from azure.ai.ml import MLClient
from azure.ai.ml.identity import AzureMLOnBehalfOfCredential

# feature store
featurestore_name = "my-featurestore"  # use the same name from part #1 of the tutorial
featurestore_subscription_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]
featurestore_resource_group_name = os.environ["AZUREML_ARM_RESOURCEGROUP"]

# feature store ml client
fs_client = MLClient(
    AzureMLOnBehalfOfCredential(),
    featurestore_subscription_id,
    featurestore_resource_group_name,
    featurestore_name,
)

初始化特征存储消耗客户端。

# feature store client
from azureml.featurestore import FeatureStoreClient
from azure.ai.ml.identity import AzureMLOnBehalfOfCredential

featurestore = FeatureStoreClient(
     credential=AzureMLOnBehalfOfCredential(),
     subscription_id=featurestore_subscription_id,
     resource_group_name=featurestore_resource_group_name,
     name=featurestore_name,
)

在项目工作区中创建名为 cpu-cluster 的计算群集。

运行训练/批处理推理作业时，需要此计算群集。

from azure.ai.ml.entities import AmlCompute

cluster_basic = AmlCompute(
    name="cpu-cluster",
    type="amlcompute",
    size="STANDARD_F4S_V2",  # you can replace it with other supported VM SKUs
    location=ws_client.workspaces.get(ws_client.workspace_name).location,
    min_instances=0,
    max_instances=1,
    idle_time_before_scale_down=360,
)
ws_client.begin_create_or_update(cluster_basic).result()

在本地环境中创建帐户功能集

在第一个教程中，你创建了具有自定义转换的 transactions 特征集。现在，请创建一个使用预计算值的 accounts 特征集。

若要加入预计算的特征，可以创建特征集规范，而无需编写任何转换代码。请使用特征集规范在完全本地的开发环境中开发和测试特征集。

不需要连接到特征存储。在此过程中，请在本地创建特征集规范，然后对其中的值采样。要从托管功能商店的特征中受益，你必须使用特征资产定义向功能商店注册特征集规范。本教程中的后续步骤将会提供更多详细信息。

浏览帐户的源数据。

备注

此笔记本使用可公开访问的 Blob 容器中托管的示例数据。只有驱动程序 wasbs 才可以在 Spark 中读取该内容。当你通过使用自己的源数据创建特征集时，请将这些特征集托管在 Azure Data Lake Storage Gen2 帐户中，并在数据路径中使用 abfss 驱动程序。
```
accounts_data_path = "wasbs://data@azuremlexampledata.blob.core.windows.net/feature-store-prp/datasources/accounts-precalculated/*.parquet"
accounts_df = spark.read.parquet(accounts_data_path)

display(accounts_df.head(5))
```

在本地通过这些预计算的特征创建 accounts 特征集规范。

此处不需要任何转换代码，因为你引用了预计算的特征。

from azureml.featurestore import create_feature_set_spec, FeatureSetSpec
from azureml.featurestore.contracts import (
    DateTimeOffset,
    FeatureSource,
    TransformationCode,
    Column,
    ColumnType,
    SourceType,
    TimestampColumn,
)


accounts_featureset_spec = create_feature_set_spec(
    source=FeatureSource(
        type=SourceType.parquet,
        path="wasbs://data@azuremlexampledata.blob.core.windows.net/feature-store-prp/datasources/accounts-precalculated/*.parquet",
        timestamp_column=TimestampColumn(name="timestamp"),
    ),
    index_columns=[Column(name="accountID", type=ColumnType.string)],
    # account profiles in the source are updated once a year. set temporal_join_lookback to 365 days
    temporal_join_lookback=DateTimeOffset(days=365, hours=0, minutes=0),
    infer_schema=True,
)
# Generate a spark dataframe from the feature set specification
accounts_fset_df = accounts_featureset_spec.to_spark_dataframe()
# display few records
display(accounts_fset_df.head(5))

导出为特征集规范。

为了向特征存储注册特征集规范，你必须以特定格式保存该规范。

运行下一个单元格后，请检查生成的 accounts 特征集规范。若要查看规范，请从文件树中打开 featurestore/featuresets/accounts/spec/FeatureSetSpec.yaml 文件。

规范包含以下重要元素：
- source：对存储资源的引用。在本例中，它是 blob 存储资源中的 parquet 文件。
- features：特征及其数据类型的列表。通过提供的转换代码，该代码必须返回映射到特征和数据类型的数据帧。如果没有提供转换代码，系统会生成查询以将特征和数据类型映射到源。在这种情况下，生成的 accounts 特征集规范不包含转换代码，因为它是预先计算的。
- index_columns：访问特征集中的值所需的联接键。
若要了解详细信息，请访问了解托管功能商店中的顶级实体和 CLI (v2) 特征集规范 YAML 架构资源。

还有一个额外的好处是，持久化支持源代码管理。

此处不需要任何转换代码，因为你引用了预计算的特征。
```
import os

# create a new folder to dump the feature set spec
accounts_featureset_spec_folder = root_dir + "/featurestore/featuresets/accounts/spec"

# check if the folder exists, create one if not
if not os.path.exists(accounts_featureset_spec_folder):
    os.makedirs(accounts_featureset_spec_folder)

accounts_featureset_spec.dump(accounts_featureset_spec_folder)
```

在本地试验未注册的功能并在准备就绪时注册到功能商店

开发特征时，可能需要先在本地进行测试和验证，然后再将其注册到特征存储或在云中运行训练管道。将本地未注册的特征集 (accounts) 与在特征存储 (transactions) 中注册的特征集组合在一起，为机器学习模型生成训练数据。

为模型选择特征。

# get the registered transactions feature set, version 1
transactions_featureset = featurestore.feature_sets.get("transactions", "1")
# Notice that account feature set spec is in your local dev environment (this notebook): not registered with feature store yet
features = [
    accounts_featureset_spec.get_feature("accountAge"),
    accounts_featureset_spec.get_feature("numPaymentRejects1dPerUser"),
    transactions_featureset.get_feature("transaction_amount_7d_sum"),
    transactions_featureset.get_feature("transaction_amount_3d_sum"),
    transactions_featureset.get_feature("transaction_amount_7d_avg"),
]

在本地生成训练数据。

此步骤生成的训练数据用于说明用途。在这一步，你可以选择在本地训练模型。本教程的后续步骤将会介绍如何在云中训练模型。

from azureml.featurestore import get_offline_features

# Load the observation data. To understand observatio ndata, refer to part 1 of this tutorial
observation_data_path = "wasbs://data@azuremlexampledata.blob.core.windows.net/feature-store-prp/observation_data/train/*.parquet"
observation_data_df = spark.read.parquet(observation_data_path)
obs_data_timestamp_column = "timestamp"

# generate training dataframe by using feature data and observation data
training_df = get_offline_features(
    features=features,
    observation_data=observation_data_df,
    timestamp_column=obs_data_timestamp_column,
)

# Ignore the message that says feature set is not materialized (materialization is optional). We will enable materialization in the next part of the tutorial.
display(training_df)
# Note: display(training_df.head(5)) displays the timestamp column in a different format. You can can call training_df.show() to see correctly formatted value

将 accounts 特征集注册到特征存储。

在你对特征定义进行本地实验后，如果它们看起来合理，则可以在特征库中注册特征集资产定义。

from azure.ai.ml.entities import FeatureSet, FeatureSetSpecification

accounts_fset_config = FeatureSet(
    name="accounts",
    version="1",
    description="accounts featureset",
    entities=["azureml:account:1"],
    stage="Development",
    specification=FeatureSetSpecification(path=accounts_featureset_spec_folder),
    tags={"data_type": "nonPII"},
)

poller = fs_client.feature_sets.begin_create_or_update(accounts_fset_config)
print(poller.result())

获取已注册的特征集，并对其进行测试。

# look up the featureset by providing name and version
accounts_featureset = featurestore.feature_sets.get("accounts", "1")
# get access to the feature data
accounts_feature_df = accounts_featureset.to_spark_dataframe()
display(accounts_feature_df.head(5))
# Note: Please ignore this warning: Failure while loading azureml_run_type_providers. Failed to load entrypoint azureml.scriptrun

运行训练实验

在这些步骤中，你将选择一系列特征，运行训练管道，并注册模型。可以重复这些步骤，直到模型的表现符合预期。

（可选）从特征存储 UI 发现特征。

第一个教程介绍了在注册 transactions 特征集时需执行的此步骤。由于你还有 accounts 特征集，因此可以浏览可用特征：
1. 转到 Azure 机器学习全球登陆页面。
2. 在左窗格中，选择“特征存储”。
3. 在特征存储列表中，选择之前创建的特征存储。
UI 会显示你创建的特征集和实体。选择特征集以浏览特征定义。你可以使用全局搜索框跨特征存储搜索特征集。
（可选）从 SDK 发现特征。

# List available feature sets
all_featuresets = featurestore.feature_sets.list()
for fs in all_featuresets:
    print(fs)

# List of versions for transactions feature set
all_transactions_featureset_versions = featurestore.feature_sets.list(
    name="transactions"
)
for fs in all_transactions_featureset_versions:
    print(fs)

# See properties of the transactions featureset including list of features
featurestore.feature_sets.get(name="transactions", version="1").features

为模型选择特征，然后将模型导出为特征检索规范。

在前面的步骤中，你从已注册和未注册的特征集的组合中选择了特征，用于本地试验和测试。现在，可以在云中进行试验。如果将所选特征保存为特征检索规范，然后使用机器学习操作(MLOps) 中的规范或持续集成和持续交付 (CI/CD) 流进行训练和推理，则模型传送敏捷性会提高。
1. 为模型选择特征。
```
# you can select features in pythonic way
features = [
     accounts_featureset.get_feature("accountAge"),
     transactions_featureset.get_feature("transaction_amount_7d_sum"),
     transactions_featureset.get_feature("transaction_amount_3d_sum"),
]

# you can also specify features in string form: featurestore:featureset:version:feature
more_features = [
     "accounts:1:numPaymentRejects1dPerUser",
     "transactions:1:transaction_amount_7d_avg",
]

more_features = featurestore.resolve_feature_uri(more_features)

features.extend(more_features)
```
2. 将所选特征导出为特征检索规范。
  
  特征检索规范是与模型关联的特征列表的可移植定义。它可以帮助简化机器学习模型的开发和操作化。它成为生成训练数据的训练管道的输入。然后，它与模型打包在一起。
  
  推理阶段使用特征检索来查找特征。它整合了机器学习生命周期的所有阶段。在试验和部署时，可以尽量减少对训练/推理管道的更改。
  
  可以选择使用特征检索规范和内置特征检索组件。你可以直接使用 get_offline_features() API，如本教程前面所示。当你将规范与模型一起打包时，规范的名称应为 feature_retrieval_spec.yaml。这样，系统就可以识别它。
```
# Create feature retrieval spec
feature_retrieval_spec_folder = root_dir + "/project/fraud_model/feature_retrieval_spec"

# check if the folder exists, create one if not
if not os.path.exists(feature_retrieval_spec_folder):
    os.makedirs(feature_retrieval_spec_folder)

featurestore.generate_feature_retrieval_spec(feature_retrieval_spec_folder, features)
```

使用管道在云中进行训练，并注册模型

在此过程中，请手动触发训练管道。在生产环境中，CI/CD 管道可以根据对源存储库中特征检索规范的更改来触发训练管道。如果模型令人满意，可以注册该模型。

运行训练管道。

训练管道包含以下步骤：
1. 特征检索：此内置组件会获取特征检索规范、观察数据和时间戳列名作为训练管道的输入。然后，它会生成训练数据作为输出。它会以托管 Spark 作业的形式运行这些步骤。
2. 训练：此步骤根据训练数据来训练模型，然后生成模型（尚未注册）。
3. 评估：此步骤验证模型表现和质量是否在阈值范围内。（在本教程中，这是一个用于说明目的的占位步骤。）
4. 注册模型：此步骤注册模型。
  
  备注
  
  在第二个教程中，你运行了一个回填作业来具体化 transactions 特征集的数据。特征检索步骤将从此特征集的脱机存储中读取特征值。即使使用 get_offline_features() API，行为也相同。
```
from azure.ai.ml import load_job  # will be used later

training_pipeline_path = (
  root_dir + "/project/fraud_model/pipelines/training_pipeline.yaml"
)
training_pipeline_definition = load_job(source=training_pipeline_path)
training_pipeline_job = ws_client.jobs.create_or_update(training_pipeline_definition)
ws_client.jobs.stream(training_pipeline_job.name)
# Note: First time it runs, each step in pipeline can take ~ 15 mins. However subsequent runs can be faster (assuming spark pool is warm - default timeout is 30 mins)
```
5. 检查训练管道和模型。
  - 若要显示管道步骤，请选择 Web 视图管道的超链接，并在新窗口中打开它。
使用模型项目中的特征检索规范：
1. 在当前工作区的左窗格中，用鼠标右键按钮选择“模型”。
2. 选择“在新标签页或窗口中打开”。
3. 选择“fraud_model”。
4. 选择“项目”。
特征检索规范与模型一起打包。训练管道中的模型注册步骤处理了此步骤。你在试验期间创建了特征检索规范。现在它是模型定义的一部分。下一个教程将介绍推理过程如何使用它。

查看特征集和模型依赖项

查看与模型关联的特征集的列表。

在同一模型页面上，选择特征集选项卡。此选项卡同时显示 transactions 和 accounts 特征集。此模型取决于这些特征集。
查看使用特征集的模型的列表：
1. 打开特征存储 UI（本教程前面介绍过）。
2. 在左窗格中，选择“特征集”。
3. 选择一个特征集。
4. 选择“模型”选项卡。
特征检索规范在模型注册时确定了此列表。

清理

本系列的教程五将介绍如何删除资源。

通过