托管特征存储错误解决

项目
11/13/2024

在本文中，了解如何解决在 Azure 机器学习中使用托管特征存储时可能遇到的常见问题。

创建和更新特征存储时遇到的问题

创建或更新特征存储时，可能会遇到以下问题：

ARM 限制错误
具体化标识 ARM ID 重复问题
RBAC 权限错误
旧版 azure-mgmt-authorization 包不适用于 AzureMLOnBehalfOfCredential

ARM 限制错误

症状

特征存储创建或更新失败。错误可能如下所示：

{
  "error": {
    "code": "TooManyRequests",
    "message": "The request is being throttled as the limit has been reached for operation type - 'Write'. ..",
    "details": [
      {
        "code": "TooManyRequests",
        "target": "Microsoft.MachineLearningServices/workspaces",
        "message": "..."
      }
    ]
  }
}

解决方案

请稍后运行特征存储创建/更新操作。由于部署分多个步骤进行，因此第二次尝试可能会因一些资源已存在而失败。删除这些资源并恢复作业。

具体化标识 ARM ID 重复问题

首次更新特征存储以启用具体化后，稍后对特征存储进行的某些更新可能会导致出现此错误。

症状

使用 SDK/CLI 更新特征存储时，更新失败并显示以下错误消息：

错误：

{
  "error":{
    "code": "InvalidRequestContent",
    "message": "The request content contains duplicate JSON property names creating ambiguity in paths 'identity.userAssignedIdentities['/subscriptions/{sub-id}/resourceGroups/{rg}/providers/Microsoft.ManagedIdentity/userAssignedIdentities/{your-uai}']'. Please update the request content to eliminate duplicates and try again."
  }
}

解决方案

此问题涉及 materialization_identity ARM ID 格式的 ARM ID。

在 Azure UI 或 SDK 中，用户分配的托管标识的 ARM ID 使用小写 resourcegroups。请参阅此示例：

(A)：/subscriptions/{sub-id}/resourcegroups/{rg}/providers/Microsoft.ManagedIdentity/userAssignedIdentities/{your-uai}

当特征存储将用户分配的托管标识用作其 materialization_identity 时，它的 ARM ID 将进行规范化处理并与 resourceGroups 一起存储。请参阅此示例：

(B)：/subscriptions/{sub-id}/resourceGroups/{rg}/providers/Microsoft.ManagedIdentity/userAssignedIdentities/{your-uai}

在更新请求中，可使用与具体化标识匹配的用户分配的标识来更新特征存储。将托管标识用于该目的时，如果同时使用格式为 (A) 的 ARM ID，则更新会失败，并返回先前的错误消息。

要解决此问题，请将字符串 resourcegroups 替换为用户分配的托管标识 ARM ID 中的 resourceGroups，然后再次运行特征存储更新。

RBAC 权限错误

要创建特征存储，用户需要 Contributor 和 User Access Administrator 角色。涵盖该操作的同一组角色的自定义角色将起作用。一个包含操作的这两个角色的超集也将起作用。

症状

如果用户没有所需的角色，则部署将失败。错误响应如以下示例所示：

{
  "error": {
    "code": "AuthorizationFailed",
    "message": "The client '{client_id}' with object id '{object_id}' does not have authorization to perform action '{action_name}' over scope '{scope}' or the scope is invalid. If access was recently granted, please refresh your credentials."
  }
}

解决方案

将 Contributor 和 User Access Administrator 角色授予将创建特征存储的资源组上的用户。然后，指示用户再次运行部署。

有关详细信息，请访问feature store materialization managed identity 角色所需的权限资源。

旧版 azure-mgmt-authorization 包不适用于 AzureMLOnBehalfOfCredential

症状

使用 azureml-examples 存储库 featurestore_sample 文件夹中提供的 setup_storage_uai 脚本时，脚本会失败并显示错误消息：

AttributeError: 'AzureMLOnBehalfOfCredential' object has no attribute 'signed_session'

解决方案；

请检查已安装的 azure-mgmt-authorization 包的版本，并确保你使用的是最新版本（至少为 3.0.0 或更高版本）。较旧的版本（例如 0.61.0）不适用于 AzureMLOnBehalfOfCredential。

特征集规范中的架构无效

在将特征集注册到特征存储之前，在本地定义特征集规范并运行 <feature_set_spec>.to_spark_dataframe() 以对其进行验证。

症状

当用户运行 <feature_set_spec>.to_spark_dataframe() 时，如果特征集数据帧架构与特征集规范定义不一致，则可能会发生各种架构验证失败的情况。

例如：

错误消息：azure.ai.ml.exceptions.ValidationException: Schema check errors, timestamp column: timestamp is not in output dataframe
错误消息：Exception: Schema check errors, no index column: accountID in output dataframe
错误消息：ValidationException: Schema check errors, feature column: transaction_7d_count has data type: ColumnType.long, expected: ColumnType.string

解决方案

检查架构验证失败错误，并相应地为列名称和类型更新特征集规范定义。例如：

更新 source.timestamp_column.name 属性以正确定义时间戳列名称
更新 index_columns 属性以正确定义索引列
更新 features 属性以正确定义特征列名称和类型
如果特征源数据的类型为 csv，请确保使用列标题生成 CSV 文件

接下来，再次运行 <feature_set_spec>.to_spark_dataframe() 以检查验证是否通过。

如果 SDK 定义了特征集规范，则建议使用 infer_schema 选项作为自动填充 features 的首选方法，而不是手动键入值。无法自动填充 timestamp_column 和 index columns。

有关详细信息，请访问特征集规范架构资源。

找不到转换类

症状

用户运行 <feature_set_spec>.to_spark_dataframe() 后，遇到以下错误：AttributeError: module '<...>' has no attribute '<...>'

例如：

AttributeError: module '7780d27aa8364270b6b61fed2a43b749.transaction_transform' has no attribute 'TransactionFeatureTransformer1'

解决方案

特征转换类应在代码文件夹根目录下的 Python 文件中定义。代码文件夹可以包含其他文件或子文件夹。

将 feature_transformation_code.transformation_class 属性的值设置为 <py file name of the transformation class>.<transformation class name>。

例如，如果代码文件夹如下所示

code/
└── my_transformation_class.py

并且 my_transformation_class.py 文件定义了 MyFeatureTransformer 类，则将

feature_transformation_code.transformation_class 设置为 my_transformation_class.MyFeatureTransformer

代码文件夹中的 FileNotFoundError

症状

如果特征集规范 YAML 是手动创建的，并且 SDK 未生成特征集，则可能会发生此错误。命令

runs <feature_set_spec>.to_spark_dataframe()

返回错误

FileNotFoundError: [Errno 2] No such file or directory: ....

解决方案

检查代码文件夹。应是特征集规范文件夹下的一个子文件夹。在特征集规范中，将 feature_transformation_code.path 设置为特征集规范文件夹的相对路径。例如：

feature set spec folder/
├── code/
│ ├── my_transformer.py
│ └── my_orther_folder
└── FeatureSetSpec.yaml

在这个例子中，YAML 中的 feature_transformation_code.path 属性应为 ./code

注意

使用 azureml-featurestore 中的 create_feature_set_spec 函数创建 FeatureSetSpec python 对象时，它可以采用任何本地文件夹作为 feature_transformation_code.path 值。当 FeatureSetSpec 对象被转储以在目标文件夹中形成特征集规范 YAML 时，代码路径将会复制到目标文件夹中，并且在规范 YAML 中更新 feature_transformation_code.path 属性。

特征集 CRUD 错误

由于 FeatureStoreEntity 无效，特征集 GET 失败

症状

在使用特征存储 CRUD 客户端来 GET 特征集（例如“fs_client.feature_sets.get(name, version)”）时，可能会收到此错误：


Traceback (most recent call last):

  File "/home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages/azure/ai/ml/operations/_feature_store_entity_operations.py", line 116, in get

    return FeatureStoreEntity._from_rest_object(feature_store_entity_version_resource)

  File "/home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages/azure/ai/ml/entities/_feature_store_entity/feature_store_entity.py", line 93, in _from_rest_object

    featurestoreEntity = FeatureStoreEntity(

  File "/home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages/azure/ai/ml/_utils/_experimental.py", line 42, in wrapped

    return func(*args, **kwargs)

  File "/home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages/azure/ai/ml/entities/_feature_store_entity/feature_store_entity.py", line 67, in __init__

    raise ValidationException(

azure.ai.ml.exceptions.ValidationException: Stage must be Development, Production, or Archived, found None

此错误也可能发生在 FeatureStore 具体化作业中，作业会失败并返回相同的错误。

解决方案

使用新版本的 SDKS 启动笔记本会话

如果它使用 azure-ai-ml，请更新为 azure-ai-ml==1.8.0。
如果它使用特征存储数据平面 SDK，请将其更新为 azureml-featurestore== 0.1.0b2。

在笔记本会话中，更新特征存储实体以设置其 stage 属性，如以下示例所示：

from azure.ai.ml.entities import DataColumn, DataColumnType
 
account_entity_config = FeatureStoreEntity(

    name="account",

    version="1",

    index_columns=[DataColumn(name="accountID", type=DataColumnType.STRING)],

    stage="Development",

    description="This entity represents user account index key accountID.",

    tags={"data_typ": "nonPII"},

)

poller = fs_client.feature_store_entities.begin_create_or_update(account_entity_config)

print(poller.result())

定义 FeatureStoreEntity 时，请将属性设置为与创建时所用的属性相同。唯一的区别是要添加 stage 属性。

begin_create_or_update() 调用成功返回后，下一个 feature_sets.get() 调用和下一个具体化作业应会成功。

特征检索作业和查询错误

特征检索规范解析错误
将模型用作特征检索作业的输入时，找不到 feature_retrieval_spec.yaml 文件
观测数据没有与任何特征值联接
用户或托管标识对特征存储没有适当的 RBAC 权限
用户或托管标识没有从源存储或脱机存储读取的适当 RBAC 权限
训练作业无法读取内置特征检索组件生成的数据
generate_feature_retrieval_spec() 由于使用本地特征集规范而失败
get_offline_features() 查询需要很长时间

当特征检索作业失败时，可以检查错误详细信息。转到“运行详细信息”页，选择“输出 + 日志”选项卡，并检查 logs/azureml/driver/stdout 文件。

如果用户在笔记本中运行 get_offline_feature() 查询，单元格输出将直接显示错误。

特征检索规范解析错误

症状

特征检索查询/作业显示以下错误：

特征无效

code: "UserError"
mesasge: "Feature '<some name>' not found in this featureset."

特征存储 URI 无效：

message: "the Resource 'Microsoft.MachineLearningServices/workspaces/<name>' under resource group '<>>resource group name>'->' was not found. For more details please go to https://aka.ms/ARMResourceNotFoundFix",
code: "ResourceNotFound"

特征集无效：

code: "UserError"
message: "Featureset with name: <name >and version: <version> not found."

解决方案

检查作业使用的 feature_retrieval_spec.yaml 中的内容。验证所有特征存储名称

URI
特征集名称/版本
feature

是否有效且存在于特征存储中。

建议使用实用工具函数从特征存储中选择特征，并生成特征检索规范 YAML 文件。

此代码片段使用了 generate_feature_retrieval_spec 实用工具函数。

from azureml.featurestore import FeatureStoreClient
from azure.ai.ml.identity import AzureMLOnBehalfOfCredential

featurestore = FeatureStoreClient(
credential = AzureMLOnBehalfOfCredential(),
subscription_id = featurestore_subscription_id,
resource_group_name = featurestore_resource_group_name,
name = featurestore_name
)

transactions_featureset = featurestore.feature_sets.get(name="transactions", version = "1")

features = [
    transactions_featureset.get_feature('transaction_amount_7d_sum'),
    transactions_featureset.get_feature('transaction_amount_3d_sum')
]

feature_retrieval_spec_folder = "./project/fraud_model/feature_retrieval_spec"
featurestore.generate_feature_retrieval_spec(feature_retrieval_spec_folder, features)

将模型用作特征检索作业的输入时，找不到 feature_retrieval_spec.yaml 文件

症状

使用已注册的模型作为特征检索作业输入时，作业将失败并出现以下错误：

ValueError: Failed with visit error: Failed with execution error: error in streaming from input data sources
    VisitError(ExecutionError(StreamError(NotFound)))
=> Failed with execution error: error in streaming from input data sources
    ExecutionError(StreamError(NotFound)); Not able to find path: azureml://subscriptions/{sub_id}/resourcegroups/{rg}/workspaces/{ws}/datastores/workspaceblobstore/paths/LocalUpload/{guid}/feature_retrieval_spec.yaml

解决方案；

如果提供一个模型作为特征检索步骤的输入，则模型的工件文件夹下应存在检索规范 YAML 文件。如果缺少该文件，作业将失败。

要修复该问题，请在注册模型之前将 feature_retrieval_spec.yaml 文件打包到模型工件文件夹的根文件夹中。

观测数据没有与任何特征值连接

症状

用户运行特征检索查询/作业后，输出数据未获得任何特征值。例如，用户运行特征检索作业以检索特征 transaction_amount_3d_avg 和 transaction_amount_7d_avg，并显示以下结果：

transactionID	accountID	timestamp	transaction_amount_3d_avg	transaction_amount_7d_avg
83870774-7A98-43B...	A1055520444618950	2023-02-28 04:34:27	Null	Null
25144265-F68B-4FD...	A1055520444618950	2023-02-28 10:44:30	Null	Null
8899ED8C-B295-43F...	A1055520444812380	2023-03-06 00:36:30	Null	Null

解决方案

特征检索执行时间点联接查询。如果联接结果显示为空，请尝试以下可能的解决方案：

扩展特征集规范定义中的 temporal_join_lookback 范围，或暂时将其删除。这允许时间点联接进一步（或无限地）回顾观察事件时间戳的过去，以找到特征值。
如果还在特征集规范定义中设置了 source.source_delay，请确保 temporal_join_lookback > source.source_delay。

如果上述解决方案都不起作用，请从特征存储中获取特征集，然后运行 <feature_set>.to_spark_dataframe() 以手动检查特征索引列和时间戳。之所以出现这种情况，原因可能是：

观测数据中的索引值在特征集数据帧中不存在
不存在时间戳值早于观察时间戳的特征值。

在这些情况下，如果特征已启用脱机具体化，则可能需要回填更多特征数据。

用户或托管标识对特征存储没有适当的 RBAC 权限

故障描述：

特征检索作业/查询失败，并在 logs/AzureML/driver/stdout 中显示以下错误消息：

Traceback (most recent call last):
  File "/home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages/azure/ai/ml/_restclient/v2022_12_01_preview/operations/_workspaces_operations.py", line 633, in get
    raise HttpResponseError(response=response, model=error, error_format=ARMErrorFormat)
azure.core.exceptions.HttpResponseError: (AuthorizationFailed) The client 'XXXX' with object id 'XXXX' does not have authorization to perform action 'Microsoft.MachineLearningServices/workspaces/read' over scope '/subscriptions/XXXX/resourceGroups/XXXX/providers/Microsoft.MachineLearningServices/workspaces/XXXX' or the scope is invalid. If access was recently granted, please refresh your credentials.
Code: AuthorizationFailed

解决方案；

如果特征检索作业使用托管标识，请将特征存储的 AzureML Data Scientist 角色分配给该标识。
如果在以下情况下出现问题：

用户在 Azure 机器学习 Spark 笔记本中运行代码
该笔记本使用用户自己的标识访问 Azure 机器学习服务

将特征存储上的 AzureML Data Scientist 角色分配给用户的 Microsoft Entra 标识。

建议的角色为 Azure Machine Learning Data Scientist。用户可以通过以下操作创建自己的自定义角色

Microsoft.MachineLearningServices/workspaces/datastores/listsecrets/action
Microsoft.MachineLearningServices/workspaces/featuresets/read
Microsoft.MachineLearningServices/workspaces/read

有关 RBAC 设置的详细信息，请访问管理对托管特征存储的访问权限资源。

用户或托管标识没有从源存储或脱机存储读取的适当 RBAC 权限

症状

特征检索作业/查询失败，并在 logs/AzureML/driver/stdout 文件中显示以下错误消息：

An error occurred while calling o1025.parquet.
: java.nio.file.AccessDeniedException: Operation failed: "This request is not authorized to perform this operation using this permission.", 403, GET, https://{storage}.dfs.core.chinacloudapi.cn/test?upn=false&resource=filesystem&maxResults=5000&directory=datasources&timeout=90&recursive=false, AuthorizationPermissionMismatch, "This request is not authorized to perform this operation using this permission. RequestId:63013315-e01f-005e-577b-7c63b8000000 Time:2023-05-01T22:20:51.1064935Z"
    at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.checkException(AzureBlobFileSystem.java:1203)
    at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.listStatus(AzureBlobFileSystem.java:408)
    at org.apache.hadoop.fs.Globber.listStatus(Globber.java:128)
    at org.apache.hadoop.fs.Globber.doGlob(Globber.java:291)
    at org.apache.hadoop.fs.Globber.glob(Globber.java:202)
    at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:2124)

解决方案；

如果特征检索作业使用托管标识，请将源存储和脱机存储上的 Storage Blob Data Reader 角色分配给该标识。
如果笔记本使用用户的标识访问 Azure 机器学习服务来运行查询，就会发生此错误。要解决此错误，请将 Storage Blob Data Reader 角色分配给源存储和离线存储存储帐户上的用户标识。

建议的最低访问要求为 Storage Blob Data Reader。用户还可分配具有更多权限的角色（例如 Storage Blob Data Contributor 或 Storage Blob Data Owner）。

训练作业无法读取内置特征检索组件生成的数据

症状

训练作业失败，并显示错误消息：训练数据不存在、格式不正确或存在分析器错误：

FileNotFoundError: [Errno 2] No such file or directory

格式不正确。

ParserError:

解决方案

内置特征检索组件有一个输出 output_data。输出数据是 uri_folder 数据资产。它始终具有以下文件夹结构：

<training data folder>/
├── data/
│ ├── xxxxx.parquet
│ └── xxxxx.parquet
└── feature_retrieval_spec.yaml

输出数据始终为 Parquet 格式。更新训练脚本以从“data”子文件夹中读取，并将数据读取为 Parquet。

`generate_feature_retrieval_spec()` 由于使用本地特征集规范而失败

故障描述：

以下 python 代码会在给定的特征列表上生成特征检索规范：

featurestore.generate_feature_retrieval_spec(feature_retrieval_spec_folder, features)

如果特征列表包含由本地特征集规范定义的特征，则 generate_feature_retrieval_spec() 会失败并显示以下错误消息：

AttributeError: 'FeatureSetSpec' object has no attribute 'id'

解决方案；

只能使用在特征存储中注册的特征集生成特征检索规范。修复此问题：

将本地特征集规范注册为特征存储中的特征集
获取已注册的特征集
仅使用已注册特征集中的特征再次创建特征列表
使用新特征列表生成特征检索规范

`get_offline_features()` 查询需要很长时间

症状：

运行 get_offline_features 以使用特征存储中的一些特征生成训练数据需要很长时间才能完成。

解决方案：

检查以下配置：

验证查询中使用的每个特征集是否在特征集规范中设置了 temporal_join_lookback。将其值设置为较小的值。
如果观察数据帧上的大小和时间戳窗口较大，请配置笔记本会话（或作业），以增加驱动程序和执行程序的大小（内存和核心），并增加执行程序的数量。

特征具体化作业错误

脱机存储配置无效
具体化标识对特征存储缺少合适的 RBAC 权限
具体化标识对读取存储没有合适的 RBAC 权限
具体化标识没有 RBAC 权限，无法将数据写入脱机存储区
将作业执行结果流式传输到笔记本会导致失败
Spark 配置无效

如果特征具体化作业失败，可以按照以下步骤检查作业失败的详细信息：

导航到特征存储页面：https://studio.ml.azure.cn/featureStore/{your-feature-store-name}。
转到“feature set”选项卡，选择相关特征集，然后导航到特征集详细信息页面。
从特征集详细信息页面中，选择“Materialization jobs”选项卡，然后选择失败的作业，以在“作业详细信息”视图中打开。
在“作业详细信息”视图中的“Properties”卡片下面，查看作业状态和错误消息。
此外，还可转到“Outputs + logs”选项卡，然后从 logs\azureml\driver\stdout 文件中查看 stdout 文件。

修复应用后，可手动触发回填具体化作业，以验证该修复是否有效。

脱机存储配置无效

症状

具体化作业失败并在 logs/azureml/driver/stdout 文件中显示以下错误消息：

Caused by: Status code: -1 error code: null error message: InvalidAbfsRestOperationExceptionjava.net.UnknownHostException: adlgen23.dfs.core.chinacloudapi.cn

java.util.concurrent.ExecutionException: Operation failed: "The specified resource name contains invalid characters.", 400, HEAD, https://{storage}.dfs.core.chinacloudapi.cn/{container-name}/{fs-id}/transactions/1/_delta_log?upn=false&action=getStatus&timeout=90

解决方案

使用 SDK 检查特征存储中定义的脱机存储目标：


from azure.ai.ml import MLClient
from azure.ai.ml.identity import AzureMLOnBehalfOfCredential

fs_client = MLClient(AzureMLOnBehalfOfCredential(), featurestore_subscription_id, featurestore_resource_group_name, featurestore_name)

featurestore = fs_client.feature_stores.get(name=featurestore_name)
featurestore.offline_store.target

还可检查特征存储的 UI 概述页中定义的脱机存储目标。验证存储和容器是否存在，并验证目标是否具有以下格式：

/subscriptions/{sub-id}/resourceGroups/{rg}/providers/Microsoft.Storage/storageAccounts/{storage}/blobServices/default/containers/{container-name}

具体化标识对特征存储缺少合适的 RBAC 权限

症状：

具体化作业失败，并在 logs/azureml/driver/stdout 中显示以下错误消息：

Traceback (most recent call last):
  File "/home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages/azure/ai/ml/_restclient/v2022_12_01_preview/operations/_workspaces_operations.py", line 633, in get
    raise HttpResponseError(response=response, model=error, error_format=ARMErrorFormat)
azure.core.exceptions.HttpResponseError: (AuthorizationFailed) The client 'XXXX' with object id 'XXXX' does not have authorization to perform action 'Microsoft.MachineLearningServices/workspaces/read' over scope '/subscriptions/XXXX/resourceGroups/XXXX/providers/Microsoft.MachineLearningServices/workspaces/XXXX' or the scope is invalid. If access was recently granted, please refresh your credentials.
Code: AuthorizationFailed

解决方案；

将特征存储的 Azure Machine Learning Data Scientist 角色分配给特征存储的具体化标识（用户分配的托管标识）。

建议的角色为 Azure Machine Learning Data Scientist。可通过以下操作创建自己的自定义角色：

Microsoft.MachineLearningServices/workspaces/datastores/listsecrets/action
Microsoft.MachineLearningServices/workspaces/featuresets/read
Microsoft.MachineLearningServices/workspaces/read

有关详细信息，请访问feature store materialization managed identity 角色所需的权限资源。

具体化标识对读取存储没有合适的 RBAC 权限

症状

具体化作业失败，并在 logs/azureml/driver/stdout 中显示以下错误消息：

An error occurred while calling o1025.parquet.
: java.nio.file.AccessDeniedException: Operation failed: "This request is not authorized to perform this operation using this permission.", 403, GET, https://{storage}.dfs.core.chinacloudapi.cn/test?upn=false&resource=filesystem&maxResults=5000&directory=datasources&timeout=90&recursive=false, AuthorizationPermissionMismatch, "This request is not authorized to perform this operation using this permission. RequestId:63013315-e01f-005e-577b-7c63b8000000 Time:2023-05-01T22:20:51.1064935Z"
    at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.checkException(AzureBlobFileSystem.java:1203)
    at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.listStatus(AzureBlobFileSystem.java:408)
    at org.apache.hadoop.fs.Globber.listStatus(Globber.java:128)
    at org.apache.hadoop.fs.Globber.doGlob(Globber.java:291)
    at org.apache.hadoop.fs.Globber.glob(Globber.java:202)
    at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:2124)

解决方案；

将源存储的 Storage Blob Data Reader 角色分配给特征存储的具体化标识（用户分配的托管标识）。

建议的最低访问要求为 Storage Blob Data Reader。还可以分配具有更多权限的角色（例如 Storage Blob Data Contributor 或 Storage Blob Data Owner）。

有关 RBAC 配置的详细信息，请访问 feature store materialization managed identity 角色所需的权限资源。

具体化标识没有适当的 RBAC 权限，无法将数据写入脱机存储区

症状

具体化作业失败，并在 logs/azureml/driver/stdout 中显示以下错误消息：

An error occurred while calling o1162.load.
: java.util.concurrent.ExecutionException: java.nio.file.AccessDeniedException: Operation failed: "This request is not authorized to perform this operation using this permission.", 403, HEAD, https://featuresotrestorage1.dfs.core.chinacloudapi.cn/offlinestore/fs_xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx_fsname/transactions/1/_delta_log?upn=false&action=getStatus&timeout=90
    at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
    at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293)
    at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
    at com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
    at com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2410)
    at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2380)
    at com.google.common.cache.LocalCache$S

解决方案

将源存储上的 Storage Blob Data Reader 角色分配给特征存储的具体化标识（用户分配的托管标识）。

建议的最低访问要求为 Storage Blob Data Contributor。还可分配具有更多权限的角色（例如 Storage Blob Data Owner）。

有关 RBAC 配置的详细信息，请访问 feature store materialization managed identity 角色所需的权限资源。

将作业输出流式传输到笔记本会导致失败

症状：

通过 fs_client.jobs.stream("<job_id>") 使用特征存储 CRUD 客户端将具体化作业结果流式传输到笔记本时，SDK 调用会失败并显示错误

HttpResponseError: (UserError) A job was found, but it is not supported in this API version and cannot be accessed.

Code: UserError

Message: A job was found, but it is not supported in this API version and cannot be accessed.

解决方案；

创建具体化作业时（例如，通过回填调用），作业可能需要几秒钟才能正确初始化。几秒钟后再次运行 jobs.stream() 命令。这应该可以解决问题。

Spark 配置无效

故障描述：

具体化作业失败并显示以下错误消息：

Synapse job submission failed due to invalid spark configuration request

{

"Message":"[..] Either the cores or memory of the driver, executors exceeded the SparkPool Node Size.\nRequested Driver Cores:[4]\nRequested Driver Memory:[36g]\nRequested Executor Cores:[4]\nRequested Executor Memory:[36g]\nSpark Pool Node Size:[small]\nSpark Pool Node Memory:[28]\nSpark Pool Node Cores:[4]"

}

解决方案；

更新特征集的 materialization_settings.spark_configuration{}。确保这些参数使用的内存大小量和核心值总数均小于实例类型（如 materialization_settings.resource 所定义）提供的值：

spark.driver.cores spark.driver.memory spark.executor.cores spark.executor.memory

例如，对于实例类型 standard_e8s_v3，以下 Spark 配置是有效选项之一：


transactions_fset_config.materialization_settings = MaterializationSettings(

    offline_enabled=True,

    resource = MaterializationComputeResource(instance_type="standard_e8s_v3"),

    spark_configuration = {

        "spark.driver.cores": 4,

        "spark.driver.memory": "36g",

        "spark.executor.cores": 4,

        "spark.executor.memory": "36g",

        "spark.executor.instances": 2

    },

    schedule = None,

)

fs_poller = fs_client.feature_sets.begin_create_or_update(transactions_fset_config)

通过

托管特征存储错误解决

创建和更新特征存储时遇到的问题

ARM 限制错误

症状

解决方案

具体化标识 ARM ID 重复问题

症状

解决方案

RBAC 权限错误

症状

解决方案

旧版 azure-mgmt-authorization 包不适用于 AzureMLOnBehalfOfCredential

症状

解决方案；

特征集规范创建错误

特征集规范中的架构无效

症状

解决方案

找不到转换类

症状

解决方案

代码文件夹中的 FileNotFoundError

症状

解决方案

特征集 CRUD 错误

由于 FeatureStoreEntity 无效，特征集 GET 失败

症状

解决方案

特征检索作业和查询错误

特征检索规范解析错误

症状

解决方案

将模型用作特征检索作业的输入时，找不到 feature_retrieval_spec.yaml 文件

症状

解决方案；

观测数据没有与任何特征值连接

症状

解决方案

用户或托管标识对特征存储没有适当的 RBAC 权限

故障描述：

解决方案；

用户或托管标识没有从源存储或脱机存储读取的适当 RBAC 权限

症状

解决方案；

训练作业无法读取内置特征检索组件生成的数据

症状

解决方案

generate_feature_retrieval_spec() 由于使用本地特征集规范而失败

故障描述：

解决方案；

get_offline_features() 查询需要很长时间

症状：

解决方案：

特征具体化作业错误

脱机存储配置无效

症状

解决方案

具体化标识对特征存储缺少合适的 RBAC 权限

症状：

解决方案；

具体化标识对读取存储没有合适的 RBAC 权限

症状

解决方案；

具体化标识没有适当的 RBAC 权限，无法将数据写入脱机存储区

症状

解决方案

将作业输出流式传输到笔记本会导致失败

症状：

解决方案；

Spark 配置无效

故障描述：

解决方案；

后续步骤

其他资源

`generate_feature_retrieval_spec()` 由于使用本地特征集规范而失败

`get_offline_features()` 查询需要很长时间