为 Azure Data Lake Storage Gen2 中的文档编制索引Indexing documents in Azure Data Lake Storage Gen2

重要

Azure Data Lake Storage Gen2 支持目前以公共预览版提供。Azure Data Lake Storage Gen2 support is currently in public preview. 提供的预览版功能不附带服务级别协议,我们不建议将其用于生产工作负荷。Preview functionality is provided without a service level agreement, and is not recommended for production workloads. 可以填写此表单来请求访问预览版。You can request access to the previews by filling out this form. REST API 版本 2020-06-30-Preview 和门户提供此功能。The REST API version 2020-06-30-Preview and portal provide this feature. 目前不支持 .NET SDK。There is currently no .NET SDK support.

设置 Azure 存储帐户时,可以选择启用分层命名空间When setting up an Azure storage account, you have the option to enable hierarchical namespace. 这样,就可以将帐户中的内容集合组织成目录和嵌套子目录的层次结构。This allows the collection of content in an account to be organized into a hierarchy of directories and nested subdirectories. 启用分层命名空间即可启用 Azure Data Lake Storage Gen2By enabling hierarchical namespace, you enable Azure Data Lake Storage Gen2.

本文介绍如何开始为 Azure Data Lake Storage Gen2 中的文档编制索引。This article describes how to get started with indexing documents that are in Azure Data Lake Storage Gen2.

设置 Azure Data Lake Storage Gen2 索引器Set up Azure Data Lake Storage Gen2 indexer

需要完成几个步骤才能为 Data Lake Storage Gen2 中的内容编制索引。There are a few steps you'll need to complete to index content from Data Lake Storage Gen2.

步骤 1:注册预览版Step 1: Sign up for the preview

填写此表单注册 Data Lake Storage Gen2 索引器预览版。Sign up for the Data Lake Storage Gen2 indexer preview by filling out this form. 在我们同意你注册预览版后,你会收到确认电子邮件。You will receive a confirmation email once you have been accepted into the preview.

步骤 2:遵循 Azure Blob 存储索引设置步骤Step 2: Follow the Azure Blob storage indexing setup steps

收到预览版注册成功的确认消息后,便可以创建索引管道。Once you've received confirmation that your preview sign-up was successful, you're ready to create the indexing pipeline.

可以使用 REST API 版本 2020-06-30-Preview 或门户为 Data Lake Storage Gen2 中的内容和元数据编制索引。You can index content and metadata from Data Lake Storage Gen2 by using the REST API version 2020-06-30-Preview or the portal. 目前不支持 .NET SDK。There is no .NET SDK support at this time.

为 Data Lake Storage Gen2 中的内容编制索引,与为 Azure Blob 存储中的内容编制索引相同。Indexing content in Data Lake Storage Gen2 is identical to indexing content in Azure Blob storage. 若要了解如何设置 Data Lake Storage Gen2 数据源、索引和索引器,请参阅如何使用 Azure 认知搜索为 Azure Blob 存储中的文档编制索引So to understand how to set up the Data Lake Storage Gen2 data source, index, and indexer, refer to How to index documents in Azure Blob Storage with Azure Cognitive Search. “Blob 存储”一文还提供了支持的文档格式、提取的 Blob 元数据属性、增量索引等相关信息。The Blob storage article also provides information about what document formats are supported, what blob metadata properties are extracted, incremental indexing, and more. 此信息同样适用于 Data Lake Storage Gen2。This information will be the same for Data Lake Storage Gen2.

访问控制Access control

Azure Data Lake Storage Gen2 实现了一个访问控制模型,该模型支持 Azure 基于角色的访问控制 (Azure RBAC) 和类似 POSIX 的访问控制列表 (ACL)。Azure Data Lake Storage Gen2 implements an access control model that supports both Azure role-based access control (Azure RBAC) and POSIX-like access control lists (ACLs). 为 Data Lake Storage Gen2 中的内容编制索引时,Azure 认知搜索不会从内容中提取 RBAC 和 ACL 信息。When indexing content from Data Lake Storage Gen2, Azure Cognitive Search will not extract the RBAC and ACL information from the content. 因此,此信息不会包含在 Azure 认知搜索索引中。As a result, this information will not be included in your Azure Cognitive Search index.

如果对索引中的每个文档保持访问控制非常重要,则应由应用程序开发人员需要负责实施安全修整If maintaining access control on each document in the index is important, it is up to the application developer to implement security trimming.

更改检测Change Detection

Data Lake Storage Gen2 索引器支持更改检测。The Data Lake Storage Gen2 indexer supports change detection. 这意味着,当索引器运行时,它只根据 Blob 的 LastModified 时间戳为更改的 Blob 重新编制索引。This means that when the indexer runs it only reindexes the changed blobs as determined by the blob's LastModified timestamp.

备注

Data Lake Storage Gen2 允许重命名目录。Data Lake Storage Gen2 allows directories to be renamed. 重命名目录时,该目录中的 blob 的时间戳不会更新。When a directory is renamed the timestamps for the blobs in that directory do not get updated. 因此,索引器不会重新索引这些 blob。As a result, the indexer will not reindex those blobs. 如果需要在目录重命名之后重新编制索引目录中的 blob(因为它们现在有新的 URL),则需更新目录中所有 blob 的 LastModified 时间戳,使索引器知道在以后运行时将其重新编制索引。If you need the blobs in a directory to be reindexed after a directory rename because they now have new URLs, you will need to update the LastModified timestamp for all the blobs in the directory so that the indexer knows to reindex them during a future run.