Azure Data Lake Storage Gen2 分层命名空间Azure Data Lake Storage Gen2 hierarchical namespace

一种关键机制,允许 Azure Data Lake Storage Gen2 以对象存储规模和价格提供文件系统性能,是分层命名空间的新增内容。A key mechanism that allows Azure Data Lake Storage Gen2 to provide file system performance at object storage scale and prices is the addition of a hierarchical namespace. 此机制支持按照整理计算机上文件系统的相同方式,将帐户内的对象/文件集合整理成一个包含目录和嵌套子目录的层次结构。This allows the collection of objects/files within an account to be organized into a hierarchy of directories and nested subdirectories in the same way that the file system on your computer is organized. 启用分层命名空间后,存储帐户便可通过分析引擎和框架所熟悉的文件系统语义,提供可扩展和具有成本效益的对象存储。With a hierarchical namespace enabled, a storage account becomes capable of providing the scalability and cost-effectiveness of object storage, with file system semantics that are familiar to analytics engines and frameworks.

分层命名空间的优点The benefits of a hierarchical namespace

以下优点适用于对 blob 数据实现分层命名空间的文件系统:The following benefits are associated with file systems that implement a hierarchical namespace over blob data:

  • 原子目录操作: 通过采用在对象名称中嵌入斜杠 (/) 来表示路径段的惯用做法,使对象存储接近目录层次结构。Atomic directory manipulation: Object stores approximate a directory hierarchy by adopting a convention of embedding slashes (/) in the object name to denote path segments. 虽然此惯用做法适用于组织对象,但对于移动、重命名或删除目录等操作则没有帮助。While this convention works for organizing objects, the convention provides no assistance for actions like moving, renaming or deleting directories. 在没有真实目录的情况下,应用程序可能必须处理数百万个单独的 blob 才能完成目录级任务。Without real directories, applications must process potentially millions of individual blobs to achieve directory-level tasks. 相比之下,分层命名空间可通过更新单个条目(父目录)来处理这些任务。By contrast, a hierarchical namespace processes these tasks by updating a single entry (the parent directory).

    这种显著的优化对于许多大数据分析框架尤为重要。This dramatic optimization is especially significant for many big data analytics frameworks. Hive、Spark 等工具通常将输出写入临时位置,然后在作业结束时对该位置进行重命名。Tools like Hive, Spark, etc. often write output to temporary locations and then rename the location at the conclusion of the job. 如果没有分层命名空间,此重命名操作通常比分析过程本身需要更长时间。Without a hierarchical namespace, this rename can often take longer than the analytics process itself. 作业延迟降低就等同于分析工作负荷的总拥有成本 (TCO) 降低。Lower job latency equals lower total cost of ownership (TCO) for analytics workloads.

  • 熟悉的界面样式: 文件系统被开发人员和用户及相关人员所熟知。Familiar Interface Style: File systems are well understood by developers and users alike. 无需在迁移到云时学习新的存储范例,因为 Data Lake Storage Gen2 公开的文件系统界面与大型和小型计算机使用相同的范例。There is no need to learn a new storage paradigm when you move to the cloud as the file system interface exposed by Data Lake Storage Gen2 is the same paradigm used by computers, large and small.

对象存储以前不支持分层命名空间,原因之一是分层命名空间会限制规模。One of the reasons that object stores haven't historically supported a hierarchical namespace is that a hierarchical namespace limits scale. 但是,Data Lake Storage Gen2 分层命名空间以线性方式扩展,并且不会降低数据容量或性能。However, the Data Lake Storage Gen2 hierarchical namespace scales linearly and does not degrade either the data capacity or performance.

确定是否启用分层命名空间Deciding whether to enable a hierarchical namespace

在帐户上启用分层命名空间后,无法将其恢复为平面命名空间。After you've enabled a hierarchical namespace on your account, you can't revert it back to a flat namespace. 因此,请根据你的对象存储工作负荷的性质考虑启用分层命名空间是否有意义。Therefore, consider whether it makes sense to enable a hierarchical namespace based on the nature of your object store workloads.

某些工作负荷可能无法通过启用分层命名空间获得任何益处。Some workloads might not gain any benefit by enabling a hierarchical namespace. 示例包括备份、图像存储和其他应用程序,其中对象组织与对象本身分开存储(例如,存储在单独的数据库中)。Examples include backups, image storage, and other applications where object organization is stored separately from the objects themselves (for example: in a separate database).

另外,虽然对 Blob 存储功能和 Azure 服务生态系统的支持在不断增强,但仍有一些功能和 Azure 服务在具有分层命名空间的帐户中尚不受支持。Also, while support for Blob storage features and the Azure service ecosystem continues to grow, there are still some features and Azure services that are not yet supported in accounts that have a hierarchical namespace. 请参阅已知问题See Known Issues.

通常情况下,对于为执行目录操作的文件系统设计的存储工作负荷,建议为其启用分层命名空间。In general, we recommend that you turn on a hierarchical namespace for storage workloads that are designed for file systems that manipulate directories. 这包括主要用于分析处理的所有工作负荷。This includes all workloads that are primarily for analytics processing. 启用分层命名空间对于需要高度整理的数据集同样有益处。Datasets that require a high degree of organization will also benefit by enabling a hierarchical namespace.

启用分层命名空间的原因由 TCO 分析确定。The reasons for enabling a hierarchical namespace are determined by a TCO analysis. 一般而言,由于存储加速改善工作负荷延迟,所需的计算资源时间将会缩短。Generally speaking, improvements in workload latency due to storage acceleration will require compute resources for less time. 由于分层命名空间启用的原子目录操作,许多工作负荷的延迟可能会得到改善。Latency for many workloads may be improved due to atomic directory manipulation that is enabled by a hierarchical namespace. 在许多工作负荷中,计算资源占总成本的 85% 以上,因此即使适度减少工作负荷延迟,也相当于大量节省了 TCO。In many workloads, the compute resource represents > 85% of the total cost and so even a modest reduction in workload latency equates to a significant amount of TCO savings. 即使启用分层命名空间会增加存储成本,但由于降低了计算成本,TCO 仍然会降低。Even in cases where enabling a hierarchical namespace increases storage costs, the TCO is still lowered due to reduced compute costs.

若要分析具有平面层次命名空间与具有分层命名空间的帐户之间在数据存储价格、事务价格和存储产能预留定价方面的差异,请参阅 Azure Data Lake Storage Gen2 定价To analyze differences in data storage prices, transaction prices, and storage capacity reservation pricing between accounts that have a flat hierarchical namespace versus a hierarchical namespace, see Azure Data Lake Storage Gen2 pricing.

后续步骤Next steps