Azure Data Lake Storage Gen2 简介Introduction to Azure Data Lake Storage Gen2

‎Azure Data Lake Storage Gen2 是一组专用于大数据分析的功能,以 Azure Blob 存储为基础而构建。‎Azure Data Lake Storage Gen2 is a set of capabilities dedicated to big data analytics, built on Azure Blob storage.

专为企业大数据分析而设计Designed for enterprise big data analytics

Data Lake Storage Gen2 使 Azure 存储成为在 Azure 上构建企业 Data Lake 的基础。Data Lake Storage Gen2 makes Azure Storage the foundation for building enterprise data lakes on Azure. Data Lake Storage Gen2 从一开始就设计为存储数千万亿字节的信息,同时保持数百千兆位的吞吐量,允许你轻松管理大量数据。Designed from the start to service multiple petabytes of information while sustaining hundreds of gigabits of throughput, Data Lake Storage Gen2 allows you to easily manage massive amounts of data.

Data Lake Storage Gen2 的一个基本部分是向 Blob 存储添加分层命名空间A fundamental part of Data Lake Storage Gen2 is the addition of a hierarchical namespace to Blob storage. 分层命名空间将对象/文件组织到目录层次结构中,以便进行有效的数据访问。The hierarchical namespace organizes objects/files into a hierarchy of directories for efficient data access. 常见的对象存储命名约定在名称中使用斜杠来模拟分层目录结构。A common object store naming convention uses slashes in the name to mimic a hierarchical directory structure. 这种结构在 Data Lake Storage Gen2 中得以真正实现。This structure becomes real with Data Lake Storage Gen2. 诸如重命名或删除目录之类的操作在目录上成为单个原子元数据操作,而不是枚举或处理共享目录名称前缀的所有对象。Operations such as renaming or deleting a directory become single atomic metadata operations on the directory rather than enumerating and processing all objects that share the name prefix of the directory.

Data Lake Storage Gen2 在 Blob 存储的基础上构建,并通过以下方式增强了性能、管理和安全性:Data Lake Storage Gen2 builds on Blob storage and enhances performance, management, and security in the following ways:

  • 优化了性能 ,因为你不需要将复制或转换数据作为分析的先决条件。Performance is optimized because you do not need to copy or transform data as a prerequisite for analysis. 与 Blob 存储上的平面命名空间相比,分层命名空间极大地提高了目录管理操作的性能,从而提高了整体作业性能。Compared to the flat namespace on Blob storage, the hierarchical namespace greatly improves the performance of directory management operations, which improves overall job performance.

  • 管理 更为容易,因为你可以通过目录和子目录来组织和操作文件。Management is easier because you can organize and manipulate files through directories and subdirectories.

  • 安全性 是可以强制实施的,因为可以在目录或单个文件上定义 POSIX 权限。Security is enforceable because you can define POSIX permissions on directories or individual files.

另外,Data Lake Storage Gen2 非常经济高效,因为它构建在低成本的 Azure Blob 存储之上。Also, Data Lake Storage Gen2 is very cost effective because it is built on top of the low-cost Azure Blob storage. 这些新增功能进一步降低了在 Azure 上运行大数据分析的总拥有成本。The additional features further lower the total cost of ownership for running big data analytics on Azure.

Data Lake Storage Gen2 的主要功能Key features of Data Lake Storage Gen2

  • Hadoop 兼容访问:使用 Data Lake Storage Gen2,可以像使用 Hadoop 分布式文件系统 (HDFS) 一样管理和访问数据。Hadoop compatible access: Data Lake Storage Gen2 allows you to manage and access data just as you would with a Hadoop Distributed File System (HDFS). 新的 ABFS 驱动程序可在所有 Apache Hadoop 环境(包括 Azure HDInsightSQL 数据仓库)中使用,以访问 Data Lake Storage Gen2 中存储的数据。The new ABFS driver is available within all Apache Hadoop environments, including Azure HDInsight, and SQL Data Warehouse to access data stored in Data Lake Storage Gen2.

  • POSIX 权限的超集:Data Lake Gen2 的安全模型支持 ACL 和 POSIX 权限以及特定于 Data Lake Storage Gen2 的一些额外粒度。A superset of POSIX permissions: The security model for Data Lake Gen2 supports ACL and POSIX permissions along with some extra granularity specific to Data Lake Storage Gen2. 可以通过存储资源管理器或 Hive 和 Spark 等框架来配置设置。Settings may be configured through Storage Explorer or through frameworks like Hive and Spark.

  • 经济高效:Data Lake Storage Gen2 提供了低成本的存储容量和事务。Cost effective: Data Lake Storage Gen2 offers low-cost storage capacity and transactions. 随着数据在其整个生命周期中的转换,记帐费率变化通过诸如 Azure Blob 存储生命周期的内置功能使成本保持在最低水平。As data transitions through its complete lifecycle, billing rates change keeping costs to a minimum via built-in features such as Azure Blob storage lifecycle.

  • 优化的驱动程序:ABFS 驱动程序已针对大数据分析进行专门优化Optimized driver: The ABFS driver is optimized specifically for big data analytics. 相应的 REST API 通过终结点 进行显示。The corresponding REST APIs are surfaced through the endpoint


按照设计,无论是通过 Data Lake Storage Gen2 还是 Blob 存储接口进行访问,Azure 存储都可自如缩放。Azure Storage is scalable by design whether you access via Data Lake Storage Gen2 or Blob storage interfaces. 它可以存储和处理许多百亿亿字节的数据 。It is able to store and serve many exabytes of data. 这种存储量可用于在每秒高级别的输入/输出操作 (IOPS) 下以每秒千兆位 (Gbps) 的速度测量的吞吐量。This amount of storage is available with throughput measured in gigabits per second (Gbps) at high levels of input/output operations per second (IOPS). 除持久性之外,还会根据在服务、帐户和文件级别上测量的持续请求延迟来执行处理。Beyond just persistence, processing is executed at near-constant per-request latencies that are measured at the service, account, and file levels.

成本效益Cost effectiveness

基于 Azure Blob 存储生成 Data Lake Storage Gen2 的多个好处之一是存储容量和事务的低成本。One of the many benefits of building Data Lake Storage Gen2 on top of Azure Blob storage is the low cost of storage capacity and transactions. 与其他云存储服务不同,在执行分析之前不需要移动或转换存储在 Data Lake Storage Gen2 中的数据。Unlike other cloud storage services, data stored in Data Lake Storage Gen2 is not required to be moved or transformed prior to performing analysis. 有关定价的详细信息,请参阅 Azure 存储定价For more information about pricing, see Azure Storage pricing.

此外,分层命名空间等功能可显著提高许多分析作业的整体性能。Additionally, features such as the hierarchical namespace significantly improve the overall performance of many analytics jobs. 这一性能方面的提升意味着你需要较少的计算能力来处理相同数量的数据,从而降低端到端分析作业的总拥有成本 (TCO)。This improvement in performance means that you require less compute power to process the same amount of data, resulting in a lower total cost of ownership (TCO) for the end-to-end analytics job.

一个服务,多个概念One service, multiple concepts

Data Lake Storage Gen2 是用于大数据分析的附加功能,基于 Azure Blob 存储而构建。Data Lake Storage Gen2 is an additional capability for big data analytics, built on top of Azure Blob storage. 虽然利用 Blob 的现有平台组件来创建和操作数据库进行分析有很多好处,但它确实导致了描述相同共享内容的多个概念。While there are many benefits in leveraging existing platform components of Blobs to create and operate data lakes for analytics, it does lead to multiple concepts describing the same, shared things.

以下是不同概念所描述的等效实体。The following are the equivalent entities, as described by different concepts. 除非另有说明,否则这些实体是直接同义的:Unless specified otherwise these entities are directly synonymous:

概念Concept 顶级组织Top Level Organization 较低级别的组织Lower Level Organization 数据容器Data Container
Blob - 常规用途对象存储Blobs - General purpose object storage 容器Container 虚拟目录(仅限 SDK - 不提供原子操作)Virtual directory (SDK only - does not provide atomic manipulation) BlobBlob
Azure Data Lake Storage Gen2 - 分析存储Azure Data Lake Storage Gen2 - Analytics Storage 容器Container DirectoryDirectory 文件File

支持的 Blob 存储功能Supported Blob storage features

Blob 存储功能(如 诊断日志记录、 访问层和  Blob 存储生命周期管理策略)现在可用于具有分层命名空间的帐户。Blob storage features such as diagnostic loggingaccess tiers, and Blob Storage lifecycle management policies now work with accounts that have a hierarchical namespace. 因此,你可以在 Blob 存储帐户上启用分层命名空间,而不会失去对这些功能的访问权限。Therefore, you can enable hierarchical namespaces on your Blob storage accounts without losing access to these features.

有关受支持的 Blob 存储功能的列表,请参阅 Azure Data Lake storage Gen2 中提供的 Blob 存储功能For a list of supported Blob storage features, see Blob Storage features available in Azure Data Lake Storage Gen2.

支持的 Azure 服务集成Supported Azure service integrations

Data Lake Storage gen2 支持多个可用于引入数据、执行分析和创建可视化表示形式的 Azure 服务。Data Lake Storage gen2 supports several Azure services that you can use to ingest data, perform analytics, and create visual representations. 有关受支持的 Azure 服务的列表,请参阅支持 Azure Data Lake Storage Gen2 的 Azure 服务For a list of supported Azure services, see Azure services that support Azure Data Lake Storage Gen2.

支持的开源平台Supported open source platforms

多个开源平台支持 Data Lake Storage Gen2。Several open source platforms support Data Lake Storage Gen2. 有关完整列表,请参阅支持 Azure Data Lake Storage Gen2 的开源平台For a complete list, see Open source platforms that support Azure Data Lake Storage Gen2.

另请参阅See also