Azure Data Lake Storage Gen2 简介Introduction to Azure Data Lake Storage Gen2

‎Azure Data Lake Storage Gen2 是一组专用于大数据分析的功能,以 Azure Blob 存储为基础而构建。‎Azure Data Lake Storage Gen2 is a set of capabilities dedicated to big data analytics, built on Azure Blob storage.

专为企业大数据分析而设计Designed for enterprise big data analytics

Data Lake Storage Gen2 使 Azure 存储成为在 Azure 上构建企业 Data Lake 的基础。Data Lake Storage Gen2 makes Azure Storage the foundation for building enterprise data lakes on Azure. Data Lake Storage Gen2 从一开始就设计为存储数千万亿字节的信息,同时保持数百千兆位的吞吐量,允许你轻松管理大量数据。Designed from the start to service multiple petabytes of information while sustaining hundreds of gigabits of throughput, Data Lake Storage Gen2 allows you to easily manage massive amounts of data.

Data Lake Storage Gen2 的一个基本部分是向 Blob 存储添加分层命名空间A fundamental part of Data Lake Storage Gen2 is the addition of a hierarchical namespace to Blob storage. 分层命名空间将对象/文件组织到目录层次结构中,以便进行有效的数据访问。The hierarchical namespace organizes objects/files into a hierarchy of directories for efficient data access. 常见的对象存储命名约定在名称中使用斜杠来模拟分层目录结构。A common object store naming convention uses slashes in the name to mimic a hierarchical directory structure. 这种结构在 Data Lake Storage Gen2 中得以真正实现。This structure becomes real with Data Lake Storage Gen2. 诸如重命名或删除目录之类的操作在目录上成为单个原子元数据操作,而不是枚举或处理共享目录名称前缀的所有对象。Operations such as renaming or deleting a directory become single atomic metadata operations on the directory rather than enumerating and processing all objects that share the name prefix of the directory.

过去,基于云的分析必须在性能、管理和安全性方面做出妥协。In the past, cloud-based analytics had to compromise in areas of performance, management, and security. Data Lake Storage Gen2 通过以下方式解决这些方面中的每个问题:Data Lake Storage Gen2 addresses each of these aspects in the following ways:

  • 优化了性能 ,因为你不需要将复制或转换数据作为分析的先决条件。Performance is optimized because you do not need to copy or transform data as a prerequisite for analysis. 分层命名空间极大地提高了目录管理操作的性能,从而提高了整体作业性能。The hierarchical namespace greatly improves the performance of directory management operations, which improves overall job performance.

  • 管理 更为容易,因为你可以通过目录和子目录来组织和操作文件。Management is easier because you can organize and manipulate files through directories and subdirectories.

  • 安全性 是可以强制实施的,因为可以在目录或单个文件上定义 POSIX 权限。Security is enforceable because you can define POSIX permissions on directories or individual files.

  • 由于 Data Lake Storage Gen2 基于低成本的 Azure Blob 存储而构建,因此,可以实现成本效益 。Cost effectiveness is made possible as Data Lake Storage Gen2 is built on top of the low-cost Azure Blob storage. 这些新增功能进一步降低了在 Azure 上运行大数据分析的总拥有成本。The additional features further lower the total cost of ownership for running big data analytics on Azure.

Data Lake Storage Gen2 的主要功能Key features of Data Lake Storage Gen2

  • Hadoop 兼容访问 :使用 Data Lake Storage Gen2,可以像使用 Hadoop 分布式文件系统 (HDFS) 一样管理和访问数据。Hadoop compatible access: Data Lake Storage Gen2 allows you to manage and access data just as you would with a Hadoop Distributed File System (HDFS). 新的 ABFS 驱动程序可在所有 Apache Hadoop 环境中使用,包括 Azure HDInsight,以访问 Data Lake Storage Gen2 中存储的数据。The new ABFS driver is available within all Apache Hadoop environments, including Azure HDInsight to access data stored in Data Lake Storage Gen2.

  • POSIX 权限的超集:Data Lake Gen2 的安全模型支持 ACL 和 POSIX 权限,以及特定于 Data Lake Storage Gen2 的一些额外粒度。A superset of POSIX permissions: The security model for Data Lake Gen2 supports ACL and POSIX permissions along with some extra granularity specific to Data Lake Storage Gen2. 可以通过存储资源管理器或 Hive 和 Spark 等框架来配置设置。Settings may be configured through Storage Explorer or through frameworks like Hive and Spark.

  • 经济高效 :Data Lake Storage Gen2 提供低成本的存储容量和事务。Cost effective: Data Lake Storage Gen2 offers low-cost storage capacity and transactions. 随着数据在其整个生命周期中的转换,记帐费率变化通过诸如 Azure Blob 存储生命周期的内置功能使成本保持在最低水平。As data transitions through its complete lifecycle, billing rates change keeping costs to a minimum via built-in features such as Azure Blob storage lifecycle.

  • 优化的驱动程序:ABFS 驱动程序已针对大数据分析进行专门优化Optimized driver: The ABFS driver is optimized specifically for big data analytics. 相应的 REST API 通过终结点 dfs.core.chinacloudapi.cn 进行显示。The corresponding REST APIs are surfaced through the endpoint dfs.core.chinacloudapi.cn.

可伸缩性Scalability

按照设计,无论是通过 Data Lake Storage Gen2 还是 Blob 存储接口进行访问,Azure 存储都可自如缩放。Azure Storage is scalable by design whether you access via Data Lake Storage Gen2 or Blob storage interfaces. 它可以存储和处理许多百亿亿字节的数据 。It is able to store and serve many exabytes of data. 这种存储量可用于在每秒高级别的输入/输出操作 (IOPS) 下以每秒千兆位 (Gbps) 的速度测量的吞吐量。This amount of storage is available with throughput measured in gigabits per second (Gbps) at high levels of input/output operations per second (IOPS). 除持久性之外,以几乎恒定的每个请求延迟执行处理,这些延迟是在服务、帐户和文件级别上测量的。Beyond just persistence, processing is executed at near-constant per-request latencies that are measured at the service, account, and file levels.

成本效益Cost effectiveness

基于 Azure Blob 存储生成 Data Lake Storage Gen2 的多个好处之一是存储容量和事务的低成本。One of the many benefits of building Data Lake Storage Gen2 on top of Azure Blob storage is the low cost of storage capacity and transactions. 与其他云存储服务不同,在执行分析之前不需要移动或转换存储在 Data Lake Storage Gen2 中的数据。Unlike other cloud storage services, data stored in Data Lake Storage Gen2 is not required to be moved or transformed prior to performing analysis. 有关定价的详细信息,请参阅 Azure 存储定价For more information about pricing, see Azure Storage pricing.

此外,分层命名空间等功能可显著提高许多分析作业的整体性能。Additionally, features such as the hierarchical namespace significantly improve the overall performance of many analytics jobs. 这一性能方面的提升意味着你需要较少的计算能力来处理相同数量的数据,从而降低端到端分析作业的总拥有成本 (TCO)。This improvement in performance means that you require less compute power to process the same amount of data, resulting in a lower total cost of ownership (TCO) for the end-to-end analytics job.

一个服务,多个概念One service, multiple concepts

Data Lake Storage Gen2 是用于大数据分析的附加功能,基于 Azure Blob 存储而构建。Data Lake Storage Gen2 is an additional capability for big data analytics, built on top of Azure Blob storage. 虽然利用 Blob 的现有平台组件来创建和操作数据库进行分析有很多好处,但它确实导致了描述相同共享内容的多个概念。While there are many benefits in leveraging existing platform components of Blobs to create and operate data lakes for analytics, it does lead to multiple concepts describing the same, shared things.

以下是不同概念所描述的等效实体。The following are the equivalent entities, as described by different concepts. 除非另有说明,否则这些实体是直接同义词:Unless specified otherwise these entities are directly synonymous:

概念Concept 顶级组织Top Level Organization 较低级别的组织Lower Level Organization 数据容器Data Container
Blob - 常规用途对象存储Blobs - General purpose object storage 容器Container 虚拟目录(仅限 SDK - 不提供原子操作)Virtual directory (SDK only - does not provide atomic manipulation) BlobBlob
ADLS Gen2 - 分析存储ADLS Gen2 - Analytics Storage 容器Container DirectoryDirectory 文件File

支持的开源平台Supported open source platforms

多个开源平台支持 Data Lake Storage Gen2。Several open source platforms support Data Lake Storage Gen2. 这些平台显示在下表中。Those platforms appear in the following table.

Note

仅支持此表中显示的版本。Only the versions that appear in this table are supported.

平台Platform 支持的版本Supported Version(s) 更多信息More Information
HDInsightHDInsight 3.6+3.6+ HDInsight 提供了哪些 Apache Hadoop 组件和版本?What are the Apache Hadoop components and versions available with HDInsight?
HadoopHadoop 3.2+3.2+ Apache Hadoop 版本存档Apache Hadoop releases archive
ClouderaCloudera 6.1+6.1+ Cloudera Enterprise 6.x 发行说明Cloudera Enterprise 6.x release notes
HortonWorksHortonworks 3.1.x++3.1.x++ 配置云数据访问Configuring cloud data access

后续步骤Next steps

以下文章介绍 Data Lake Storage Gen2 的一些主要概念,并详细介绍如何存储、访问、管理数据以及从数据中获取见解:The following articles describe some of the main concepts of Data Lake Storage Gen2 and detail how to store, access, manage, and gain insights from your data: