有关在 Linux 上使用 HDInsight 的信息Information about using HDInsight on Linux

Azure HDInsight 群集提供了基于熟悉的 Linux 环境并在 Azure 云中运行的 Apache Hadoop。Azure HDInsight clusters provide Apache Hadoop on a familiar Linux environment, running in the Azure cloud. 在大多数情况下,它的工作方式应该与其他任何 Hadoop-on-Linux 安装完全相同。For most things, it should work exactly as any other Hadoop-on-Linux installation. 本文档指出了你应该注意的具体差异。This document calls out specific differences that you should be aware of.

先决条件Prerequisites

本文档中的许多步骤使用以下实用程序,这些程序可能需要在系统上安装。Many of the steps in this document use the following utilities, which may need to be installed on your system.

域名Domain names

从 Internet 连接到群集时要使用的完全限定域名 (FQDN) 是 CLUSTERNAME.azurehdinsight.cnCLUSTERNAME-ssh.azurehdinsight.cn(仅适用于 SSH)。The fully qualified domain name (FQDN) to use when connecting to the cluster from the internet is CLUSTERNAME.azurehdinsight.cn or CLUSTERNAME-ssh.azurehdinsight.cn (for SSH only).

就内部来说,群集中的每个节点都有一个在群集配置期间分配的名称。Internally, each node in the cluster has a name that is assigned during cluster configuration. 若要查找群集名称,请参阅 Ambari Web UI 上的 主机 页。To find the cluster names, see the Hosts page on the Ambari Web UI. 还可以使用以下方法从 Ambari REST API 返回主机列表:You can also use the following to return a list of hosts from the Ambari REST API:

curl -u admin -G "https://CLUSTERNAME.azurehdinsight.cn/api/v1/clusters/CLUSTERNAME/hosts" | jq '.items[].Hosts.host_name'

CLUSTERNAME 替换为群集的名称。Replace CLUSTERNAME with the name of your cluster. 出现提示时,请输入管理员帐户的密码。When prompted, enter the password for the admin account. 此命令返回包含群集中主机列表的 JSON 文档。This command returns a JSON document that contains a list of the hosts in the cluster. jq 用于为每个主机提取 host_name 元素值。jq is used to extract the host_name element value for each host.

若需查找特定服务的节点的名称,可查询 Ambari 以获取该组件。If you need to find the name of the node for a specific service, you can query Ambari for that component. 例如,若需查找 HDFS 名称节点的主机,请使用以下命令:For example, to find the hosts for the HDFS name node, use the following command:

curl -u admin -G "https://CLUSTERNAME.azurehdinsight.cn/api/v1/clusters/CLUSTERNAME/services/HDFS/components/NAMENODE" | jq '.host_components[].HostRoles.host_name'

此命令会返回一个描述该服务的 JSON 文档,然后 jq 就会只拉取主机的 host_name 值。This command returns a JSON document describing the service, and then jq pulls out only the host_name value for the hosts.

对服务的远程访问Remote access to services

  • Ambari (web) - https://CLUSTERNAME.azurehdinsight.cnAmbari (web) - https://CLUSTERNAME.azurehdinsight.cn

    使用群集管理员用户和密码进行身份验证,并登录到 Ambari。Authenticate by using the cluster administrator user and password, and then sign in to Ambari.

    身份验证是纯文本身份验证 - 始终使用 HTTPS 来帮助确保连接是安全的。Authentication is plaintext - always use HTTPS to help ensure that the connection is secure.

    Important

    某些 Web UI 可使用内部域名通过 Ambari 访问节点。Some of the web UIs available through Ambari access nodes using an internal domain name. 内部域名不可通过 Internet 公开访问。Internal domain names are not publicly accessible over the internet. 在尝试通过 Internet 访问某些功能时,可能会收到“找不到服务器”错误。You may receive "server not found" errors when trying to access some features over the Internet.

    要使用 Ambari web UI 的全部功能,请使用 SSH 隧道通过代理将 Web 流量传送到群集头节点。To use the full functionality of the Ambari web UI, use an SSH tunnel to proxy web traffic to the cluster head node. 请参阅使用 SSH 隧道访问 Apache Ambari Web UI、ResourceManager、JobHistory、NameNode、Oozie 和其他 Web UISee Use SSH Tunneling to access Apache Ambari web UI, ResourceManager, JobHistory, NameNode, Oozie, and other web UIs

  • Ambari (REST) - https://CLUSTERNAME.azurehdinsight.cn/ambariAmbari (REST) - https://CLUSTERNAME.azurehdinsight.cn/ambari

    Note

    通过使用群集管理员用户和密码进行身份验证。Authenticate by using the cluster administrator user and password.

    身份验证是纯文本身份验证 - 始终使用 HTTPS 来帮助确保连接是安全的。Authentication is plaintext - always use HTTPS to help ensure that the connection is secure.

  • WebHCat (Templeton) - https://CLUSTERNAME.azurehdinsight.cn/templetonWebHCat (Templeton) - https://CLUSTERNAME.azurehdinsight.cn/templeton

    Note

    通过使用群集管理员用户和密码进行身份验证。Authenticate by using the cluster administrator user and password.

    身份验证是纯文本身份验证 - 始终使用 HTTPS 来帮助确保连接是安全的。Authentication is plaintext - always use HTTPS to help ensure that the connection is secure.

  • SSH - CLUSTERNAME-ssh.azurehdinsight.cn,使用端口 22 或 23。SSH - CLUSTERNAME-ssh.azurehdinsight.cn on port 22 or 23. 端口 22 用于连接主要头节点,而端口 23 用于连接辅助头节点。Port 22 is used to connect to the primary headnode, while 23 is used to connect to the secondary. 有关头节点的详细信息,请参阅 HDInsight 中的 Apache Hadoop 群集的可用性和可靠性For more information on the head nodes, see Availability and reliability of Apache Hadoop clusters in HDInsight.

    Note

    只能通过 SSH 从客户端计算机访问群集头节点。You can only access the cluster head nodes through SSH from a client machine. 在连接后,可以通过使用 SSH 从头节点访问从节点。Once connected, you can then access the worker nodes by using SSH from a headnode.

有关详细信息,请参阅 HDInsight 上的 Apache Hadoop 服务使用的端口文档。For more information, see the Ports used by Apache Hadoop services on HDInsight document.

文件位置File locations

Hadoop 相关文件可在群集节点上的 /usr/hdp中找到。Hadoop-related files can be found on the cluster nodes at /usr/hdp. 此目录包含以下子目录:This directory contains the following subdirectories:

  • 2.6.5.3006-29:目录名称是 HDInsight 使用的 Hadoop 平台版本。2.6.5.3006-29: The directory name is the version of the Hadoop platform used by HDInsight. 群集上的数字可能与这里列出的有所不同。The number on your cluster may be different than the one listed here.
  • current:此目录包含 2.6.5.3006-29 目录下的子目录的链接。current: This directory contains links to subdirectories under the 2.6.5.3006-29 directory. 由于该目录存在,因此无需记住版本号。This directory exists so that you don't have to remember the version number.

可以在 Hadoop 分布式文件系统上的 /example/HdiSamples 处找到示例数据和 JAR 文件。Example data and JAR files can be found on Hadoop Distributed File System at /example and /HdiSamples.

HDFS、Azure 存储和 Data Lake StorageHDFS, Azure Storage, and Data Lake Storage

在大部分 Hadoop 发行版中,数据都存储在 HDFS 中,HDFS 由群集中计算机上的本地存储提供支持。In most Hadoop distributions, the data is stored in HDFS, which is backed by local storage on the machines in the cluster. 对基于云的解决方案使用本地存储可能费用高昂,因为计算资源以小时或分钟为单位来计费。Using local storage can be costly for a cloud-based solution where you are charged hourly or by minute for compute resources.

使用 HDInsight 时,数据文件使用 Azure Blob 存储以可缩放和复原的方式存储在云中。When using HDInsight, the data files are stored in a scalable and resilient way in the cloud using Azure Blob Storage. 这些服务提供以下优势:These services provide the following benefits:

  • 成本低廉的长期存储。Cheap long-term storage.
  • 可从外部服务访问,例如网站、文件上传/下载实用程序、各种语言 SDK 和 Web 浏览器。Accessibility from external services such as websites, file upload/download utilities, various language SDKs, and web browsers.
  • 大型文件容量和大型可缩放存储。Large file capacity and large scalable storage.

有关详细信息,请参阅了解 BlobFor more information, see Understanding blobs.

使用 Azure 存储时,不需要从 HDInsight 执行任何特殊操作即可访问数据。When using Azure Storage, you don't have to do anything special from HDInsight to access the data. 例如,以下命令列出 /example/data 文件夹中的文件:For example, the following command lists files in the /example/data folder:

hdfs dfs -ls /example/data

在 HDInsight 中,从计算资源中分离数据存储资源。In HDInsight, the data storage resource is decoupled from compute resources. 因此,你可以根据需要创建 HDInsight 群集来执行计算,然后在工作完成后删除该群集,同时,在云存储中安全地将数据文件持久保存所需的任意时长。Therefore, you can create HDInsight clusters to do computation as you need, and later delete the cluster when the work is finished, meanwhile keeping your data files persisted safely in cloud storage as long as you need.

URI 和方案URI and scheme

在访问文件时,一些命令可能需要用户将方案指定为 URI 的一部分。Some commands may require you to specify the scheme as part of the URI when accessing a file. 例如,Storm-HDFS 组件就需要指定方案。For example, the Storm-HDFS component requires you to specify the scheme. 使用非默认存储(作为“附加”存储添加到群集的存储)时,必须始终将方案作为 URI 的一部分来使用。When using non-default storage (storage added as "additional" storage to the cluster), you must always use the scheme as part of the URI.

使用 __Azure 存储__时,可以使用以下 URI 方案之一:When using Azure Storage, use one of the following URI schemes:

  • wasb:///:使用未加密的通信访问默认存储。wasb:///: Access default storage using unencrypted communication.

  • wasbs:///:使用加密的通信访问默认存储。wasbs:///: Access default storage using encrypted communication. 仅 HDInsight 3.6 及以上版本支持 wasbs 方案。The wasbs scheme is supported only from HDInsight version 3.6 onwards.

  • wasb://<container-name>@<account-name>.blob.core.chinacloudapi.cn/:与非默认存储帐户通信时使用。wasb://<container-name>@<account-name>.blob.core.chinacloudapi.cn/: Used when communicating with a non-default storage account. 例如,具有其他存储帐户或访问可公开访问的存储帐户中存储的数据时。For example, when you have an additional storage account or when accessing data stored in a publicly accessible storage account.

使用 Azure Data Lake Storage Gen2 时,可以使用以下 URI 方案之一:When using Azure Data Lake Storage Gen2, use one of the following URI schemes:

  • abfs:///:使用未加密的通信访问默认存储。abfs:///: Access default storage using unencrypted communication.

  • abfss:///:使用加密的通信访问默认存储。abfss:///: Access default storage using encrypted communication. 仅 HDInsight 3.6 及以上版本支持 abfss 方案。The abfss scheme is supported only from HDInsight version 3.6 onwards.

  • abfs://<container-name>@<account-name>.dfs.core.chinacloudapi.cn/:与非默认存储帐户通信时使用。abfs://<container-name>@<account-name>.dfs.core.chinacloudapi.cn/: Used when communicating with a non-default storage account. 例如,具有其他存储帐户或访问可公开访问的存储帐户中存储的数据时。For example, when you have an additional storage account or when accessing data stored in a publicly accessible storage account.

Important

使用 Data Lake Storage 作为 HDInsight 的默认存储时,必须在存储中指定一个用作 HDInsight 存储根目录的路径。When using Data Lake Storage as the default store for HDInsight, you must specify a path within the store to use as the root of HDInsight storage. 默认路径为 /clusters/<cluster-name>/The default path is /clusters/<cluster-name>/.

群集使用的是哪种存储What storage is the cluster using

可以使用 Ambari 来检索群集的默认存储配置。You can use Ambari to retrieve the default storage configuration for the cluster. 可以使用以下命令通过 curl 检索 HDFS 配置信息,并使用 jq对其进行筛选:Use the following command to retrieve HDFS configuration information using curl, and filter it using jq:

curl -u admin -G "https://CLUSTERNAME.azurehdinsight.cn/api/v1/clusters/CLUSTERNAME/configurations/service_config_versions?service_name=HDFS&service_config_version=1" | jq '.items[].configurations[].properties["fs.defaultFS"] | select(. != null)'```

Note

此命令会返回应用到服务器的第一个配置 (service_config_version=1),其中包含此信息。This command returns the first configuration applied to the server (service_config_version=1), which contains this information. 可能需要列出所有配置版本,才能找到最新版本。You may need to list all configuration versions to find the latest one.

此命令返回类似于以下 URI 的值:This command returns a value similar to the following URIs:

  • wasb://<container-name>@<account-name>.blob.core.chinacloudapi.cnwasb://<container-name>@<account-name>.blob.core.chinacloudapi.cn if using an Azure Storage account.

    帐户名是 Azure 存储帐户的名称。The account name is the name of the Azure Storage account. 容器名称是作为群集存储的根的 blob 容器。The container name is the blob container that is the root of the cluster storage.

也可以在 Azure 门户中使用以下步骤查找存储信息:You can also find the storage information using the Azure portal by using the following steps:

  1. Azure 门户中,选择 HDInsight 群集。From the Azure portal, select your HDInsight cluster.

  2. 在“属性” 部分中,选择“存储帐户” 。From the Properties section, select Storage Accounts. 会显示群集的存储信息。The storage information for the cluster is displayed.

如何从 HDInsight 外部访问文件How do I access files from outside HDInsight

从 HDInsight 群集外部访问数据的方法有多种。There are a various ways to access data from outside the HDInsight cluster. 以下是一些可用于处理数据的实用工具和 SDK 的链接:The following are a few links to utilities and SDKs that can be used to work with your data:

如果使用的是 Azure 存储,请参阅以下链接了解可用于访问数据的方式:If using Azure Storage, see the following links for ways that you can access your data:

  • Azure CLI:适用于 Azure 的命令行接口命令。Azure CLI: Command-Line interface commands for working with Azure. 在安装后,使用 az storage 命令获取有关使用存储的帮助,或者使用 az storage blob 获取特定于 Blob 的命令。After installing, use the az storage command for help on using storage, or az storage blob for blob-specific commands.

  • blobxfer.py:用于处理 Azure 存储中的 blob 的 python 脚本。blobxfer.py: A python script for working with blobs in Azure Storage.

  • 多种 SDK:Various SDKs:

缩放你的群集Scaling your cluster

使用群集缩放功能可动态更改群集使用的数据节点数。The cluster scaling feature allows you to dynamically change the number of data nodes used by a cluster. 可以在其他作业或进程正在群集上运行时执行缩放操作。You can perform scaling operations while other jobs or processes are running on a cluster. 另请参阅缩放 HDInsight 群集See also, Scale HDInsight clusters

不同的群集类型会受缩放操作影响,如下所示:The different cluster types are affected by scaling as follows:

  • Hadoop:减少群集中的节点数时,群集中的某些服务将重新启动。Hadoop: When scaling down the number of nodes in a cluster, some of the services in the cluster are restarted. 缩放操作会导致正在运行或挂起的作业在缩放操作完成时失败。Scaling operations can cause jobs running or pending to fail at the completion of the scaling operation. 可以在操作完成后重新提交这些作业。You can resubmit the jobs once the operation is complete.

  • HBase:在完成缩放操作后的几分钟内,区域服务器会自动进行平衡。HBase: Regional servers are automatically balanced within a few minutes, once the scaling operation completes. 若要手动平衡区域服务器,请使用以下步骤:To manually balance regional servers, use the following steps:

    1. 使用 SSH 连接到 HDInsight 群集。Connect to the HDInsight cluster using SSH. 有关详细信息,请参阅 将 SSH 与 HDInsight 配合使用For more information, see Use SSH with HDInsight.

    2. 使用以下命令来启动 HBase shell:Use the following to start the HBase shell:

       hbase shell
      
    3. 加载 HBase shell 后,使用以下方法来手动平衡区域服务器:Once the HBase shell has loaded, use the following to manually balance the regional servers:

       balancer
      
  • Storm:你应在执行缩放操作后重新平衡任何正在运行的 Storm 拓扑。Storm: You should rebalance any running Storm topologies after a scaling operation has been performed. 重新平衡允许拓扑根据群集中的新节点数重新调整并行度设置。Rebalancing allows the topology to readjust parallelism settings based on the new number of nodes in the cluster. 若要重新平衡正在运行的拓扑,请使用下列选项之一:To rebalance running topologies, use one of the following options:

    • SSH:连接到服务器并使用以下命令来重新平衡拓扑:SSH: Connect to the server and use the following command to rebalance a topology:

        storm rebalance TOPOLOGYNAME
      

      还可以指定参数来替代拓扑原来提供的并行度提示。You can also specify parameters to override the parallelism hints originally provided by the topology. 例如,storm rebalance mytopology -n 5 -e blue-spout=3 -e yellow-bolt=10 会将拓扑重新配置为 5 个辅助角色进程,蓝色的 BlueSpout 组件有 3 个 executor,黄色 YellowBolt 组件有 10 个 executor。For example, storm rebalance mytopology -n 5 -e blue-spout=3 -e yellow-bolt=10 reconfigures the topology to 5 worker processes, 3 executors for the blue-spout component, and 10 executors for the yellow-bolt component.

    • Storm UI:使用以下步骤来重新平衡使用 Storm UI 的拓扑。Storm UI: Use the following steps to rebalance a topology using the Storm UI.

      1. 在 Web 浏览器中打开 https://CLUSTERNAME.azurehdinsight.cn/stormui ,其中 CLUSTERNAME 是 Storm 群集的名称。Open https://CLUSTERNAME.azurehdinsight.cn/stormui in your web browser, where CLUSTERNAME is the name of your Storm cluster. 如果出现提示,请输入在创建 HDInsight 群集时指定的群集管理员用户名和密码。If prompted, enter the HDInsight cluster administrator (admin) name and password you specified when creating the cluster.
      2. 选择要重新平衡的拓扑,并选择“重新平衡” 按钮。Select the topology you wish to rebalance, then select the Rebalance button. 输入执行重新平衡操作前的延迟。Enter the delay before the rebalance operation is performed.
  • Kafka:执行缩放操作后,应重新均衡分区副本。Kafka: You should rebalance partition replicas after scaling operations. 有关详细信息,请参阅通过 Apache Kafka on HDInsight 实现数据的高可用性文档。For more information, see the High availability of data with Apache Kafka on HDInsight document.

有关缩放 HDInsight 群集的特定信息,请参阅:For specific information on scaling your HDInsight cluster, see:

如何安装 Hue(或其他 Hadoop 组件)?How do I install Hue (or other Hadoop component)?

HDInsight 是一个托管服务。HDInsight is a managed service. 如果 Azure 检测到群集存在问题,则可能会删除故障节点,再创建一个节点来代替。If Azure detects a problem with the cluster, it may delete the failing node and create a node to replace it. 如果在群集节点上手动安装组件,则发生此操作时,这些组件不会保留。If you manually install things on the cluster, they are not persisted when this operation occurs. 应该改用 HDInsight 脚本操作Instead, use HDInsight Script Actions. 脚本操作可用于进行以下更改:A script action can be used to make the following changes:

  • 安装并配置服务或网站。Install and configure a service or web site.
  • 安装和配置需要在群集的多个节点上进行配置更改的组件。Install and configure a component that requires configuration changes on multiple nodes in the cluster.

脚本操作是 Bash 脚本。Script Actions are Bash scripts. 该脚本在群集创建期间运行,用于安装并配置其他组件。The scripts run during cluster creation, and are used to install and configure additional components. 提供了用于安装以下组件的示例脚本:Example scripts are provided for installing the following components:

有关开发自己的脚本操作的信息,请参阅使用 HDInsight 进行脚本操作开发For information on developing your own Script Actions, see Script Action development with HDInsight.

Jar 文件Jar files

某些 Hadoop 技术以自包含 jar 文件形式提供,这些文件包含某些函数,这些函数用作 MapReduce 作业的一部分,或来自 Pig 或 Hive 内部。Some Hadoop technologies are provided in self-contained jar files that contain functions used as part of a MapReduce job, or from inside Pig or Hive. 它们通常不需要进行任何设置,并可以在创建后上传到群集和直接使用。They often don't require any setup, and can be uploaded to the cluster after creation and used directly. 如需确保组件在群集重置映像后仍存在,可将 jar 文件存储在群集的默认存储(WASB 或 ADL)中。If you want to make sure the component survives reimaging of the cluster, you can store the jar file in the default storage for your cluster (WASB or ADL).

例如,如果要使用 Apache DataFu 的最新版本,可以下载包含项目的 jar,并将其上传到 HDInsight 群集。For example, if you want to use the latest version of Apache DataFu, you can download a jar containing the project and upload it to the HDInsight cluster. 然后按照 DataFu 文档的说明通过 Pig 或 Hive 使用它。Then follow the DataFu documentation on how to use it from Pig or Hive.

Important

某些属于独立 jar 文件的组件通过 HDInsight 提供,但不在路径中。Some components that are standalone jar files are provided with HDInsight, but are not in the path. 若要查找特定组件,可使用以下命令在群集上搜索:If you are looking for a specific component, you can use the follow to search for it on your cluster:

find / -name *componentname*.jar 2>/dev/null

此命令会返回任何匹配的 jar 文件的路径。This command returns the path of any matching jar files.

要使用不同版本的组件,请上传所需版本,并在作业中使用它。To use a different version of a component, upload the version you need and use it in your jobs.

Important

完全支持通过 HDInsight 群集提供的组件,Azure 支持部门帮助找出并解决与这些组件相关的问题。Components provided with the HDInsight cluster are fully supported and Azure Support helps to isolate and resolve issues related to these components.

自定义组件可获得合理范围的支持,有助于进一步解决问题。Custom components receive commercially reasonable support to help you to further troubleshoot the issue. 这可能会促进解决问题,或要求使用可用的开源技术渠道,在渠道中可找到该技术的深厚的专业知识。This might result in resolving the issue OR asking you to engage available channels for the open source technologies where deep expertise for that technology is found. 有许多可以使用的社区站点,例如:面向 HDInsight 的 MSDN 论坛Azure CSDNFor example, there are many community sites that can be used, like: MSDN forum for HDInsight, Azure CSDN. 此外,Apache 项目在 https://apache.org 上提供了项目站点,例如:HadoopSparkAlso Apache projects have project sites on https://apache.org, for example: Hadoop, Spark.

后续步骤Next steps