管理 HDInsight 群集的日志Manage logs for an HDInsight cluster

HDInsight 群集生成各种日志文件。An HDInsight cluster produces a variety of log files. 例如,Apache Hadoop 和相关服务(如 Apache Spark)生成详细的作业执行日志。For example, Apache Hadoop and related services, such as Apache Spark, produce detailed job execution logs. 日志文件管理是使 HDInsight 群集保持正常状态的一部分工作。Log file management is part of maintaining a healthy HDInsight cluster. 此外,在日志存档方面可能存在相应的法规要求。There can also be regulatory requirements for log archiving. 由于日志文件数量和大小方面的原因,优化日志存储和存档能够对服务成本管理带来帮助。Due to the number and size of log files, optimizing log storage and archiving helps with service cost management.

HDInsight 群集日志管理包括保留有关群集环境各个方面的信息。Managing HDInsight cluster logs includes retaining information about all aspects of the cluster environment. 此信息包括所有关联的 Azure 服务日志、群集配置、作业执行信息、任何错误状态,以及所需的其他数据。This information includes all associated Azure Service logs, cluster configuration, job execution information, any error states, and other data as needed.

HDInsight 日志管理的典型步骤如下:Typical steps in HDInsight log management are:

  • 步骤 1:确定日志保留策略Step 1: Determine log retention policies
  • 步骤 2:管理群集服务版本配置日志Step 2: Manage cluster service versions configuration logs
  • 步骤 3:管理群集作业执行日志文件Step 3: Manage cluster job execution log files
  • 步骤 4:预测日志卷存储大小和成本Step 4: Forecast log volume storage sizes and costs
  • 步骤 5:确定日志存档策略和过程Step 5: Determine log archive policies and processes

步骤 1:确定日志保留策略Step 1: Determine log retention policies

创建 HDInsight 群集日志管理策略的第一步是收集有关业务方案的信息以及作业执行历史记录存储要求。The first step in creating a HDInsight cluster log management strategy is to gather information about business scenarios and job execution history storage requirements.

群集详细信息Cluster details

以下群集详细信息可以帮助收集日志管理策略中的信息。The following cluster details are useful in helping to gather information in your log management strategy. 从特定 Azure 帐户中创建的所有 HDInsight 群集收集此信息。Gather this information from all HDInsight clusters you have created in a particular Azure account.

  • 群集名称Cluster name
  • 群集区域和 Azure 可用性区域Cluster region and Azure availability zone
  • 群集状态,包括上次状态更改的详细信息Cluster state, including details of the last state change
  • 为主节点、核心节点和任务节点指定的 HDInsight 实例的类型和数量Type and number of HDInsight instances specified for the master, core, and task nodes

可以使用 Azure 门户获取其中的大多数顶级信息。You can get most of this top-level information using the Azure portal. 或者,可以使用 Azure CLI 获取有关 HDInsight 群集的信息:Alternatively, you can use Azure CLI to get information about your HDInsight cluster(s):

    az hdinsight list --resource-group <ResourceGroup>
    az hdinsight show --resource-group <ResourceGroup> --name <ClusterName>

也可以使用 PowerShell 查看此信息。You can also use PowerShell to view this information. 有关详细信息,请参阅使用 Azure PowerShell 在 HDInsight 中管理 Apache Hadoop 群集For more information, see Apache Manage Hadoop clusters in HDInsight by using Azure PowerShell.

了解群集上运行的工作负荷Understand the workloads running on your clusters

必须了解 HDInsight 群集上运行的工作负荷类型才能为每种类型设计适当的日志记录策略。It's important to understand the workload types running on your HDInsight cluster(s) to design appropriate logging strategies for each type.

  • 工作负荷是试验性的(例如用于开发或测试)还是生产级的?Are the workloads experimental (such as development or test) or production-quality?
  • 生产级工作负荷的正常运行频率如何?How often do the production-quality workloads normally run?
  • 是否有任何工作负荷是资源密集型的和/或长时间运行的?Are any of the workloads resource-intensive and/or long-running?
  • 是否有任何工作负荷使用一组复杂的 Hadoop 服务,需要为其生成多种类型的日志?Do any of the workloads use a complex set of Hadoop services for which multiple types of logs are produced?
  • 是否有任何工作负荷具有关联的法规执行沿袭要求?Do any of the workloads have associated regulatory execution lineage requirements?

示例日志保留模式和实践Example log retention patterns and practices

  • 考虑向每个日志项添加一个标识符或通过其他方法,来保持数据沿袭跟踪。Consider maintaining data lineage tracking by adding an identifier to each log entry, or through other techniques. 这样,便可以追溯数据和操作的起源,并通过每个阶段跟踪数据,以了解其一致性和有效性。This allows you to trace back the original source of the data and the operation, and follow the data through each stage to understand its consistency and validity.

  • 考虑如何从一个或多个群集收集日志并为其创建排序规则,以进行审核、监视、规划和警报等活动。Consider how you can collect logs from the cluster, or from more than one cluster, and collate them for purposes such as auditing, monitoring, planning, and alerting. 可能使用自定义解决方案定期访问和下载日志文件,然后对其进行合并和分析,以提供仪表板视图。You might use a custom solution to access and download the log files on a regular basis, and combine and analyze them to provide a dashboard display. 还可以添加其他功能用于发出安全警报或执行故障检测。You can also add additional capabilities for alerting for security or failure detection. 可以使用 PowerShell、HDInsight SDK 或可以访问 Azure 经典部署模型的代码生成这些实用工具。You can build these utilities using PowerShell, the HDInsight SDKs, or code that accesses the Azure classic deployment model.

  • 考虑监视解决方案或服务是否能够带来好处。Consider whether a monitoring solution or service would be a useful benefit. Microsoft System Center 提供了 HDInsight 管理包The Microsoft System Center provides an HDInsight management pack. 可以使用 Apache Chukwa 和 Ganglia 等第三方工具收集和集中处理日志。You can also use third-party tools such as Apache Chukwa and Ganglia to collect and centralize logs. 许多公司提供用于监视基于 Hadoop 的大数据解决方案的服务,例如:Centerity、Compuware APM、Sematext SPM 和 Zettaset Orchestrator。Many companies offer services to monitor Hadoop-based big data solutions, for example: Centerity, Compuware APM, Sematext SPM, and Zettaset Orchestrator.

步骤 2:管理群集服务版本和查看日志Step 2: Manage cluster service versions and view logs

典型的 HDInsight 群集使用多个服务和开源软件包(例如 Apache HBase、Apache Spark 等)。A typical HDInsight cluster uses several services and open-source software packages (such as Apache HBase, Apache Spark, and so forth). 对于某些工作负荷(例如生物信息学应用),除了作业执行日志以外,可能还需要保留服务配置日志历史记录。For some workloads, such as bioinformatics, you may be required to retain service configuration log history in addition to job execution logs.

使用 Ambari UI 查看群集配置设置View cluster configuration settings with the Ambari UI

Apache Ambari 提供 Web UI 和 REST API 来简化 HDInsight 群集的管理、配置和监视。Apache Ambari simplifies the management, configuration, and monitoring of a HDInsight cluster by providing a web UI and a REST API. 基于 Linux 的 HDInsight 群集上已随附 Ambari。Ambari is included on Linux-based HDInsight clusters. 在 Azure 门户的“HDInsight”页上选择“群集仪表板”窗格,打开“群集仪表板”链接页。 Select the Cluster Dashboard pane on the Azure portal HDInsight page to open the Cluster Dashboards link page. 接下来,选择“HDInsight 群集仪表板”窗格打开 Ambari UI。 Next, select the HDInsight cluster dashboard pane to open the Ambari UI. 系统会提示输入群集登录凭据。You are prompted for your cluster login credentials.

若要打开服务视图列表,请在 Azure 门户页上选择 HDInsight 对应的“Ambari 视图”窗格。 To open a list of service views, select the Ambari Views pane on the Azure portal page for HDInsight. 此列表的内容根据安装的库而异。This list varies, depending on which libraries you've installed. 例如,可能会显示“YARN 队列管理器”、“Hive 视图”和“Tez 视图”。For example, you may see YARN Queue Manager, Hive View, and Tez View. 选择任一服务链接以查看配置和服务信息。Select any service link to see configuration and service information. Ambari UI 中的“堆栈和版本”页提供有关群集服务配置和服务版本历史记录的信息。 The Ambari UI Stack and Version page provides information about the cluster services' configuration and service version history. 若要导航到 Ambari UI 的此部分,请选择“管理”菜单,然后选择“堆栈和版本”。 To navigate to this section of the Ambari UI, select the Admin menu and then Stacks and Versions. 选择“版本”选项卡查看服务版本信息。 Select the Versions tab to see service version information.

堆栈和版本

使用 Ambari UI 可以下载群集中特定主机(或节点)上运行的任一(或所有)服务的配置。Using the Ambari UI, you can download the configuration for any (or all) services running on a particular host (or node) in the cluster. 选择“主机”菜单,然后选择所需主机的链接。 Select the Hosts menu, then the link for the host of interest. 在该主机的页面上,依次选择“主机操作”按钮和“下载客户端配置”。 On that host's page, select the Host Actions button and then Download Client Configs.

主机客户端配置

查看脚本操作日志View the script action logs

使用 HDInsight 脚本操作可以手动或者根据指定在群集上运行脚本。HDInsight script actions run scripts on a cluster, either manually or when specified. 例如,可以使用脚本操作在群集上安装其他软件,或者更改配置设置的默认值。For example, script actions can be used to install additional software on the cluster or to alter configuration settings from the default values. 在脚本操作日志中可以深入了解设置群集过程中发生的错误,以及可能影响群集性能和可用性的配置设置更改。Script action logs can provide insight into errors that occurred during setup of the cluster, and also configuration settings' changes that could affect cluster performance and availability. 若要查看脚本操作的状态,请在 Ambari UI 中选择“操作”按钮,或访问默认存储帐户中的状态日志。 To see the status of a script action, select the ops button on your Ambari UI, or access the status logs in the default storage account. 存储日志位于 /STORAGE_ACCOUNT_NAME/DEFAULT_CONTAINER_NAME/custom-scriptaction-logs/CLUSTER_NAME/DATEThe storage logs are available at /STORAGE_ACCOUNT_NAME/DEFAULT_CONTAINER_NAME/custom-scriptaction-logs/CLUSTER_NAME/DATE.

查看 Ambari 警报状态日志View Ambari alerts status logs

Apache Ambari 会将警报状态更改写入 ambari-alerts.logApache Ambari writes alert status changes to ambari-alerts.log. 完整路径为 /var/log/ambari-server/ambari-alerts.logThe full path is /var/log/ambari-server/ambari-alerts.log. 若要为日志启用调试,请在 /etc/ambari-server/conf/log4j.properties. 中更改一个属性,然后更改 # Log alert state changes 下的条目,从:To enable debugging for the log, change a property in /etc/ambari-server/conf/log4j.properties. Change then entry under # Log alert state changes from:

log4j.logger.alerts=INFO,alerts

to

log4j.logger.alerts=DEBUG,alerts

步骤 3:管理群集作业执行日志文件Step 3: Manage the cluster job execution log files

下一步是查看各种服务的作业执行日志文件。The next step is reviewing the job execution log files for the various services. 服务可能包括 Apache HBase、Apache Spark 等等。Services could include Apache HBase, Apache Spark, and many others. Hadoop 群集会生成大量的详细日志,因此,确定有用(以及无用)的日志可能很耗时。A Hadoop cluster produces a large number of verbose logs, so determining which logs are useful (and which aren't) can be time-consuming. 了解日志记录系统对于有针对性的日志文件管理非常重要。Understanding the logging system is important for targeted management of log files. 下图是一个示例日志文件。The following image is an example log file.

HDInsight 示例日志文件示例输出

访问 Hadoop 日志文件Access the Hadoop log files

HDInsight 将其日志文件同时存储在群集文件系统和 Azure 存储中。HDInsight stores its log files both in the cluster file system and in Azure storage. 若要检查群集中的日志文件,可与群集建立 SSH 连接并浏览文件系统,或者在远程头节点服务器上使用 Hadoop YARN 状态门户。You can examine log files in the cluster by opening an SSH connection to the cluster and browsing the file system, or by using the Hadoop YARN Status portal on the remote head node server. 使用可以访问和下载 Azure 存储中的数据的任何工具,即可检查 Azure 存储中的日志文件。You can examine the log files in Azure storage using any of the tools that can access and download data from Azure storage. 这些工具包括 AzCopyCloudXplorer 和 Visual Studio 服务器资源管理器。Examples are AzCopy, CloudXplorer, and the Visual Studio Server Explorer. 此外,可以使用 PowerShell 和 Azure 存储客户端库或 Azure.NET SDK 访问 Azure Blob 存储中的数据。You can also use PowerShell and the Azure Storage Client libraries, or the Azure .NET SDKs, to access data in Azure blob storage.

Hadoop 在群集中的各个节点上以“任务尝试”的形式运行作业。 Hadoop runs the work of the jobs as task attempts on various nodes in the cluster. HDInsight 可以发起推理任务尝试,并终止一开始就无法完成的其他任何任务尝试。HDInsight can initiate speculative task attempts, terminating any other task attempts that do not complete first. 这会即时生成大量的活动并将其记录到控制器、stderr 和 syslog 日志文件。This generates significant activity that is logged to the controller, stderr, and syslog log files on-the-fly. 此外,多个任务尝试会同时运行,但日志文件只能以线性方式显示结果。In addition, multiple task attempts are running simultaneously, but a log file can only display results linearly.

写入 Azure Blob 存储的 HDInsight 日志HDInsight logs written to Azure Blob storage

对于使用 Azure PowerShell cmdlet 或 .NET 作业提交 API 提交的任何作业,HDInsight 群集已配置为将任务日志写入 Azure Blob 存储帐户。HDInsight clusters are configured to write task logs to an Azure Blob storage account for any job that is submitted using the Azure PowerShell cmdlets or the .NET job submission APIs. 如果通过群集的 SSH 连接提交作业,则会将执行日志记录信息存储在上一部分所述的 Azure 表中。If you submit jobs through SSH to the cluster, then the execution logging information is stored in the Azure Tables as discussed in the previous section.

除了 HDInsight 生成的核心日志文件以外,安装的服务(例如 YARN)也会生成作业执行日志文件。In addition to the core log files generated by HDInsight, installed services such as YARN also generate job execution log files. 日志文件的类型和数量取决于安装的服务。The number and type of log files depends on the services installed. 常见的服务包括 Apache HBase、Apache Spark ,等等。Common services are Apache HBase, Apache Spark, and so on. 调查每个服务的作业日志执行文件可以了解群集上提供的各种日志记录文件。Investigate the job log execution files for each service to understand the overall logging files available on your cluster. 每个服务使用自身独特的方法进行日志记录,并为日志文件提供独特的存储位置。Each service has its own unique methods of logging and locations for storing log files. 下一部分通过一个示例详细介绍了如何访问最常见的服务日志文件(通过 YARN)。As an example, details for accessing the most common service log files (from YARN) are discussed in the following section.

YARN 生成的 HDInsight 日志HDInsight logs generated by YARN

YARN 聚合工作器节点上所有容器的日志,并按工作器节点将这些日志存储为一个聚合日志文件。YARN aggregates logs across all containers on a worker node and stores those logs as one aggregated log file per worker node. 应用程序完成运行后,该日志将存储在默认文件系统中。That log is stored on the default file system after an application finishes. 应用程序可能使用数百或数千个容器,但在单个工作器节点上运行的所有容器的日志始终聚合成单个文件。Your application may use hundreds or thousands of containers, but logs for all containers that are run on a single worker node are always aggregated to a single file. 因此,在每个工作节点上,应用程序只使用一个日志。There is only one log per worker node used by your application. 在 HDInsight 群集版本 3.0 和更高版本上,日志聚合默认已启用。Log aggregation is enabled by default on HDInsight clusters version 3.0 and above. 聚合日志位于群集的默认存储中。Aggregated logs are located in default storage for the cluster.

    /app-logs/<user>/logs/<applicationId>

无法直接读取聚合日志,因为它们是以容器编制索引的 TFile 二进制格式编写的。The aggregated logs are not directly readable, as they are written in a TFile binary format indexed by container. 使用 YARN ResourceManager 日志或 CLI 工具以纯文本的形式查看感兴趣的应用程序或容器的这些日志。Use the YARN ResourceManager logs or CLI tools to view these logs as plain text for applications or containers of interest.

YARN CLI 工具YARN CLI tools

若要使用 YARN CLI 工具,则必须首先使用 SSH 连接到 HDInsight 群集。To use the YARN CLI tools, you must first connect to the HDInsight cluster using SSH. 运行这些命令时,请指定 <applicationId><user-who-started-the-application><containerId><worker-node-address> 信息。Specify the <applicationId>, <user-who-started-the-application>, <containerId>, and <worker-node-address> information when running these commands. 可使用以下命令之一以纯文本格式查看日志:You can view the logs as plain text with one of the following commands:

    yarn logs -applicationId <applicationId> -appOwner <user-who-started-the-application>
    yarn logs -applicationId <applicationId> -appOwner <user-who-started-the-application> -containerId <containerId> -nodeAddress <worker-node-address>

YARN ResourceManager UIYARN ResourceManager UI

YARN ResourceManager UI 在群集头节点上运行,可通过 Ambari Web UI 访问它。The YARN ResourceManager UI runs on the cluster head node, and is accessed through the Ambari web UI. 使用以下步骤查看 YARN 日志:Use the following steps to view the YARN logs:

  1. 在 Web 浏览器中导航到 https://CLUSTERNAME.azurehdinsight.cnIn a web browser, navigate to https://CLUSTERNAME.azurehdinsight.cn. 将 CLUSTERNAME 替换为 HDInsight 群集的名称。Replace CLUSTERNAME with the name of your HDInsight cluster.
  2. 在左侧的服务列表中选择“YARN”。From the list of services on the left, select YARN.
  3. 在“快速链接”下拉列表中选择一个群集头节点,然后选择“ResourceManager 日志”。 From the Quick Links dropdown, select one of the cluster head nodes and then select ResourceManager logs. 此时将显示 YARN 日志的链接列表。You are presented with a list of links to YARN logs.

步骤 4:预测日志卷存储大小和成本Step 4: Forecast log volume storage sizes and costs

完成前面的步骤后,便了解了 HDInsight 群集生成的日志文件的类型和数量。After completing the previous steps, you have an understanding of the types and volumes of log files that your HDInsight cluster(s) are producing.

接下来,请分析一段时间内关键日志存储位置中的日志数据量。Next, analyze the volume of log data in key log storage locations over a period of time. 例如,可以分析 30-60-90 天内的数据量和增长率。For example, you can analyze volume and growth over 30-60-90 day periods. 在电子表格中或使用其他工具(例如 Visual Studio、Azure 存储资源管理器或 Power Query for Excel)记录此信息。Record this information in a spreadsheet or use other tools such as Visual Studio, the Azure Storage Explorer, or Power Query for Excel. 有关详细信息,请参阅分析 HDInsight 日志For more information, see Analyze HDInsight logs.

现在,我们已获得足够的信息来为关键日志创建日志管理策略。You now have enough information to create a log management strategy for the key logs. 使用电子表格(或所选的工具)预测日志大小增长率,以及后续的 Azure 服务日志存储费用。Use your spreadsheet (or tool of choice) to forecast both log size growth and log storage Azure service costs going forward. 另请考虑所要检查的日志集的任何日志保留要求。Consider also any log retention requirements for the set of logs that you are examining. 确定可以删除哪些日志文件(如果有)和应该保留和存档哪些日志后,可以重新预测将来的日志存储成本,以降低昂贵的 Azure 存储费用。Now you can reforecast future log storage costs, after determining which log files can be deleted (if any) and which logs should be retained and archived to less expensive Azure storage.

步骤 5:确定日志存档策略和过程Step 5: Determine log archive policies and processes

确定可以删除哪些日志文件后,可以调整许多 Hadoop 服务上的日志记录参数,以便在指定的时间段后自动删除日志文件。After you determine which log files can be deleted, you can adjust logging parameters on many Hadoop services to automatically delete log files after a specified time period.

对于某些日志文件,可以使用价格较低的日志文件存档方法。For certain log files, you can use a lower-priced log file archiving approach. 对于 Azure 资源管理器活动日志,可以使用 Azure 门户来探索此方法。For Azure Resource Manager activity logs, you can explore this approach using the Azure portal. 通过在 Azure 门户中选择 HDInsight 实例对应的“活动日志”链接,设置资源管理器日志的存档。 Set up archiving of the Resource Manager logs by selecting the Activity Log link in the Azure portal for your HDInsight instance. 在“活动日志”搜索页面顶部,选择“导出”菜单项打开“导出活动日志”窗格。 On the top of the Activity Log search page, select the Export menu item to open the Export activity log pane. 填写订阅、区域、是否导出到存储帐户,以及日志的保留天数。Fill in the subscription, region, whether to export to a storage account, and how many days to retain the logs. 在同一窗格中,还可以指定是否导出到事件中心。On this same pane, you can also indicate whether to export to an event hub.

导出日志文件

或者,可以使用 PowerShell 编写日志存档的脚本。Alternatively, you can script log archiving with PowerShell. 有关示例 PowerShell 脚本,请参阅将 Azure 自动化日志存档到 Azure Blob 存储For an example PowerShell script, see Archive Azure Automation logs to Azure Blob Storage.

访问 Azure 存储指标Accessing Azure storage metrics

可将 Azure 存储配置为记录存储操作和访问。Azure storage can be configured to log storage operations and access. 可以使用这些非常详细的日志进行容量监视和规划,以及审核存储请求。You can use these very detailed logs for capacity monitoring and planning, and for auditing requests to storage. 记录的信息包括延迟详细信息,用于监视和微调解决方案的性能。The logged information includes latency details, enabling you to monitor and fine-tune the performance of your solutions. 可以使用用于 Hadoop 的 .NET SDK 检查针对保存 HDInsight 群集数据的 Azure 存储生成的日志文件。You can use the .NET SDK for Hadoop to examine the log files generated for the Azure storage that holds the data for an HDInsight cluster.

控制旧日志文件的备份索引大小和数量Control the size and number of backup indexes for old log files

若要控制保留的日志文件的大小和数量,请设置 RollingFileAppender 的以下属性:To control the size and number of log files retained, set the following properties of the RollingFileAppender:

  • maxFileSize 是文件的临界大小,超过该大小会轮换该文件。maxFileSize is the critical size of the file, above which the file is rolled. 默认值为 10 MB。The default value is 10 MB.
  • maxBackupIndex 指定要创建的备份文件数,默认值为 1。maxBackupIndex specifies the number of backup files to be created, default 1.

其他日志管理方法Other log management techniques

为了避免磁盘空间不足,可以使用一些 OS 工具(例如 logrotate)来管理日志文件的处理。To avoid running out of disk space, you can use some OS tools such as logrotate to manage handling of log files. 可将 logrotate 配置为每日运行,压缩日志文件和删除旧文件。You can configure logrotate to run on a daily basis, compressing log files and removing old ones. 所用的方法取决于不同的要求,例如,要在本地节点上保留日志文件多久。Your approach depends on your requirements, such as how long to keep the logfiles on local nodes.

还可以检查是否为一个或多个服务启用了调试日志记录,这会明显增大输出日志的大小。You can also check whether DEBUG logging is enabled for one or more services, which greatly increases the output log size.

若要将所有节点中的日志收集到一个中心位置,可以创建数据流,例如,将所有日志项引入 Solr。To collect the logs from all the nodes to one central location, you can create a data flow, such as ingesting all log entries into Solr.

后续步骤Next steps