在 Azure HDInsight 中通过 Apache Ambari 优化 Apache PigOptimize Apache Pig with Apache Ambari in Azure HDInsight

Apache Ambari 是用于管理和监视 HDInsight 群集的 Web 界面。Apache Ambari is a web interface to manage and monitor HDInsight clusters. 有关 Ambari Web UI 的简介,请参阅使用 Apache Ambari Web UI 管理 HDInsight 群集For an introduction to Ambari Web UI, see Manage HDInsight clusters by using the Apache Ambari Web UI.

可以通过 Ambari Web UI 修改 Apache Pig 属性以优化 Pig 查询。Apache Pig properties can be modified from the Ambari web UI to tune Pig queries. 通过 Ambari 修改 Pig 属性会直接修改 /etc/pig/2.4.2.0-258.0/pig.properties 文件中的 Pig 属性。Modifying Pig properties from Ambari directly modifies the Pig properties in the /etc/pig/2.4.2.0-258.0/pig.properties file.

  1. 若要修改 Pig 属性,请导航到 Pig 的“配置”选项卡,然后展开“高级 pig-properties”窗格。 To modify Pig properties, navigate to the Pig Configs tab, and then expand the Advanced pig-properties pane.

  2. 查找、取消注释并更改相应的属性值。Find, uncomment, and change the value of the property you wish to modify.

  3. 选择窗口右上方的“保存”以保存新值。Select Save on the top-right side of the window to save the new value. 某些属性可能需要重启服务才能生效。Some properties may require a service restart.

    高级 Apache Pig 属性

备注

任何会话级设置都会重写 pig.properties 文件中的属性值。Any session-level settings override property values in the pig.properties file.

优化执行引擎Tune execution engine

可以使用两个执行引擎来执行 Pig 脚本:MapReduce 和 Tez。Two execution engines are available to execute Pig scripts: MapReduce and Tez. Tez 是经过优化的引擎,比 MapReduce 要快得多。Tez is an optimized engine and is much faster than MapReduce.

  1. 若要修改执行引擎,请在“高级 pig-properties”窗格中找到 exectype 属性。To modify the execution engine, in the Advanced pig-properties pane, find the property exectype.

  2. 默认值为 MapReduceThe default value is MapReduce. 请将它更改为 TezChange it to Tez.

启用本地模式Enable local mode

与在 Hive 中一样,本地模式可用于加快作业,且生成的数据量相对较小。Similar to Hive, local mode is used to speed jobs with relatively smaller amounts of data.

  1. 若要启用本地模式,请将 pig.auto.local.enabled 设置为 trueTo enable the local mode, set pig.auto.local.enabled to true. 默认值为 false。The default value is false.

  2. 输入数据大小小于 pig.auto.local.input.maxbytes 属性值的作业被视为小型作业。Jobs with an input data size less than the pig.auto.local.input.maxbytes property value are considered to be small jobs. 默认值为 1 GB。The default value is 1 GB.

将用户 jar 复制到缓存中Copy user jar cache

Pig 可将 UDF 所需的 JAR 文件复制到分布式缓存,使这些文件可供任务节点使用。Pig copies the JAR files required by UDFs to a distributed cache to make them available for task nodes. 这些 jar 不经常更改。These jars don't change frequently. 如果已启用,pig.user.cache.enabled 设置允许将 jar 放入缓存,使同一用户运行的作业能够重复使用这些文件。If enabled, the pig.user.cache.enabled setting allows jars to be placed in a cache to reuse them for jobs run by the same user. 该设置可以小幅提高作业的性能。This setting results in a minor increase in job performance.

  1. 若要启用,请将 pig.user.cache.enabled 设置为 true。To enable, set pig.user.cache.enabled to true. 默认值为 false。The default is false.

  2. 若要设置缓存 jar 的基本路径,请将 pig.user.cache.location 设置为基本路径。To set the base path of the cached jars, set pig.user.cache.location to the base path. 默认为 /tmpThe default is /tmp.

使用内存设置优化性能Optimize performance with memory settings

以下内存设置可以帮助优化 Pig 脚本的性能。The following memory settings can help optimize Pig script performance.

  • pig.cachedbag.memusage:给定到包的内存量。pig.cachedbag.memusage: The amount of memory given to a bag. 包是元组的集合。A bag is collection of tuples. 元组是字段的有序集,字段是数据片段。A tuple is an ordered set of fields, and a field is a piece of data. 如果包中的数据超过给定的内存,则会溢出到磁盘。If the data in a bag is beyond the given memory, it's spilled to disk. 默认值为 0.2,表示可用内存的 20%。The default value is 0.2, which represents 20 percent of available memory. 这是在应用程序中的所有包之间分摊的内存量。This memory is shared across all bags in an application.

  • pig.spill.size.threshold:超过此溢出大小阈值(以字节为单位)的包将溢出到磁盘。pig.spill.size.threshold: Bags larger than this spill size threshold (in bytes) are spilled to disk. 默认值为 5 MB。The default value is 5 MB.

压缩临时文件Compress temporary files

Pig 在作业执行期间生成临时文件。Pig generates temporary files during job execution. 压缩临时文件可以在将文件读取或写入到磁盘时提高性能。Compressing the temporary files results in a performance increase when reading or writing files to disk. 以下设置可用于压缩临时文件。The following settings can be used to compress temporary files.

  • pig.tmpfilecompression:如果为 true,则启用临时文件压缩。pig.tmpfilecompression: When true, enables temporary file compression. 默认值为 false。The default value is false.

  • pig.tmpfilecompression.codec:用于压缩临时文件的压缩编解码器。pig.tmpfilecompression.codec: The compression codec to use for compressing the temporary files. 建议的压缩编解码器为 LZO 和 Snappy,它们可以降低 CPU 使用率。The recommended compression codecs are LZO and Snappy for lower CPU use.

启用拆分合并Enable split combining

如果已启用,则会合并小型文件,以减少映射任务数目。When enabled, small files are combined for fewer map tasks. 该设置可以提高包含大量小型文件的作业的效率。This setting improves the efficiency of jobs with many small files. 若要启用,请将 pig.noSplitCombination 设置为 true。To enable, set pig.noSplitCombination to true. 默认值为 false。The default value is false.

优化映射器Tune mappers

可以通过修改 pig.maxCombinedSplitSize 属性来控制映射器数目。The number of mappers is controlled by modifying the property pig.maxCombinedSplitSize. 此属性指定单个映射任务要处理的数据大小。This property specifies the size of the data to be processed by a single map task. 默认值为文件系统的默认块大小。The default value is the filesystem's default block size. 增大此值可减少映射器任务的数目。Increasing this value results in a lower number of mapper tasks.

优化化简器Tune reducers

化简器数目根据 pig.exec.reducers.bytes.per.reducer 参数计算。The number of reducers is calculated based on the parameter pig.exec.reducers.bytes.per.reducer. 该参数指定每个化简器处理的字节数,默认值为 1 GB。The parameter specifies the number of bytes processed per reducer, by default 1 GB. 若要限制化简器的最大数目,请设置 pig.exec.reducers.max 属性,默认值为 999。To limit the maximum number of reducers, set the pig.exec.reducers.max property, by default 999.

后续步骤Next steps