使用 Azure 机器学习优化数据处理Optimize data processing with Azure Machine Learning

本文介绍可帮助优化本地和大规模的数据处理速度的最佳实践。In this article, you learn about best practices to help you optimize data processing speeds locally and at scale.

Azure 机器学习与用于数据处理的开放源代码包和框架进行了集成。Azure Machine Learning is integrated with open-source packages and frameworks for data processing. 通过使用这些集成并应用本文中的最佳做法建议,可以提高本地和大规模数据处理的速度。By using these integrations and applying the best practice recommendations in this article, you can improve your data processing speeds both locally and at scale.

Parquet 和 CSV 文件格式Parquet and CSV file formats

逗号分隔值 (csv) 文件是用于数据处理的常见文件格式。Comma-separated values (csv) files are common file formats for data processing. 但是,建议将 parquet 文件格式用于机器学习任务。However, parquet file formats are recommended for machine learning tasks.

Parquet 文件以二进制纵栏格式存储数据。Parquet files store data in a binary columnar format. 如果需要将数据拆分为多个文件,此格式会很有用。This format is useful if splitting up the data into multiple files is needed. 此外,使用此格式可将机器学习试验的相关字段设定为目标。Also, this format allows you to target the relevant fields for your machine learning experiments. 可以通过只选择训练 ML 模型时所必需的列来减少数据加载量,而不必读取全部 20 GB 的数据文件。Instead of having to read in a 20-GB data file, you can decrease that data load, by selecting the necessary columns to train your ML model. 还可以对 Parquet 文件进行压缩,以最大程度减小所需的处理能力和占用的空间。Parquet files can also be compressed to minimize processing power and take up less space.

CSV 文件通常用于导入和导出数据,因为它们易于在 Excel 中进行编辑和读取。CSV files are commonly used to import and export data, since they're easy to edit and read in Excel. CSV 中的数据采用基于行的格式以字符串形式存储,并且可以对这类文件进行压缩以减少数据传输负载。The data in CSVs are stored as strings in a row-based format, and the files can be compressed to lessen data transfer loads. 未经压缩的 CSV 大小可增加约 2-10 倍,经过压缩的 CSV 甚至可以增加更多。Uncompressed CSVs can expand by a factor of about 2-10 and compressed CSVs can increase even further. 因此,内存中的 5-GB CSV 可扩展至超过计算机具有的 8 GB RAM。So that 5-GB CSV in memory expands to well over the 8 GB of RAM you have on your machine. 这种压缩行为可能会增加数据传输延迟,在需要处理大量数据的情况下,此行为可能产生不好的效果。This compression behavior may increase data transfer latency, which isn't ideal if you have large amounts of data to process.

Pandas 数据帧Pandas dataframe

Pandas 数据帧常见于数据操作和分析。Pandas dataframes are commonly used for data manipulation and analysis. Pandas 适用于小于 1 GB 的数据大小,但当文件大小超过 1 GB 时,pandas 数据帧的处理时间会变慢。Pandas works well for data sizes less than 1 GB, but processing times for pandas dataframes slow down when file sizes reach about 1 GB. 这种减慢是因为存储中的数据大小不同于数据帧中的数据大小。This slowdown is because the size of your data in storage isn't the same as the size of data in a dataframe. 例如,CSV 文件中的数据在数据帧中最多可以扩展到 10 倍,因此,1 GB 的 CSV 文件在数据帧中可能会变成 10 GB。For instance, data in CSV files can expand up to 10 times in a dataframe, so a 1-GB CSV file can become 10 GB in a dataframe.

Pandas 是单线程的,也就是说,在单个 CPU 上每次只能执行一个操作。Pandas is single threaded, meaning operations are done one at a time on a single CPU. 借助使用分布式后端来包装 PandasModin 等包,可以很容易地让工作负载在单个 Azure 机器学习计算实例上的多个虚拟 CPU 中并行运行。You can easily parallelize workloads to multiple virtual CPUs on a single Azure Machine Learning compute instance with packages like Modin that wrap Pandas using a distributed backend.

要使用 ModinDask 并行运行任务,只需将这行代码 import pandas as pd 更改为 import modin.pandas as pdTo parallelize your tasks with Modin and Dask, just change this line of code import pandas as pd to import modin.pandas as pd.

数据帧:内存不足错误Dataframe: out of memory error

通常,当数据帧扩展至超过计算机上的可用 RAM 时,会发生内存不足错误。Typically an out of memory error occurs when your dataframe expands above the available RAM on your machine. 此概念还适用于分布式框架,如 ModinDaskThis concept also applies to a distributed framework like Modin or Dask. 也就是说,你的操作尝试在群集的每个节点上的内存中加载数据帧,但没有足够的 RAM 来执行此操作。That is, your operation attempts to load the dataframe in memory on each node in your cluster, but not enough RAM is available to do so.

一个解决方案是增加 RAM 以适应内存中的数据帧。One solution is to increase your RAM to fit the dataframe in memory. 建议将计算大小和处理能力设置为包含 2 倍大小的 RAM。We recommend your compute size and processing power contain two times the size of RAM. 因此,如果数据帧为 10 GB,请使用至少具有 20 GB RAM 的计算目标,以确保可以将数据帧恰当地置于内存中并对其进行处理。So if your dataframe is 10 GB, use a compute target with at least 20 GB of RAM to ensure that the dataframe can comfortably fit in memory and be processed.

对于多个虚拟 CPU (vCPU),请记住,所设置的分区应拟合计算机上的每个 vCPU 所能拥有的 RAM 容量。For multiple virtual CPUs, vCPU, keep in mind that you want one partition to comfortably fit into the RAM each vCPU can have on the machine. 也就是说,如果具有 16 GB RAM 4 个 vCPU,则每个 vCPU 需要大约 2-GB 的数据帧。That is, if you have 16-GB RAM 4 vCPUs, you want about 2-GB dataframes per each vCPU.

本地与远程Local vs remote

你可能会注意到,与使用 Azure 机器学习预配的远程 VM 相比,在本地电脑上工作时,某些 pandas 数据帧命令的执行速度更快。You may notice certain pandas dataframe commands perform faster when working on your local PC versus a remote VM you provisioned with Azure Machine Learning. 你的本地电脑通常启用了页面文件,这样就可以加载多于物理内存所能容纳的数据量,也就是你的硬盘驱动器将充当 RAM 的扩展。Your local PC typically has a page file enabled, which allows you to load more than what fits in physical memory, that is your hard drive is being used as an extension of your RAM. 目前,Azure 机器学习 VM 运行时未启用页面文件,因此,只能加载可用物理 RAM 允许的数据量。Currently, Azure Machine Learning VMs run without a page file, therefore can only load as much data as physical RAM available.

对于计算密集型作业,建议选取更大的 VM 以提高处理速度。For compute-heavy jobs, we recommend you pick a larger VM to improve processing speeds.

详细了解 Azure 机器学习的可用 VM 系列和大小Learn more about the available VM series and sizes for Azure Machine Learning.

有关 RAM 规范,请参阅相应的 VM 系列页,例如 Dv2-Dsv2 系列For RAM specifications, see the corresponding VM series pages such as, Dv2-Dsv2 series.

最小化 CPU 工作负载Minimize CPU workloads

如果无法将更多 RAM 添加到计算机,则可以通过以下技巧来帮助最小化 CPU 工作负载和优化处理时间。If you can't add more RAM to your machine, you can apply the following techniques to help minimize CPU workloads and optimize processing times. 单个系统和分布式系统均适用以下建议。These recommendations pertain to both single and distributed systems.

方法Technique 说明Description
压缩Compression 为数据使用其他表示形式,该形式应使用较少内存且不会对计算结果产生显著影响。Use a different representation for your data, in a way that uses less memory and doesn't significantly impact the results of your calculation.

示例: 不要将条目存储为每个条目包含 10 个字节或更多字节的字符串,而是将它们存储为布尔值 True 或 False(可使用 1 个字节存储)。Example: Instead of storing entries as a string with about 10 bytes or more per entry, store them as a boolean, True or False, which you could store in 1 byte.
分块Chunking 将数据以子集(区块)的形式加载到内存中,每次处理一个子集或并行处理多个子集。Load data into memory in subsets (chunks), processing the data one subset at time, or multiple subsets in parallel. 如果需要处理所有数据,但不需要同时将所有数据加载到内存中,则此方法最有效。This method works best if you need to process all the data, but don't need to load all the data into memory at once.

示例: 一次加载和处理一个月的数据,而不是一次处理整整一年的数据。Example: Instead of processing a full year's worth of data at once, load and process the data one month at a time.
索引Indexing 应用并使用索引,即用于指示在何处查找所需数据的摘要。Apply and use an index, a summary that tells you where to find the data you care about. 当只需要使用数据的一个子集而不是完整集时,索引很有用Indexing is useful when you only need to use a subset of the data, instead of the full set

示例: 如果有按月份排序的全年销售数据,可借助索引快速搜索要处理的所需月份的数据。Example: If you have a full year's worth of sales data sorted by month, an index helps you quickly search for the desired month that you wish to process.

改变数据处理的规模Scale data processing

如果前面的建议不够用,且无法找到适合容纳你的数据的虚拟机,可以:If the previous recommendations aren't enough, and you can't get a virtual machine that fits your data, you can,

  • 使用像 SparkDask 这样的框架来处理“超出内存”的数据。Use a framework like Spark or Dask to process the data 'out of memory'. 通过这种方式,数据帧按分区加载到 RAM 中并进行处理,最后收集最终结果。In this option, the dataframe is loaded into RAM partition by partition and processed, with the final result being gathered at the end.

  • 使用分布式框架横向扩展到群集。Scale out to a cluster using a distributed framework. 采用这种方式,将拆分数据处理负载并在多个 CPU 上并行运行,最后收集最终结果。In this option, data processing loads are split up and processed on multiple CPUs that work in parallel, with the final result gathered at the end.

下表根据代码偏好或数据大小推荐与 Azure 机器学习集成的分布式框架。The following table recommends distributed frameworks that are integrated with Azure Machine Learning based on your code preference or data size.

经验或数据大小Experience or data size 建议Recommendation
如果你熟悉 PandasIf you're familiar with Pandas ModinDask 数据帧Modin or Dask dataframe
如果偏好 SparkIf you prefer Spark PySpark
适用于小于 1 GB 的数据For data less than 1 GB 在本地或对远程 Azure 机器学习计算实例执行 PandasPandas locally or a remote Azure Machine Learning compute instance
适用于大于 10 GB 的数据For data larger than 10 GB 使用 RayDaskSpark 移动到群集Move to a cluster using Ray, Dask, or Spark

可以使用 dask-cloudprovider 包在 Azure ML 计算群集上创建 Dask 群集。You can create Dask clusters on Azure ML compute cluster with the dask-cloudprovider package. 或者,可以在计算实例上以本地方式运行 DaskOr you can run Dask locally on a compute instance.

后续步骤Next steps