对 Azure Blob 容器、SQL Server 和 Hive 表中的数据采样Sample data in Azure blob containers, SQL Server, and Hive tables
以下文章介绍了如何对存储在三个不同 Azure 位置之一的数据进行采样:The following articles describe how to sample data that is stored in one of three different Azure locations:
- Azure Blob 容器数据的采样方法是先以编程方式下载该数据,并使用样本 Python 代码对其采样。Azure blob container data is sampled by downloading it programmatically and then sampling it with sample Python code.
- SQL Server 数据是使用 SQL 和 Python 编程语言进行采样。SQL Server data is sampled using both SQL and the Python Programming Language.
- Hive 表数据是使用 Hive 查询进行采样。Hive table data is sampled using Hive queries.
此采样任务是团队数据科学流程 (TDSP) 中的一个步骤。This sampling task is a step in the Team Data Science Process (TDSP).
为什么对数据采样?Why sample data?
如果计划要分析的数据集很大,通常最好是对数据进行向下采样,以将数据减至较小但具备代表性且更易于管理的规模。If the dataset you plan to analyze is large, it's usually a good idea to down-sample the data to reduce it to a smaller but representative and more manageable size. 缩小可能有利于数据理解、探索和特征工程。Downsizing may facilitate data understanding, exploration, and feature engineering. 这一采样角色在 Cortana Analytics 进程中的作用是能够快速建立数据处理函数和机器学习模型的快速原型。This sampling role in the Cortana Analytics Process is to enable fast prototyping of the data processing functions and machine learning models.