对 Azure 上 SQL Server 中的数据进行采样Sample data in SQL Server on Azure

本文介绍了如何使用 SQL 或 Python 编程语言对 Azure 上的 SQL Server 中存储的数据进行采样。This article shows how to sample data stored in SQL Server on Azure using either SQL or the Python programming language. 还介绍了如何通过将采样数据保存到文件、上传到 Azure blob,然后读取到 Azure 机器学习工作室,将数据移至 Azure 机器学习中。It also shows how to move sampled data into Azure Machine Learning by saving it to a file, uploading it to an Azure blob, and then reading it into Azure Machine Learning Studio.

Python 采样使用要连接到 Azure 上 SQL Server 的 pyodbc ODBC 和 Pandas 库进行采样。The Python sampling uses the pyodbc ODBC library to connect to SQL Server on Azure and the Pandas library to do the sampling.

备注

本文档中的示例 SQL 代码假设该数据在 Azure 上的 SQL Server 中。The sample SQL code in this document assumes that the data is in a SQL Server on Azure. 如果不存在,请参阅“将数据移动到 Azure 上的 SQL Server”文章,获取有关如何将数据移动到 Azure 上的 SQL Server 的说明。If it is not, refer to Move data to SQL Server on Azure article for instructions on how to move your data to SQL Server on Azure.

为什么对数据进行采样?Why sample your data? 如果计划要分析的数据集很大,通常最好是对数据进行向下采样,以将数据减至较小但具备代表性且更易于管理的规模。If the dataset you plan to analyze is large, it's usually a good idea to down-sample the data to reduce it to a smaller but representative and more manageable size. 采样有利于数据理解、探索和功能设计。Sampling facilitates data understanding, exploration, and feature engineering. 它在团队数据科学过程 (TDSP) 中的作用是启用数据处理功能和机器学习模型的快速原型设计。Its role in the Team Data Science Process (TDSP) is to enable fast prototyping of the data processing functions and machine learning models.

此采样任务是团队数据科学流程 (TDSP) 中的一个步骤。This sampling task is a step in the Team Data Science Process (TDSP).

使用 SQLUsing SQL

本部分介绍了几种使用 SQL 针对数据库中的数据执行简单随机采样的方法。This section describes several methods using SQL to perform simple random sampling against the data in the database. 请根据数据大小及其分发方式选择一种方法。Choose a method based on your data size and its distribution.

以下两项显示了如何使用 SQL Server 中的 newid 执行采样。The following two items show how to use newid in SQL Server to perform the sampling. 选择的方法取决于想要进行采样的随机程度(假设以下示例代码中的 pk_id 是自动生成的主密钥)。The method you choose depends on how random you want the sample to be (pk_id in the following sample code is assumed to be an autogenerated primary key).

  1. 不太严格的随机采样Less strict random sample

    select  * from <table_name> where <primary_key> in 
    (select top 10 percent <primary_key> from <table_name> order by newid())
    
  2. 更随机的采样More random sample

    SELECT * FROM <table_name>
    WHERE 0.1 >= CAST(CHECKSUM(NEWID(), <primary_key>) & 0x7fffffff AS float)/ CAST (0x7fffffff AS int)
    

Tablesample 也可用于数据采样。Tablesample can be used for sampling the data as well. 如果数据大小较大(假设不同页面上的数据各不相关)并且想要在一个合理时间完成查询,此选项可能更好。This option may be a better approach if your data size is large (assuming that data on different pages is not correlated) and for the query to complete in a reasonable time.

SELECT *
FROM <table_name> 
TABLESAMPLE (10 PERCENT)

备注

可以通过将采样数据存储在新表中从其浏览和生成功能You can explore and generate features from this sampled data by storing it in a new table

连接到 Azure 机器学习Connecting to Azure Machine Learning

可直接在 Azure 机器学习导入数据模块中使用上述采样查询,对数据进行联机低采样并将其引入 Azure 机器学习试验。You can directly use the sample queries above in the Azure Machine Learning Import Data module to down-sample the data on the fly and bring it into an Azure Machine Learning experiment. 使用读取器模块读取采样数据的屏幕截图如下所示:A screenshot of using the reader module to read the sampled data is shown here:

读取器 SQL

使用 Python 编程语言Using the Python programming language

本部分演示了如何使用 pyodbc 库建立 ODBC 与 Python 中 SQL Server 数据库的连接。This section demonstrates using the pyodbc library to establish an ODBC connect to a SQL server database in Python. 数据库连接字符串如下所示:(将 servername、dbname、username 和 password 替换为配置):The database connection string is as follows: (replace servername, dbname, username, and password with your configuration):

#Set up the SQL Azure connection
import pyodbc    
conn = pyodbc.connect('DRIVER={SQL Server};SERVER=<servername>;DATABASE=<dbname>;UID=<username>;PWD=<password>')

Python 中的 Pandas 库提供一组丰富的数据结构,以及针对 Python 编程的数据操作的数据分析工具。The Pandas library in Python provides a rich set of data structures and data analysis tools for data manipulation for Python programming. 以下代码将 Azure SQL 数据库表中的 0.1% 数据采样读取到 Pandas 数据中:The following code reads a 0.1% sample of the data from a table in Azure SQL Database into a Pandas data:

import pandas as pd

# Query database and load the returned results in pandas data frame
data_frame = pd.read_sql('''select column1, column2... from <table_name> tablesample (0.1 percent)''', conn)

现在,可以在 Pandas 数据帧中处理采样的数据。You can now work with the sampled data in the Pandas data frame.

连接到 Azure 机器学习Connecting to Azure Machine Learning

可以使用以下示例代码将低采样的数据保存到文件,并将其上传到 Azure blob。You can use the following sample code to save the down-sampled data to a file and upload it to an Azure blob. 可以使用导入数据模块将 blob 中的数据直接读取到 Azure 机器学习试验。The data in the blob can be directly read into an Azure Machine Learning Experiment using the Import Data module. 步骤如下:The steps are as follows:

  1. 将 Pandas 数据帧写入本地文件Write the pandas data frame to a local file

    dataframe.to_csv(os.path.join(os.getcwd(),LOCALFILENAME), sep='\t', encoding='utf-8', index=False)
    
  2. 将本地文件上传到 Azure BlobUpload local file to Azure blob

    from azure.storage import BlobService
    import tables
    
    STORAGEACCOUNTNAME= <storage_account_name>
    LOCALFILENAME= <local_file_name>
    STORAGEACCOUNTKEY= <storage_account_key>
    CONTAINERNAME= <container_name>
    BLOBNAME= <blob_name>
    
    output_blob_service=BlobService(account_name=STORAGEACCOUNTNAME,account_key=STORAGEACCOUNTKEY)    
    localfileprocessed = os.path.join(os.getcwd(),LOCALFILENAME) #assuming file is in current working directory
    
    try:
    
    #perform upload
    output_blob_service.put_block_blob_from_path(CONTAINERNAME,BLOBNAME,localfileprocessed)
    
    except:            
        print ("Something went wrong with uploading blob:"+BLOBNAME)
    
  3. 使用 Azure 机器学习导入数据模块从 Azure blob 读取数据,如下方屏幕截图所示:Read data from Azure blob using Azure Machine Learning Import Data module as shown in the following screen grab:

blob 读取器

运行中的团队数据科学过程示例The Team Data Science Process in Action example

有关使用公用数据集的团队数据科学过程的演练示例,请参阅运行中的团队数据科学过程:使用 SQL ServerTo walkthrough an example of the Team Data Science Process a using a public dataset, see Team Data Science Process in Action: using SQL Server.