在 Azure 上处理 SQL Server 虚拟机中的数据Process Data in SQL Server Virtual Machine on Azure

本文档介绍如何浏览数据,并针对存储在 Azure 的 SQL Server VM 中的数据生成功能。This document covers how to explore data and generate features for data stored in a SQL Server VM on Azure. 可通过使用 SQL 或 Python 等编程语言的数据整理来完成此目标。This goal may be completed by data wrangling using SQL or by using a programming language like Python.

备注

本文档中的示例 SQL 语句假定数据在 SQL Server 中。The sample SQL statements in this document assume that data is in SQL Server. 如果不是这样,请参阅云数据科学进程映射,了解如何将数据移到 SQL Server。If it isn't, refer to the cloud data science process map to learn how to move your data to SQL Server.

使用 SQLUsing SQL

我们介绍本部分中使用 SQL 的以下数据整理任务:We describe the following data wrangling tasks in this section using SQL:

  1. 数据浏览Data Exploration
  2. 功能生成Feature Generation

数据浏览Data Exploration

以下是几个可用于浏览存储在 SQL Server 中的数据的示例 SQL 脚本。Here are a few sample SQL scripts that can be used to explore data stores in SQL Server.

备注

有关实际的示例,可以使用 NYC 出租车数据集,并参考名为使用 IPython Notebook 和 SQL Server 的 NYC 数据整理的IPNB,以获取端到端的演练。For a practical example, you can use the NYC Taxi dataset and refer to the IPNB titled NYC Data wrangling using IPython Notebook and SQL Server for an end-to-end walk-through.

  1. 获取每个工作日观测值的计数Get the count of observations per day

    SELECT CONVERT(date, <date_columnname>) as date, count(*) as c from <tablename> group by CONVERT(date, <date_columnname>)

  2. 获取某分类列中的级别Get the levels in a categorical column

    select distinct <column_name> from <databasename>

  3. 获取两个分类列组合中的级别数Get the number of levels in combination of two categorical columns

    select <column_a>, <column_b>,count(*) from <tablename> group by <column_a>, <column_b>

  4. 获取数字列的分布Get the distribution for numerical columns

    select <column_name>, count(*) from <tablename> group by <column_name>

功能生成Feature Generation

在本部分中,介绍使用 SQL 生成功能的方法:In this section, we describe ways of generating features using SQL:

  1. 生成基于计数的功能Count based Feature Generation
  2. 生成装箱功能Binning Feature Generation
  3. 从单个列推出功能Rolling out the features from a single column

备注

一旦生成其他功能,可将它们作为列添加到现有表格,或使用其他功能和主键来创建可与原始表结合的新表格。Once you generate additional features, you can either add them as columns to the existing table or create a new table with the additional features and primary key, that can be joined with the original table.

基于计数生成功能Count based Feature Generation

以下示例演示两种生成计数功能的方法。The following examples demonstrate two ways of generating count features. 第一种方法是使用条件求和,第二种方法是使用“where”子句。The first method uses conditional sum and the second method uses the 'where' clause. 之后这些结果可与原始表联接(使用主键列),这样就会具有原始数据的计数功能。These results may then be joined with the original table (using primary key columns) to have count features alongside the original data.

select <column_name1>,<column_name2>,<column_name3>, COUNT(*) as Count_Features from <tablename> group by <column_name1>,<column_name2>,<column_name3> 

select <column_name1>,<column_name2> , sum(1) as Count_Features from <tablename> 
where <column_name3> = '<some_value>' group by <column_name1>,<column_name2> 

生成装箱功能Binning Feature Generation

下面的示例演示如何通过将可用作函数的数值列装箱(使用 5 箱),从而生成装箱函数:The following example shows how to generate binned features by binning (using five bins) a numerical column that can be used as a feature instead:

SELECT <column_name>, NTILE(5) OVER (ORDER BY <column_name>) AS BinNumber from <tablename>

从单个列推出功能Rolling out the features from a single column

在此部分中,将演示如何在表格中推出单列以生成其他功能。In this section, we demonstrate how to roll out a single column in a table to generate additional features. 该示例假定用户尝试在其中生成功能的表中,具有一个纬度或经度列。The example assumes that there is a latitude or longitude column in the table from which you are trying to generate features.

下面简要介绍纬度/经度位置数据(来自 stackoverflow 的资源如何测量纬度和经度的准确性?)。Here is a brief primer on latitude/longitude location data (resourced from stackoverflow How to measure the accuracy of latitude and longitude?). 在将位置作为一个或多个特征包括在内之前,可以通过本指南来理解:This guidance is useful to understand before including location as one or more features:

  • 符号指示地球上的四个方向:东、南、西、北。The sign tells us whether we are north or south, east or west on the globe.
  • 非零的百位数代表经度,而不是纬度!A nonzero hundreds digit tells us that we're using longitude, not latitude!
  • 十位数提供约 1000 公里的位置。The tens digit gives a position to about 1,000 kilometers. 提供处于哪个大洲或海洋的有用信息。It gives us useful information about what continent or ocean we are on.
  • 个位(十进制度)代表 111 公里以上的(60 海里,约 69 英里)位置。The units digit (one decimal degree) gives a position up to 111 kilometers (60 nautical miles, about 69 miles). 它可让你大致了解所在的州、国家或地区。It can tell you roughly what state, country, or region you're in.
  • 第一个小数位值达 11.1 km:可将相邻的大城市区分开。The first decimal place is worth up to 11.1 km: it can distinguish the position of one large city from a neighboring large city.
  • 第二位小数值达 1.1 km:可将村庄分开。The second decimal place is worth up to 1.1 km: it can separate one village from the next.
  • 第三位小数值达 110 m:可以定大型农业区域或工业园区。The third decimal place is worth up to 110 m: it can identify a large agricultural field or institutional campus.
  • 第四位小数值达 11 m:可识别小块土地。The fourth decimal place is worth up to 11 m: it can identify a parcel of land. 其准确性相当于未更正的、无干扰的 GPS 部件的典型准确性。It is comparable to the typical accuracy of an uncorrected GPS unit with no interference.
  • 第五位小数值达 1.1 m:可将树与树区分开。The fifth decimal place is worth up to 1.1 m: it distinguishes trees from each other. 可通过差异更正获得该级别的、商用 GPS 计价单位的准确性。Accuracy to this level with commercial GPS units can only be achieved with differential correction.
  • 第六个小数位值达 0.11 m:可用于详细布局结构、设计景观和修建道路等。The sixth decimal place is worth up to 0.11 m: you can use this for laying out structures in detail, for designing landscapes, building roads. 对于追踪冰川和河流的运动,它是不二之选。It should be more than good enough for tracking movements of glaciers and rivers. 可通过差异更正 GPS 等获得以上数值。This can be achieved by taking painstaking measures with GPS, such as differentially corrected GPS.

位置信息还具有以下特征:分离地区、位置和城市信息。The location information can be featurized as follows, separating out region, location, and city information. 也可以调用按点查找位置上提供的 REST 终结点(如必应地图 API)来获取区域/地区信息。You can also call a REST end point such as Bing Maps API available at Find a Location by Point to get the region/district information.

select 
    <location_columnname>
    ,round(<location_columnname>,0) as l1        
    ,l2=case when LEN (PARSENAME(round(ABS(<location_columnname>) - FLOOR(ABS(<location_columnname>)),6),1)) >= 1 then substring(PARSENAME(round(ABS(<location_columnname>) - FLOOR(ABS(<location_columnname>)),6),1),1,1) else '0' end     
    ,l3=case when LEN (PARSENAME(round(ABS(<location_columnname>) - FLOOR(ABS(<location_columnname>)),6),1)) >= 2 then substring(PARSENAME(round(ABS(<location_columnname>) - FLOOR(ABS(<location_columnname>)),6),1),2,1) else '0' end     
    ,l4=case when LEN (PARSENAME(round(ABS(<location_columnname>) - FLOOR(ABS(<location_columnname>)),6),1)) >= 3 then substring(PARSENAME(round(ABS(<location_columnname>) - FLOOR(ABS(<location_columnname>)),6),1),3,1) else '0' end     
    ,l5=case when LEN (PARSENAME(round(ABS(<location_columnname>) - FLOOR(ABS(<location_columnname>)),6),1)) >= 4 then substring(PARSENAME(round(ABS(<location_columnname>) - FLOOR(ABS(<location_columnname>)),6),1),4,1) else '0' end     
    ,l6=case when LEN (PARSENAME(round(ABS(<location_columnname>) - FLOOR(ABS(<location_columnname>)),6),1)) >= 5 then substring(PARSENAME(round(ABS(<location_columnname>) - FLOOR(ABS(<location_columnname>)),6),1),5,1) else '0' end     
    ,l7=case when LEN (PARSENAME(round(ABS(<location_columnname>) - FLOOR(ABS(<location_columnname>)),6),1)) >= 6 then substring(PARSENAME(round(ABS(<location_columnname>) - FLOOR(ABS(<location_columnname>)),6),1),6,1) else '0' end     
from <tablename>

如前面所述:可以进一步使用上述基于位置的功能来生成其他计数功能。These location-based features can be further used to generate additional count features as described earlier.

提示

可以使用设定语言,以编程方式插入记录。You can programmatically insert the records using your language of choice. 可能需要以块形式插入数据以提高写入效率(有关如何使用 pyodbc 执行此操作的示例,请参阅使用 python 访问 SQLServer 的 HelloWorld 示例)。You may need to insert the data in chunks to improve write efficiency (for an example of how to do this using pyodbc, see A HelloWorld sample to access SQLServer with python). 另一种方法是使用 BCP 实用工具将数据插入数据库中。Another alternative is to insert data in the database using the BCP utility.

连接到 Azure 机器学习Connecting to Azure Machine Learning

新生成的功能可作为列添加到现有表或存储在新表中,也可与原始表结合以进行机器学习。The newly generated feature can be added as a column to an existing table or stored in a new table and joined with the original table for machine learning. 如果已经创建,可使用 Azure 机器学习中的导入数据模块生成或访问功能,如下所示:Features can be generated or accessed if already created, using the Import Data module in Azure Machine Learning as shown below:

azureml 读取器

使用 Python 等编程语言Using a programming language like Python

如果数据位于 SQL Server 中,使用 Python 浏览数据和生成功能类似于使用 Python处理 Azure blob 中的数据,如处理数据科学环境中的 Azure Blob 数据中所述。Using Python to explore data and generate features when the data is in SQL Server is similar to processing data in Azure blob using Python as documented in Process Azure Blob data in your data science environment. 将数据库中的数据加载到 pandas 数据帧,以便进行更多处理。Load the data from the database into a pandas data frame for more processing. 在本部分中,我们记录连接到数据库并将数据加载到的数据帧的过程。We document the process of connecting to the database and loading the data into the data frame in this section.

以下连接字符串格式可用于使用 pyodbc 从 Python 连接到 SQL Server 数据库(具有特定值的替换服务器名、dbname、用户名和密码):The following connection string format can be used to connect to a SQL Server database from Python using pyodbc (replace servername, dbname, username, and password with your specific values):

#Set up the SQL Azure connection
import pyodbc    
conn = pyodbc.connect('DRIVER={SQL Server};SERVER=<servername>;DATABASE=<dbname>;UID=<username>;PWD=<password>')

Python 中的 Pandas 库提供一组丰富的数据结构,以及针对 Python 编程的数据操作的数据分析工具。The Pandas library in Python provides a rich set of data structures and data analysis tools for data manipulation for Python programming. 下面的代码读取 SQL Server 数据库返回的结果到 Pandas 数据帧:The code below reads the results returned from a SQL Server database into a Pandas data frame:

# Query database and load the returned results in pandas data frame
data_frame = pd.read_sql('''select <columnname1>, <columnname2>... from <tablename>''', conn)

现可使用 Pandas 数据帧,如文章处理数据科学环境中的 Azure Blob 数据所述。Now you can work with the Pandas data frame as covered in the article Process Azure Blob data in your data science environment.

运行中的 Azure 数据科学示例Azure Data Science in Action Example

有关使用公用数据集的 Azure 数据科学过程的端到端演练示例,请参阅运行中的 Azure 数据科学过程For an end-to-end walkthrough example of the Azure Data Science Process using a public dataset, see Azure Data Science Process in Action.