使用 SQL 和 Python 在 SQL Server 中为数据创建功能Create features for data in SQL Server using SQL and Python

本文档演示如何在 Azure 上为存储于 SQL Server VM 中的数据生成功能,用于帮助算法更有效地从数据中进行学习。This document shows how to generate features for data stored in a SQL Server VM on Azure that help algorithms learn more efficiently from the data. 可以使用 SQL 或 Python 等编程语言来完成此任务。You can use SQL or a programming language like Python to accomplish this task. 下面演示了这两种方法。Both approaches are demonstrated here.

此任务是团队数据科学过程 (TDSP) 中的一个步骤。This task is a step in the Team Data Science Process (TDSP).

备注

有关实际的示例,可以参阅 NYC 出租车数据集 使用 IPython Notebook 和 SQL Server 处理 NYC 数据获取端到端的演练。For a practical example, you can consult the NYC Taxi dataset and refer to the IPNB titled NYC Data wrangling using IPython Notebook and SQL Server for an end-to-end walk-through.

必备条件Prerequisites

本文假设用户具备以下条件:This article assumes that you have:

使用 SQL 生成特征Feature generation with SQL

在本部分中,介绍使用 SQL 生成功能的方法:In this section, we describe ways of generating features using SQL:

  1. 生成基于计数的功能Count based Feature Generation
  2. 生成装箱功能Binning Feature Generation
  3. 从单个列推出功能Rolling out the features from a single column

备注

一旦生成其他功能,可将它们作为列添加到现有表格,或使用其他功能和主键来创建可与原始表结合的新表格。Once you generate additional features, you can either add them as columns to the existing table or create a new table with the additional features and primary key, that can be joined with the original table.

基于计数生成特征Count based feature generation

本文档演示两种生成计数功能的方法。This document demonstrates two ways of generating count features. 第一种方法是使用条件求和,第二种方法是使用 where 子句。The first method uses conditional sum and the second method uses the 'where` clause. 之后这些新表格可与原始表结合(使用主键列),使其具有原始数据的计数功能。These can then be joined with the original table (using primary key columns) to have count features alongside the original data.

select <column_name1>,<column_name2>,<column_name3>, COUNT(*) as Count_Features from <tablename> group by <column_name1>,<column_name2>,<column_name3>

select <column_name1>,<column_name2> , sum(1) as Count_Features from <tablename>
where <column_name3> = '<some_value>' group by <column_name1>,<column_name2>

生成装箱功能Binning Feature Generation

下面的示例演示如何通过将可用作函数的数值列装箱(使用 5 箱),从而生成装箱函数:The following example shows how to generate binned features by binning (using 5 bins) a numerical column that can be used as a feature instead:

`SELECT <column_name>, NTILE(5) OVER (ORDER BY <column_name>) AS BinNumber from <tablename>`

从单个列推出功能Rolling out the features from a single column

在此部分中,将演示如何在表格中推出单列以生成其他功能。In this section, we demonstrate how to roll out a single column in a table to generate additional features. 该示例假定用户尝试在其中生成功能的表中,具有一个纬度或经度列。The example assumes that there is a latitude or longitude column in the table from which you are trying to generate features.

下面简要介绍纬度/经度位置数据(来自 stackoverflow 的资源https://gis.stackexchange.com/questions/8650/how-to-measure-the-accuracy-of-latitude-and-longitude)。Here is a brief primer on latitude/longitude location data (resourced from stackoverflow https://gis.stackexchange.com/questions/8650/how-to-measure-the-accuracy-of-latitude-and-longitude). 下面是一些有用的信息,可帮助在从字段创建特征之前,了解位置数据:Here are some useful things to understand about location data before creating features from the field:

  • 符号指示我们在地球上的北部还是南部、东部还是西部。The sign indicates whether we are north or south, east or west on the globe.
  • 非零百位数指示使用的是经度,而不是纬度。A nonzero hundreds digit indicates longitude, not latitude is being used.
  • 十位数提供约 1000 公里的位置。The tens digit gives a position to about 1,000 kilometers. 它提供处于哪个大陆或大洋的有用信息。It gives useful information about what continent or ocean we are on.
  • 个位(十进制度)代表 111 公里以上的(60 海里,约 69 英里)位置。The units digit (one decimal degree) gives a position up to 111 kilometers (60 nautical miles, about 69 miles). 它指示大致处于哪一个大邦或国家/地区。It indicates, roughly, what large state or country/region we are in.
  • 第一个小数位值达 11.1 km:可将相邻的大城市区分开。The first decimal place is worth up to 11.1 km: it can distinguish the position of one large city from a neighboring large city.
  • 第二位小数值达 1.1 km:可将村庄分开。The second decimal place is worth up to 1.1 km: it can separate one village from the next.
  • 第三位小数值达 110 m:可以定大型农业区域或工业园区。The third decimal place is worth up to 110 m: it can identify a large agricultural field or institutional campus.
  • 第四位小数值达 11 m:可识别小块土地。The fourth decimal place is worth up to 11 m: it can identify a parcel of land. 其准确性相当于未更正的、无干扰的 GPS 部件的典型准确性。It is comparable to the typical accuracy of an uncorrected GPS unit with no interference.
  • 第五位小数值达 1.1 m:可将树与树区分开。The fifth decimal place is worth up to 1.1 m: it distinguishes trees from each other. 可通过差异更正获得该级别的、商用 GPS 计价单位的准确性。Accuracy to this level with commercial GPS units can only be achieved with differential correction.
  • 第六个小数位值达 0.11 m:可用于详细布局结构、设计景观和修建道路等。The sixth decimal place is worth up to 0.11 m: you can use this for laying out structures in detail, for designing landscapes, building roads. 对于追踪冰川和河流的运动,它是不二之选。It should be more than good enough for tracking movements of glaciers and rivers. 可通过差异更正 GPS 等获得以上数值。This can be achieved by taking painstaking measures with GPS, such as differentially corrected GPS.

位置信息可具有以下特征:分离地区、位置和城市信息。The location information can be featurized by separating out region, location, and city information. 请注意:也可以调用 https://msdn.microsoft.com/library/ff701710.aspx 上的 Bing 地图 API 等 REST 终结点以获取区地区/区域信息。Note that once can also call a REST end point such as Bing Maps API available at https://msdn.microsoft.com/library/ff701710.aspx to get the region/district information.

select
    <location_columnname>
    ,round(<location_columnname>,0) as l1        
    ,l2=case when LEN (PARSENAME(round(ABS(<location_columnname>) - FLOOR(ABS(<location_columnname>)),6),1)) >= 1 then substring(PARSENAME(round(ABS(<location_columnname>) - FLOOR(ABS(<location_columnname>)),6),1),1,1) else '0' end     
    ,l3=case when LEN (PARSENAME(round(ABS(<location_columnname>) - FLOOR(ABS(<location_columnname>)),6),1)) >= 2 then substring(PARSENAME(round(ABS(<location_columnname>) - FLOOR(ABS(<location_columnname>)),6),1),2,1) else '0' end     
    ,l4=case when LEN (PARSENAME(round(ABS(<location_columnname>) - FLOOR(ABS(<location_columnname>)),6),1)) >= 3 then substring(PARSENAME(round(ABS(<location_columnname>) - FLOOR(ABS(<location_columnname>)),6),1),3,1) else '0' end     
    ,l5=case when LEN (PARSENAME(round(ABS(<location_columnname>) - FLOOR(ABS(<location_columnname>)),6),1)) >= 4 then substring(PARSENAME(round(ABS(<location_columnname>) - FLOOR(ABS(<location_columnname>)),6),1),4,1) else '0' end     
    ,l6=case when LEN (PARSENAME(round(ABS(<location_columnname>) - FLOOR(ABS(<location_columnname>)),6),1)) >= 5 then substring(PARSENAME(round(ABS(<location_columnname>) - FLOOR(ABS(<location_columnname>)),6),1),5,1) else '0' end     
    ,l7=case when LEN (PARSENAME(round(ABS(<location_columnname>) - FLOOR(ABS(<location_columnname>)),6),1)) >= 6 then substring(PARSENAME(round(ABS(<location_columnname>) - FLOOR(ABS(<location_columnname>)),6),1),6,1) else '0' end     
from <tablename>

如前面所述:可以进一步使用上述基于位置的特征来生成其他计数特征。These location based features can be further used to generate additional count features as described earlier.

提示

可以使用设定语言,以编程方式插入记录。You can programmatically insert the records using your language of choice. 可能需要将数据插入区块以提高写入效率。You may need to insert the data in chunks to improve write efficiency. 下面是如何使用 pyodbc 执行此操作的示例Here is an example of how to do this using pyodbc. 另一种方法是采用 BCP 实用工具将数据插入数据库中Another alternative is to insert data in the database using BCP utility

连接到 Azure 机器学习Connecting to Azure Machine Learning

新生成的功能可作为列添加到现有表或存储在新表中,也可与原始表结合以进行机器学习。The newly generated feature can be added as a column to an existing table or stored in a new table and joined with the original table for machine learning. 如果已经创建,可使用 Azure ML 中的导入数据模块生成或访问生成功能,如下所示:Features can be generated or accessed if already created, using the Import Data module in Azure ML as shown below:

Azure ML 读取器

使用 Python 等编程语言Using a programming language like Python

如果数据位于 SQL Server 中,使用 Python 生成特征类似于使用 Python处理 Azure blob 中的数据。Using Python to generate features when the data is in SQL Server is similar to processing data in Azure blob using Python. 有关比较,请参阅在数据科学环境中处理 Azure Blob 数据For comparison, see Process Azure Blob data in your data science environment. 将数据库中的数据加载到 pandas 数据帧,以便进一步处理。Load the data from the database into a pandas data frame to process it further. 本部分介绍连接到数据库并将数据加载到数据帧的过程。The process of connecting to the database and loading the data into the data frame is documented in this section.

以下连接字符串格式可用于使用 pyodbc 从 Python 连接到 SQL Server 数据库(具有特定值的替换服务器名、dbname、用户名和密码):The following connection string format can be used to connect to a SQL Server database from Python using pyodbc (replace servername, dbname, username, and password with your specific values):

#Set up the SQL Azure connection
import pyodbc
conn = pyodbc.connect('DRIVER={SQL Server};SERVER=<servername>;DATABASE=<dbname>;UID=<username>;PWD=<password>')

Python 中的 Pandas 库提供一组丰富的数据结构,以及针对 Python 编程的数据操作的数据分析工具。The Pandas library in Python provides a rich set of data structures and data analysis tools for data manipulation for Python programming. 以下代码读取从 SQL Server 数据库返回到 Pandas 数据帧的结果:The following code reads the results returned from a SQL Server database into a Pandas data frame:

# Query database and load the returned results in pandas data frame
data_frame = pd.read_sql('''select <columnname1>, <columnname2>... from <tablename>''', conn)

现在可根据 Panda 创建 Azure blob 存储数据 主题中的说明来使用 Pandas 数据框架。Now you can work with the Pandas data frame as covered in topics Create features for Azure blob storage data using Panda.