使用高级分析处理 Azure Blob 数据Process Azure blob data with advanced analytics

本文档介绍了如何浏览数据,以及如何从 Azure Blob 存储中存储的数据生成功能。This document covers exploring data and generating features from data stored in Azure Blob storage.

将数据加载到 Pandas 数据帧Load the data into a Pandas data frame

要浏览和操作数据集,必须将数据集从 blob 源下载到本地文件,该本地文件随后可加载到 Pandas 数据帧中。In order to explore and manipulate a dataset, it must be downloaded from the blob source to a local file which can then be loaded in a Pandas data frame. 下面是此过程的所需步骤:Here are the steps to follow for this procedure:

  1. 通过 blob 服务使用下方示例 Python 代码从 Azure blob 下载数据。Download the data from Azure blob with the following sample Python code using blob service. 使用特定值替代下方代码中的变量:Replace the variable in the code below with your specific values:

     from azure.storage.blob import BlobService
     import tables
    
     STORAGEACCOUNTNAME= <storage_account_name>
     STORAGEACCOUNTKEY= <storage_account_key>
     LOCALFILENAME= <local_file_name>        
     CONTAINERNAME= <container_name>
     BLOBNAME= <blob_name>
    
     #download from blob
     t1=time.time()
     blob_service=BlobService(account_name=STORAGEACCOUNTNAME,account_key=STORAGEACCOUNTKEY)
     blob_service.get_blob_to_path(CONTAINERNAME,BLOBNAME,LOCALFILENAME)
     t2=time.time()
     print(("It takes %s seconds to download "+blobname) % (t2 - t1))
    
  2. 从下载的文件中将数据读入 Pandas 数据帧。Read the data into a Pandas data-frame from the downloaded file.

     #LOCALFILE is the file path    
     dataframe_blobdata = pd.read_csv(LOCALFILE)
    

现在可以准备浏览数据并在此数据集上生成功能了。Now you are ready to explore the data and generate features on this dataset.

数据浏览Data Exploration

下方为如何使用 Pandas 浏览数据的几个示例:Here are a few examples of ways to explore data using Pandas:

  1. 检查行数和列数Inspect the number of rows and columns

     print 'the size of the data is: %d rows and  %d columns' % dataframe_blobdata.shape
    
  2. 检查数据集中的前面或后面几行,如下所示:Inspect the first or last few rows in the dataset as below:

     dataframe_blobdata.head(10)
    
     dataframe_blobdata.tail(10)
    
  3. 使用以下示例代码检查每列导入的数据类型Check the data type each column was imported as using the following sample code

     for col in dataframe_blobdata.columns:
         print dataframe_blobdata[col].name, ':\t', dataframe_blobdata[col].dtype
    
  4. 检查数据集中列的基本统计信息,如下所示Check the basic stats for the columns in the data set as follows

     dataframe_blobdata.describe()
    
  5. 如下所示,查看每列值的条目数Look at the number of entries for each column value as follows

     dataframe_blobdata['<column_name>'].value_counts()
    
  6. 使用以下示例代码计算每列中缺少的值与实际的条目数Count missing values versus the actual number of entries in each column using the following sample code

     miss_num = dataframe_blobdata.shape[0] - dataframe_blobdata.count()
     print miss_num
    
  7. 如果数据中有特定列存在缺少的值,可按如下方法进行替代:If you have missing values for a specific column in the data, you can drop them as follows:

     dataframe_blobdata_noNA = dataframe_blobdata.dropna()
     dataframe_blobdata_noNA.shape
    

    另一种替代缺失值的方法是使用模式函数:Another way to replace missing values is with the mode function:

     dataframe_blobdata_mode = dataframe_blobdata.fillna({'<column_name>':dataframe_blobdata['<column_name>'].mode()[0]})        
    
  8. 使用数量不定的量化创建直方图,以绘制变量的分布情况Create a histogram plot using variable number of bins to plot the distribution of a variable

     dataframe_blobdata['<column_name>'].value_counts().plot(kind='bar')
    
     np.log(dataframe_blobdata['<column_name>']+1).hist(bins=50)
    
  9. 使用散点图或内置的关联函数查看变量之间的关联Look at correlations between variables using a scatterplot or using the built-in correlation function

     #relationship between column_a and column_b using scatter plot
     plt.scatter(dataframe_blobdata['<column_a>'], dataframe_blobdata['<column_b>'])
    
     #correlation between column_a and column_b
     dataframe_blobdata[['<column_a>', '<column_b>']].corr()
    

功能生成Feature Generation

可按如下所示使用 Python 生成功能:We can generate features using Python as follows:

基于指示器值生成功能Indicator value based Feature Generation

可以按如下方式创建分类功能:Categorical features can be created as follows:

  1. 检查分类列的分布:Inspect the distribution of the categorical column:

     dataframe_blobdata['<categorical_column>'].value_counts()
    
  2. 为每个列值生成指示器值Generate indicator values for each of the column values

     #generate the indicator column
     dataframe_blobdata_identity = pd.get_dummies(dataframe_blobdata['<categorical_column>'], prefix='<categorical_column>_identity')
    
  3. 联接指示器列与原始数据帧Join the indicator column with the original data frame

         #Join the dummy variables back to the original data frame
         dataframe_blobdata_with_identity = dataframe_blobdata.join(dataframe_blobdata_identity)
    
  4. 删除原始变量本身:Remove the original variable itself:

     #Remove the original column rate_code in df1_with_dummy
     dataframe_blobdata_with_identity.drop('<categorical_column>', axis=1, inplace=True)
    

生成装箱功能Binning Feature Generation

要生成装箱功能,请按如下所示操作:For generating binned features, we proceed as follows:

  1. 添加一系列的列,量化数字列Add a sequence of columns to bin a numeric column

     bins = [0, 1, 2, 4, 10, 40]
     dataframe_blobdata_bin_id = pd.cut(dataframe_blobdata['<numeric_column>'], bins)
    
  2. 将装箱转换为一系列的布尔变量Convert binning to a sequence of boolean variables

     dataframe_blobdata_bin_bool = pd.get_dummies(dataframe_blobdata_bin_id, prefix='<numeric_column>')
    
  3. 最后,将虚拟变量联接回原始数据帧Finally, Join the dummy variables back to the original data frame

     dataframe_blobdata_with_bin_bool = dataframe_blobdata.join(dataframe_blobdata_bin_bool)    
    

将数据写回 Azure blob 并在 Azure 机器学习中使用Writing data back to Azure blob and consuming in Azure Machine Learning

探索过数据并创建必要功能后,可将数据(已采样或已特征化)上传至 Azure blob 并在 Azure 机器学习中使用数据,操作步骤如下:请注意,也可在 Azure 机器学习工作室(经典)中创建其他特征。After you have explored the data and created the necessary features, you can upload the data (sampled or featurized) to an Azure blob and consume it in Azure Machine Learning using the following steps: Note that additional features can be created in the Azure Machine Learning Studio (classic) as well.

  1. 将数据帧写入本地文件Write the data frame to local file

     dataframe.to_csv(os.path.join(os.getcwd(),LOCALFILENAME), sep='\t', encoding='utf-8', index=False)
    
  2. 将数据上传到 Azure blob,操作如下:Upload the data to Azure blob as follows:

     from azure.storage.blob import BlobService
     import tables
    
     STORAGEACCOUNTNAME= <storage_account_name>
     LOCALFILENAME= <local_file_name>
     STORAGEACCOUNTKEY= <storage_account_key>
     CONTAINERNAME= <container_name>
     BLOBNAME= <blob_name>
    
     output_blob_service=BlobService(account_name=STORAGEACCOUNTNAME,account_key=STORAGEACCOUNTKEY)    
     localfileprocessed = os.path.join(os.getcwd(),LOCALFILENAME) #assuming file is in current working directory
    
     try:
    
     #perform upload
     output_blob_service.put_block_blob_from_path(CONTAINERNAME,BLOBNAME,localfileprocessed)
    
     except:            
         print ("Something went wrong with uploading blob:"+BLOBNAME)
    
  3. 现在可使用 Azure 机器学习的导入数据模块从 blob 读取数据,如下方屏幕截图所示:Now the data can be read from the blob using the Azure Machine Learning Import Data module as shown in the screen below:

blob 读取器