使用 pandas 浏览 Azure blob 存储中的数据Explore data in Azure blob storage with pandas

本文介绍如何使用 pandas Python 包浏览存储在 Azure blob 容器中的数据。This article covers how to explore data that is stored in Azure blob container using pandas Python package.

此任务是团队数据科学过程中的一个步骤。This task is a step in the Team Data Science Process.

先决条件Prerequisites

本文假设用户具备以下条件:This article assumes that you have:

将数据加载到 pandas 数据帧Load the data into a pandas DataFrame

要浏览和操作数据集,首先必须从 blob 源将数据集下载到本地文件,然后将数据集加载到 pandas 数据帧。To explore and manipulate a dataset, it must first be downloaded from the blob source to a local file, which can then be loaded in a pandas DataFrame. 下面是此过程的所需步骤:Here are the steps to follow for this procedure:

  1. 通过 blob 服务使用下方 Python 代码示例从 Azure blob 下载数据。Download the data from Azure blob with the following Python code sample using blob service. 使用特定值替代下方代码中的变量:Replace the variable in the following code with your specific values:

    from azure.storage.blob import BlockBlobService
    import pandas as pd
    import tables
    
    STORAGEACCOUNTNAME= <storage_account_name>
    STORAGEACCOUNTKEY= <storage_account_key>
    LOCALFILENAME= <local_file_name>
    CONTAINERNAME= <container_name>
    BLOBNAME= <blob_name>
    
    #download from blob
    t1=time.time()
    blob_service=BlockBlobService(account_name=STORAGEACCOUNTNAME,account_key=STORAGEACCOUNTKEY)
    blob_service.get_blob_to_path(CONTAINERNAME,BLOBNAME,LOCALFILENAME)
    t2=time.time()
    print(("It takes %s seconds to download "+blobname) % (t2 - t1))
    
  2. 从下载的文件中将数据读入 pandas 数据帧。Read the data into a pandas DataFrame from the downloaded file.

    # LOCALFILE is the file path
    dataframe_blobdata = pd.read_csv(LOCALFILENAME)
    

现在可以准备浏览数据并在此数据集上生成功能了。Now you are ready to explore the data and generate features on this dataset.

使用 pandas 浏览数据的示例Examples of data exploration using pandas

下方举例说明了如何使用 pandas 浏览数据:Here are a few examples of ways to explore data using pandas:

  1. 检查行数和列数Inspect the number of rows and columns

    print 'the size of the data is: %d rows and  %d columns' % dataframe_blobdata.shape
    
  2. 在下方数据集中检查前面或后面几Inspect the first or last few rows in the following dataset:

    dataframe_blobdata.head(10)
    
    dataframe_blobdata.tail(10)
    
  3. 使用如下示例代码检查每列导入的数据类型Check the data type each column was imported as using the following sample code

    for col in dataframe_blobdata.columns:
        print dataframe_blobdata[col].name, ':\t', dataframe_blobdata[col].dtype
    
  4. 如下所示,检查数据中列的基本统计信息Check the basic stats for the columns in the data set as follows

    dataframe_blobdata.describe()
    
  5. 如下所示,查看每列值的条目数Look at the number of entries for each column value as follows

    dataframe_blobdata['<column_name>'].value_counts()
    
  6. 使用下方示例代码计算每列中的缺失值与实际项目数Count missing values versus the actual number of entries in each column using the following sample code

    miss_num = dataframe_blobdata.shape[0] - dataframe_blobdata.count()
    print miss_num
    
  7. 如果数据中的特定列存在缺失值,可按如下方法进行替代:If you have missing values for a specific column in the data, you can drop them as follows:

    dataframe_blobdata_noNA = dataframe_blobdata.dropna()
    dataframe_blobdata_noNA.shape
    

另一种替代缺失值的方法是使用模式函数:Another way to replace missing values is with the mode function:

dataframe_blobdata_mode = dataframe_blobdata.fillna(
    {'<column_name>': dataframe_blobdata['<column_name>'].mode()[0]})
  1. 使用数量不定的量化创建直方图绘制出变量分布情况Create a histogram plot using variable number of bins to plot the distribution of a variable

    dataframe_blobdata['<column_name>'].value_counts().plot(kind='bar')
    
    np.log(dataframe_blobdata['<column_name>']+1).hist(bins=50)
    
  2. 使用散点图或内置关联函数查看变量间的关联Look at correlations between variables using a scatterplot or using the built-in correlation function

    # relationship between column_a and column_b using scatter plot
    plt.scatter(dataframe_blobdata['<column_a>'], dataframe_blobdata['<column_b>'])
    
    # correlation between column_a and column_b
    dataframe_blobdata[['<column_a>', '<column_b>']].corr()