快速入门:使用 Databricks 分析数据Quickstart: Analyze data with Databricks

在本快速入门中,你将使用 Azure Databricks 运行 Apache Spark 作业,以对存储帐户中存储的数据执行分析。In this quickstart, you run an Apache Spark job using Azure Databricks to perform analytics on data stored in a storage account. 在 Spark 作业中,你将分析收音机频道订阅数据,以根据人口统计信息洞察免费/付费节目的使用情况。As part of the Spark job, you'll analyze a radio channel subscription data to gain insights into free/paid usage based on demographics.

先决条件Prerequisites

  • 具有活动订阅的 Azure 帐户。An Azure account with an active subscription. 创建一个试用帐户Create a trial account.

  • Azure Data Lake Gen2 存储帐户的名称。The name of your Azure Data Lake Gen2 storage account. 创建 Azure Data Lake Storage Gen2 存储帐户Create an Azure Data Lake Storage Gen2 storage account.

  • 分配有角色“存储 Blob 数据参与者” 的 Azure 服务主体的租户 ID、应用 ID 和密码。The tenant ID, app ID, and password of an Azure service principal with an assigned role of Storage Blob Data Contributor. 创建服务主体Create a service principal.

    重要

    在 Data Lake Storage Gen2 存储帐户范围内分配角色。Assign the role in the scope of the Data Lake Storage Gen2 storage account. 可以将角色分配给父资源组或订阅,但在这些角色分配传播到存储帐户之前,你将收到与权限相关的错误。You can assign a role to the parent resource group or subscription, but you'll receive permissions-related errors until those role assignments propagate to the storage account.

创建 Azure Databricks 工作区Create an Azure Databricks workspace

在本部分,使用 Azure 门户创建 Azure Databricks 工作区。In this section, you create an Azure Databricks workspace using the Azure portal.

  1. 在 Azure 门户中,选择“创建资源”,然后创建“Azure Databricks” 。In the Azure portal, select Create a resource and create Azure Databricks.

    Azure 门户上的 DatabricksDatabricks on Azure portal

  2. 在“Azure Databricks 服务” 下,提供所需的值以创建 Databricks 工作区。Under Azure Databricks Service, provide the values to create a Databricks workspace.

    创建 Azure Databricks 工作区Create an Azure Databricks workspace

    提供以下值:Provide the following values:

    propertiesProperty 说明Description
    工作区名称Workspace name 提供 Databricks 工作区的名称Provide a name for your Databricks workspace
    订阅Subscription 从下拉列表中选择自己的 Azure 订阅。From the drop-down, select your Azure subscription.
    资源组Resource group 指定是要创建新的资源组还是使用现有的资源组。Specify whether you want to create a new resource group or use an existing one. 资源组是用于保存 Azure 解决方案相关资源的容器。A resource group is a container that holds related resources for an Azure solution. 有关详细信息,请参阅 Azure 资源组概述For more information, see Azure Resource Group overview.
    位置Location 选择“中国东部 2”。Select China East 2. 可以根据偏好随意选择其他公共区域。Feel free to select another public region if you prefer.
    定价层Pricing Tier 选择“标准”或“高级”。 Choose between Standard or Premium. 有关这些层的详细信息,请参阅 Databricks 价格页For more information on these tiers, see Databricks pricing page.
  3. 创建帐户需要几分钟时间。The account creation takes a few minutes. 若要监视操作状态,请查看顶部的进度栏。To monitor the operation status, view the progress bar at the top.

  4. 选择“固定到仪表板” ,然后选择“创建” 。Select Pin to dashboard and then select Create.

在 Databricks 中创建 Spark 群集Create a Spark cluster in Databricks

  1. 在 Azure 门户中,转到所创建的 Databricks 工作区,然后选择“启动工作区”。 In the Azure portal, go to the Databricks workspace that you created, and then select Launch Workspace.

  2. 随后将会重定向到 Azure Databricks 门户。You are redirected to the Azure Databricks portal. 从门户中,选择“新建” > “群集”。 From the portal, select New > Cluster.

    Azure 上的 DatabricksDatabricks on Azure

  3. 在“新建群集”页中,提供用于创建群集的值。 In the New cluster page, provide the values to create a cluster.

    在 Azure 上创建 Databricks Spark 群集Create Databricks Spark cluster on Azure

    填写以下字段的值,对于其他字段接受默认值:Fill in values for the following fields, and accept the default values for the other fields:

    • 输入群集的名称。Enter a name for the cluster.

    • 请务必选中“在不活动超过 120 分钟后终止” 复选框。Make sure you select the Terminate after 120 minutes of inactivity checkbox. 提供一个持续时间(以分钟为单位),如果群集在这段时间内一直未被使用,则会将其终止。Provide a duration (in minutes) to terminate the cluster, if the cluster is not being used.

  4. 选择“创建群集”。 Select Create cluster. 群集运行后,可将笔记本附加到该群集,并运行 Spark 作业。Once the cluster is running, you can attach notebooks to the cluster and run Spark jobs.

创建笔记本Create notebook

在本部分中,你将在 Azure Databricks 工作区中创建一个 Notebook,然后运行代码片段来配置存储帐户。In this section, you create a notebook in Azure Databricks workspace and then run code snippets to configure the storage account.

  1. Azure 门户中,转到所创建的 Azure Databricks 工作区,然后选择“启动工作区”。 In the Azure portal, go to the Azure Databricks workspace you created, and then select Launch Workspace.

  2. 在左窗格中选择“工作区” 。In the left pane, select Workspace. 工作区 下拉列表中,选择 创建 > 笔记本From the Workspace drop-down, select Create > Notebook.

    屏幕截图显示了如何在 Databricks 中创建笔记本,并突出显示了“创建 > 笔记本”菜单选项。Screenshot that shows how to create a notebook in Databricks and highlights the Create > Notebook menu option.

  3. 在“创建 Notebook”对话框中,输入 Notebook 的名称。 In the Create Notebook dialog box, enter a name for the notebook. 选择“Scala”作为语言,然后选择前面创建的 Spark 群集。 Select Scala as the language, and then select the Spark cluster that you created earlier.

    在 Databricks 中创建笔记本Create notebook in Databricks

    选择“创建” 。Select Create.

  4. 将以下代码块复制并粘贴到第一个单元格中,但目前请勿运行此代码。Copy and paste the following code block into the first cell, but don't run this code yet.

    spark.conf.set("fs.azure.account.auth.type.<storage-account-name>.dfs.core.chinacloudapi.cn", "OAuth")
    spark.conf.set("fs.azure.account.oauth.provider.type.<storage-account-name>.dfs.core.chinacloudapi.cn", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
    spark.conf.set("fs.azure.account.oauth2.client.id.<storage-account-name>.dfs.core.chinacloudapi.cn", "<appID>")
    spark.conf.set("fs.azure.account.oauth2.client.secret.<storage-account-name>.dfs.core.chinacloudapi.cn", "<password>")
    spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storage-account-name>.dfs.core.chinacloudapi.cn", "https://login.partner.microsoftonline.cn/<tenant-id>/oauth2/token")
    spark.conf.set("fs.azure.createRemoteFileSystemDuringInitialization", "true")
    dbutils.fs.ls("abfss://<container-name>@<storage-account-name>.dfs.core.chinacloudapi.cn/")
    spark.conf.set("fs.azure.createRemoteFileSystemDuringInitialization", "false")
    
    
  5. 在此代码块中,请将 storage-account-nameappIDpasswordtenant-id 占位符值替换为在创建服务主体时收集的值。In this code block, replace the storage-account-name, appID, password, and tenant-id placeholder values in this code block with the values that you collected when you created the service principal. container-name 占位符值设置为你要为容器指定的任何名称。Set the container-name placeholder value to whatever name you want to give the container.

  6. SHIFT + ENTER 键,运行此块中的代码。Press the SHIFT + ENTER keys to run the code in this block.

引入示例数据Ingest sample data

开始学习本部分之前,必须完成以下先决条件:Before you begin with this section, you must complete the following prerequisites:

将以下代码输入到 Notebook 单元格中:Enter the following code into a notebook cell:

%sh wget -P /tmp https://raw.githubusercontent.com/Azure/usql/master/Examples/Samples/Data/json/radiowebsite/small_radio_json.json

在单元格中,按 SHIFT + ENTER 来运行代码 。In the cell, press SHIFT + ENTER to run the code.

现在,请在此单元格下方的新单元格中输入以下代码,将括号中出现的值替换为此前使用的相同值:Now in a new cell below this one, enter the following code, and replace the values that appear in brackets with the same values you used earlier:

dbutils.fs.cp("file:///tmp/small_radio_json.json", "abfss://<container-name>@<storage-account-name>.dfs.core.chinacloudapi.cn/")

在单元格中,按 SHIFT + ENTER 来运行代码 。In the cell, press SHIFT + ENTER to run the code.

运行 Spark SQL 作业Run a Spark SQL Job

执行以下任务来对数据运行 Spark SQL 作业。Perform the following tasks to run a Spark SQL job on the data.

  1. 运行一条 SQL 语句,以使用示例 JSON 数据文件 small_radio_json.json 中的数据创建一个临时表。Run a SQL statement to create a temporary table using data from the sample JSON data file, small_radio_json.json. 在以下代码片段中,请将占位符值替换为容器名称和存储帐户名称。In the following snippet, replace the placeholder values with your container name and storage account name. 使用之前创建的 Notebook,将该代码片段粘贴到该 Notebook 中的一个新的代码单元格中,然后按 SHIFT + ENTER。Using the notebook you created earlier, paste the snippet in a new code cell in the notebook, and then press SHIFT + ENTER.

    %sql
    DROP TABLE IF EXISTS radio_sample_data;
    CREATE TABLE radio_sample_data
    USING json
    OPTIONS (
     path  "abfss://<container-name>@<storage-account-name>.dfs.core.chinacloudapi.cn/small_radio_json.json"
    )
    

    成功完成命令后,Databricks 群集中将以表的形式列出 JSON 文件中的所有数据。Once the command successfully completes, you have all the data from the JSON file as a table in Databricks cluster.

    使用 %sql 语言魔法 (magic) 命令可从笔记本运行 SQL 代码,即使该笔记本采用另一种类型。The %sql language magic command enables you to run a SQL code from the notebook, even if the notebook is of another type.

  2. 让我们看看示例 JSON 数据的快照,以便更好地了解运行的查询。Let's look at a snapshot of the sample JSON data to better understand the query that you run. 将以下代码片段粘贴到代码单元中,并按 SHIFT + ENTERPaste the following snippet in the code cell and press SHIFT + ENTER.

    %sql
    SELECT * from radio_sample_data
    
  3. 随后将会看到以下屏幕截图中所示的表格输出(此处只显示了一部分列):You see a tabular output like shown in the following screenshot (only some columns are shown):

    示例 JSON 数据Sample JSON data

    在其他详细信息中,示例数据捕获了无线电频道的听众的性别(列名为“性别”),以及这些听众的订阅是免费还是付费的(列名为“级别”)。 Among other details, the sample data captures the gender of the audience of a radio channel (column name, gender) and whether their subscription is free or paid (column name, level).

  4. 现在创建这些数据的可视表示形式,以显示每种性别、有多少用户使用免费帐户和多少用户是付费的订户。You now create a visual representation of this data to show for each gender, how many users have free accounts and how many are paid subscribers. 在表格输出的底部,单击“条形图”图标,再单击“绘图选项”。 From the bottom of the tabular output, click the Bar chart icon, and then click Plot Options.

    创建条形图Create bar chart

  5. 在“自定义绘图”中,按屏幕截图中所示拖放值。 In Customize Plot, drag-and-drop values as shown in the screenshot.

    屏幕截图显示了“自定义绘图”屏幕和可以拖放的值。Screenshot that shows the Customize Plot screen and the values that you can drag and drop.

    • 将“键”设置为“性别”。 Set Keys to gender.
    • 将“序列分组”设置为“级别”。 Set Series groupings to level.
    • 将“值”设置为“级别”。 Set Values to level.
    • 将“聚合”设置为“计数”。 Set Aggregation to COUNT.
  6. 单击“应用” 。Click Apply.

  7. 输出将显示以下屏幕截图中所示的可视表示形式:The output shows the visual representation as depicted in the following screenshot:

    自定义条形图Customize bar chart

清理资源Clean up resources

完成本文后,可以终止群集。Once you're finished with this article, you can terminate the cluster. 从 Azure Databricks 工作区中,选择“群集” 并找到要终止的群集。From the Azure Databricks workspace, select Clusters and locate the cluster you want to terminate. 将鼠标指针移动到“操作” 列下的省略号上,然后选择“终止” 图标。Hover your mouse cursor over the ellipsis under Actions column, and select the Terminate icon.

停止 Databricks 群集Stop a Databricks cluster

如果不手动终止群集,但在创建群集时选中了“在不活动 __ 分钟后终止” 复选框,则该群集会自动停止。If you do not manually terminate the cluster it automatically stops, provided you selected the Terminate after __ minutes of inactivity checkbox while creating the cluster. 如果设置了此选项,则群集在处于不活动状态达到指定时间量后将停止。If you set this option the cluster stops after it has been inactive for the designated amount of time.

后续步骤Next steps

在本文中,你在 Azure Databricks 中创建了一个 Spark 群集,并使用启用了 Data Lake Storage Gen2 的存储帐户中的数据运行了一个 Spark 作业。In this article, you created a Spark cluster in Azure Databricks and ran a Spark job using data in a storage account with Data Lake Storage Gen2 enabled.

请继续学习下一篇文章,了解如何使用 Azure Databricks 执行 ETL(提取、转换和加载数据)操作。Advance to the next article to learn how to perform an ETL operation (extract, transform, and load data) using Azure Databricks.