快速入门:使用 Azure 门户在 Azure Databricks 上运行 Spark 作业Quickstart: Run a Spark job on Azure Databricks using the Azure portal

在本快速入门中,将使用 Azure 门户创建一个具有 Apache Spark 群集的 Azure Databricks 工作区。In this quickstart, you use the Azure portal to create an Azure Databricks workspace with an Apache Spark cluster. 你在群集上运行作业,并使用自定义图表根据波士顿安全数据生成实时报告。You run a job on the cluster and use custom charts to produce real-time reports from Boston safety data.


  • Azure 订阅 - 创建订阅Azure subscription - create one. 不能使用 Azure 免费试用订阅完成本教程。This tutorial cannot be carried out using Azure Free Trial Subscription . 如果你有免费帐户,请转到个人资料并将订阅更改为“即用即付”。If you have a free account, go to your profile and change your subscription to pay-as-you-go . 有关详细信息,请参阅 Azure 试用帐户For more information, see Azure trial account. 然后,移除支出限制,并为你所在区域的 vCPU 请求增加配额Then, remove the spending limit, and request a quota increase for vCPUs in your region. 创建 Azure Databricks 工作区时,可以选择“试用版(高级 - 14天免费 DBU)”定价层,让工作区访问免费的高级 Azure Databricks DBU 14 天。When you create your Azure Databricks workspace, you can select the Trial (Premium - 14-Days Free DBUs) pricing tier to give the workspace access to free Premium Azure Databricks DBUs for 14 days.

  • 登录 Azure 门户Sign in to the Azure portal.

在 Databricks 中创建 Spark 群集Create a Spark cluster in Databricks


若要使用免费帐户创建 Azure Databricks 群集,请在创建群集前转到你的配置文件并将订阅更改为 即用即付To use a free account to create the Azure Databricks cluster, before creating the cluster, go to your profile and change your subscription to pay-as-you-go . 有关详细信息,请参阅 Azure 试用帐户For more information, see Azure trial account.

  1. 在 Azure 门户中,转到所创建的 Databricks 工作区,然后单击“启动工作区”。In the Azure portal, go to the Databricks workspace that you created, and then click Launch Workspace .

  2. 随后将会重定向到 Azure Databricks 门户。You are redirected to the Azure Databricks portal. 在该门户中,单击“新建群集”。From the portal, click New Cluster .

    Azure 上的 DatabricksDatabricks on Azure

  3. 在“新建群集”页中,提供用于创建群集的值。In the New cluster page, provide the values to create a cluster.

    在 Azure 上创建 Databricks Spark 群集Create Databricks Spark cluster on Azure

    除以下值外,接受其他所有默认值:Accept all other default values other than the following:

    • 输入群集的名称。Enter a name for the cluster.

    • 对于本文,请通过(5.X、6.X、7.X)运行时创建群集。 For this article, create a cluster with ( 5.X , 6.X , 7.X ) runtime.

    • 请务必选中 在不活动超过 __ 分钟后终止 复选框。Make sure you select the Terminate after __ minutes of inactivity checkbox. 提供一个持续时间(以分钟为单位),如果群集在这段时间内一直未被使用,则会将其终止。Provide a duration (in minutes) to terminate the cluster, if the cluster is not being used.

      选择“创建群集”。Select Create cluster . 群集运行后,可将笔记本附加到该群集,并运行 Spark 作业。Once the cluster is running, you can attach notebooks to the cluster and run Spark jobs.

有关创建群集的详细信息,请参阅在 Azure Databricks 中创建 Spark 群集For more information on creating clusters, see Create a Spark cluster in Azure Databricks.

运行 Spark SQL 作业Run a Spark SQL job

执行以下任务在 Databricks 中创建笔记本、将该笔记本配置为从 Azure 开放数据集读取数据,然后针对这些数据运行 Spark SQL 作业。Perform the following tasks to create a notebook in Databricks, configure the notebook to read data from an Azure Open Datasets, and then run a Spark SQL job on the data.

  1. 在左窗格中选择“Azure Databricks”。In the left pane, select Azure Databricks . 在“常见任务”中,选择“新建笔记本”。 From the Common Tasks , select New Notebook .

    在 Databricks 中创建笔记本Create notebook in Databricks

  2. 在“创建笔记本”对话框中输入一个名称,选择“Python”作为语言,并选择前面创建的 Spark 群集。 In the Create Notebook dialog box, enter a name, select Python as the language, and select the Spark cluster that you created earlier.

    在 Databricks 中创建笔记本Create notebook in Databricks

    选择“创建”。Select Create .

  3. 在此步骤中,请使用 Azure 开放数据集中的波士顿安全数据创建 Spark DataFrame,并使用 SQL 来查询数据。In this step, create a Spark DataFrame with Boston Safety Data from Azure Open Datasets, and use SQL to query the data.

    以下命令设置 Azure 存储访问信息。The following command sets the Azure storage access information. 将此 PySpark 代码粘贴到第一个单元格中,然后使用 Shift+Enter 来运行代码。Paste this PySpark code into the first cell and use Shift+Enter to run the code.

    blob_account_name = "azureopendatastorage"
    blob_container_name = "citydatacontainer"
    blob_relative_path = "Safety/Release/city=Boston"
    blob_sas_token = r"?st=2019-02-26T02%3A34%3A32Z&se=2119-02-27T02%3A34%3A00Z&sp=rl&sv=2018-03-28&sr=c&sig=XlJVWA7fMXCSxCKqJm8psMOh0W4h7cSYO28coRqF2fs%3D"

    以下命令允许 Spark 以远程方式从 Blob 存储读取数据。The following command allows Spark to read from Blob storage remotely. 将此 PySpark 代码粘贴到下一个单元格中,然后使用 Shift+Enter 来运行代码。Paste this PySpark code into the next cell and use Shift+Enter to run the code.

    wasbs_path = 'wasbs://%s@%s.blob.core.chinacloudapi.cn/%s' % (blob_container_name, blob_account_name, blob_relative_path)
    spark.conf.set('fs.azure.sas.%s.%s.blob.core.chinacloudapi.cn' % (blob_container_name, blob_account_name), blob_sas_token)
    print('Remote blob path: ' + wasbs_path)

    以下命令创建 DataFrame。The following command creates a DataFrame. 将此 PySpark 代码粘贴到下一个单元格中,然后使用 Shift+Enter 来运行代码。Paste this PySpark code into the next cell and use Shift+Enter to run the code.

    df = spark.read.parquet(wasbs_path)
    print('Register the DataFrame as a SQL temporary view: source')
  4. 运行一个 SQL 语句,从名为 source 的临时视图返回头 10 行数据。Run a SQL statement return the top 10 rows of data from the temporary view called source . 将此 PySpark 代码粘贴到下一个单元格中,然后使用 Shift+Enter 来运行代码。Paste this PySpark code into the next cell and use Shift+Enter to run the code.

    print('Displaying top 10 rows: ')
    display(spark.sql('SELECT * FROM source LIMIT 10'))
  5. 随后将会看到以下屏幕截图中所示的表格输出(此处只显示了一部分列):You see a tabular output like shown in the following screenshot (only some columns are shown):

    示例数据Sample data

  6. 现在创建此数据的视觉表示形式,用于说明使用 Citizens Connect App 和 City Worker App 而非其他源报告了多少安全事件。You now create a visual representation of this data to show how many safety events are reported using the Citizens Connect App and City Worker App instead of other sources. 在表格输出的底部,选择“条形图”图标,然后单击“绘图选项”。 From the bottom of the tabular output, select the Bar chart icon, and then click Plot Options .

    创建条形图Create bar chart

  7. 在“自定义绘图”中,按屏幕截图中所示拖放值。In Customize Plot , drag-and-drop values as shown in the screenshot.

    自定义饼图Customize pie chart

    • 将“键”设置为“source”。 Set Keys to source .

    • 将“值”设置为“<\id>”。 Set Values to <\id> .

    • 将“聚合”设置为“计数”。 Set Aggregation to COUNT .

    • 将“显示类型”设置为“饼图”。 Set Display type to Pie chart .

      单击“应用”。Click Apply .

清理资源Clean up resources

完成本文后,可以终止群集。After you have finished the article, you can terminate the cluster. 为此,请在 Azure Databricks 工作区的左窗格中选择“群集” 。To do so, from the Azure Databricks workspace, from the left pane, select Clusters . 针对想要终止的群集,将光标移到“操作” 列下面的省略号上,选择“终止” 图标。For the cluster you want to terminate, move the cursor over the ellipsis under Actions column, and select the Terminate icon.

停止 Databricks 群集Stop a Databricks cluster

如果不手动终止群集,但在创建群集时选中了“在不活动 __ 分钟后终止” 复选框,则该群集会自动停止。If you do not manually terminate the cluster it will automatically stop, provided you selected the Terminate after __ minutes of inactivity checkbox while creating the cluster. 在这种情况下,如果群集保持非活动状态超过指定的时间,则会自动停止。In such a case, the cluster automatically stops, if it has been inactive for the specified time.

后续步骤Next steps

在本文中,我们已在 Azure Databricks 中创建一个 Spark 群集,并使用 Azure 开放数据集中的数据运行了一个 Spark 作业。In this article, you created a Spark cluster in Azure Databricks and ran a Spark job using data from Azure Open Datasets. 我们还可以查看 Spark 数据源,了解如何将其他数据源中的数据导入 Azure Databricks。You can also look at Spark data sources to learn how to import data from other data sources into Azure Databricks. 请继续学习下一篇文章,了解如何使用 Azure Databricks 执行 ETL(提取、转换和加载数据)操作。Advance to the next article to learn how to perform an ETL operation (extract, transform, and load data) using Azure Databricks.