教程:从 Azure Databricks 笔记本查询虚拟网络中的 SQL Server Linux Docker 容器Tutorial: Query a SQL Server Linux Docker container in a virtual network from an Azure Databricks notebook

本教程介绍如何将 Azure Databricks 与虚拟网络中的 SQL Server Linux Docker 容器集成。This tutorial teaches you how to integrate Azure Databricks with a SQL Server Linux Docker container in a virtual network.

在本教程中,你将了解如何执行以下操作:In this tutorial, you learn how to:

  • 将 Azure Databricks 工作区部署到虚拟网络Deploy an Azure Databricks workspace to a virtual network
  • 在公用网络中安装 Linux 虚拟机Install a Linux virtual machine in a public network
  • 安装 DockerInstall Docker
  • 安装 Linux 上的 Microsoft SQL Server docker 容器Install Microsoft SQL Server on Linux docker container
  • 使用 Databricks 笔记本中的 JDBC 查询 SQL ServerQuery the SQL Server using JDBC from a Databricks notebook

先决条件Prerequisites

创建 Linux 虚拟机Create a Linux virtual machine

  1. 在 Azure 门户中,选择“虚拟机”的图标。In the Azure portal, select the icon for Virtual Machines. 然后选择“+ 添加”。Then, select + Add.

    添加新的 Azure 虚拟机

  2. 在“基本信息”选项卡上,选择 Ubuntu Server 18.04 LTS 并将 VM 大小更改为 B2。On the Basics tab, Choose Ubuntu Server 18.04 LTS and change the VM size to B2s. 选择管理员用户名和密码。Choose an administrator username and password.

    新虚拟机配置的“基本信息”选项卡

  3. 导航到“网络”选项卡。选择虚拟网络和包含 Azure Databricks 群集的公共子网。Navigate to the Networking tab. Choose the virtual network and the public subnet that includes your Azure Databricks cluster. 选择“查看 + 创建”选项卡,然后选择“创建”以部署虚拟机 。Select Review + create , then Create to deploy the virtual machine.

    新虚拟机配置的“网络”选项卡

  4. 部署完成后,导航到虚拟机。When the deployment is complete, navigate to the virtual machine. 请注意“概述”中的公共 IP 地址和虚拟网络/子网。Notice the Public IP address and Virtual network/subnet in the Overview. 选择“公共 IP 地址”Select the Public IP Address

    虚拟机概述

  5. 将“分配”更改为“静态”并输入“DNS 名称标签” 。Change the Assignment to Static and enter a DNS name label. 选择“保存”并重启虚拟机。Select Save , and restart the virtual machine.

    公共 IP 地址配置

  6. 在“设置”下选择“网络”选项卡。 Select the Networking tab under Settings. 请注意,在 Azure Databricks 部署过程中创建的网络安全组与虚拟机关联。Notice that the network security group that was created during the Azure Databricks deployment is associated with the virtual machine. 选择“添加入站端口规则”。Select Add inbound port rule.

  7. 添加规则,为 SSH 打开端口 22。Add a rule to open port 22 for SSH. 使用以下设置:Use the following settings:

    设置Setting 建议的值Suggested value 说明Description
    Source IP 地址IP Addresses IP 地址指定此规则将允许或拒绝来自特定源 IP 地址的传入流量。IP Addresses specifies that incoming traffic from a specific source IP Address will be allowed or denied by this rule.
    源 IP 地址Source IP addresses <公共 IP><your public ip> 输入公共 IP 地址。Enter the your public IP address. 可通过访问 bing.com 并搜索“我的 IP”来找到公共 IP 地址。You can find your public IP address by visiting bing.com and searching for "my IP".
    源端口范围Source port ranges * 允许来自任何端口的流量。Allow traffic from any port.
    目标Destination IP 地址IP Addresses IP 地址指定此规则将允许或拒绝来自特定源 IP 地址的传出流量。IP Addresses specifies that outgoing traffic for a specific source IP Address will be allowed or denied by this rule.
    目标 IP 地址Destination IP addresses <VM 公共 IP><your vm public ip> 输入虚拟机的公共 IP 地址。Enter your virtual machine's public IP address. 可在虚拟机的“概述”页找到此地址。You can find this on the Overview page of your virtual machine.
    目标端口范围Destination port ranges 2222 为 SSH 打开端口 22。Open port 22 for SSH.
    优先度Priority 290290 为规则指定优先级。Give the rule a priority.
    名称Name ssh-databricks-tutorial-vmssh-databricks-tutorial-vm 为规则指定名称。Give the rule a name.

    为端口 22 添加入站安全规则

  8. 添加一条规则,以便使用以下设置为 SQL 打开端口 1433:Add a rule to open port 1433 for SQL with the following settings:

    设置Setting 建议的值Suggested value 说明Description
    Source 任意Any 源指定此规则将允许或拒绝来自特定源 IP 地址的传入流量。Source specifies that incoming traffic from a specific source IP Address will be allowed or denied by this rule.
    源端口范围Source port ranges * 允许来自任何端口的流量。Allow traffic from any port.
    目标Destination IP 地址IP Addresses IP 地址指定此规则将允许或拒绝来自特定源 IP 地址的传出流量。IP Addresses specifies that outgoing traffic for a specific source IP Address will be allowed or denied by this rule.
    目标 IP 地址Destination IP addresses <VM 公共 IP><your vm public ip> 输入虚拟机的公共 IP 地址。Enter your virtual machine's public IP address. 可在虚拟机的“概述”页找到此地址。You can find this on the Overview page of your virtual machine.
    目标端口范围Destination port ranges 14331433 为 SQL Server 打开端口 22。Open port 22 for SQL Server.
    优先度Priority 300300 为规则指定优先级。Give the rule a priority.
    名称Name sql-databricks-tutorial-vmsql-databricks-tutorial-vm 为规则指定名称。Give the rule a name.

    为端口 1433 添加入站安全规则

在 Docker 容器中运行 SQL ServerRun SQL Server in a Docker container

  1. 打开用于 Windows 的 Ubuntu,或者其他任何允许你通过 SSH 登录虚拟机的工具。Open Ubuntu for Windows, or any other tool that will allow you to SSH into the virtual machine. 导航到 Azure 门户中的虚拟机,然后选择“连接”以获取连接所需的 SSH 命令。Navigate to your virtual machine in the Azure portal and select Connect to get the SSH command you need to connect.

    连接到虚拟机

  2. 在 Ubuntu 终端中输入命令,然后输入在配置虚拟机时创建的管理员密码。Enter the command in your Ubuntu terminal and enter the admin password you created when you configured the virtual machine.

    Ubuntu 终端 SSH 登录

  3. 使用以下命令在虚拟机上安装 Docker。Use the following command to install Docker on the virtual machine.

    sudo apt-get install docker.io
    

    使用以下命令验证 Docker 的安装:Verify the install of Docker with the following command:

    sudo docker --version
    
  4. 安装映像。Install the image.

    sudo docker pull mcr.microsoft.com/mssql/server:2017-latest
    

    检查映像。Check the images.

    sudo docker images
    
  5. 从映像运行容器。Run the container from the image.

    sudo docker run -e 'ACCEPT_EULA=Y' -e 'SA_PASSWORD=Password1234' -p 1433:1433 --name sql1  -d mcr.microsoft.com/mssql/server:2017-latest
    

    验证容器是否正在运行。Verify that the container is running.

    sudo docker ps -a
    

创建 SQL 数据库Create a SQL database

  1. 打开 SQL Server Management Studio 并使用服务器名和 SQL 身份验证连接到服务器。Open SQL Server Management Studio and connect to the server using the server name and SQL Authentication. 登录用户名是 SA,密码是 Docker 命令中设置的密码。The sign in username is SA and the password is the password set in the Docker command. 示例命令中的密码为 Password1234The password in the example command is Password1234.

    使用 SQL Server Management Studio 连接到 SQL Server

  2. 成功连接后,选择“新建查询”,然后输入以下代码片段以创建数据库、表,并在表中插入一些记录。Once you've successfully connected, select New Query and enter the following code snippet to create a database, a table, and insert some records in the table.

    CREATE DATABASE MYDB;
    GO
    USE MYDB;
    CREATE TABLE states(Name VARCHAR(20), Capitol VARCHAR(20));
    INSERT INTO states VALUES ('Delaware','Dover');
    INSERT INTO states VALUES ('South Carolina','Columbia');
    INSERT INTO states VALUES ('Texas','Austin');
    SELECT * FROM states
    GO
    

    查询以创建 SQL Server 数据库

从 Azure Databricks 查询 SQL ServerQuery SQL Server from Azure Databricks

  1. 导航到 Azure Databricks 工作区,并验证是否已按照先决条件的要求创建了一个群集。Navigate to your Azure Databricks workspace and verify that you created a cluster as part of the prerequisites. 然后选择“创建笔记本”。Then, select Create a Notebook. 为笔记本指定一个名称,选择 Python 作为语言,然后选择创建的群集。Give the notebook a name, select Python as the language, and select the cluster you created.

    新的 Databricks 笔记本设置

  2. 使用以下命令对 SQL Server 虚拟机的内部 IP 地址执行 ping 操作。Use the following command to ping the internal IP Address of the SQL Server virtual machine. 此 ping 操作应会成功。This ping should be successful. 如果不成功,请验证容器是否正在运行,并查看网络安全组 (NSG) 配置。If not, verify that the container is running, and review the network security group (NSG) configuration.

    %sh
    ping 10.179.64.4
    

    还可以使用 nslookup 命令进行查看。You can also use the nslookup command to review.

    %sh
    nslookup databricks-tutorial-vm.westus2.cloudapp.azure.com
    
  3. 成功对 SQL Server 执行 ping 操作后,即可查询数据库和表。Once you've successfully pinged the SQL Server, you can query the database and tables. 运行以下 Python 代码:Run the following python code:

    jdbcHostname = "10.179.64.4"
    jdbcDatabase = "MYDB"
    userName = 'SA'
    password = 'Password1234'
    jdbcPort = 1433
    jdbcUrl = "jdbc:sqlserver://{0}:{1};database={2};user={3};password={4}".format(jdbcHostname, jdbcPort, jdbcDatabase, userName, password)
    
    df = spark.read.jdbc(url=jdbcUrl, table='states')
    display(df)
    

清理资源Clean up resources

若不再需要资源组、Azure Databricks 工作区以及所有相关资源,请将其删除。When no longer needed, delete the resource group, the Azure Databricks workspace, and all related resources. 删除作业可避免产生不必要的计费。Deleting the job avoids unnecessary billing. 若计划将来使用 Azure Databricks 工作区,则可停止群集并在以后重启。If you're planning to use the Azure Databricks workspace in future, you can stop the cluster and restart it later. 如果不打算继续使用此 Azure Databricks 工作区,请使用以下步骤删除在本教程中创建的所有资源:If you are not going to continue to use this Azure Databricks workspace, delete all resources you created in this tutorial by using the following steps:

  1. 在 Azure 门户的左侧菜单中,单击“资源组”,并单击所创建的资源的名称。From the left-hand menu in the Azure portal, click Resource groups and then click the name of the resource group you created.

  2. 在资源组页上选择“删除”,在文本框中键入要删除的资源的名称,然后再次选择“删除” 。On your resource group page, select Delete , type the name of the resource to delete in the text box, and then select Delete again.

后续步骤Next steps

请继续学习下一篇文章,了解如何使用 Azure Databricks 提取、转换和加载数据。Advance to the next article to learn how to extract, transform, and load data using Azure Databricks.