Tutorial: Query a SQL Server Linux Docker container in a virtual network from an Azure Databricks notebook

This tutorial teaches you how to integrate Azure Databricks with a SQL Server Linux Docker container in a virtual network.

In this tutorial, you learn how to:

  • Deploy an Azure Databricks workspace to a virtual network
  • Install a Linux virtual machine in a public network
  • Install Docker
  • Install Microsoft SQL Server on Linux docker container
  • Query the SQL Server using JDBC from a Databricks notebook

Prerequisites

Create a Linux virtual machine

  1. In the Azure portal, select the icon for Virtual Machines. Then, select + Add.

    Add new Azure virtual machine

  2. On the Basics tab, Choose Ubuntu Server 18.04 LTS and change the VM size to B2s. Choose an administrator username and password.

    Basics tab of new virtual machine configuration

  3. Navigate to the Networking tab. Choose the virtual network and the public subnet that includes your Azure Databricks cluster. Select Review + create, then Create to deploy the virtual machine.

    Networking tab of new virtual machine configuration

  4. When the deployment is complete, navigate to the virtual machine. Notice the Public IP address and Virtual network/subnet in the Overview. Select the Public IP Address

    Virtual machine overview

  5. Change the Assignment to Static and enter a DNS name label. Select Save, and restart the virtual machine.

    Public IP Address configuration

  6. Select the Networking tab under Settings. Notice that the network security group that was created during the Azure Databricks deployment is associated with the virtual machine. Select Add inbound port rule.

  7. Add a rule to open port 22 for SSH. Use the following settings:

    Setting Suggested value Description
    Source IP Addresses IP Addresses specifies that incoming traffic from a specific source IP Address will be allowed or denied by this rule.
    Source IP addresses <your public ip> Enter your public IP address. You can find your public IP address by visiting bing.com and searching for "my IP".
    Source port ranges * Allow traffic from any port.
    Destination IP Addresses IP Addresses specifies that outgoing traffic for a specific source IP Address will be allowed or denied by this rule.
    Destination IP addresses <your vm public ip> Enter your virtual machine's public IP address. You can find this on the Overview page of your virtual machine.
    Destination port ranges 22 Open port 22 for SSH.
    Priority 290 Give the rule a priority.
    Name ssh-databricks-tutorial-vm Give the rule a name.

    Add inbound security rule for port 22

  8. Add a rule to open port 1433 for SQL with the following settings:

    Setting Suggested value Description
    Source Any Source specifies that incoming traffic from a specific source IP Address will be allowed or denied by this rule.
    Source port ranges * Allow traffic from any port.
    Destination IP Addresses IP Addresses specifies that outgoing traffic for a specific source IP Address will be allowed or denied by this rule.
    Destination IP addresses <your vm public ip> Enter your virtual machine's public IP address. You can find this on the Overview page of your virtual machine.
    Destination port ranges 1433 Open port 22 for SQL Server.
    Priority 300 Give the rule a priority.
    Name sql-databricks-tutorial-vm Give the rule a name.

    Add inbound security rule for port 1433

Run SQL Server in a Docker container

  1. Open Ubuntu for Windows, or any other tool that will allow you to SSH into the virtual machine. Navigate to your virtual machine in the Azure portal and select Connect to get the SSH command you need to connect.

    Connect to virtual machine

  2. Enter the command in your Ubuntu terminal and enter the admin password you created when you configured the virtual machine.

    Ubuntu terminal SSH sign in

  3. Use the following command to install Docker on the virtual machine.

    sudo apt-get install docker.io
    

    Verify the install of Docker with the following command:

    sudo docker --version
    
  4. Install the image.

    sudo docker pull mcr.microsoft.com/mssql/server:2017-latest
    

    Check the images.

    sudo docker images
    
  5. Run the container from the image.

    sudo docker run -e 'ACCEPT_EULA=Y' -e 'SA_PASSWORD=Password1234' -p 1433:1433 --name sql1  -d mcr.microsoft.com/mssql/server:2017-latest
    

    Verify that the container is running.

    sudo docker ps -a
    

Create a SQL database

  1. Open SQL Server Management Studio and connect to the server using the server name and SQL Authentication. The sign in username is SA and the password is the password set in the Docker command. The password in the example command is Password1234.

    Connect to SQL Server using SQL Server Management Studio

  2. Once you've successfully connected, select New Query and enter the following code snippet to create a database, a table, and insert some records in the table.

    CREATE DATABASE MYDB;
    GO
    USE MYDB;
    CREATE TABLE states(Name VARCHAR(20), Capitol VARCHAR(20));
    INSERT INTO states VALUES ('Delaware','Dover');
    INSERT INTO states VALUES ('South Carolina','Columbia');
    INSERT INTO states VALUES ('Texas','Austin');
    SELECT * FROM states
    GO
    

    Query to create a SQL Server database

Query SQL Server from Azure Databricks

  1. Navigate to your Azure Databricks workspace and verify that you created a cluster as part of the prerequisites. Then, select Create a Notebook. Give the notebook a name, select Python as the language, and select the cluster you created.

    New Databricks notebook settings

  2. Use the following command to ping the internal IP Address of the SQL Server virtual machine. This ping should be successful. If not, verify that the container is running, and review the network security group (NSG) configuration.

    %sh
    ping 10.179.64.4
    

    You can also use the nslookup command to review.

    %sh
    nslookup databricks-tutorial-vm.chinaeast2.cloudapp.chinacloudapi.cn
    
  3. Once you've successfully pinged the SQL Server, you can query the database and tables. Run the following python code:

    jdbcHostname = "10.179.64.4"
    jdbcDatabase = "MYDB"
    userName = 'SA'
    password = 'Password1234'
    jdbcPort = 1433
    jdbcUrl = "jdbc:sqlserver://{0}:{1};database={2};user={3};password={4}".format(jdbcHostname, jdbcPort, jdbcDatabase, userName, password)
    
    df = spark.read.jdbc(url=jdbcUrl, table='states')
    display(df)
    

Clean up resources

When no longer needed, delete the resource group, the Azure Databricks workspace, and all related resources. Deleting the job avoids unnecessary billing. If you're planning to use the Azure Databricks workspace in future, you can stop the cluster and restart it later. If you are not going to continue to use this Azure Databricks workspace, delete all resources you created in this tutorial by using the following steps:

  1. From the left-hand menu in the Azure portal, click Resource groups and then click the name of the resource group you created.

  2. On your resource group page, select Delete, type the name of the resource to delete in the text box, and then select Delete again.

Next steps

Advance to the next article to learn how to extract, transform, and load data using Azure Databricks.