Troubleshoot link - Azure SQL Managed Instance

Applies to: Azure SQL Managed Instance

This article teaches you how to monitor and troubleshoot issues with a link between SQL Server and Azure SQL Managed Instance.

You can check the state of the link with Transact-SQL (T-SQL), Azure PowerShell or the Azure CLI. If you encounter issues, you can use the error codes to troubleshoot the problem.

Many issues with creating the link can be resolved by checking the network between the two instances, and validating the environment has been properly prepared for the link.

If you run into issues with a link, you can use Transact-SQL (T-SQL), Azure PowerShell or the Azure CLI to get information about the current state of the link.

Use T-SQL for a quick status details of the link state, and then use Azure PowerShell or the Azure CLI for a comprehensive information about the current state of the link.

Use T-SQL to determine the state of the link during the seeding phase, or after data synchronization begins.

Use the following T-SQL query to determine the status of the link during the seeding phase on the SQL Server or SQL Managed Instance that hosts the database seeded through the link:

SELECT
    ag.local_database_name AS 'Local database name',
    ar.current_state AS 'Current state',
    ar.is_source AS 'Is source',
    ag.internal_state_desc AS 'Internal state desc',
    ag.database_size_bytes / 1024 / 1024 AS 'Database size MB',
    ag.transferred_size_bytes / 1024 / 1024 AS 'Transferred MB',
    ag.transfer_rate_bytes_per_second / 1024 / 1024 AS 'Transfer rate MB/s',
    ag.total_disk_io_wait_time_ms / 1000 AS 'Total Disk IO wait (sec)',
    ag.total_network_wait_time_ms / 1000 AS 'Total Network wait (sec)',
    ag.is_compression_enabled AS 'Compression',
    ag.start_time_utc AS 'Start time UTC',
    ag.estimate_time_complete_utc as 'Estimated time complete UTC',
    ar.completion_time AS 'Completion time',
    ar.number_of_attempts AS 'Attempt No'
FROM sys.dm_hadr_physical_seeding_stats AS ag
    INNER JOIN sys.dm_hadr_automatic_seeding AS ar
    ON local_physical_seeding_id = operation_id

-- Estimated seeding completion time
SELECT DISTINCT CONVERT(VARCHAR(8), DATEADD(SECOND, DATEDIFF(SECOND, start_time_utc, estimate_time_complete_utc) ,0), 108) as 'Estimated complete time'
FROM sys.dm_hadr_physical_seeding_stats

If the query returns no results, then the seeding process hasn't started or has already completed.

Use the following T-SQL query on the primary instance to check the health of the link once data synchronization begins:

DECLARE @link_name varchar(max) = '<DAGname>'
SELECT
   rs.synchronization_health_desc [Link sync health]
FROM
   sys.availability_groups ag 
   join sys.dm_hadr_availability_replica_states rs 
   on ag.group_id = rs.group_id 
WHERE 
   rs.is_local = 0 AND rs.role = 2 AND ag.is_distributed = 1 AND ag.name = @link_name 
GO

The query returns the following possible values:

  • no result: The query was executed on the secondary instance.
  • HEALTHY: The link is healthy, and data is being synchronized between the replicas.
  • NOT_HEALTHY: The link is unhealthy, and data is not synchronizing between the replicas.

The replicaState value describes the current link. If the state also includes Error then an error occurred during the operation listed in the state. For example, LinkCreationError indicates that an error occurred while creating the link.

Some possible replicaState values are:

  • CreatingLink: Initial seeding
  • LinkSynchronizing: Data replication is in progress
  • LinkFailoverInProgress: Failover is in progress

For a complete list of link state properties, review the Distributed Availability Groups - GET REST API command.

There are two distinct categories of errors you can encounter when using the link - errors when you try to initialize the link, and errors when you try to create the link.

The following error can occur when initializing a link (Link state: LinkInitError):

The following error can occur when creating a link (Link state: LinkCreationError):

  • Error 41977: The target database isn't responsive. Check link parameters and try again.

Inconsistent state after forced failover

Following a forced failover, you might encounter a split-brain scenario where both replicas are in the primary role, leaving the link in an inconsistent state. This can happen if you fail over to the secondary replica during a disaster, and then the primary replica comes back online.

First, confirm you're in a split-brain scenario. You can do so by using SQL Server Management Studio (SSMS) or Transact-SQL (T-SQL).

Connect to both SQL Server and SQL managed instance in SSMS, and then in Object Explorer, expand Availability replicas under the Availability group node in Always On High Availability. If two different replicas are listed as (Primary), you're in a split-brain scenario.

Alternatively, you can run the following T-SQL script on both SQL Server and SQL Managed Instance to check the role of the replicas:

-- Execute on SQL Server and SQL Managed Instance 
USE master
DECLARE @link_name varchar(max) = '<DAGName>'
SELECT
   ag.name [Link name], 
   rs.role_desc [Link role] 
FROM
   sys.availability_groups ag 
   JOIN sys.dm_hadr_availability_replica_states rs 
   ON ag.group_id = rs.group_id 
WHERE 
   rs.is_local = 1 AND ag.is_distributed = 1 AND ag.name = @link_name 
GO

If both instances list PRIMARY in Link role column, you're in a split-brain scenario.

To resolve the split brain state, first take a backup on whichever replica was the original primary. If the original primary was SQL Server, then take a tail log backup. If the original primary was SQL Managed Instance, then take a copy-only full backup. After the backup completes, set the distributed availability group to the secondary role for the replica that used to be the original primary but will now be the new secondary.

For example, in the event of a true disaster, assuming you've forced a failover of your SQL Server workload to Azure SQL Managed Instance, and you intend to continue running your workload on SQL Managed Instance, take a tail log backup on SQL Server, and then set the distributed availability group to the secondary role on SQL Server such as the following example:

--Execute on SQL Server 
USE master
ALTER AVAILABILITY GROUP [<DAGName>] 
SET (ROLE = SECONDARY) 
GO 

Next, execute a planned manual failover from SQL Managed Instance to SQL Server by using the link, such as the following example:

--Execute on SQL Managed Instance 
USE master
ALTER AVAILABILITY GROUP [<DAGName>] FAILOVER 
GO 

Expired certificate

It's possible for the certificate used for the link to expire. If the certificate expires, the link fails. To resolve this issue, rotate the certificate.

Test network connectivity

Bidirectional network connectivity between SQL Server and SQL Managed Instance is necessary for the link to work. After you open ports on the SQL Server side and configure an NSG rule on the SQL Managed Instance side, test connectivity by using either SQL Server Management Studio (SSMS) or Transact-SQL.

Test the network by creating a temporary SQL Agent job on both SQL Server and SQL Managed Instance to check the connection between the two instances. When you use Network Checker in SSMS, the job is automatically created for you, and deleted after the test completes. You need to manually delete the SQL Agent job if you test your network by using T-SQL.

Note

Executing PowerShell scripts by the SQL Server Agent on SQL Server on Linux is not currently supported, so it's not currently possible to execute Test-NetConnection from the SQL Server Agent job on SQL Server on Linux.

To use the SQL Agent to test network connectivity, you need the following requirements:

  • The user doing the test must have permissions to create a job (either as a sysadmin or belongs to the SQLAgentOperator role for msdb) for both SQL Server and SQL Managed Instance.
  • The SQL Server Agent service must be running on SQL Server. Since the Agent is on by default on SQL Managed Instance, no additional action is necessary.

To use T-SQL to test connectivity, you have to check the connection in both directions. First, test the connection from SQL Server to SQL Managed Instance, and then test the connection from SQL Managed Instance to SQL Server.

Test connection from SQL Server to SQL Managed Instance

Use SQL Server Agent on SQL Server to run connectivity tests from SQL Server to SQL Managed Instance.

  1. Connect to SQL Managed Instance, and run the following script to generate parameters you'll need later:

    SELECT 'DECLARE @serverName NVARCHAR(512) = N''' + value + ''''
    FROM sys.dm_hadr_fabric_config_parameters
    WHERE parameter_name = 'DnsRecordName'
    
    UNION
    
    SELECT 'DECLARE @node NVARCHAR(512) = N''' + NodeName + '.' + Cluster + ''''
    FROM (
        SELECT SUBSTRING(replica_address, 0, CHARINDEX('\', replica_address)) AS NodeName,
            RIGHT(service_name, CHARINDEX('/', REVERSE(service_name)) - 1) AppName,
            JoinCol = 1
        FROM sys.dm_hadr_fabric_partitions fp
        INNER JOIN sys.dm_hadr_fabric_replicas fr
            ON fp.partition_id = fr.partition_id
        INNER JOIN sys.dm_hadr_fabric_nodes fn
            ON fr.node_name = fn.node_name
        WHERE service_name LIKE '%ManagedServer%'
            AND replica_role = 2
    ) t1
    LEFT JOIN (
        SELECT value AS Cluster,
            JoinCol = 1
        FROM sys.dm_hadr_fabric_config_parameters
        WHERE parameter_name = 'ClusterName'
        ) t2
        ON (t1.JoinCol = t2.JoinCol)
    INNER JOIN (
        SELECT [value] AS AppName
        FROM sys.dm_hadr_fabric_config_parameters
        WHERE section_name = 'SQL'
            AND parameter_name = 'InstanceName'
        ) t3
        ON (t1.AppName = t3.AppName)
    
    UNION
    
    SELECT 'DECLARE @port NVARCHAR(512) = N''' + value + ''''
    FROM sys.dm_hadr_fabric_config_parameters
    WHERE parameter_name = 'HadrPort';
    

    Results should look like the following sample:

     DECLARE @node NVARCHAR(512) = N'DB123.tr123456.china-east.worker.database.chinacloudapi.cn'
     DECLARE @port NVARCHAR(512) = N'11002'
     DECLARE @serverName NVARCHAR(512) = N'contoso-instance.12345678.database.chinacloudapi.cn'
    

    Save the results to use the next steps. Since these parameters can change after any failover, be sure to generate them again, if necessary.

  2. Connect to your SQL Server instance.

  3. Open a new query window and paste the following script:

    --START
    -- Parameters section
    DECLARE @node NVARCHAR(512) = N''
    DECLARE @port NVARCHAR(512) = N''
    DECLARE @serverName NVARCHAR(512) = N''
    
    --Script section
    IF EXISTS (
            SELECT job_id
            FROM msdb.dbo.sysjobs_view
            WHERE name = N'TestMILinkConnection'
            )
        EXEC msdb.dbo.sp_delete_job @job_name = N'TestMILinkConnection',
            @delete_unused_schedule = 1
    
    DECLARE @jobId BINARY (16),
        @cmd NVARCHAR(MAX)
    
    EXEC msdb.dbo.sp_add_job @job_name = N'TestMILinkConnection',
        @enabled = 1,
        @job_id = @jobId OUTPUT
    
    SET @cmd = (N'tnc ' + @serverName + N' -port 5022 | select ComputerName, RemoteAddress, TcpTestSucceeded | Format-List')
    
    EXEC msdb.dbo.sp_add_jobstep @job_id = @jobId,
        @step_name = N'Test Port 5022',
        @step_id = 1,
        @cmdexec_success_code = 0,
        @on_success_action = 3,
        @on_fail_action = 3,
        @subsystem = N'PowerShell',
        @command = @cmd,
        @database_name = N'master'
    
    SET @cmd = (N'tnc ' + @node + N' -port ' + @port + ' | select ComputerName, RemoteAddress, TcpTestSucceeded | Format-List')
    
    EXEC msdb.dbo.sp_add_jobstep @job_id = @jobId,
        @step_name = N'Test HADR Port',
        @step_id = 2,
        @cmdexec_success_code = 0,
        @subsystem = N'PowerShell',
        @command = @cmd,
        @database_name = N'master'
    
    EXEC msdb.dbo.sp_add_jobserver @job_id = @jobId,
        @server_name = N'(local)'
    GO
    
    EXEC msdb.dbo.sp_start_job @job_name = N'TestMILinkConnection'
    GO
    
    --Check status every 5 seconds
    DECLARE @RunStatus INT
    
    SET @RunStatus = 10
    
    WHILE (@RunStatus >= 4)
    BEGIN
        SELECT DISTINCT @RunStatus = run_status
        FROM [msdb].[dbo].[sysjobhistory] JH
        INNER JOIN [msdb].[dbo].[sysjobs] J
            ON JH.job_id = J.job_id
        WHERE J.name = N'TestMILinkConnection'
            AND step_id = 0
    
        WAITFOR DELAY '00:00:05';
    END
    
    --Get logs once job completes
    SELECT [step_name],
        SUBSTRING([message], CHARINDEX('TcpTestSucceeded', [message]), CHARINDEX('Process Exit', [message]) - CHARINDEX('TcpTestSucceeded', [message])) AS    TcpTestResult,
        SUBSTRING([message], CHARINDEX('RemoteAddress', [message]), CHARINDEX('TcpTestSucceeded', [message]) - CHARINDEX('RemoteAddress', [message])) AS    RemoteAddressResult,
        [run_status],
        [run_duration],
        [message]
    FROM [msdb].[dbo].[sysjobhistory] JH
    INNER JOIN [msdb].[dbo].[sysjobs] J
        ON JH.job_id = J.job_id
    WHERE J.name = N'TestMILinkConnection'
        AND step_id <> 0
        --END
    
  4. Replace the @node, @port, and @serverName parameters with the values you got from the first step.

  5. Run the script and check the results. You should see results such as the following example:

    Screenshot that shows the output with the test results in S S M S.

  6. Verify the results:

    • The outcome of each test at TcpTestSucceeded should be TcpTestSucceeded : True.
    • The RemoteAddresses should belong to the IP range for the SQL Managed Instance subnet.

    If the response is unsuccessful, verify the following network settings:

    • There are rules in both the network firewall and the SQL Server host OS (Windows/Linux) firewall that allows traffic to the entire subnet IP range of SQL Managed Instance.
    • There's an NSG rule that allows communication on port 5022 for the virtual network that hosts SQL Managed Instance.

Test connection from SQL Managed Instance to SQL Server

To check that SQL Managed Instance can reach SQL Server, first create a test endpoint. Then you use the SQL Server Agent to run a PowerShell script with the tnc command pinging SQL Server on port 5022 from the SQL managed instance.

To create a test endpoint, connect to SQL Server and run the following T-SQL script:

-- Run on SQL Server
-- Create the certificate needed for the test endpoint
USE MASTER
CREATE CERTIFICATE TEST_CERT
WITH SUBJECT = N'Certificate for SQL Server',
EXPIRY_DATE = N'3/30/2051'
GO

-- Create the test endpoint on SQL Server
USE MASTER
CREATE ENDPOINT TEST_ENDPOINT
    STATE=STARTED
    AS TCP (LISTENER_PORT=5022, LISTENER_IP = ALL)
    FOR DATABASE_MIRRORING (
        ROLE=ALL,
        AUTHENTICATION = CERTIFICATE TEST_CERT,
        ENCRYPTION = REQUIRED ALGORITHM AES
    )

To verify that the SQL Server endpoint is receiving connections on port 5022, run the following PowerShell command on the host operating system of your SQL Server instance:

tnc localhost -port 5022

A successful test shows TcpTestSucceeded : True. You can then proceed to create a SQL Server Agent job on the SQL managed instance to try testing the SQL Server test endpoint on port 5022 from the SQL managed instance.

Next, create a SQL Server Agent job on the SQL managed instance called NetHelper by running the following T-SQL script on the SQL managed instance. Replace:

  • <SQL_SERVER_IP_ADDRESS> with the IP address of SQL Server that can be accessed from SQL managed instance.
-- Run on SQL managed instance
-- SQL_SERVER_IP_ADDRESS should be an IP address that could be accessed from the SQL Managed Instance host machine.
DECLARE @SQLServerIpAddress NVARCHAR(MAX) = '<SQL_SERVER_IP_ADDRESS>'; -- insert your SQL Server IP address in here
DECLARE @tncCommand NVARCHAR(MAX) = 'tnc ' + @SQLServerIpAddress + ' -port 5022 -InformationLevel Quiet';
DECLARE @jobId BINARY(16);

IF EXISTS (
        SELECT *
        FROM msdb.dbo.sysjobs
        WHERE name = 'NetHelper'
        ) THROW 70000,
    'Agent job NetHelper already exists. Please rename the job, or drop the existing job before creating it again.',
    1
    -- To delete NetHelper job run: EXEC msdb.dbo.sp_delete_job @job_name=N'NetHelper'
    EXEC msdb.dbo.sp_add_job @job_name = N'NetHelper',
        @enabled = 1,
        @description = N'Test SQL Managed Instance to SQL Server network connectivity on port 5022.',
        @category_name = N'[Uncategorized (Local)]',
        @owner_login_name = N'sa',
        @job_id = @jobId OUTPUT;

EXEC msdb.dbo.sp_add_jobstep @job_id = @jobId,
    @step_name = N'TNC network probe from SQL MI to SQL Server',
    @step_id = 1,
    @os_run_priority = 0,
    @subsystem = N'PowerShell',
    @command = @tncCommand,
    @database_name = N'master',
    @flags = 40;

EXEC msdb.dbo.sp_update_job @job_id = @jobId,
    @start_step_id = 1;

EXEC msdb.dbo.sp_add_jobserver @job_id = @jobId,
    @server_name = N'(local)';

Tip

If you need to modify the IP address of your SQL Server for the connectivity probe from SQL managed instance, delete NetHelper job by running EXEC msdb.dbo.sp_delete_job @job_name=N'NetHelper', and re-create NetHelper job using the previous script.

Then, create a stored procedure ExecuteNetHelper that helps run the job, and obtains results from the network probe. Run the following T-SQL script on SQL managed instance:

-- Run on managed instance
IF EXISTS(SELECT * FROM sys.objects WHERE name = 'ExecuteNetHelper')
    THROW 70001, 'Stored procedure ExecuteNetHelper already exists. Rename or drop the existing procedure before creating it again.', 1
GO
CREATE PROCEDURE ExecuteNetHelper AS
-- To delete the procedure run: DROP PROCEDURE ExecuteNetHelper
BEGIN
    -- Start the job.
    DECLARE @NetHelperstartTimeUtc DATETIME = GETUTCDATE();
    DECLARE @stop_exec_date DATETIME = NULL;

    EXEC msdb.dbo.sp_start_job @job_name = N'NetHelper';

    -- Wait for job to complete and then see the outcome.
    WHILE (@stop_exec_date IS NULL)
    BEGIN
        -- Wait and see if the job has completed.
        WAITFOR DELAY '00:00:01'

        SELECT @stop_exec_date = sja.stop_execution_date
        FROM msdb.dbo.sysjobs sj
        INNER JOIN msdb.dbo.sysjobactivity sja
            ON sj.job_id = sja.job_id
        WHERE sj.name = 'NetHelper'

        -- If job has completed, get the outcome of the network test.
        IF (@stop_exec_date IS NOT NULL)
        BEGIN
            SELECT sj.name JobName,
                sjsl.date_modified AS 'Date executed',
                sjs.step_name AS 'Step executed',
                sjsl.log AS 'Connectivity status'
            FROM msdb.dbo.sysjobs sj
            LEFT JOIN msdb.dbo.sysjobsteps sjs
                ON sj.job_id = sjs.job_id
            LEFT JOIN msdb.dbo.sysjobstepslogs sjsl
                ON sjs.step_uid = sjsl.step_uid
            WHERE sj.name = 'NetHelper'
        END

        -- In case of operation timeout (90 seconds), print timeout message.
        IF (datediff(second, @NetHelperstartTimeUtc, getutcdate()) > 90)
        BEGIN
            SELECT 'NetHelper timed out during the network check. Please investigate SQL Agent logs for more information.'

            BREAK;
        END
    END
END;

Run the following query on SQL managed instance to execute the stored procedure that will execute the NetHelper agent job and show the resulting log:

-- Run on managed instance
EXEC ExecuteNetHelper;

If the connection was successful, the log shows True. If the connection was unsuccessful, the log shows False.

Screenshot that shows the expected output of the NetHelper SQL Agent job.

If the connection was unsuccessful, verify the following items:

  • The firewall on the host SQL Server instance allows inbound and outbound communication on port 5022.
  • An NSG rule for the virtual network that hosts SQL Managed Instance allows communication on port 5022.
  • If your SQL Server instance is on an Azure VM, an NSG rule allows communication on port 5022 on the virtual network that hosts the VM.
  • SQL Server is running.
  • There exists test endpoint on SQL Server.

After resolving issues, rerun NetHelper network probe again by running EXEC ExecuteNetHelper on managed instance.

Finally, after the network test is successful, drop the test endpoint and certificate on SQL Server by using the following T-SQL commands:

-- Run on SQL Server
DROP ENDPOINT TEST_ENDPOINT;
GO
DROP CERTIFICATE TEST_CERT;
GO

Caution

Proceed with the next steps only if you've validated network connectivity between your source and target environments. Otherwise, troubleshoot network connectivity issues before proceeding.

For more information on the link feature, review the following resources: