使用 Azure Monitor 日志监视 Site RecoveryMonitor Site Recovery with Azure Monitor Logs

本文介绍如何使用 Azure Monitor 日志Log Analytics 监视 Azure Site Recovery 复制的计算机。This article describes how to monitor machines replicated by Azure Site Recovery, using Azure Monitor Logs, and Log Analytics.

Azure Monitor 日志提供一个日志数据平台用于收集活动和资源日志,以及其他监视数据。Azure Monitor Logs provide a log data platform that collects activity and resource logs, along with other monitoring data. 在 Azure Monitor 日志中,可以使用 Log Analytics 编写和测试日志查询,并以交互方式分析日志数据。Within Azure Monitor Logs, you use Log Analytics to write and test log queries, and to interactively analyze log data. 可以可视化和查询日志结果,并配置警报来根据监视的数据采取措施。You can visualize and query log results, and configure alerts to take actions based on monitored data.

对于 Site Recovery,Azure Monitor 日志可帮助你执行以下操作:For Site Recovery, you can use Azure Monitor Logs to help you do the following:

  • 监视 Site Recovery 运行状况和状态Monitor Site Recovery health and status. 例如,可以监视复制运行状况、测试故障转移状态、Site Recovery 事件、受保护计算机的恢复点目标 (RPO),以及磁盘/数据更改率。For example, you can monitor replication health, test failover status, Site Recovery events, recovery point objectives (RPOs) for protected machines, and disk/data change rates.
  • 为 Site Recovery 设置警报Set up alerts for Site Recovery. 例如,可以针对计算机运行状况、测试故障转移状态或 Site Recovery 作业状态配置警报。For example, you can configure alerts for machine health, test failover status, or Site Recovery job status.

支持结合 Site Recovery 使用 Azure Monitor 日志进行“Azure 到 Azure”的复制,以及“VMware VM/物理服务器到 Azure”的复制 。Using Azure Monitor Logs with Site Recovery is supported for Azure to Azure replication, and VMware VM/physical server to Azure replication.

备注

若要获取 VMware 和物理计算机的变动数据日志和上传速率日志,需要在进程服务器上安装 Azure 监视代理。To get the churn data logs and upload rate logs for VMware and physical machines, you need to install a Azure monitoring agent on the Process Server. 此代理可将复制计算机的日志发送到工作区。This agent sends the logs of the replicating machines to the workspace. 此功能仅适用于 9.30 移动代理版本和更高版本。This capability is available only for 9.30 mobility agent version onwards.

开始之前Before you start

下面是需要的项:Here's what you need:

  • 至少一台在恢复服务保管库中受保护的计算机。At least one machine protected in a Recovery Services vault.
  • 用于存储 Site Recovery 日志的 Log Analytics 工作区。A Log Analytics workspace to store Site Recovery logs. 了解如何设置工作区。Learn about setting up a workspace.
  • 基本了解如何在 Log Analytics 中编写、运行和分析日志查询。A basic understanding of how to write, run, and analyze log queries in Log Analytics. 了解详细信息Learn more.

在开始之前,我们建议查看常见监视问题We recommend that you review common monitoring questions before you start.

配置 Site Recovery 以发送日志Configure Site Recovery to send logs

  1. 在保管库中,单击“诊断设置” > “添加诊断设置”。 In the vault, click Diagnostic settings > Add diagnostic setting.

    显示“添加诊断设置”选项的屏幕截图。

  2. 在“诊断设置”中,指定一个名称,并选中“发送到 Log Analytics”复选框 。In Diagnostic settings, specify a name, and check the box Send to Log Analytics.

  3. 选择 Azure Monitor 日志订阅和 Log Analytics 工作区。Select the Azure Monitor Logs subscription, and the Log Analytics workspace.

  4. 在切换选项中选择“Azure 诊断”。Select Azure Diagnostics in the toggle.

  5. 在日志列表中,选择带有 AzureSiteRecovery 前缀的所有日志。From the log list, select all the logs with the prefix AzureSiteRecovery. Then click OK.

    “诊断设置屏幕”的屏幕截图。

Site Recovery 日志将开始馈送到选定工作区中的某个表 (AzureDiagnostics) 内。The Site Recovery logs start to feed into a table (AzureDiagnostics) in the selected workspace.

在进程服务器上配置 Azure 监视代理以发送变动和上传速率日志Configure Azure monitoring agent on the Process Server to send churn and upload rate logs

可以在本地捕获 VMware/物理计算机的数据变动速率信息和源数据上传速率信息。You can capture the data churn rate information and source data upload rate information for your VMware/physical machines at on-premises. 若要启用此功能,需要在进程服务器上安装 Azure 监视代理。To enable this, a Azure monitoring agent is required to be installed on the Process Server.

  1. 转到 Log Analytics 工作区并单击“高级设置”。Go to the Log Analytics workspace and click on Advanced Settings.

  2. 单击“连接的源”页面,然后选择“Windows Server” 。Click on Connected Sources page and further select Windows Servers.

  3. 在进程服务器上下载 Windows 代理(64 位)。Download the Windows Agent (64 bit) on the Process Server.

  4. 获取工作区 ID 和密钥Obtain the workspace ID and key

  5. 将代理配置为使用 TLS 1.2Configure agent to use TLS 1.2

  6. 通过提供获取的工作区 ID 和密钥完成代理安装Complete the agent installation by providing the obtained workspace ID and key.

  7. 安装完成后,转到 Log Analytics 工作区并单击“高级设置”。Once the installation is complete, go to Log Analytics workspace and click on Advanced Settings. 转到“数据”页并单击“Windows 性能计数器” 。Go to the Data page and further click on Windows Performance Counters.

  8. 单击“+”添加以下两个计数器,采样间隔为 300 秒:Click on '+' to add the following two counters with sample interval of 300 seconds:

    • ASRAnalytics(*)\SourceVmChurnRateASRAnalytics(*)\SourceVmChurnRate
    • ASRAnalytics(*)\SourceVmThrpRateASRAnalytics(*)\SourceVmThrpRate

变动和上传速率数据将开始输入工作区。The churn and upload rate data will start feeding into the workspace.

查询日志 - 示例Query the logs - examples

使用以 Kusto 查询语言编写的日志查询从日志中检索数据。You retrieve data from logs using log queries written with the Kusto query language. 本部分提供几个可用于 Site Recovery 监视的常见查询示例。This section provides a few examples of common queries you might use for Site Recovery monitoring.

备注

其中一些示例使用设置为 A2AreplicationProviderName_sSome of the examples use replicationProviderName_s set to A2A. 此查询检索已使用 Site Recovery 复制到次要 Azure 区域的 Azure VM。This retrieves Azure VMs that are replicated to a secondary Azure region using Site Recovery. 若要检索已使用 Site Recovery 复制到 Azure 的本地 VMware VM 或物理服务器,可在这些示例中将 A2A 替换为 InMageAzureV2In these examples, you can replace A2A with InMageAzureV2, if you want to retrieve on-premises VMware VMs or physical servers that are replicated to Azure using Site Recovery.

查询复制运行状况Query replication health

此查询绘制所有受保护 Azure VM 的当前复制运行状况的饼图,这些 VM 划分为三种状态:“正常”、“警告”或“严重”。This query plots a pie chart for the current replication health of all protected Azure VMs, broken down into three states: Normal, Warning, or Critical.

AzureDiagnostics  
| where replicationProviderName_s == "A2A"   
| where isnotempty(name_s) and isnotnull(name_s)  
| summarize hint.strategy=partitioned arg_max(TimeGenerated, *) by name_s  
| project name_s , replicationHealth_s  
| summarize count() by replicationHealth_s  
| render piechart   

查询移动服务版本Query Mobility service version

此查询为使用 Site Recovery 复制的 Azure VM 绘制饼图,其中的信息已按这些 VM 运行的移动代理版本划分。This query plots a pie chart for Azure VMs replicated with Site Recovery, broken down by the version of the Mobility agent that they're running.

AzureDiagnostics  
| where replicationProviderName_s == "A2A"   
| where isnotempty(name_s) and isnotnull(name_s)  
| summarize hint.strategy=partitioned arg_max(TimeGenerated, *) by name_s  
| project name_s , agentVersion_s  
| summarize count() by agentVersion_s  
| render piechart 

查询 RPO 时间Query RPO time

此查询绘制使用 Site Recovery 复制的 Azure VM 的条形图,其中的信息已按恢复点目标 (RPO) 划分:小于 15 分钟、15-30 分钟、大于 30 分钟。This query plots a bar chart of Azure VMs replicated with Site Recovery, broken down by recovery point objective (RPO): Less than 15 minutes, between 15-30 minutes, more than 30 minutes.

AzureDiagnostics 
| where replicationProviderName_s == "A2A"   
| where isnotempty(name_s) and isnotnull(name_s)  
| extend RPO = case(rpoInSeconds_d <= 900, "<15Min",   
rpoInSeconds_d <= 1800, "15-30Min", ">30Min")  
| summarize hint.strategy=partitioned arg_max(TimeGenerated, *) by name_s  
| project name_s , RPO  
| summarize Count = count() by RPO  
| render barchart 

此屏幕截图显示了使用 Site Recovery 复制 Azure VM 的条形图。

查询 Site Recovery 作业Query Site Recovery jobs

此查询检索过去 72 小时触发的所有 Site Recovery 作业(适用于所有灾难恢复方案)及其完成状态。This query retrieves all Site Recovery jobs (for all disaster recovery scenarios), triggered in the last 72 hours, and their completion state.

AzureDiagnostics  
| where Category == "AzureSiteRecoveryJobs"  
| where TimeGenerated >= ago(72h)   
| project JobName = OperationName , VaultName = Resource , TargetName = affectedResourceName_s, State = ResultType  

查询 Site Recovery 事件Query Site Recovery events

此查询检索过去 72 小时引发的所有 Site Recovery 事件(适用于所有灾难恢复方案)及其严重性。This query retrieves all Site Recovery events (for all disaster recovery scenarios) raised in the last 72 hours, along with their severity.

AzureDiagnostics   
| where Category == "AzureSiteRecoveryEvents"   
| where TimeGenerated >= ago(72h)   
| project AffectedObject=affectedResourceName_s , VaultName = Resource, Description_s = healthErrors_s , Severity = Level  

查询测试故障转移状态(饼图)Query test failover state (pie chart)

此查询绘制使用 Site Recovery 复制的 Azure VM 的测试故障转移状态饼图。This query plots a pie chart for the test failover status of Azure VMs replicated with Site Recovery.

AzureDiagnostics  
| where replicationProviderName_s == "A2A"   
| where isnotempty(name_s) and isnotnull(name_s)  
| where isnotempty(failoverHealth_s) and isnotnull(failoverHealth_s)  
| summarize hint.strategy=partitioned arg_max(TimeGenerated, *) by name_s  
| project name_s , Resource, failoverHealth_s  
| summarize count() by failoverHealth_s  
| render piechart 

查询测试故障转移状态(表格)Query test failover state (table)

此查询绘制使用 Site Recovery 复制的 Azure VM 的测试故障转移状态表格。This query plots a table for the test failover status of Azure VMs replicated with Site Recovery.

AzureDiagnostics   
| where replicationProviderName_s == "A2A"   
| where isnotempty(name_s) and isnotnull(name_s)   
| where isnotempty(failoverHealth_s) and isnotnull(failoverHealth_s)   
| summarize hint.strategy=partitioned arg_max(TimeGenerated, *) by name_s   
| project VirtualMachine = name_s , VaultName = Resource , TestFailoverStatus = failoverHealth_s 

查询计算机 RPOQuery machine RPO

此查询绘制一个趋势图,用于跟踪特定 Azure VM (ContosoVM123) 在过去 72 小时的 RPO。This query plots a trend graph that tracks the RPO of a specific Azure VM (ContosoVM123) for the last 72 hours.

AzureDiagnostics   
| where replicationProviderName_s == "A2A"   
| where TimeGenerated > ago(72h)  
| where isnotempty(name_s) and isnotnull(name_s)   
| where name_s == "ContosoVM123"  
| project TimeGenerated, name_s , RPO_in_seconds = rpoInSeconds_d   
| render timechart 

此屏幕截图显示了跟踪特定 Azure VM 的 RPO 的趋势图。

查询 Azure VM 的数据更改(变动)速率和上传速率Query data change rate (churn) and upload rate for an Azure VM

此查询绘制特定 Azure VM (ContosoVM123) 的趋势图,表示数据更改速率(每秒写入字节数)和数据上传速率。This query plots a trend graph for a specific Azure VM (ContosoVM123), that represents the data change rate (Write Bytes per Second), and data upload rate.

AzureDiagnostics   
| where Category in ("AzureSiteRecoveryProtectedDiskDataChurn", "AzureSiteRecoveryReplicationDataUploadRate")   
| extend CategoryS = case(Category contains "Churn", "DataChurn",   
Category contains "Upload", "UploadRate", "none")  
| extend InstanceWithType=strcat(CategoryS, "_", InstanceName_s)   
| where TimeGenerated > ago(24h)   
| where InstanceName_s startswith "ContosoVM123"   
| project TimeGenerated , InstanceWithType , Churn_MBps = todouble(Value_s)/1048576   
| render timechart  

此屏幕截图显示了特定 Azure VM 的趋势图。

查询 VMware 或物理计算机的数据更改(变动)速率和上传速率Query data change rate (churn) and upload rate for a VMware or physical machine

备注

请确保在进程服务器上设置监视代理以获取这些日志。Ensure you set up the monitoring agent on the Process Server to fetch these logs. 请参阅配置监视代理的步骤Refer steps to configure monitoring agent.

此查询为复制的项“win-9r7sfh9qlru”的特定磁盘“disk0”绘制趋势图,表示数据更改速率(每秒写入字节数)和数据上传速率 。This query plots a trend graph for a specific disk disk0 of a replicated item win-9r7sfh9qlru, that represents the data change rate (Write Bytes per Second), and data upload rate. 可以在恢复服务保管库中复制的项的“磁盘”边栏选项卡上找到磁盘名称。You can find the disk name on Disks blade of the replicated item in the recovery services vault. 要在查询中使用的实例名是计算机的 DNS 名称,后跟 _ 和磁盘名称,如本例所示。Instance name to be used in the query is DNS name of the machine followed by _ and disk name as in this example.

Perf
| where ObjectName == "ASRAnalytics"
| where InstanceName contains "win-9r7sfh9qlru_disk0"
| where TimeGenerated >= ago(4h) 
| project TimeGenerated ,CounterName, Churn_MBps = todouble(CounterValue)/5242880 
| render timechart

进程服务器每 5 分钟将此数据推送到 Log Analytics 工作区。Process Server pushes this data every 5 minutes to the Log Analytics workspace. 这些数据点表示 5 分钟内计算的平均值。These data points represent the average computed for 5 minutes.

查询灾难恢复摘要(Azure 到 Azure)Query disaster recovery summary (Azure to Azure)

此查询绘制已复制到次要 Azure 区域的 Azure VM 的摘要表格。This query plots a summary table for Azure VMs replicated to a secondary Azure region. 其中显示 VM 名称、复制和保护状态、RPO、测试故障转移状态、移动代理版本、任何活动的复制错误以及源位置。It shows VM name, replication and protection status, RPO, test failover status, Mobility agent version, any active replication errors, and the source location.

AzureDiagnostics 
| where replicationProviderName_s == "A2A"   
| where isnotempty(name_s) and isnotnull(name_s)   
| summarize hint.strategy=partitioned arg_max(TimeGenerated, *) by name_s   
| project VirtualMachine = name_s , Vault = Resource , ReplicationHealth = replicationHealth_s, Status = protectionState_s, RPO_in_seconds = rpoInSeconds_d, TestFailoverStatus = failoverHealth_s, AgentVersion = agentVersion_s, ReplicationError = replicationHealthErrors_s, SourceLocation = primaryFabricName_s 

查询灾难恢复摘要(VMware/物理服务器)Query disaster recovery summary (VMware/physical servers)

此查询绘制已复制到 Azure 的 VMware VM 和物理服务器的摘要表格。This query plots a summary table for VMware VMs and physical servers replicated to Azure. 其中显示计算机名称、复制和保护状态、RPO、测试故障转移状态、移动代理版本、任何活动的复制错误以及相关的进程服务器。It shows machine name, replication and protection status, RPO, test failover status, Mobility agent version, any active replication errors, and the relevant process server.

AzureDiagnostics  
| where replicationProviderName_s == "InMageAzureV2"   
| where isnotempty(name_s) and isnotnull(name_s)   
| summarize hint.strategy=partitioned arg_max(TimeGenerated, *) by name_s   
| project VirtualMachine = name_s , Vault = Resource , ReplicationHealth = replicationHealth_s, Status = protectionState_s, RPO_in_seconds = rpoInSeconds_d, TestFailoverStatus = failoverHealth_s, AgentVersion = agentVersion_s, ReplicationError = replicationHealthErrors_s, ProcessServer = processServerName_g  

设置警报 - 示例Set up alerts - examples

可以基于 Azure Monitor 数据设置 Site Recovery 警报。You can set up Site Recovery alerts based on Azure Monitor data. 详细了解如何设置日志警报。Learn more about setting up log alerts.

备注

其中一些示例使用设置为 A2AreplicationProviderName_sSome of the examples use replicationProviderName_s set to A2A. 这会针对已复制到次要 Azure 区域的 Azure VM 设置警报。This sets alerts for Azure VMs that are replicated to a secondary Azure region. 若要针对已复制到 Azure 的本地 VMware VM 或物理服务器设置警报,可在这些示例中将 A2A 替换为 InMageAzureV2In these examples, you can replace A2A with InMageAzureV2 if you want to set alerts for on-premises VMware VMs or physical servers replicated to Azure.

多台计算机处于严重状态Multiple machines in a critical state

如果有 20 台以上的已复制 Azure VM 进入严重状态,则设置警报。Set up an alert if more than 20 replicated Azure VMs go into a Critical state.

AzureDiagnostics   
| where replicationProviderName_s == "A2A"   
| where replicationHealth_s == "Critical"  
| where isnotempty(name_s) and isnotnull(name_s)   
| summarize hint.strategy=partitioned arg_max(TimeGenerated, *) by name_s   
| summarize count() 

对于警报,请将“阈值”设置为 20。For the alert, set Threshold value to 20.

一台计算机处于严重状态Single machine in a critical state

如果特定的已复制 Azure VM 进入严重状态,则设置警报。Set up an alert if a specific replicated Azure VM goes into a Critical state.

AzureDiagnostics   
| where replicationProviderName_s == "A2A"   
| where replicationHealth_s == "Critical"  
| where name_s == "ContosoVM123"  
| where isnotempty(name_s) and isnotnull(name_s)   
| summarize hint.strategy=partitioned arg_max(TimeGenerated, *) by name_s   
| summarize count()  

对于警报,请将“阈值”设置为 1。For the alert, set Threshold value to 1.

多台计算机超过 RPOMultiple machines exceed RPO

如果有 20 台以上的 Azure VM 的 RPO 超过 30 分钟,则设置警报。Set up an alert if the RPO for more than 20 Azure VMs exceeds 30 minutes.

AzureDiagnostics   
| where replicationProviderName_s == "A2A"   
| where isnotempty(name_s) and isnotnull(name_s)   
| where rpoInSeconds_d > 1800  
| summarize hint.strategy=partitioned arg_max(TimeGenerated, *) by name_s   
| project name_s , rpoInSeconds_d   
| summarize count()  

对于警报,请将“阈值”设置为 20。For the alert, set Threshold value to 20.

一台计算机超过 RPOSingle machine exceeds RPO

如果只有一台 Azure VM 的 RPO 超过 30 分钟,则设置警报。Set up an alert if the RPO for a single Azure VM exceeds 30 minutes.

AzureDiagnostics   
| where replicationProviderName_s == "A2A"   
| where isnotempty(name_s) and isnotnull(name_s)   
| where name_s == "ContosoVM123"  
| where rpoInSeconds_d > 1800  
| summarize hint.strategy=partitioned arg_max(TimeGenerated, *) by name_s   
| project name_s , rpoInSeconds_d   
| summarize count()  

对于警报,请将“阈值”设置为 1。For the alert, set Threshold value to 1.

多台计算机的测试故障转移超过 90 天Test failover for multiple machines exceeds 90 days

如果 20 台以上的 VM 的上次成功测试故障转移超过 90 天,则设置警报。Set up an alert if the last successful test failover was more than 90 days, for more than 20 VMs.

AzureDiagnostics  
| where replicationProviderName_s == "A2A"   
| where Category == "AzureSiteRecoveryReplicatedItems"  
| where isnotempty(name_s) and isnotnull(name_s)   
| where lastSuccessfulTestFailoverTime_t <= ago(90d)   
| summarize hint.strategy=partitioned arg_max(TimeGenerated, *) by name_s   
| summarize count()  

对于警报,请将“阈值”设置为 20。For the alert, set Threshold value to 20.

一台计算机的测试故障转移超过 90 天Test failover for single machine exceeds 90 days

如果某个特定 VM 的上次成功测试故障转移超过 90 天,则设置警报。Set up an alert if the last successful test failover for a specific VM was more than 90 days ago.

AzureDiagnostics  
| where replicationProviderName_s == "A2A"   
| where Category == "AzureSiteRecoveryReplicatedItems"  
| where isnotempty(name_s) and isnotnull(name_s)   
| where lastSuccessfulTestFailoverTime_t <= ago(90d)   
| where name_s == "ContosoVM123"  
| summarize hint.strategy=partitioned arg_max(TimeGenerated, *) by name_s   
| summarize count()  

对于警报,请将“阈值”设置为 1。For the alert, set Threshold value to 1.

Site Recovery 作业失败Site Recovery job fails

如果在过去一天,某个 Site Recovery 作业(在本例中为“重新保护”作业)在任何 Site Recovery 方案中失败,则设置警报。Set up an alert if a Site Recovery job (in this case the Reprotect job) fails for any Site Recovery scenario, during the last day.

AzureDiagnostics   
| where Category == "AzureSiteRecoveryJobs"   
| where OperationName == "Reprotect"  
| where ResultType == "Failed"  
| summarize count()  

对于警报,请将“阈值”设置为 1,将“期限”设置为 1440 分钟,以检查过去一天发生的失败。 For the alert, set Threshold value to 1, and Period to 1440 minutes, to check failures in the last day.

后续步骤Next steps

了解内置的 Site Recovery 监视。Learn about inbuilt Site Recovery monitoring.