复制活动中的数据一致性验证(预览版)Data consistency verification in copy activity (Preview)

适用于:是 Azure 数据工厂是 Azure Synapse Analytics(预览版)APPLIES TO: yesAzure Data Factory yesAzure Synapse Analytics (Preview)

当你将数据从源存储移动到目标存储时,Azure 数据工厂复制活动提供了一个选项,供你执行额外的数据一致性验证,以确保数据不仅成功地从源存储复制到目标存储,而且验证了源存储和目标存储之间的一致性。When you move data from source to destination store, Azure Data Factory copy activity provides an option for you to do additional data consistency verification to ensure the data is not only successfully copied from source to destination store, but also verified to be consistent between source and destination store. 在数据移动过程中发现不一致的文件后,可以中止复制活动,或者通过启用容错设置跳过不一致的文件来继续复制其余文件。Once inconsistent files have been found during the data movement, you can either abort the copy activity or continue to copy the rest by enabling fault tolerance setting to skip inconsistent files. 通过在复制活动中启用会话日志设置,可以获取跳过的文件名称。You can get the skipped file names by enabling session log setting in copy activity.

重要

此功能目前处于预览状态,我们正在积极处理以下限制:This feature is currently in preview with the following limitations we are actively working on:

  • 当在复制活动中启用会话日志设置以记录跳过的不一致文件时,如果复制活动失败,则无法 100% 保证日志文件的完整性。When you enable session log setting in copy activity to log the inconsistent files being skipped, the completeness of log file can not be 100% guaranteed if copy activity failed.
  • 会话日志只包含不一致的文件,到目前为止,尚未记录成功复制的文件。The session log contains inconsistent files only, where the successfully copied files are not logged so far.

支持的数据存储和方案Supported data stores and scenarios

  • 除 FTP、sFTP 和 HTTP 外,所有连接器都支持数据一致性验证。Data consistency verification is supported by all the connectors except FTP, sFTP, and HTTP.
  • 临时复制方案不支持数据一致性验证。Data consistency verification is not supported in staging copy scenario.
  • 复制二进制文件时,只有在复制活动中设置了“PreserveHierarchy”行为,数据一致性验证才可用。When copying binary files, data consistency verification is only available when 'PreserveHierarchy' behavior is set in copy activity.
  • 在启用了数据一致性验证的单个复制活动中复制多个二进制文件时,可以选择中止复制活动,或通过启用容错设置跳过不一致的文件来继续复制其余文件。When copying multiple binary files in single copy activity with data consistency verification enabled, you have an option to either abort the copy activity or continue to copy the rest by enabling fault tolerance setting to skip inconsistent files.
  • 在启用了数据一致性验证的单个复制活动中复制表时,如果从源读取的行数与复制到目标的行数和跳过的不兼容行数之和不同,复制活动将失败。When copying a table in single copy activity with data consistency verification enabled, copy activity fails if the number of rows read from the source is different from the number of rows copied to the destination plus the number of incompatible rows that were skipped.

配置Configuration

以下示例提供了一个 JSON 定义,用于在复制活动中启用数据一致性验证:The following example provides a JSON definition to enable data consistency verification in Copy Activity:

"typeProperties": { 
"source": { 
        "type": "BinarySource", 
        "storeSettings": { 
            "type": "AzureDataLakeStoreReadSettings", 
            "recursive": true 
        } 
    }, 
    "sink": { 
        "type": "BinarySink", 
        "storeSettings": { 
            "type": "AzureDataLakeStoreWriteSettings" 
        } 
}, 
    "validateDataConsistency": true, 
    "skipErrorFile": { 
        "dataInconsistency": true 
    }, 
    "logStorageSettings": { 
        "linkedServiceName": { 
            "referenceName": "ADLSGen2_storage", 
            "type": "LinkedServiceReference" 
        }, 
        "path": "/sessionlog/" 
} 
} 
属性Property 说明Description 允许的值Allowed values 必须Required
validateDataConsistencyvalidateDataConsistency 如果将此属性设置为 true,则在复制二进制文件时,复制活动将检查从源存储复制到目标存储的每个二进制文件的文件大小、lastModifiedDate 和 MD5 校验和,以确保源存储和目标存储之间的数据一致性。If you set true for this property, when copying binary files, copy activity will check file size, lastModifiedDate, and MD5 checksum for each binary file copied from source to destination store to ensure the data consistency between source and destination store. 复制表格数据时,复制活动将检查作业完成后的总行计数,以确保从源中读取的行数与复制到目标的行数和跳过的不兼容行数之和相同。When copying tabular data, copy activity will check the total row count after job completes to ensure the total number of rows read from the source is same as the number of rows copied to the destination plus the number of incompatible rows that were skipped. 请注意,启用此选项会影响复制性能。Be aware the copy performance will be affected by enabling this option. TrueTrue
False(默认值)False (default)
No
dataInconsistencydataInconsistency SkipErrorFile 属性包中的一个键值对,用于确定是否要跳过不一致的文件。One of the key-value pairs within skipErrorFile property bag to determine if you want to skip the inconsistent files.
-True:要通过跳过不一致的文件来复制其余文件。-True: you want to copy the rest by skipping inconsistent files.
-False:找到不一致的文件后要中止复制活动。- False: you want to abort the copy activity once inconsistent file found.
请注意,只有在你复制二进制文件并将 validateDataConsistency 设置为 True 时,此属性才有效。Be aware this property is only valid when you are copying binary files and set validateDataConsistency as True.
TrueTrue
False(默认值)False (default)
No
logStorageSettingslogStorageSettings 一组属性,可以指定这些属性以使会话日志能够记录跳过的文件。A group of properties that can be specified to enable session log to log skipped files. No
linkedServiceNamelinkedServiceName Azure Blob 存储Azure Data Lake Storage Gen2 的链接服务,用于存储会话日志文件。The linked service of Azure Blob Storage or Azure Data Lake Storage Gen2 to store the session log files. AzureBlobStorageAzureBlobFS 类型链接服务的名称,指代用于存储日志文件的实例。The names of an AzureBlobStorage or AzureBlobFS types linked service, which refers to the instance that you use to store the log files. No
pathpath 日志文件的路径。The path of the log files. 指定用于存储日志文件的路径。Specify the path that you want to store the log files. 如果未提供路径,服务会为用户创建一个容器。If you do not provide a path, the service creates a container for you. No

备注

  • 从/向 Azure Blob 或 Azure Data Lake Storage Gen2 复制二进制文件时,ADF 会利用 Azure Blob APIAzure Data Lake Storage Gen2 API 来进行块级 MD5 校验和验证。When copying binary files from, or to Azure Blob or Azure Data Lake Storage Gen2, ADF does block level MD5 checksum verification leveraging Azure Blob API and Azure Data Lake Storage Gen2 API. 如果 Azure Blob 或 Azure Data Lake Storage Gen2 上存在用作数据源的基于文件的 ContentMD5,则 ADF 在读取文件后也会进行文件级 MD5 校验和验证。If ContentMD5 on files exist on Azure Blob or Azure Data Lake Storage Gen2 as data sources, ADF does file level MD5 checksum verification after reading the files as well. 将文件复制到作为数据目标的 Azure Blob 或 Azure Data Lake Storage Gen2 之后,ADF 会将 ContentMD5 写入 Azure Blob 或 Azure Data Lake Storage Gen2,下游应用程序可以进一步使用 ContentMD5 进行数据一致性验证。After copying files to Azure Blob or Azure Data Lake Storage Gen2 as data destination, ADF writes ContentMD5 to Azure Blob or Azure Data Lake Storage Gen2 which can be further consumed by downstream applications for data consistency verification.
  • ADF 在任何存储区存储之间复制二进制文件时,会进行文件大小验证。ADF does file size verification when copying binary files between any storage stores.

监视Monitoring

自定义活动的输出Output from copy activity

复制活动完全运行后,可以从每个复制活动运行的输出中看到数据一致性验证的结果:After the copy activity runs completely, you can see the result of data consistency verification from the output of each copy activity run:

"output": {
            "dataRead": 695,
            "dataWritten": 186,
            "filesRead": 3,  
            "filesWritten": 1, 
            "filesSkipped": 2, 
            "throughput": 297,
            "logPath": "https://myblobstorage.blob.core.chinacloudapi.cn//myfolder/a84bf8d4-233f-4216-8cb5-45962831cd1b/",
            "dataConsistencyVerification": 
           { 
                "VerificationResult": "Verified", 
                "InconsistentData": "Skipped" 
           } 
        }

可以从“dataConsistencyVerification 属性”中查看数据一致性验证的详细信息。You can see the details of data consistency verification from "dataConsistencyVerification property".

VerificationResult 的值:Value of VerificationResult:

  • 已验证:已验证复制的数据在源存储和目标存储之间是否一致。Verified: Your copied data has been verified to be consistent between source and destination store.
  • 未验证:由于未在复制活动中启用 validateDataConsistency,因此尚未验证复制的数据是否一致。NotVerified: Your copied data has not been verified to be consistent because you have not enabled the validateDataConsistency in copy activity.
  • 不支持:尚未验证复制的数据是否一致,因为此特定副本对不支持数据一致性验证。Unsupported: Your copied data has not been verified to be consistent because data consistency verification is not supported for this particular copy pair.

InconsistentData 的值:Value of InconsistentData:

  • 已发现:ADF 复制活动发现了不一致的数据。Found: ADF copy activity has found inconsistent data.
  • 已跳过:ADF 复制活动发现并跳过了不一致的数据。Skipped: ADF copy activity has found and skipped inconsistent data.
  • :ADF 复制活动未发现任何不一致的数据。None: ADF copy activity has not found any inconsistent data. 这可能是因为已验证你的数据在源存储和目标存储之间是一致的,也可能是因为你在复制活动中禁用了 validateDataConsistency。It can be either because your data has been verified to be consistent between source and destination store or because you disabled validateDataConsistency in copy activity.

复制活动的会话日志Session log from copy activity

如果配置为记录不一致的文件,则可以从以下路径找到日志文件:https://[your-blob-account].blob.core.chinacloudapi.cn/[path-if-configured]/copyactivity-logs/[copy-activity-name]/[copy-activity-run-id]/[auto-generated-GUID].csvIf you configure to log the inconsistent file, you can find the log file from this path: https://[your-blob-account].blob.core.chinacloudapi.cn/[path-if-configured]/copyactivity-logs/[copy-activity-name]/[copy-activity-run-id]/[auto-generated-GUID].csv. 日志文件只能是 csv 文件。The log files will be the csv files.

日志文件的架构如下所示:The schema of a log file is as following:

Column 说明Description
TimestampTimestamp ADF 跳过不一致文件时的时间戳。The timestamp when ADF skips the inconsistent files.
级别Level 此项的日志级别。The log level of this item. 对于显示文件跳过的项,它将处于“警告”级别。It will be in 'Warning' level for the item showing file skipping.
OperationNameOperationName 每个文件上的 ADF 复制活动操作行为。ADF copy activity operational behavior on each file. 它将为“FileSkip”,以指定要跳过的文件。It will be 'FileSkip' to specify the file to be skipped.
OperationItemOperationItem 要跳过的文件名。The file name to be skipped.
MessageMessage 说明为何要跳过文件的详细信息。More information to illustrate why files being skipped.

日志文件的示例如下所示:The example of a log file is as following:

Timestamp, Level, OperationName, OperationItem, Message
2020-02-26 06:22:56.3190846, Warning, FileSkip, "sample1.csv", "File is skipped after read 548000000 bytes: ErrorCode=DataConsistencySourceDataChanged,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Source file 'sample1.csv' is changed by other clients during the copy activity run.,Source=,'." 

从上面的日志文件中,你可以看到 sample1.csv 已跳过,因为无法验证它在源存储和目标存储之间是否一致。From the log file above, you can see sample1.csv has been skipped because it failed to be verified to be consistent between source and destination store. 你可以获取有关 sample1.csv 变得不一致的原因的更多详细信息,这是因为当 ADF 复制活动同时复制时,其他应用程序正在更改 sample1.csv。You can get more details about why sample1.csv becomes inconsistent is because it was being changed by other applications when ADF copy activity is copying at the same time.

后续步骤Next steps

请参阅其他复制活动文章:See the other Copy Activity articles: