将大量随机数据以并行方式上传到 Azure 存储Upload large amounts of random data in parallel to Azure storage

本教程是一个系列中的第二部分。This tutorial is part two of a series. 本教程演示如何部署将大量随机数据上传到 Azure 存储帐户的应用程序。This tutorial shows you deploy an application that uploads large amount of random data to an Azure storage account.

本系列教程的第二部分将介绍如何:In part two of the series, you learn how to:

  • 配置连接字符串Configure the connection string
  • 构建应用程序Build the application
  • 运行应用程序Run the application
  • 验证连接数Validate the number of connections

Azure Blob 存储提供可缩放的服务来存储数据。Azure blob storage provides a scalable service for storing your data. 为了尽可能提高应用程序的性能,建议了解 blob 存储的工作方式。To ensure your application is as performant as possible, an understanding of how blob storage works is recommended. 了解 Azure Blob 的限制非常重要,要深入了解这些限制,请访问:Blob 存储的可伸缩性和性能目标Knowledge of the limits for Azure blobs is important, to learn more about these limits visit: Scalability and performance targets for Blob storage.

在使用 Blob 设计高性能应用程序时,分区命名是另一个潜在重要因素。Partition naming is another potentially important factor when designing a high-performance application using blobs. 对于大于或等于4 MiB 的块大小,会使用高吞吐量块 blob,并且分区命名不会影响性能。For block sizes greater than or equal to 4 MiB, High-Throughput block blobs are used, and partition naming will not impact performance. 对于小于4 MiB 的块大小,Azure 存储使用基于范围的分区方案来进行缩放和负载均衡。For block sizes less than 4 MiB, Azure storage uses a range-based partitioning scheme to scale and load balance. 此配置意味着具有相似命名约定或前缀的文件转到相同分区。This configuration means that files with similar naming conventions or prefixes go to the same partition. 此逻辑还包括文件上传到的容器的名称。This logic includes the name of the container that the files are being uploaded to. 本教程中使用名称为 GUID 的文件以及随机生成的内容。In this tutorial, you use files that have GUIDs for names as well as randomly generated content. 然后将这些文件和内容上传到五个使用随机名称的不同容器。They are then uploaded to five different containers with random names.

先决条件Prerequisites

若要完成本教程,必须先完成上一存储教程:为可缩放的应用程序创建虚拟机和存储帐户To complete this tutorial, you must have completed the previous Storage tutorial: Create a virtual machine and storage account for a scalable application.

远程登录到虚拟机Remote into your virtual machine

在本地计算机上使用以下命令创建与虚拟机的远程桌面会话。Use the following command on your local machine to create a remote desktop session with the virtual machine. 将 IP 地址替换为虚拟机的 publicIPAddress。Replace the IP address with the publicIPAddress of your virtual machine. 出现提示时,输入创建虚拟机时使用的凭据。When prompted, enter the credentials you used when creating the virtual machine.

mstsc /v:<publicIpAddress>

配置连接字符串Configure the connection string

在 Azure 门户中导航到存储帐户。In the Azure portal, navigate to your storage account. 在存储帐户的“设置” 下选择“访问密钥” 。Select Access keys under Settings in your storage account. 从主密钥或辅助密钥复制 连接字符串Copy the connection string from the primary or secondary key. 登录到上一教程中创建的虚拟机。Log in to the virtual machine you created in the previous tutorial. 以管理员身份打开“命令提示符”,并使用 /m 开关运行 setx 命令,该命令可保存计算机设置环境变量 。Open a Command Prompt as an administrator and run the setx command with the /m switch, this command saves a machine setting environment variable. 重载“命令提示符”后,环境变量才可用 。The environment variable is not available until you reload the Command Prompt. 替换以下示例中的“<storageConnectionString>”:Replace <storageConnectionString> in the following sample:

setx storageconnectionstring "<storageConnectionString>" /m

完成后,打开另一“命令提示符”,导航到 D:\git\storage-dotnet-perf-scale-app 并键入 dotnet build 以生成应用程序 。When finished, open another Command Prompt, navigate to D:\git\storage-dotnet-perf-scale-app and type dotnet build to build the application.

运行应用程序Run the application

导航到 D:\git\storage-dotnet-perf-scale-appNavigate to D:\git\storage-dotnet-perf-scale-app.

键入 dotnet run 运行应用程序。Type dotnet run to run the application. 首次运行 dotnet 时,它会填充本地程序包高速缓存,以加快恢复速度并实现脱机访问。The first time you run dotnet it populates your local package cache, to improve restore speed and enable offline access. 完成此命令需要最多一分钟,并且仅完成一次。This command takes up to a minute to complete and only happens once.

dotnet run

应用程序创建五个随机命名的容器,并开始将暂存目录中的文件上传到存储帐户。The application creates five randomly named containers and begins uploading the files in the staging directory to the storage account. 应用程序将最小线程设置和 DefaultConnectionLimit 设置为 100,以确保在运行应用程序时允许大量并发连接。The application sets the minimum threads to 100 and the DefaultConnectionLimit to 100 to ensure that a large number of concurrent connections are allowed when running the application.

除设置线程和连接限制设置外,还需将 UploadFromStreamAsync 方法的 BlobRequestOptions 配置为使用并行,并禁用 MD5 哈希验证。In addition to setting the threading and connection limit settings, the BlobRequestOptions for the UploadFromStreamAsync method are configured to use parallelism and disable MD5 hash validation. 文件以 100 mb 的块上传,此配置提高了性能,但如果网络性能不佳,可能成本高昂,因为如果出现失败,会重试整个 100 mb 的块。The files are uploaded in 100-mb blocks, this configuration provides better performance but can be costly if using a poorly performing network as if there is a failure the entire 100-mb block is retried.

propertiesProperty Value 说明Description
ParallelOperationThreadCountParallelOperationThreadCount 88 上传时,此设置将 blob 分为多个块。The setting breaks the blob into blocks when uploading. 为获得最佳性能,此值应为内核数的 8 倍。For highest performance, this value should be eight times the number of cores.
DisableContentMD5ValidationDisableContentMD5Validation truetrue 该属性禁用对上传内容的 MD5 哈希检查。This property disables checking the MD5 hash of the content uploaded. 禁用 MD5 验证可加快传输速度。Disabling MD5 validation produces a faster transfer. 但是不能确认传输文件的有效性或完整性。But does not confirm the validity or integrity of the files being transferred.
StoreBlobContentMD5StoreBlobContentMD5 falsefalse 该属性确定是否计算和存储文件的 MD5 哈希。This property determines if an MD5 hash is calculated and stored with the file.
RetryPolicyRetryPolicy 2 秒回退,最多重试 10 次2-second backoff with 10 max retry 确定请求的重试策略。Determines the retry policy of requests. 重试连接失败,在此示例中,ExponentialRetry 策略配置为 2 秒回退,最多可重试 10 次。Connection failures are retried, in this example an ExponentialRetry policy is configured with a 2-second backoff, and a maximum retry count of 10. 当应用程序快要达到 Blob 存储的可伸缩性目标时,此设置非常重要。This setting is important when your application gets close to hitting the scalability targets for Blob storage. 有关详细信息,请参阅 Blob 存储的可伸缩性和性能目标For more information, see Scalability and performance targets for Blob storage.

下例显示了 UploadFilesAsync 任务:The UploadFilesAsync task is shown in the following example:

private static async Task UploadFilesAsync()
{
    // Create random 5 characters containers to upload files to.
    CloudBlobContainer[] containers = await GetRandomContainersAsync();
    var currentdir = System.IO.Directory.GetCurrentDirectory();

    // path to the directory to upload
    string uploadPath = currentdir + "\\upload";
    Stopwatch time = Stopwatch.StartNew();
    try
    {
        Console.WriteLine("Iterating in directory: {0}", uploadPath);
        int count = 0;
        int max_outstanding = 100;
        int completed_count = 0;

        // Define the BlobRequestOptions on the upload.
        // This includes defining an exponential retry policy to ensure that failed connections are retried with a backoff policy. As multiple large files are being uploaded
        // large block sizes this can cause an issue if an exponential retry policy is not defined.  Additionally parallel operations are enabled with a thread count of 8
        // This could be should be multiple of the number of cores that the machine has. Lastly MD5 hash validation is disabled for this example, this improves the upload speed.
        BlobRequestOptions options = new BlobRequestOptions
        {
            ParallelOperationThreadCount = 8,
            DisableContentMD5Validation = true,
            StoreBlobContentMD5 = false
        };
        // Create a new instance of the SemaphoreSlim class to define the number of threads to use in the application.
        SemaphoreSlim sem = new SemaphoreSlim(max_outstanding, max_outstanding);

        List<Task> tasks = new List<Task>();
        Console.WriteLine("Found {0} file(s)", Directory.GetFiles(uploadPath).Count());

        // Iterate through the files
        foreach (string path in Directory.GetFiles(uploadPath))
        {
            // Create random file names and set the block size that is used for the upload.
            var container = containers[count % 5];
            string fileName = Path.GetFileName(path);
            Console.WriteLine("Uploading {0} to container {1}.", path, container.Name);
            CloudBlockBlob blockBlob = container.GetBlockBlobReference(fileName);

            // Set block size to 100MB.
            blockBlob.StreamWriteSizeInBytes = 100 * 1024 * 1024;
            await sem.WaitAsync();

            // Create tasks for each file that is uploaded. This is added to a collection that executes them all asyncronously.  
            tasks.Add(blockBlob.UploadFromFileAsync(path, null, options, null).ContinueWith((t) =>
            {
                sem.Release();
                Interlocked.Increment(ref completed_count);
            }));
            count++;
        }

        // Creates an asynchronous task that completes when all the uploads complete.
        await Task.WhenAll(tasks);

        time.Stop();

        Console.WriteLine("Upload has been completed in {0} seconds. Press any key to continue", time.Elapsed.TotalSeconds.ToString());

        Console.ReadLine();
    }
    catch (DirectoryNotFoundException ex)
    {
        Console.WriteLine("Error parsing files in the directory: {0}", ex.Message);
    }
    catch (Exception ex)
    {
        Console.WriteLine(ex.Message);
    }
}

以下示例是截断的应用程序输出,该应用程序在 Windows 系统上运行。The following example is a truncated application output running on a Windows system.

Created container https://mystorageaccount.blob.core.chinacloudapi.cn/9efa7ecb-2b24-49ff-8e5b-1d25e5481076
Created container https://mystorageaccount.blob.core.chinacloudapi.cn/bbe5f0c8-be9e-4fc3-bcbd-2092433dbf6b
Created container https://mystorageaccount.blob.core.chinacloudapi.cn/9ac2f71c-6b44-40e7-b7be-8519d3ba4e8f
Created container https://mystorageaccount.blob.core.chinacloudapi.cn/47646f1a-c498-40cd-9dae-840f46072180
Created container https://mystorageaccount.blob.core.chinacloudapi.cn/38b2cdab-45fa-4cf9-94e7-d533837365aa
Iterating in directory: D:\git\storage-dotnet-perf-scale-app\upload
Found 50 file(s)
Starting upload of D:\git\storage-dotnet-perf-scale-app\upload\1d596d16-f6de-4c4c-8058-50ebd8141e4d.txt to container 9efa7ecb-2b24-49ff-8e5b-1d25e5481076.
Starting upload of D:\git\storage-dotnet-perf-scale-app\upload\242ff392-78be-41fb-b9d4-aee8152a6279.txt to container bbe5f0c8-be9e-4fc3-bcbd-2092433dbf6b.
Starting upload of D:\git\storage-dotnet-perf-scale-app\upload\38d4d7e2-acb4-4efc-ba39-f9611d0d55ef.txt to container 9ac2f71c-6b44-40e7-b7be-8519d3ba4e8f.
Starting upload of D:\git\storage-dotnet-perf-scale-app\upload\45930d63-b0d0-425f-a766-cda27ff00d32.txt to container 47646f1a-c498-40cd-9dae-840f46072180.
Starting upload of D:\git\storage-dotnet-perf-scale-app\upload\5129b385-5781-43be-8bac-e2fbb7d2bd82.txt to container 38b2cdab-45fa-4cf9-94e7-d533837365aa.
...
Upload has been completed in 142.0429536 seconds. Press any key to continue

验证连接Validate the connections

在上载文件的同时,可以验证存储帐户的并发连接数。While the files are being uploaded, you can verify the number of concurrent connections to your storage account. 打开“命令提示符” 并键入 netstat -a | find /c "blob:https"Open a Command Prompt and type netstat -a | find /c "blob:https". 此命令显示当前使用 netstat 打开的连接数。This command shows the number of connections that are currently opened using netstat. 下例显示的输出与自己运行该教程时看到的输出类似。The following example shows a similar output to what you see when running the tutorial yourself. 如该示例所示,上传随机文件到存储帐户时,打开了 800 个连接。As you can see from the example, 800 connections were open when uploading the random files to the storage account. 此值在整个上传过程中不断更改。This value changes throughout running the upload. 通过以并行块区块的形式进行上传,可显著减少传输内容所需的时间。By uploading in parallel block chunks, the amount of time required to transfer the contents is greatly reduced.

C:\>netstat -a | find /c "blob:https"
800

C:\>

后续步骤Next steps

本系列的第二部分介绍了以并行方式将大量随机数据上传到存储帐户的方法,例如如何:In part two of the series, you learned about uploading large amounts of random data to a storage account in parallel, such as how to:

  • 配置连接字符串Configure the connection string
  • 构建应用程序Build the application
  • 运行应用程序Run the application
  • 验证连接数Validate the number of connections

进入本系列的第三部分,从存储帐户下载大量数据。Advance to part three of the series to download large amounts of data from a storage account.