教程:将数据迁移到 Azure Cosmos DB 中的 Cassandra API 帐户Tutorial: Migrate your data to Cassandra API account in Azure Cosmos DB

作为开发人员,你可能具有在本地或在云中运行的现有 Cassandra 工作负荷,并且你可能希望将其迁移到 Azure。As a developer, you might have existing Cassandra workloads that are running on-premises or in the cloud, and you might want to migrate them to Azure. 可以将这类工作负荷迁移到 Azure Cosmos DB 中的 Cassandra API 帐户。You can migrate such workloads to a Cassandra API account in Azure Cosmos DB. 本教程对可用来将 Apache Cassandra 数据迁移到 Azure Cosmos DB 中的 Cassandra API 帐户的各种选项进行了说明。This tutorial provides instructions on different options available to migrate Apache Cassandra data into the Cassandra API account in Azure Cosmos DB.

本教程涵盖以下任务:This tutorial covers the following tasks:

  • 规划迁移Plan for migration
  • 迁移的先决条件Prerequisites for migration
  • 使用 cqlsh COPY 命令迁移数据Migrate data using cqlsh COPY command
  • 使用 Spark 迁移数据Migrate data using Spark

如果没有 Azure 订阅,可在开始前创建一个试用帐户If you don't have an Azure subscription, create a trial account before you begin.

迁移的先决条件Prerequisites for migration

  • 估计吞吐量需求: 在将数据迁移到 Azure Cosmos DB 中的 Cassandra API 帐户之前,应当估计你的工作负荷的吞吐量需求。Estimate your throughput needs: Before migrating data to the Cassandra API account in Azure Cosmos DB, you should estimate the throughput needs of your workload. 通常,建议先从 CRUD 操作所需的平均吞吐量开始,然后再包括提取转换加载 (ETL) 或高峰操作所需的额外吞吐量。In general, it's recommended to start with the average throughput required by the CRUD operations and then include the additional throughput required for the Extract Transform Load (ETL) or spiky operations. 需要以下详细信息来规划迁移:You need the following details to plan for migration:

    • 现有数据大小或估计的数据大小: 定义最小数据库大小和吞吐量要求。Existing data size or estimated data size: Defines the minimum database size and throughput requirement. 如果正在估计新应用程序的数据大小,则可以假定数据均匀分布在行中并通过乘以数据大小来估计值。If you are estimating data size for a new application, you can assume that the data is uniformly distributed across the rows and estimate the value by multiplying with the data size.

    • 所需吞吐量: 大概的读取(查询/获取)和写入(更新/删除/插入)吞吐率。Required throughput: Approximate read (query/get) and write (update/delete/insert) throughput rate. 需要使用此值来计算所需的请求单位以及处于稳定状态的数据大小。This value is required to compute the required request units along with steady state data size.

    • 架构: 通过 cqlsh 连接到现有的 Cassandra 群集并从 Cassandra 中导出架构:The schema: Connect to your existing Cassandra cluster through cqlsh and export the schema from Cassandra:

      cqlsh [IP] "-e DESC SCHEMA" > orig_schema.cql
      

    确定现有工作负载的需求后,应根据收集到的吞吐量需求创建一个 Azure Cosmos 帐户、数据库和容器。After you identify the requirements of your existing workload, you should create an Azure Cosmos account, database, and containers according to the gathered throughput requirements.

    • 确定操作的 RU 费用: 可以使用 Cassandra API 支持的 任何 SDK 来确定 RU。Determine the RU charge for an operation: You can determine the RUs by using any of the SDKs supported by the Cassandra API. 此示例演示获取 RU 费用的 .NET 版本。This example shows the .NET version of getting RU charges.

      var tableInsertStatement = table.Insert(sampleEntity);
      var insertResult = await tableInsertStatement.ExecuteAsync();
      
      foreach (string key in insertResult.Info.IncomingPayload)
        {
           byte[] valueInBytes = customPayload[key];
           double value = Encoding.UTF8.GetString(valueInBytes);
           Console.WriteLine($"CustomPayload:  {key}: {value}");
        }
      
  • 分配所需的吞吐量: 随着吞吐量需求的增长,Azure Cosmos DB 可以自动扩展存储和吞吐量。Allocate the required throughput: Azure Cosmos DB can automatically scale storage and throughput as your requirements grow. 可以使用 Azure Cosmos DB 请求单位计算器来估计吞吐量需求。You can estimate your throughput needs by using the Azure Cosmos DB request unit calculator.

  • 在 Cassandra API 帐户中创建表: 在开始迁移数据之前,通过 Azure 门户或 cqlsh 预先创建所有表。Create tables in the Cassandra API account: Before you start migrating data, pre-create all your tables from the Azure portal or from cqlsh. 如果要迁移到具有数据库级别吞吐量的 Azure Cosmos 帐户,请确保在创建 Azure Cosmos 容器时提供分区键。If you are migrating to an Azure Cosmos account that has database level throughput, make sure to provide a partition key when creating the Azure Cosmos containers.

  • 增加吞吐量: 数据迁移的持续时间取决于为 Azure Cosmos DB 中的表预配的吞吐量。Increase throughput: The duration of your data migration depends on the amount of throughput you provisioned for the tables in Azure Cosmos DB. 在迁移期间增加吞吐量。Increase the throughput for the duration of migration. 提高吞吐量后,可避免受到速率限制,并缩短迁移时间。With the higher throughput, you can avoid rate limiting and migrate in less time. 完成迁移后,减少吞吐量以节约成本。After you've completed the migration, decrease the throughput to save costs. 此外,还建议在源数据库所在的同一区域中拥有 Azure Cosmos 帐户。It's also recommended to have the Azure Cosmos account in the same region as your source database.

  • 启用 TLS:Azure Cosmos DB 具有严格的安全要求和标准。Enable TLS: Azure Cosmos DB has strict security requirements and standards. 请确保在与帐户进行交互时启用 TLS。Be sure to enable TLS when you interact with your account. 当你将 CQL 与 SSH 配合使用时,可以选择提供 TLS 信息。When you use CQL with SSH, you have an option to provide TLS information.

迁移数据的选项Options to migrate data

你可以使用以下选项将数据从现有的 Cassandra 工作负载移到 Azure Cosmos DB:You can move data from existing Cassandra workloads to Azure Cosmos DB by using the following options:

使用 cqlsh COPY 命令迁移数据Migrate data using cqlsh COPY command

CQL COPY 命令用于将本地数据复制到 Azure Cosmos DB 中的 Cassandra API 帐户。The CQL COPY command is used to copy local data to the Cassandra API account in Azure Cosmos DB. 使用以下步骤复制数据:Use the following steps to copy data:

  1. 获取 Cassandra API 帐户的连接字符串信息:Get your Cassandra API account's connection string information:

    • 登录到 Azure 门户,导航到你的 Azure Cosmos 帐户。Sign in to the Azure portal, and navigate to your Azure Cosmos account.

    • 打开“连接字符串”窗格,其中包含从 cqlsh 连接到 Cassandra API 帐户所需的所有信息。Open the Connection String pane that contains all the information that you need to connect to your Cassandra API account from cqlsh.

  2. 使用门户中的连接信息登录到 cqlsh。Sign in to cqlsh using the connection information from the portal.

  3. 使用 CQL COPY 命令将本地数据复制到 Cassandra API 帐户。Use the CQL COPY command to copy local data to the Cassandra API account.

    COPY exampleks.tablename FROM filefolderx/*.csv 
    

使用 Spark 迁移数据Migrate data using Spark

通过以下步骤使用 Spark 将数据迁移到 Cassandra API 帐户:Use the following steps to migrate data to the Cassandra API account with Spark:

如果有数据驻留在 Azure 虚拟机或任何其他云的现有群集中,则建议使用 Spark 作业迁移数据。Migrating data by using Spark jobs is a recommended option if you have data residing in an existing cluster in Azure virtual machines or any other cloud. 此选项需要将 Spark 设置为一次性的中介,或者定期引入。This option requires Spark to be set up as an intermediary for one time or regular ingestion. 可以通过在本地与 Azure 之间使用 Azure ExpressRoute 连接来加快此迁移。You can accelerate this migration by using Azure ExpressRoute connectivity between on-premises and Azure.

清理资源Clean up resources

不再需要资源组、Azure Cosmos 帐户和所有相关的资源时,可将其删除。When they're no longer needed, you can delete the resource group, the Azure Cosmos account, and all the related resources. 为此,请选择虚拟机的资源组,选择“删除”,然后确认要删除的资源组的名称。To do so, select the resource group for the virtual machine, select Delete, and then confirm the name of the resource group to delete.

后续步骤Next steps

在本教程中,你已了解如何将数据迁移到 Azure Cosmos DB 中的 Cassandra API 帐户。In this tutorial, you've learned how to migrate your data to Cassandra API account in Azure Cosmos DB. 现在你可以继续阅读以下文章来了解其他 Azure Cosmos DB 概念:You can now proceed to the following article to learn about other Azure Cosmos DB concepts: