在适用于 PostgreSQL 的 Azure Cosmos DB 中创建和分发表

项目
11/22/2023

适用对象：PostgreSQL 的 Azure Cosmos DB （由 PostgreSQL 的 Citus 数据库扩展提供支持）

在此示例中，我们将使用适用于 PostgreSQL 的 Azure Cosmos DB 分布式表来存储和查询所记录的来自 GitHub 开源参与者的事件。

先决条件

若要按照本快速入门操作，首先需要：

在 Azure 门户中创建群集。
通过 psql 连接到群集以运行 SQL 命令。

创建表

通过 psql 建立连接后，让我们来创建表。将以下命令复制并粘贴到 psql 终端窗口，然后按 Enter 运行：

CREATE TABLE github_users
(
	user_id bigint,
	url text,
	login text,
	avatar_url text,
	gravatar_id text,
	display_login text
);

CREATE TABLE github_events
(
	event_id bigint,
	event_type text,
	event_public boolean,
	repo_id bigint,
	payload jsonb,
	repo jsonb,
	user_id bigint,
	org jsonb,
	created_at timestamp
);

CREATE INDEX event_type_index ON github_events (event_type);
CREATE INDEX payload_index ON github_events USING GIN (payload jsonb_path_ops);

请注意 github_events 中 payload 上的 GIN 索引。该索引可用于在 JSONB 列中进行快速查询。由于 Citus 是 PostgreSQL 扩展，因此适用于 PostgreSQL 的 Azure Cosmos DB 支持高级 PostgreSQL 功能，例如用于存储半结构化数据的 JSONB 数据类型。

分发表

create_distributed_table() 是适用于 PostgreSQL 的 Azure Cosmos DB 提供的 magic 函数，用于跨多台计算机分发表和使用资源。该函数将表分解为分片，这些分片可以分布在各个节点，以提高存储和计算性能。

注意

在实际应用程序中，当工作负载适合 64 个 vCore、256GB RAM 和 2TB 存储时，可以使用单节点群集。在这种情况下，分布表是可选的。稍后，可以使用 create_distributed_table_concurrently 根据需要分布表。

让我们分发表：

SELECT create_distributed_table('github_users', 'user_id');
SELECT create_distributed_table('github_events', 'user_id');

重要

需要分发表或使用基于架构的分片才能利用 Azure Cosmos DB for PostgreSQL 性能功能。如果不分发表或架构，则工作器节点无法帮助运行涉及其数据的查询。

将数据加载到分布式表中

我们已准备好使用示例数据填充表。对于本快速入门，我们将使用之前从 GitHub API 捕获的数据集。

我们将使用 pg_azure_storage 扩展直接从 Azure Blob 存储中的公共容器加载数据。首先需要在数据库中创建扩展：

SELECT * FROM create_extension('azure_storage');

运行以下命令，让数据库提取示例 CSV 文件并将其加载到数据库表中。

-- download users and store in table

COPY github_users FROM 'https://pgquickstart.blob.core.chinacloudapi.cn/github/users.csv.gz';

-- download events and store in table

COPY github_events FROM 'https://pgquickstart.blob.core.chinacloudapi.cn/github/events.csv.gz';

请注意扩展如何识别提供给 copy 命令的 URL 来自 Azure Blob 存储，我们指向的文件是 gzip 压缩的，它也会为我们自动处理。

可以使用 citus_tables 视图查看分布式表的详细信息，包括它们的大小：

SELECT * FROM citus_tables;

  table_name   | citus_table_type | distribution_column | colocation_id | table_size | shard_count | table_owner | access_method 
---------------+------------------+---------------------+---------------+------------+-------------+-------------+---------------
 github_events | distributed      | user_id             |             1 | 388 MB     |          32 | citus       | heap
 github_users  | distributed      | user_id             |             1 | 39 MB      |          32 | citus       | heap
(2 rows)

后续步骤

现在我们已经分发了表并加载了数据。接下来，让我们尝试跨分布式表运行查询。

运行分布式查询 >

通过

在适用于 PostgreSQL 的 Azure Cosmos DB 中创建和分发表

先决条件

创建表

分发表

将数据加载到分布式表中

后续步骤

其他资源