使用 Azure Cosmos DB 进行社交Going social with Azure Cosmos DB

生活在大规模互连的社会中,这意味着有时候你也成了社交网络中的一部分 。Living in a massively interconnected society means that, at some point in life, you become part of a social network. 使用社交网络与朋友、同事和家人保持联系,有时还会与有共同兴趣的人分享我们的激情。You use social networks to keep in touch with friends, colleagues, family, or sometimes to share your passion with people with common interests.

作为工程师或开发人员,你可能想知道这些网络如何存储和互连数据。As engineers or developers, you might have wondered how do these networks store and interconnect your data. 或者,你甚至可能承担着为特定利基市场创建或构建新社交网络的任务。Or you might have even been tasked to create or architect a new social network for a specific niche market. 这时就会产生一个大问题:所有这些数据是如何存储的?That's when the significant question arises: How is all this data stored?

假设你正在创建一个新型时尚的社交网络,用户可以在此网络中发布带有相关媒体(如图片、视频,甚至音乐)的文章。Suppose you're creating a new and shiny social network where your users can post articles with related media, like pictures, videos, or even music. 用户可以对帖子发表评论并打分以进行评级。Users can comment on posts and give points for ratings. 主网站登录页上将提供用户可见并可与之交互的帖子源。There will be a feed of posts that users will see and interact with on the main website landing page. 这种方法听起来并不复杂,但为了简单起见,我们就此止步。This method doesn't sound complex at first, but for the sake of simplicity, let's stop there. (可自行深入了解受关系影响的自定义用户馈送,但这超出了本文的目标。)(You can delve into custom user feeds affected by relationships, but it goes beyond the goal of this article.)

那么,如何确定存储这些数据的方式以及位置?So, how do you store this data and where?

你可能有使用 SQL 数据库的经验,或者了解数据的关系模型。You might have experience on SQL databases or have a notion of relational modeling of data. 可开始绘制如下内容:You may start drawing something as follows:

说明相对关系模型的关系图

非常标准且美观的数据结构,但却不可扩展。A perfectly normalized and pretty data structure... that doesn't scale.

不要误会,我一直致力于开发 SQL 数据库。Don't get me wrong, I've worked with SQL databases all my life. 它们很棒,但就像每种模式、实践方法和软件平台一样,它并不适用于所有场景。They're great, but like every pattern, practice and software platform, it's not perfect for every scenario.

为什么在此方案中 SQL 不是最佳选择?Why isn't SQL the best choice in this scenario? 我们来看一下单个帖子的结构。Let's look at the structure of a single post. 如果我想在网站或应用程序中显示帖子,可能不得不执行查询...通过联接八个表 (!) 来仅仅显示单个帖子。If I wanted to show the post in a website or application, I'd have to do a query with... by joining eight tables(!) just to show one single post. 现在请想象一下:一系列帖子动态地加载并显示在屏幕上。你可能就明白我的意思了。Now picture a stream of posts that dynamically load and appear on the screen, and you might see where I'm going.

可以使用一个功能足够强大的超大 SQL 实例来解决数以千计的查询,其中通过许多联接来呈现内容。You could use an enormous SQL instance with enough power to solve thousands of queries with many joins to serve your content. 但当已经有一个更简单的解决方案存在时,为什么还要选择这种呢?But why would you, when a simpler solution exists?

NoSQL 加载The NoSQL road

本文将指导你使用 Azure 的 NoSQL 数据库 Azure Cosmos DB 经济高效地对社交平台的数据进行建模。This article guides you into modeling your social platform's data with Azure's NoSQL database Azure Cosmos DB cost-effectively. 还将介绍如何使用其他 Azure Cosmos DB 功能,如 Gremlin APIIt also tells you how to use other Azure Cosmos DB features like the Gremlin API. 使用 NoSQL 方法以 JSON 格式存储数据并应用非规范化,就可以将以前的复杂帖子转换为单个文档:Using a NoSQL approach, storing data, in JSON format and applying denormalization, the previously complicated post can be transformed into a single Document:

{
    "id":"ew12-res2-234e-544f",
    "title":"post title",
    "date":"2016-01-01",
    "body":"this is an awesome post stored on NoSQL",
    "createdBy":User,
    "images":["https://myfirstimage.png","https://mysecondimage.png"],
    "videos":[
        {"url":"https://myfirstvideo.mp4", "title":"The first video"},
        {"url":"https://mysecondvideo.mp4", "title":"The second video"}
    ],
    "audios":[
        {"url":"https://myfirstaudio.mp3", "title":"The first audio"},
        {"url":"https://mysecondaudio.mp3", "title":"The second audio"}
    ]
}

可以使用单个查询获得,且无需联接。And it can be gotten with a single query, and with no joins. 此查询非常简单直观,且在预算方面,它所需要的资源更少,但得到的结果更好。This query is much simple and straightforward, and, budget-wise, it requires fewer resources to achieve a better result.

Azure Cosmos DB 的自动索引功能可确保为所有属性都编制索引。Azure Cosmos DB makes sure that all properties are indexed with its automatic indexing. 自动索引甚至可以进行自定义The automatic indexing can even be customized. 通过此无架构方法,可存储具有不同动态结构的文档。The schema-free approach lets us store documents with different and dynamic structures. 也许明天你想让帖子有一个与其相关的类别或井号标签列表?Maybe tomorrow you want posts to have a list of categories or hashtags associated with them? Cosmos DB 将使用添加的属性处理新文档,而无需我们进行额外的工作。Cosmos DB will handle the new Documents with the added attributes without extra work required by us.

可以将对帖子的评论视为具有父属性的其他帖子。Comments on a post can be treated as other posts with a parent property. (这可以简化对象映射。)(This practice simplifies your object mapping.)

{
    "id":"1234-asd3-54ts-199a",
    "title":"Awesome post!",
    "date":"2016-01-02",
    "createdBy":User2,
    "parent":"ew12-res2-234e-544f"
}

{
    "id":"asd2-fee4-23gc-jh67",
    "title":"Ditto!",
    "date":"2016-01-03",
    "createdBy":User3,
    "parent":"ew12-res2-234e-544f"
}

并且所有社交互动都可以作为计数器存储在单个对象上:And all social interactions can be stored on a separate object as counters:

{
    "id":"dfe3-thf5-232s-dse4",
    "post":"ew12-res2-234e-544f",
    "comments":2,
    "likes":10,
    "points":200
}

创建源只不过是创建文档的问题,文档可按给定的相关顺序保留帖子 ID 列表:Creating feeds is just a matter of creating documents that can hold a list of post IDs with a given relevance order:

[
    {"relevance":9, "post":"ew12-res2-234e-544f"},
    {"relevance":8, "post":"fer7-mnb6-fgh9-2344"},
    {"relevance":7, "post":"w34r-qeg6-ref6-8565"}
]

可拥有“最新”的流,其中包含按创建日期排序的帖子。You could have a "latest" stream with posts ordered by creation date. 或者可拥有一个“最热门”的流,其中包含过去 24 小时内点赞数较多的帖子。Or you could have a "hottest" stream with those posts with more likes in the last 24 hours. 甚至可以根据关注者和兴趣等逻辑为每个用户实现自定义流。You could even implement a custom stream for each user based on logic like followers and interests. 它仍然是一个帖子列表。It would still be a list of posts. 虽然如何生成这些列表还是一个问题,但读取性能不会受到阻碍。It's a matter of how to build these lists, but the reading performance stays unhindered. 在获得其中一个列表之后,使用 IN 关键字向 Cosmos DB 发出单个查询以一次获取帖子页面。Once you acquire one of these lists, you issue a single query to Cosmos DB using the IN keyword to get pages of posts at a time.

可以使用 Azure 应用服务的后台进程 - Webjobs - 来构建源流。The feed streams could be built using Azure App Services' background processes: Webjobs. 创建一个帖子后,可以通过使用 Azure 存储队列和 Web 作业(通过 Azure Webjobs SDK 触发)触发后台处理,从而根据自己的自定义逻辑实现流内的帖子传播。Once a post is created, background processing can be triggered by using Azure Storage Queues and Webjobs triggered using the Azure Webjobs SDK, implementing the post propagation inside streams based on your own custom logic.

通过使用这种相同的技术创建最终一致性环境还可以以延迟方式处理评分和点赞。Points and likes over a post can be processed in a deferred manner using this same technique to create an eventually consistent environment.

至于关注者,则需要有更多的技巧来处理。Followers are trickier. Cosmos DB 具有文档大小限制,而且读取/写入大型文档会影响应用程序的可伸缩性。Cosmos DB has a document size limit, and reading/writing large documents can impact the scalability of your application. 因此,可考虑使用以下结构,以文档形式存储关注者:So you may think about storing followers as a document with this structure:

{
    "id":"234d-sd23-rrf2-552d",
    "followersOf": "dse4-qwe2-ert4-aad2",
    "followers":[
        "ewr5-232d-tyrg-iuo2",
        "qejh-2345-sdf1-ytg5",
        //...
        "uie0-4tyg-3456-rwjh"
    ]
}

此结构可能适用于拥有数千位关注者的用户。This structure might work for a user with a few thousands followers. 但如果有名人加入排名,此方法会生成大型文档,并可能最终达到文档大小上限。If some celebrity joins the ranks, however, this approach will lead to a large document size, and it might eventually hit the document size cap.

为了解决此问题,可以使用一种混合方法。To solve this problem, you can use a mixed approach. 可以在用户统计信息文档中存储关注者人数:As part of the User Statistics document you can store the number of followers:

{
    "id":"234d-sd23-rrf2-552d",
    "user": "dse4-qwe2-ert4-aad2",
    "followers":55230,
    "totalPosts":452,
    "totalPoints":11342
}

可以使用 Azure Cosmos DB Gremlin API 存储实际的关注者图形,以创建每位用户的顶点边缘,从中反映出“A 关注 B”关系。You can store the actual graph of followers using Azure Cosmos DB Gremlin API to create vertexes for each user and edges that maintain the "A-follows-B" relationships. 使用 Gremlin API 不仅可以获取某位用户的关注者,而且还能创建更复杂的查询以推荐具有共同点的用户。With the Gremlin API, you can get the followers of a certain user and create more complex queries to suggest people in common. 如果在图形中添加用户喜欢或感兴趣的内容类别,就可以开始布置智能内容发现、推荐关注用户感兴趣的内容和查找具有共同点的用户等体验。If you add to the graph the Content Categories that people like or enjoy, you can start weaving experiences that include smart content discovery, suggesting content that those people you follow like, or finding people that you might have much in common with.

仍然可以使用用户统计信息文档在 UI 或快速配置文件预览中创建卡片。The User Statistics document can still be used to create cards in the UI or quick profile previews.

“阶梯”模式和数据重复The "Ladder" pattern and data duplication

你可能已注意到,在引用帖子的 JSON 文档中,某个用户出现了多次。As you might have noticed in the JSON document that references a post, there are many occurrences of a user. 你猜得没错,这种重复意味着在此非范式下,描述用户的信息可能在多处显示。And you'd have guessed right, these duplicates mean that the information that describes a user, given this denormalization, might be present in more than one place.

若要允许更快速地查询,请引发数据重复。To allow for faster queries, you incur data duplication. 此负面影响的问题在于,如果某些操作导致用户的数据发生更改,那么需要查找该用户曾经执行过的所有活动并对这些活动全部进行更新。The problem with this side effect is that if by some action, a user's data changes, you need to find all the activities the user ever did and update them all. 听上去不太实用,对不对?Doesn't sound practical, right?

通过识别用户的关键属性解决该问题。对于每个活动,都会在应用程序中显示此属性。You're going to solve it by identifying the key attributes of a user that you show in your application for each activity. 如果在应用程序中直观显示一个帖子并仅显示创建者的姓名和照片,那么为什么还要在“createdBy”属性中存储用户的所有数据呢?If you visually show a post in your application and show just the creator's name and picture, why store all of the user's data in the "createdBy" attribute? 如果对于每一条评论都只显示用户的照片,那么的确不需要该用户的其余信息。If for each comment you just show the user's picture, you don't really need the rest of the user's information. 这就会涉及到我所称的“阶梯模式”。That's where something I call the "Ladder pattern" becomes involved.

我们以用户信息为例:Let's take user information as an example:

{
    "id":"dse4-qwe2-ert4-aad2",
    "name":"John",
    "surname":"Doe",
    "address":"742 Evergreen Terrace",
    "birthday":"1983-05-07",
    "email":"john@doe.com",
    "twitterHandle":"\@john",
    "username":"johndoe",
    "password":"some_encrypted_phrase",
    "totalPoints":100,
    "totalPosts":24
}

通过查看此信息,可以快速检测出哪些是重要的信息,哪些不是,从而就会创建一个“阶梯”:By looking at this information, you can quickly detect which is critical information and which isn't, thus creating a "Ladder":

阶梯模式关系图

最简单的一步称为 UserChunk,这是标识用户的最小信息块并可用于数据重复。The smallest step is called a UserChunk, the minimal piece of information that identifies a user and it's used for data duplication. 通过减少重复数据的大小直到只留下要“显示”的信息,可以降低大规模更新的可能性。By reducing the duplicated data size to only the information you'll "show", you reduce the possibility of massive updates.

中间步骤称为用户。The middle step is called the user. 这是会在 Cosmos DB 上的大多数依赖性能的查询上使用的完整数据,也是最常访问和最重要的数据。It's the full data that will be used on most performance-dependent queries on Cosmos DB, the most accessed and critical. 它包括由 UserChunk 表示的信息。It includes the information represented by a UserChunk.

最复杂的一步是扩展用户。The largest is the Extended User. 它包括重要用户信息和其他不需要快速读取或最终使用的数据,如登录过程。It includes the critical user information and other data that doesn't need to be read quickly or has eventual usage, like the sign-in process. 此数据可以存储在 Cosmos DB 外、Azure SQL 数据库或 Azure 表存储中。This data can be stored outside of Cosmos DB, in Azure SQL Database or Azure Storage Tables.

为什么要拆分用户,甚至将此信息存储在不同的位置?Why would you split the user and even store this information in different places? 因为,从性能角度考虑,文档越大,查询成本将越高。Because from a performance point of view, the bigger the documents, the costlier the queries. 保持文档精简,使其包含为社交网络执行所有依赖性能的查询的正确信息。Keep documents slim, with the right information to do all your performance-dependent queries for your social network. 存储最终方案的其他额外信息,例如用于使用情况分析和大数据计划的完整配置文件编辑、登录和数据挖掘。Store the other extra information for eventual scenarios like full profile edits, logins, and data mining for usage analytics and Big Data initiatives. 不必关心数据挖掘的数据收集过程是否较慢,因为它在 Azure SQL 数据库上运行。You really don't care if the data gathering for data mining is slower, because it's running on Azure SQL Database. 尽管你的用户拥有快速而精简的体验,但你仍会感到担心。You do have concern though that your users have a fast and slim experience. Cosmos DB 中存储的用户如下列代码所示:A user stored on Cosmos DB would look like this code:

{
    "id":"dse4-qwe2-ert4-aad2",
    "name":"John",
    "surname":"Doe",
    "username":"johndoe"
    "email":"john@doe.com",
    "twitterHandle":"\@john"
}

贴子内容如下所示:And a Post would look like:

{
    "id":"1234-asd3-54ts-199a",
    "title":"Awesome post!",
    "date":"2016-01-02",
    "createdBy":{
        "id":"dse4-qwe2-ert4-aad2",
        "username":"johndoe"
    }
}

当区块属性受影响需要进行编辑时,可轻松找到受影响的文档。When an edit arises where a chunk attribute is affected, you can easily find the affected documents. 只需使用指向编制索引属性的查询,例如 SELECT * FROM posts p WHERE p.createdBy.id == "edited_user_id",然后更新区块。Just use queries that point to the indexed attributes, such as SELECT * FROM posts p WHERE p.createdBy.id == "edited_user_id", and then update the chunks.

基础知识The underlying knowledge

存储所有日益增长的此内容后,你可能会考虑:我如何处理来自我的用户的所有此信息流?After storing all this content that grows and grows every day, you might find thinking: What can I do with all this stream of information from my users?

答案非常简单:投入使用并从中进行学习。The answer is straightforward: Put it to work and learn from it.

但是,可以学到什么?But what can you learn? 一些简单的示例包括情绪分析、基于用户偏好的内容建议,甚至自动执行内容审查器。内容审查器可确保通过社交网络发布的内容对该系列均安全。A few easy examples include sentiment analysis, content recommendations based on a user's preferences, or even an automated content moderator that makes sure the content published by your social network is safe for the family.

由于想要深入了解,你可能会认为自己需要更多数学科学方面的知识才能从简单数据库和文件中提取出这些模式和信息,其实不然。Now that I got you hooked, you'll probably think you need some PhD in math science to extract these patterns and information out of simple databases and files, but you'd be wrong.

一个可用的选项是使用 Azure 认知服务分析用户内容:不仅可以更好地理解它们(通过使用 文本分析 API 分析编写的内容),而且还可以使用计算机视觉 API 检测不需要的内容或成人内容,并采取相应的措施。One available option is to use Azure Cognitive Services to analyze your users content; not only can you understand them better (through analyzing what they write with Text Analytics API), but you could also detect unwanted or mature content and act accordingly with Computer Vision API. 认知服务包括大量不需要使用任何一种机器学习知识的现成的可用解决方案。Cognitive Services includes many out-of-the-box solutions that don't require any kind of Machine Learning knowledge to use.

多区域缩放社交体验A multiple-region scale social experience

最后,还必须说明一个非常重要的项目:可伸缩性 。There is a last, but not least, important article I must address: scalability. 设计体系结构时,每个组件都应该自行缩放。When you design an architecture, each component should scale on its own. 你最终将需要处理更多数据,或者希望拥有更大的地理覆盖范围。You will eventually need to process more data, or you will want to have a bigger geographical coverage. 幸运的是,使用 Cosmos DB 完成这两项任务是一种统包体验 。Thankfully, achieving both tasks is a turnkey experience with Cosmos DB.

Cosmos DB 支持现成的动态分区。Cosmos DB supports dynamic partitioning out-of-the-box. 它会根据给定的分区键自动创建分区,分区键在文档中定义为属性 。It automatically creates partitions based on a given partition key, which is defined as an attribute in your documents. 定义正确的分区键操作必须在设计时完成。Defining the correct partition key must be done at design time. 有关详细信息,请参阅 Azure Cosmos DB 分区For more information, see Partitioning in Azure Cosmos DB.

对于社交体验,必须将分区策略与查询和写入方式保持一致。For a social experience, you must align your partitioning strategy with the way you query and write. (例如,推荐在同一分区内进行读取,并通过在多个分区上分散写入来避免“热点”。)某些选项为:基于临时键的分区(日/月/周)、按内容类别、按地理区域,或按用户。(For example, reads within the same partition are desirable, and avoid "hot spots" by spreading writes on multiple partitions.) Some options are: partitions based on a temporal key (day/month/week), by content category, by geographical region, or by user. 这一切都取决于查询数据并在社交体验中显示数据的方式。It all really depends on how you'll query the data and show the data in your social experience.

Cosmos DB 以透明方式在所有分区中运行查询(包括聚合),因此无需在数据增长过程中添加任何逻辑。Cosmos DB will run your queries (including aggregates) across all your partitions transparently, so you don't need to add any logic as your data grows.

一段时间后,最终流量会增加,资源消耗(通过 RU 即“请求单位”进行度量)也会增加。With time, you'll eventually grow in traffic and your resource consumption (measured in RUs, or Request Units) will increase. 随着用户群体的增长,你将更频繁地进行读取和写入操作。You will read and write more frequently as your user base grows. 用户群体将开始创建和阅读更多内容。The user base will start creating and reading more content. 因此,缩放吞吐量的能力至关重要 。So the ability of scaling your throughput is vital. 增加 RU 非常容易。Increasing your RUs is easy. 可以通过在 Azure 门户中单击几次或通过 API 发出命令来实现。You can do it with a few clicks on the Azure portal or by issuing commands through the API.

扩展和定义分区键

如果情况不断好转会怎样?What happens if things keep getting better? 假设来自其他区域、国家/地区或大洲的用户注意到你的平台并开始使用。Suppose users from another region, country, or continent notice your platform and start using it. 真是太棒了!What a great surprise!

可是等等!But wait! 你很快就会发现他们使用平台的体验并不是最佳体验。You soon realize their experience with your platform isn't optimal. 他们距离你的运营区域太远,会出现非常严重的延迟情况。They're so far away from your operational region that the latency is terrible. 你显然不希望他们放弃使用。You obviously don't want them to quit. 如果有一种简单的方法可以扩展多区域覆盖范围就好了。If only there was an easy way of extending your multiple-regional reach? 确实有!There is!

通过 Cosmos DB,只需单击数次即可通过透明方式多区域复制数据,并从客户端代码中自动选择可用区域。Cosmos DB lets you replicate your data multiple-regionally and transparently with a couple of clicks and automatically select among the available regions from your client code. 此进程还意味着可以拥有多个故障转移区域This process also means that you can have multiple failover regions.

将数据复制到多个区域时,需确保客户端可以利用该数据。When you replicate your data multiple-regionally, you need to make sure that your clients can take advantage of it. 如果要使用 Web 前端或从移动客户端访问 API,则可以部署 Azure 流量管理器并在所有所需区域克隆 Azure 应用服务(方法是使用某个性能配置来支持扩展的多区域覆盖范围)。If you're using a web frontend or accessing APIs from mobile clients, you can deploy Azure Traffic Manager and clone your Azure App Service on all the desired regions, using a performance configuration to support your extended multiple-regional coverage. 客户端访问前端或 API 时,将被路由到最近的应用服务,而该应用服务将连接到本地的 Cosmos DB 副本。When your clients access your frontend or APIs, they'll be routed to the closest App Service, which in turn, will connect to the local Cosmos DB replica.

将多区域覆盖范围添加到社交平台

结论Conclusion

本文阐述了使用低成本服务完全在 Azure 上创建社交网络的替代方案。This article sheds some light into the alternatives of creating social networks completely on Azure with low-cost services. 它通过鼓励使用多层存储解决方案和称为“阶梯”的数据分发来提供结果。it delivers results by encouraging the use of a multi-layered storage solution and data distribution called "Ladder".

社交网络中各 Azure 服务之间的交互关系图

事实上,对于此类方案并没有万能方法。The truth is that there's no silver bullet for this kind of scenarios. 需结合各种卓越的服务共同创建,才能提供绝佳的体验:Azure Cosmos DB 的速度和自由性,可用于提供绝佳的社交应用程序;Azure 认知搜索等一流搜索解决方案,可用于提供幕后的智能操作;Azure 应用服务的灵活性,不仅可以托管与语言无关的应用程序,甚至还可以托管功能强大的后台处理程序;Azure 存储和 Azure SQL 数据库的可扩展性,可用于存储大量数据;Azure 机器学习的分析功能,可创建能够为进程提供反馈,并且有助于我们向合适的用户提供合适的内容的知识和智能。It's the synergy created by the combination of great services that allow us to build great experiences: the speed and freedom of Azure Cosmos DB to provide a great social application, the intelligence behind a first-class search solution like Azure Cognitive Search, the flexibility of Azure App Services to host not even language-agnostic applications but powerful background processes and the expandable Azure Storage and Azure SQL Database for storing massive amounts of data and the analytic power of Azure Machine Learning to create knowledge and intelligence that can provide feedback to your processes and help us deliver the right content to the right users.

后续步骤Next steps

若要详细了解 Cosmos DB 用例,请参阅常见 Cosmos DB 用例To learn more about use cases for Cosmos DB, see Common Cosmos DB use cases.