编辑
2023-06-06
👨‍🎓 无限进步
00
请注意,本文编写于 744 天前,最后修改于 225 天前,其中某些信息可能已经过时。

目录

CHAPTER 11: DESIGN A NEWS FEED SYSTEM
Step 1 - Understand the problem and establish design scope
Step 2 - Propose high-level design and get buy-in
Newsfeed APIs
Feed publishing API 源发布接口
Newsfeed retrieval API 新闻源检索 API
Feed publishing
Newsfeed building
Step 3 - Design deep dive
Feed publishing deep dive
Web servers
Fanout service
1
2
Newsfeed retrieval deep dive 新闻源检索深入探讨
Cache architecture
Step 4 - Wrap up
Reference materials

CHAPTER 11: DESIGN A NEWS FEED SYSTEM

In this chapter, you are asked to design a news feed system. What is news feed? According to the Facebook help page, “News feed is the constantly updating list of stories in the middle of your home page. News Feed includes status updates, photos, videos, links, app activity, and likes from people, pages, and groups that you follow on Facebook” [1]. This is a popular interview question. Similar questions commonly asked are: design Facebook news feed, Instagram feed, Twitter timeline, etc. 在本章中,要求您设计一个新闻源系统。什么是动态消息?根据 Facebook帮助页面,“新闻提要是中间不断更新的故事列表 您的主页。动态消息包括状态更新、照片、视频、链接、应用活动和 来自您在Facebook上关注的人,页面和小组的喜欢“ [1]。这是一个流行的 面试问题。常见的类似问题是:设计Facebook新闻提要, Instagram feed、Twitter timeline 等。

image.png

Step 1 - Understand the problem and establish design scope

The first set of clarification questions are to understand what the interviewer has in mind when she asks you to design a news feed system. At the very least, you should figure out what features to support. Here is an example of candidate-interviewer interaction: 第一组澄清问题是了解面试官的想法 当她要求你设计一个新闻提要系统时。至少,你应该弄清楚 要支持哪些功能。以下是候选人与面试官互动的示例:
Candidate: Is this a mobile app? Or a web app? Or both? Interviewer: Both Candidate: What are the important features? Interview: A user can publish a post and see her friends’ posts on the news feed page. Candidate: Is the news feed sorted by reverse chronological order or any particular order such as topic scores? For instance, posts from your close friends have higher scores. Interviewer: To keep things simple, let us assume the feed is sorted by reverse chronological order. Candidate: How many friends can a user have? Interviewer: 5000 Candidate: What is the traffic volume? Interviewer: 10 million DAU Candidate: Can feed contain images, videos, or just text? Interviewer: It can contain media files, including both images and videos. 应聘者:这是一个移动应用程序吗?还是网络应用程序?还是两者兼而有之? 采访者:两位 应聘者:有哪些重要特点? 采访:用户可以发布帖子并在动态消息页面上查看好友的帖子。 候选项:新闻提要是按时间倒序还是按任何特定顺序排序 比如主题分数?例如,来自您密友的帖子得分更高。 采访者:为了简单起见,我们假设提要是按时间倒序排序的 次序。 应聘者:一个用户可以有多少个朋友? 面试官:5000 应聘者:流量是多少? 面试官:1000万DAU 应聘者:源可以包含图像、视频或仅包含文本吗? 采访者:它可以包含媒体文件,包括图像和视频。

什么是DAU?

DAU是Daily Active User的缩写,用于衡量网站、互联网应用或网络游戏的日活跃用户数量。它是统计一天内登录或使用某个产品的用户数,去除重复登录的用户后得出的。
Now you have gathered the requirements, we focus on designing the system.

Step 2 - Propose high-level design and get buy-in

The design is divided into two flows: feed publishing and news feed building. • Feed publishing: when a user publishes a post, corresponding data is written into cache and database. A post is populated to her friends’ news feed. • Newsfeed building: for simplicity, let us assume the news feed is built by aggregating friends’ posts in reverse chronological order. 该设计分为两个流程:提要发布和新闻提要构建。 • 提要发布:当用户发布帖子时,相应的数据被写入缓存和数据库。帖子将填充到她朋友的新闻源中。 • 新闻源构建:为简单起见,我们假设新闻源是通过按时间倒序聚合朋友的帖子来构建的。

Newsfeed APIs

The news feed APIs are the primary ways for clients to communicate with servers. Those APIs are HTTP based that allow clients to perform actions, which include posting a status,retrieving news feed, adding friends, etc. We discuss two most important APIs: feed publishing API and news feed retrieval API. 新闻源 API 是客户端与服务器通信的主要方式。这些 API 基于 HTTP,允许客户端执行操作,包括发布状态、检索新闻提要、添加好友等。我们讨论了两个最重要的 API:提要发布 API 和新闻提要检索 API。

Feed publishing API 源发布接口

To publish a post, a HTTP POST request will be sent to the server. The API is shown below: POST /v1/me/feed Params: • content: content is the text of the post. • auth_token: it is used to authenticate API requests. 要发布帖子,将向服务器发送 HTTP POST 请求。该接口如下所示: post /v1/me/feed 参数: • content:内容是帖子的文本。 • auth_token:用于对 API 请求进行身份验证。

Newsfeed retrieval API 新闻源检索 API

The API to retrieve news feed is shown below: GET /v1/me/feed Params: • auth_token: it is used to authenticate API requests. 用于检索新闻源的 API 如下所示: get /v1/me/feed 参数: • auth_token:用于对 API 请求进行身份验证。

Feed publishing

Figure 11-2 shows the high-level design of the feed publishing flow.

image.png

• User: a user can view news feeds on a browser or mobile app. A user makes a post with content “Hello” through API: /v1/me/feed?content=Hello&auth_token={auth_token} • Load balancer: distribute traffic to web servers. • Web servers: web servers redirect traffic to different internal services. • Post service: persist post in the database and cache. • Fanout service: push new content to friends’ news feed. Newsfeed data is stored in the cache for fast retrieval. • Notification service: inform friends that new content is available and send out push notifications. • 用户:用户可以在浏览器或移动应用程序上查看新闻提要。用户发布帖子通过 API 的内容“你好”: /v1/me/feed?content=Hello&auth_token={auth_token} • 负载均衡器:将流量分配到 Web 服务器。 • Web 服务器:Web 服务器将流量重定向到不同的内部服务。 • 发布服务:将 post 保留在数据库和缓存中。 • 扇出服务:将新内容推送到好友的新闻提要。新闻源数据存储在用于快速检索的缓存。 • 通知服务:通知好友有新内容可用并发送推送通知。

Newsfeed building

In this section, we discuss how news feed is built behind the scenes. Figure 11-3 shows the high-level design: 在本节中,我们将讨论如何在幕后构建新闻源。图 11-3 显示了 高级设计:

image.png

User: a user sends a request to retrieve her news feed. The request looks like this: / v1/me/feed. • Load balancer: load balancer redirects traffic to web servers. • Web servers: web servers route requests to newsfeed service. • Newsfeed service: news feed service fetches news feed from the cache. • Newsfeed cache: store news feed IDs needed to render the news feed.

Step 3 - Design deep dive

The high-level design briefly covered two flows: feed publishing and news feed building. Here, we discuss those topics in more depth. 高级设计简要介绍了两个流程:提要发布和新闻提要构建。 在这里,我们将更深入地讨论这些主题。

Feed publishing deep dive

Figure 11-4 outlines the detailed design for feed publishing. We have discussed most of components in high-level design, and we will focus on two components: web servers and fanout service. 图 11-4 概述了 Feed 发布的详细设计。我们已经讨论了高级设计中的大多数组件,我们将重点介绍两个组件:Web 服务器和扇出服务。

image.png

Web servers

Besides communicating with clients, web servers enforce authentication and rate-limiting. Only users signed in with valid auth_token are allowed to make posts. The system limits the number of posts a user can make within a certain period, vital to prevent spam and abusive content. 除了与客户端通信外,Web 服务器还强制执行身份验证和速率限制。 只有使用有效auth_token登录的用户才能发帖。系统限制 用户在一定时间内可以发布的帖子数量,这对于防止垃圾邮件和辱骂行为至关重要 内容。

Fanout service

Fanout is the process of delivering a post to all friends. Two types of fanout models are: fanout on write (also called push model) and fanout on read (also called pull model). Both models have pros and cons. We explain their workflows and explore the best approach to support our system. 扇出是向所有朋友发送帖子的过程。两种类型的扇出模型是:写入时扇出(也称为推送模型)和读取时扇出(也称为拉取模型)。两种模型都有优点和缺点。我们解释他们的工作流程,并探索支持我们系统的最佳方法。

1

Fanout on write. With this approach, news feed is pre-computed during write time. A new post is delivered to friends’ cache immediately after it is published. Pros: • The news feed is generated in real-time and can be pushed to friends immediately. • Fetching news feed is fast because the news feed is pre-computed during write time. Cons: • If a user has many friends, fetching the friend list and generating news feeds for all of them are slow and time consuming. It is called hotkey problem. • For inactive users or those rarely log in, pre-computing news feeds waste computing resources. 写入时扇出。使用此方法,新闻源是在写入时预先计算的。新帖子在发布后会立即发送到朋友的缓存中。 优点: • 动态消息实时生成,可即时推送给好友。 • 获取新闻提要的速度很快,因为新闻提要是在写入时预先计算的。 缺点: • 如果用户有很多朋友,获取朋友列表并为所有这些朋友生成新闻提要既缓慢又耗时。它被称为热键问题。 • 对于不活跃或很少登录的用户,预先计算新闻提要会浪费计算资源。

2

Fanout on read. The news feed is generated during read time. This is an on-demand model.Recent posts are pulled when a user loads her home page. Pros: • For inactive users or those who rarely log in, fanout on read works better because it will not waste computing resources on them. • Data is not pushed to friends so there is no hotkey problem. Cons: • Fetching the news feed is slow as the news feed is not pre-computed. 扇出读取。新闻源在阅读期间生成。这是一个按需模型。当用户加载其主页时,将提取最近的帖子。 优点: • 对于不活跃的用户或很少登录的用户,读取时扇出效果更好,因为它不会浪费计算资源。 •数据不会推送给朋友,因此没有热键问题。 缺点: • 获取新闻源的速度很慢,因为新闻源不是预先计算的。
We adopt a hybrid approach to get benefits of both approaches and avoid pitfalls in them. Since fetching the news feed fast is crucial, we use a push model for the majority of users. For celebrities or users who have many friends/followers, we let followers pull news content on-demand to avoid system overload. Consistent hashing is a useful technique to mitigate the hotkey problem as it helps to distribute requests/data more evenly. Let us take a close look at the fanout service as shown in Figure 11-5. 我们采用混合方法来获得这两种方法的好处并避免其中的陷阱。由于快速获取新闻提要至关重要,因此我们为大多数用户使用推送模型。 对于名人或拥有许多朋友/关注者的用户,我们允许关注者按需拉取新闻内容,以避免系统过载。一致散列是缓解热键问题的有用技术,因为它有助于更均匀地分配请求/数据。让我们仔细看看扇出服务,如图 11-5 所示。

image.png

The fanout service works as follows: 1. Fetch friend IDs from the graph database. Graph databases are suited for managing friend relationship and friend recommendations. Interested readers wishing to learn more about this concept should refer to the reference material [2]. 2. Get friends info from the user cache. The system then filters out friends based on user settings. For example, if you mute someone, her posts will not show up on your news feed even though you are still friends. Another reason why posts may not show is that a user could selectively share information with specific friends or hide it from other people. 3. Send friends list and new post ID to the message queue. 4. Fanout workers fetch data from the message queue and store news feed data in the news feed cache. You can think of the news feed cache as a <post_id, user_id> mapping table.Whenever a new post is made, it will be appended to the news feed table as shown in Figure 11-6. The memory consumption can become very large if we store the entire user and post objects in the cache. Thus, only IDs are stored. To keep the memory size small, we set a configurable limit. The chance of a user scrolling through thousands of posts in news feed is slim. Most users are only interested in the latest content, so the cache miss rate is low. 5. Store <post_id, user_id > in news feed cache. Figure 11-6 shows an example of what the news feed looks like in cache 扇出服务的工作方式如下: 1. 从图形数据库中获取好友 ID。图形数据库适用于管理好友关系和好友推荐。有兴趣的读者希望了解更多关于这个概念的信息,请参考参考资料[2]。 2. 从用户缓存中获取好友信息。然后,系统会根据用户设置过滤掉好友。例如,如果您将某人静音,即使你们仍然是好友,她的帖子也不会显示在您的动态消息中。帖子可能不会显示的另一个原因是用户可以有选择地与特定朋友共享信息或对其他人隐藏信息。 3. 将好友列表和新帖子 ID 发送到消息队列。 4. 扇出工作线程从消息队列中获取数据,并将新闻源数据存储在新闻源缓存中。您可以将新闻源缓存视为<post_id user_id>映射表。每当发布新帖子时,它都会附加到新闻源表中,如图 11-6 所示。如果我们将整个用户和发布对象存储在缓存中,内存消耗可能会变得非常大。因此,仅存储 ID。为了保持较小的内存大小,我们设置了一个可配置的限制。用户在新闻提要中滚动浏览数千个帖子的机会很小。大多数用户只对最新内容感兴趣,因此缓存未命中率较低。 5. 将<post_id、user_id >存储在新闻提要缓存中。图 11-6 显示了新闻源在缓存中的外观示例

image.png

Newsfeed retrieval deep dive 新闻源检索深入探讨

Figure 11-7 illustrates the detailed design for news feed retrieval.

image.png

As shown in Figure 11-7, media content (images, videos, etc.) are stored in CDN for fastretrieval. Let us look at how a client retrieves news feed. 1. A user sends a request to retrieve her news feed. The request looks like this: /v1/me/feed 2. The load balancer redistributes requests to web servers. 3. Web servers call the news feed service to fetch news feeds. 4. News feed service gets a list post IDs from the news feed cache. 5. A user’s news feed is more than just a list of feed IDs. It contains username, profile picture, post content, post image, etc. Thus, the news feed service fetches the complete user and post objects from caches (user cache and post cache) to construct the fully hydrated news feed. 6. The fully hydrated news feed is returned in JSON format back to the client for rendering. 如图 11-7 所示,媒体内容(图像、视频等)存储在 CDN 中以便快速检索。让我们看看客户端如何检索新闻提要。 1. 用户发送请求以检索其新闻提要。请求如下所示:/v1/me/feed 2. 负载均衡器将请求重新分发到 Web 服务器。 3. Web 服务器调用新闻源服务来获取新闻源。 4. 新闻源服务从新闻源缓存中获取列表帖子 ID。 5. 用户的动态消息不仅仅是一个动态 ID 列表。它包含用户名,个人资料图片,帖子内容,帖子图像等。因此,新闻源服务从缓存(用户缓存和发布缓存)中提取完整的用户和发布对象,以构造完全冻结的新闻源。 6. 完全水合的新闻提要以 JSON 格式返回给客户端渲染。

Cache architecture

Cache is extremely important for a news feed system. We divide the cache tier into 5 layers as shown in Figure 11-8. 缓存对于新闻源系统非常重要。我们将缓存层分为 5 层,如图 11-8 所示。

image.png

• News Feed: It stores IDs of news feeds. • Content: It stores every post data. Popular content is stored in hot cache. • Social Graph: It stores user relationship data. • Action: It stores info about whether a user liked a post, replied a post, or took other actions on a post. • Counters: It stores counters for like, reply, follower, following, etc. •新闻提要:它存储新闻提要的ID。 •内容:它存储每个帖子数据。热门内容存储在热缓存中。 •社交图谱:它存储用户关系数据。 •操作:它存储有关用户是否喜欢帖子,回复帖子或对帖子执行其他操作的信息。 •计数器:它存储喜欢,回复,关注者,关注等的计数器。

Step 4 - Wrap up

In this chapter, we designed a news feed system. Our design contains two flows: feed publishing and news feed retrieval. Like any system design interview questions, there is no perfect way to design a system. Every company has its unique constraints, and you must design a system to fit those constraints. Understanding the tradeoffs of your design and technology choices are important. If there are a few minutes left, you can talk about scalability issues. To avoid duplicated discussion, only high-level talking points are listed below. Scaling the database: • Vertical scaling vs Horizontal scaling • SQL vs NoSQL • Master-slave replication • Read replicas • Consistency models • Database sharding Other talking points: • Keep web tier stateless • Cache data as much as you can • Support multiple data centers • Lose couple components with message queues • Monitor key metrics. For instance, QPS during peak hours and latency while users refreshing their news feed are interesting to monitor. Congratulations on getting this far! Now give yourself a pat on the back. Good job! 在本章中,我们设计了一个新闻提要系统。我们的设计包含两个流程:源发布和新闻源检索。 像任何系统设计面试问题一样,没有完美的方法来设计系统。每个公司都有其独特的约束,您必须设计一个系统来适应这些约束。 了解设计和技术选择的权衡非常重要。如果还剩几分钟,您可以讨论可伸缩性问题。为避免重复讨论,下面仅列出高级别的谈话要点。 扩展数据库: • 垂直缩放与水平缩放 • SQL vs NoSQL • 主从复制 • 只读副本 • 一致性模型 • 数据库分片 其他谈话要点: • 保持 Web 层无状态 • 尽可能多地缓存数据 • 支持多个数据中心 • 使用消息队列丢失几个组件 • 监控关键指标。例如,高峰时段的QPS和用户延迟 刷新他们的新闻提要很有趣。 恭喜你走到了这一步!现在拍拍自己的背。干得好!

Reference materials

[1] How News Feed Works: https://www.facebook.com/help/327131014036297/ [2] Friend of Friend recommendations Neo4j and SQL Sever: http://geekswithblogs.net/brendonpage/archive/2015/10/26/friend-of-friendrecommendations-with-neo4j.aspx

`

本文作者:Eric

本文链接:

版权声明:本博客所有文章除特别声明外,均采用 BY-NC-SA 许可协议。转载请注明出处!