Designing a system that supports millions of users is challenging, and it is a journey that requires continuous refinement and endless improvement. In this chapter, we build a system that supports a single user and gradually scale it up to serve millions of users. After reading this chapter, you will master a handful of techniques that will help you to crack the system design interview questions. /rɪˈfaɪnmənt/ 设计一个支持数百万用户的系统具有挑战性,这是一个旅程 需要不断的完善和无休止的改进。在本章中,我们将构建一个系统 它支持单个用户,并逐渐扩展它以服务于数百万用户。阅读后 本章,您将掌握一些可以帮助您破解系统的技术设计面试问题。
A journey of a thousand miles begins with a single step, and building a complex system is no different. To start with something simple, everything is running on a single server. Figure 1-1 shows the illustration of a single server setup where everything is running on one server: web app, database, cache, etc 千里之行始于足下,构建复杂系统也不例外。从简单的事情开始,一切都在一台服务器上运行。图 1-1 显示了单个服务器设置的图示,其中所有内容都在一台服务器上运行:Web 应用程序、数据库、缓存
To understand this setup, it is helpful to investigate the request flow and traffic source. Let us first look at the request flow (Figure 1-2) 要了解此设置,调查请求流和流量源会很有帮助。让我们先看一下请求流(图 1-2)
1. 用户通过域名访问网站,例如 api.mysite.com。通常,域名系统 (DNS) 是由第三方提供的付费服务,而不是由我们的服务器托管。 2. 互联网协议 (IP) 地址返回到浏览器或移动应用程序。在此示例中,返回 IP 地址 15.125.23.214。 3. 获取 IP 地址后,超文本传输协议 (HTTP) [1] 请求将直接发送到您的 Web 服务器。 4. Web 服务器返回 HTML 页面或 JSON 响应进行渲染
Next, let us examine the traffic source. The traffic to your web server comes from two sources: web application and mobile application. • Web application: it uses a combination of server-side languages (Java, Python, etc.) to handle business logic, storage, etc., and client-side languages (HTML and JavaScript) for presentation. • Mobile application: HTTP protocol is the communication protocol between the mobile app and the web server. JavaScript Object Notation (JSON) is commonly used API response format to transfer data due to its simplicity. An example of the API response in JSON format is shown below: 接下来,让我们检查流量来源。 Web 服务器的流量来自两个来源:Web 应用程序和移动应用程序。 • Web 应用程序:它使用服务器端语言(Java、Python 等)的组合来处理业务逻辑、存储等,并使用客户端语言(HTML 和JavaScript)来进行表示。 • 移动应用程序:HTTP 协议是移动应用程序和网络服务器之间的通信协议。 JavaScript 对象表示法 (JSON) 由于其简单性而成为常用的 API 响应格式来传输数据。 JSON 格式的 API 响应示例如下所示:
GET /users/12 – Retrieve user object for id = 12
With the growth of the user base, one server is not enough, and we need multiple servers: one for web/mobile traffic, the other for the database (Figure 1-3). Separating web/mobile traffic (web tier) and database (data tier) servers allows them to be scaled independently 随着用户群的增长,一台服务器是不够的,我们需要多台服务器:一台用于 Web/移动流量,另一台用于数据库(图 1-3)。分离 Web/移动流量(Web 层)和数据库(数据层)服务器允许它们独立扩展
You can choose between a traditional relational database and a non-relational database. Let us examine their differences. Relational databases are also called a relational database management system (RDBMS) or SQL database. The most popular ones are MySQL, Oracle database, PostgreSQL, etc. Relational databases represent and store data in tables and rows. You can perform join operations using SQL across different database tables. Non-Relational databases are also called NoSQL databases. Popular ones are CouchDB, Neo4j, Cassandra, HBase, Amazon DynamoDB, etc. [2]. These databases are grouped into four categories: key-value stores, graph stores, column stores, and document stores. Join operations are generally not supported in non-relational databases. For most developers, relational databases are the best option because they have been around for over 40 years and historically, they have worked well. However, if relational databases are not suitable for your specific use cases, it is critical to explore beyond relational databases. Non-relational databases might be the right choice if: • Your application requires super-low latency. • Your data are unstructured, or you do not have any relational data. • You only need to serialize and deserialize data (JSON, XML, YAML, etc.). • You need to store a massive amount of data. 您可以在传统关系数据库和非关系数据库之间进行选择。让我们检查一下它们的区别。 关系数据库也称为关系数据库管理系统 (RDBMS) 或 SQL 数据库。最流行的是 MySQL、Oracle 数据库、PostgreSQL 等。关系数据库以表和行的形式表示和存储数据。您可以使用 SQL 跨不同的数据库表执行连接操作。 非关系数据库也称为 NoSQL 数据库。流行的有 CouchDB、Neo4j、Cassandra、HBase、Amazon DynamoDB 等[2]。这些数据库分为四类:键值存储、图形存储、列存储和文档存储。非关系数据库通常不支持连接操作。 对于大多数开发人员来说,关系数据库是最佳选择,因为它们已经存在了 40 多年,而且从历史上看,它们运行良好。但是,如果关系数据库不适合您的特定用例,则探索关系数据库之外的内容至关重要。如果出现以下情况,非关系数据库可能是正确的选择: • 您的应用程序需要超低延迟。 • 您的数据是非结构化的,或者您没有任何关系数据。 • 您只需要序列化和反序列化数据(JSON、XML、YAML 等)。 • 您需要存储大量数据。
/ˌhɔːrɪˈzɑːntl/
Vertical scaling, referred to as “scale up”, means the process of adding more power (CPU, RAM, etc.) to your servers. Horizontal scaling, referred to as “scale-out”, allows you to scale by adding more servers into your pool of resources. When traffic is low, vertical scaling is a great option, and the simplicity of vertical scaling is its main advantage. Unfortunately, it comes with serious limitations. • Vertical scaling has a hard limit. It is impossible to add unlimited CPU and memory to a single server. • Vertical scaling does not have failover and redundancy. If one server goes down, the website/app goes down with it completely. Horizontal scaling is more desirable for large scale applications due to the limitations of vertical scaling. In the previous design, users are connected to the web server directly. Users will unable to access the website if the web server is offline. In another scenario, if many users access the web server simultaneously and it reaches the web server’s load limit, users generally experience slower response or fail to connect to the server. A load balancer is the best technique to address these problems. 垂直扩展,称为“向上扩展”,是指为您的服务器增加更多功能(CPU、RAM 等)的过程。水平扩展,称为“向外扩展”,允许您通过向资源池中添加更多服务器来进行扩展。 当流量较低时,垂直缩放是一个很好的选择,垂直缩放的简单性是它的主要优势。不幸的是,它有严重的局限性。 • 垂直缩放有一个硬性限制。不可能向单个服务器添加无限的 CPU 和内存。 • 垂直扩展没有故障转移和冗余。如果一台服务器出现故障,网站/应用程序将随之完全崩溃。 由于垂直缩放的限制,水平缩放更适用于大规模应用程序。 在以前的设计中,用户直接连接到 Web 服务器。如果网络服务器离线,用户将无法访问该网站。另外一个场景,如果很多用户同时访问web服务器,达到了web服务器的负载限制,用户普遍会出现响应变慢或者连接不上服务器的情况。负载均衡器是解决这些问题的最佳技术
A load balancer evenly distributes incoming traffic among web servers that are defined in a load-balanced set. Figure 1-4 shows how a load balancer works. 负载平衡器在负载平衡集中定义的 Web 服务器之间均匀分配传入流量。图 1-4 显示了负载均衡器的工作原理
As shown in Figure 1-4, users connect to the public IP of the load balancer directly. With this setup, web servers are unreachable directly by clients anymore. For better security, private IPs are used for communication between servers. A private IP is an IP address reachable only between servers in the same network; however, it is unreachable over the internet. The load balancer communicates with web servers through private IPs. In Figure 1-4, after a load balancer and a second web server are added, we successfully solved no failover issue and improved the availability of the web tier. Details are explained below: • If server 1 goes offline, all the traffic will be routed to server 2. This prevents the website from going offline. We will also add a new healthy web server to the server pool to balance the load. • If the website traffic grows rapidly, and two servers are not enough to handle the traffic, the load balancer can handle this problem gracefully. You only need to add more servers to the web server pool, and the load balancer automatically starts to send requests to them. Now the web tier looks good, what about the data tier? The current design has one database, so it does not support failover and redundancy. Database replication is a common technique to address those problems. Let us take a look. 如图1-4所示,用户直接连接负载均衡器的公网IP。使用此设置,客户端无法再直接访问 Web 服务器。为了更好的安全性,私有 IP 用于服务器之间的通信。私有 IP 是只能在同一网络中的服务器之间访问的 IP 地址;但是,它无法通过 Internet 访问。负载均衡器通过私有 IP 与 Web 服务器通信。 在图 1-4 中,添加负载均衡器和第二个 Web 服务器后,我们成功解决了无故障转移问题并提高了 Web 层的可用性。具体解释如下: • 如果服务器1 掉线,所有流量将被路由到服务器2,从而防止网站掉线。我们还将向服务器池中添加一个新的健康 Web 服务器以平衡负载。 • 如果网站流量快速增长,两台服务器不足以处理流量,负载均衡器可以优雅地处理这个问题。 您只需要向 Web 服务器池中添加更多服务器,负载均衡器就会自动开始向它们发送请求。现在 Web 层看起来不错,那么数据层呢?目前的设计只有一个数据库,因此不支持故障转移和冗余。数据库复制是解决这些问题的常用技术。让我们来看看
问题:
Quoted from Wikipedia: “Database replication can be used in many database management systems, usually with a master/slave relationship between the original (master) and the copies (slaves)” [3]. A master database generally only supports write operations. A slave database gets copies of the data from the master database and only supports read operations. All the data-modifying commands like insert, delete, or update must be sent to the master database. Most applications require a much higher ratio of reads to writes; thus, the number of slave databases in a system is usually larger than the number of master databases. Figure 1-5 shows a master database with multiple slave databases. 引自维基百科:“数据库复制可用于许多数据库管理系统,通常在原始(主)和副本之间具有主/从关系(奴隶)“ [3]。 master 数据库通常只支持写入操作。从属数据库获取来自 master 数据库的数据,仅支持读取操作。 所有数据修改插入、删除或更新等命令必须发送到 master 数据库。应用进程需要更高的读写比率; 因此,从属的数量系统中的数据库通常大于主数据库的数量。图 1-5 显示具有多个从属数据库的主数据库。
Advantages of database replication: • Better performance: In the master-slave model, all writes and updates happen in master nodes; whereas, read operations are distributed across slave nodes. This model improves performance because it allows more queries to be processed in parallel. • Reliability: If one of your database servers is destroyed by a natural disaster, such as a typhoon or an earthquake, data is still preserved. You do not need to worry about data loss because data is replicated across multiple locations. • High availability: By replicating data across different locations, your website remains in operation even if a database is offline as you can access data stored in another database server 数据库复制的优点: • 更好的性能:在主从模式下,所有写入和更新都发生在主节点中;而读取操作分布在从属节点上。此模型提高了性能,因为它允许并行处理更多查询。 • 可靠性:如果您的某个数据库服务器被自然灾害(如台风或地震)摧毁,数据仍会保留。您无需担心数据丢失,因为数据是跨多个位置复制的。 • 高可用性:通过跨不同位置复制数据,即使数据库处于脱机状态,您的网站也能保持运行状态,因为您可以访问存储在另一个数据库服务器中的数据
In the previous section, we discussed how a load balancer helped to improve system availability. We ask the same question here: what if one of the databases goes offline? The architectural design discussed in Figure 1-5 can handle this case: • If only one slave database is available and it goes offline, read operations will be directed to the master database temporarily. As soon as the issue is found, a new slave database will replace the old one. In case multiple slave databases are available, read operations are redirected to other healthy slave databases. A new database server will replace the old one. • If the master database goes offline, a slave database will be promoted to be the new master. All the database operations will be temporarily executed on the new master database. A new slave database will replace the old one for data replication immediately. In production systems, promoting a new master is more complicated as the data in a slave database might not be up to date. The missing data needs to be updated by running data recovery scripts. Although some other replication methods like multi-masters and circular replication could help, those setups are more complicated; and their discussions are beyond the scope of this book. Interested readers should refer to the listed reference materials [4] [5]. Figure 1-6 shows the system design after adding the load balancer and database replication. 在上一节中,我们讨论了负载均衡器如何帮助改进系统可用性。我们在这里问同样的问题:如果其中一个数据库离线怎幺办?如图 1-5 中讨论的架构设计可以处理这种情况: • 如果只有一个从库,它下线了,读操作将被引导暂时到master数据库。一发现问题,新建slave数据库 将取代旧的。如果有多个从数据库可用,读取操作是重定向到其他健康的从属数据库。一个新的数据库服务器将取代旧的。 • 如果主库下线,一个从库将被提升为新的master。所有的数据库操作都会临时在新的master上执行。一个新的从数据库将立即替换旧的从数据库进行数据复制。在生产系统中,提升一个新的 master 比 slave 中的数据更复杂数据库可能不是最新的。缺失的数据需要通过运行脚本恢复数据来更新。尽管其他一些复制方法如多主复制和循环复制可能会有所帮助,这些设置更复杂;他们的讨论是超出了本书的范围。有兴趣的读者应该参考列出的参考资料[4] [5]。 图 1-6 显示了添加负载均衡器和数据库复制后的系统设计。
问题:
Let us take a look at the design: • A user gets the IP address of the load balancer from DNS. • A user connects the load balancer with this IP address. • The HTTP request is routed to either Server 1 or Server 2. • A web server reads user data from a slave database. • A web server routes any data-modifying operations to the master database. This includes write, update, and delete operations. 让我们看一下设计: • 用户从 DNS 获取负载均衡器的 IP 地址。 • 用户使用此 IP 地址连接负载均衡器。 • HTTP 请求路由到服务器 1 或服务器 2。 • Web 服务器从从数据库读取用户数据。 • Web 服务器将任何数据修改操作路由到 master 数据库。这包括写入、更新和删除操作
Now, you have a solid understanding of the web and data tiers, it is time to improve the load/response time. This can be done by adding a cache layer and shifting static content (JavaScript/CSS/image/video files) to the content delivery network (CDN) 现在,您对 Web 和数据层有了深入的了解,是时候改进加载/响应时间了。这可以通过添加缓存层并将静态内容(JavaScript / CSS /图像/视频文件)移动到内容交付网络(CDN)来完成。
A cache is a temporary storage area that stores the result of expensive responses or frequently accessed data in memory so that subsequent requests are served more quickly. As illustrated in Figure 1-6, every time a new web page loads, one or more database calls are executed to fetch data. The application performance is greatly affected by calling the database repeatedly. The cache can mitigate this problem 缓存是一个临时存储区域,它将昂贵的响应或频繁访问的数据的结果存储在内存中,以便更快地处理后续请求。如图 1-6 所示,每次加载新网页时,都会执行一个或多个数据库调用来获取数据。重复调用数据库对应用程序性能的影响很大。缓存可以缓解此问题
The cache tier is a temporary data store layer, much faster than the database. The benefits of having a separate cache tier include better system performance, ability to reduce database workloads, and the ability to scale the cache tier independently. Figure 1-7 shows a possible setup of a cache server: 缓存层是临时数据存储层,比数据库快得多。拥有单独的缓存层的好处包括更好的系统性能、减少数据库工作负载的能力以及独立缩放缓存层的能力。图 1-7 显示了缓存服务器的可能设置:
After receiving a request, a web server first checks if the cache has the available response. If it has, it sends data back to the client. If not, it queries the database, stores the response in cache, and sends it back to the client. This caching strategy is called a read-through cache. Other caching strategies are available depending on the data type, size, and access patterns. A previous study explains how different caching strategies work [6]. 收到请求后,Web 服务器首先检查缓存是否具有可用的响应。如果有,它会将数据发送回客户端。如果没有,它将查询数据库,将响应存储在缓存中,然后将其发送回客户端。此缓存策略称为直读缓存。其他缓存策略可用,具体取决于数据类型、大小和访问模式。之前的一项研究解释了不同的缓存策略是如何工作的[6]。 Interacting with cache servers is simple because most cache servers provide APIs for common programming languages. The following code snippet shows typical Memcached APIs: 与缓存服务器交互很简单,因为大多数缓存服务器都为常见编程语言提供 API。以下代码片段显示了典型的 Memcached API:
Here are a few considerations for using a cache system: • Decide when to use cache. Consider using cache when data is read frequently but modified infrequently. Since cached data is stored in volatile memory, a cache server is not ideal for persisting data. For instance, if a cache server restarts, all the data in memory is lost. Thus, important data should be saved in persistent data stores. • Expiration policy. It is a good practice to implement an expiration policy. Once cached data is expired, it is removed from the cache. When there is no expiration policy, cached data will be stored in the memory permanently. It is advisable not to make the expiration date too short as this will cause the system to reload data from the database too frequently. Meanwhile, it is advisable not to make the expiration date too long as the data can become stale. • Consistency: This involves keeping the data store and the cache in sync. Inconsistency can happen because data-modifying operations on the data store and cache are not in a single transaction. When scaling across multiple regions, maintaining consistency between the data store and cache is challenging. For further details, refer to the paper titled “Scaling Memcache at Facebook” published by Facebook [7]. • Mitigating failures: A single cache server represents a potential single point of failure (SPOF), defined in Wikipedia as follows: “A single point of failure (SPOF) is a part of a system that, if it fails, will stop the entire system from working” [8]. As a result, multiple cache servers across different data centers are recommended to avoid SPOF. Another recommended approach is to overprovision the required memory by certain percentages. This provides a buffer as the memory usage increases. 以下是使用缓存系统的一些注意事项: • 决定何时使用缓存。当数据被频繁读取但不经常修改时,考虑使用缓存。由于缓存数据存储在易失性内存中,因此缓存服务器不是持久化数据的理想选择。例如,如果缓存服务器重新启动,内存中的所有数据都会丢失。因此,重要数据应保存在持久数据存储中。 • 过期政策。实施过期政策是一种很好的做法。一旦缓存数据过期,它就会从缓存中删除。当没有过期策略时,缓存数据将永久保存在内存中。建议不要将到期日期设置得太短,因为这会导致系统过于频繁地从数据库重新加载数据。同时,建议不要将到期日期设置得太长,因为数据可能会过时。 • 一致性:这涉及到保持数据存储和高速缓存的同步状态。由于数据存储和缓存上的数据修改操作不在单个事务中,因此可能会发生不一致。当跨多个区域进行缩放时,保持数据存储和缓存之间的一致性具有挑战性。有关更多细节,请参考脸书[7]发布的题为“扩展脸书内存缓存”的论文。 • 减轻故障:单个缓存服务器代表一个潜在的单点故障(SPOF),在维基百科中定义如下:“单点故障(SPOF)是系统的一部分,如果它失败,将停止整个系统工作”[8]。因此,建议跨不同的数据中心使用多个高速缓存服务器,以避免使用SPOF。另一种推荐的方法是按一定的百分比过剩供应所需的内存。随着内存使用量的增加,这就提供了一个缓冲区
• Eviction Policy: Once the cache is full, any requests to add items to the cache might cause existing items to be removed. This is called cache eviction. Least-recently-used (LRU) is the most popular cache eviction policy. Other eviction policies, such as the Least Frequently Used (LFU) or First in First Out (FIFO), can be adopted to satisfy different use cases. • 逐出策略:缓存已满后,任何向缓存添加项的请求都可能导致删除现有项。这称为缓存逐出。最近最少使用的 (LRU) 是最常用的缓存逐出策略。可以采用其他逐出策略,例如最不常用 (LFU) 或先进先出 (FIFO),以满足不同的用例。
A CDN is a network of geographically dispersed servers used to deliver static content. CDN servers cache static content like images, videos, CSS, JavaScript files, etc. Dynamic content caching is a relatively new concept and beyond the scope of this book. It enables the caching of HTML pages that are based on request path, query strings, cookies, and request headers. Refer to the article mentioned in reference material [9] for more about this. This book focuses on how to use CDN to cache static content. Here is how CDN works at the high-level: when a user visits a website, a CDN server closest to the user will deliver static content. Intuitively, the further users are from CDN servers, the slower the website loads. For example, if CDN servers are in San Francisco, users in Los Angeles will get content faster than users in Europe. Figure 1-9 is a great example that shows how CDN improves load time. CDN 是地理上分散的服务器网络,用于交付静态内容。CDN 服务器缓存静态内容,如图像、视频、CSS、JavaScript 文件等。动态内容缓存是一个相对较新的概念,超出了本书的范围。它支持缓存基于请求路径、查询字符串、Cookie 和请求标头的 HTML 页面。有关此内容的更多信息,请参阅参考资料 [9] 中提到的文章。本书重点介绍如何使用 CDN 缓存静态内容。以下是 CDN 在高级别的工作方式:当用户访问网站时,离用户最近的 CDN 服务器将提供静态内容。直观地说,来自 CDN 服务器的用户越多,网站加载速度就越慢。例如,如果 CDN 服务器位于旧金山,则洛杉矶的用户将比欧洲用户更快地获取内容。图 1-9 是一个很好的示例,显示了 CDN 如何缩短加载时间。
Figure 1-10 demonstrates the CDN workflow. 图▁1-10▁展示了▁CDN▁工作流程
1. User A tries to get image.png by using an image URL. The URL’s domain is provided by the CDN provider. The following two image URLs are samples used to demonstrate what image URLs look like on Amazon and Akamai CDNs: • https://mysite.cloudfront.net/logo.jpg • https://mysite.akamai.com/image-manager/img/logo.jpg 2. If the CDN server does not have image.png in the cache, the CDN server requests the file from the origin, which can be a web server or online storage like Amazon S3. 3. The origin returns image.png to the CDN server, which includes optional HTTP header Time-to-Live (TTL) which describes how long the image is cached. 4. The CDN caches the image and returns it to User A. The image remains cached in the CDN until the TTL expires. 5. User B sends a request to get the same image. 6. The image is returned from the cache as long as the TTL has not expired. 1. 用户 A 尝试使用图像 URL 获取图像.png。URL 的域由 CDN 提供商提供。以下两个图像 URL 是用于演示图像 URL 在 Amazon 和 Akamai CDN 上的外观的示例: • https://mysite.cloudfront.net/logo.jpg • https://mysite.akamai.com/image-manager/img/logo.jpg 2。如果 CDN 服务器在缓存中没有图像.png,CDN 服务器将从源请求文件,源可以是 Web 服务器或在线存储,如 Amazon S3。 3. 源返回图像.png到 CDN 服务器,其中包括可选的 HTTP 标头生存时间 (TTL),用于描述图像缓存的时间。 4. CDN 缓存图像并将其返回给用户 A。图像将保留在 CDN 中,直到 TTL 过期。 5. 用户 B 发送请求以获取相同的图像。 6. 只要 TTL 未过期,图像就会从缓存中返回。
• Cost: CDNs are run by third-party providers, and you are charged for data transfers in and out of the CDN. Caching infrequently used assets provides no significant benefits so you should consider moving them out of the CDN. • Setting an appropriate cache expiry: For time-sensitive content, setting a cache expiry time is important. The cache expiry time should neither be too long nor too short. If it is too long, the content might no longer be fresh. If it is too short, it can cause repeat reloading of content from origin servers to the CDN. • CDN fallback: You should consider how your website/application copes with CDN failure. If there is a temporary CDN outage, clients should be able to detect the problem and request resources from the origin. • Invalidating files: You can remove a file from the CDN before it expires by performing one of the following operations: • Invalidate the CDN object using APIs provided by CDN vendors. • Use object versioning to serve a different version of the object. To version an object, you can add a parameter to the URL, such as a version number. For example, version number 2 is added to the query string: image.png?v=2. Figure 1-11 shows the design after the CDN and cache are added • 成本:CDN 由第三方提供商运营,您需要为进出CDN 的数据传输付费。缓存不经常使用的资产不会带来明显的好处,因此您应该考虑将它们移出 CDN。 • 设置适当的缓存过期时间:对于时间敏感的内容,设置缓存过期时间很重要。缓存过期时间既不能太长也不能太短。如果太长,内容可能不再新鲜。如果太短,可能会导致内容从源服务器重复重新加载到 CDN。 • CDN 回退:您应该考虑您的网站/应用程序如何处理 CDN 故障。如果出现临时 CDN 中断,客户端应该能够检测到问题并从源请求资源。 • 使文件失效:您可以通过执行以下操作之一在文件过期之前将其从CDN 中删除: • 使用CDN 供应商提供的API 使CDN 对象失效。 • 使用对象版本控制来提供不同版本的对象。要对对象进行版本控制,您可以向 URL 添加参数,例如版本号。例如,将版本号 2 添加到查询字符串中:image.png?v=2。 图1-11为添加CDN和缓存后的设计
1. Static assets (JS, CSS, images, etc.,) are no longer served by web servers. They are fetched from the CDN for better performance. 2. The database load is lightened by caching data. 1. 静态资产(JS、CSS、图像等)不再由 Web 服务器提供服务。它们是从 CDN 获取的,以获得更好的性能。 2. 缓存数据减轻了数据库负载。
Now it is time to consider scaling the web tier horizontally. For this, we need to move state (for instance user session data) out of the web tier. A good practice is to store session data in the persistent storage such as relational database or NoSQL. Each web server in the cluster can access state data from databases. This is called stateless web tier. 现在是时候考虑水平扩展 Web 层了。为此,我们需要将状态(例如用户会话数据)移出 Web 层。一个好的做法是将会话数据存储在持久性存储中,例如关系数据库或 NoSQL。集群中的每个 Web 服务器都可以从数据库访问状态数据。这称为无状态 Web 层。
A stateful server and stateless server has some key differences. A stateful server remembers client data (state) from one request to the next. A stateless server keeps no state information. Figure 1-12 shows an example of a stateful architecture. 有状态服务器和无状态服务器有一些关键区别。有状态服务器从一个请求到下一个请求记住客户端数据(状态)。无状态服务器不保留任何状态信息。图 1-12 显示了有状态体系结构的示例
In Figure 1-12, user A’s session data and profile image are stored in Server 1. To authenticate User A, HTTP requests must be routed to Server 1. If a request is sent to other servers like Server 2, authentication would fail because Server 2 does not contain User A’s session data. Similarly, all HTTP requests from User B must be routed to Server 2; all requests from User C must be sent to Server 3. The issue is that every request from the same client must be routed to the same server. This can be done with sticky sessions in most load balancers [10]; however, this adds the overhead. Adding or removing servers is much more difficult with this approach. It is also challenging to handle server failures. 在图 1-12 中,用户 A 的会话数据和配置文件图像存储在服务器 1 中。若要对用户 A 进行身份验证,必须将 HTTP 请求路由到服务器 1。如果将请求发送到其他服务器(如服务器 2),则身份验证将失败,因为服务器 2 不包含用户 A 的会话数据。同样,来自用户 B 的所有 HTTP 请求都必须路由到服务器 2;来自用户 C 的所有请求都必须发送到服务器 3。 问题是来自同一客户端的每个请求都必须路由到同一服务器。这可以通过大多数负载均衡器中的粘性会话来完成 [10];但是,这会增加开销。使用此方法添加或删除服务器要困难得多。处理服务器故障也具有挑战性。
Figure 1-13 shows the stateless architecture.
In this stateless architecture, HTTP requests from users can be sent to any web servers, which fetch state data from a shared data store. State data is stored in a shared data store and kept out of web servers. A stateless system is simpler, more robust, and scalable. Figure 1-14 shows the updated design with a stateless web tier. 在此无状态体系结构中,来自用户的 HTTP 请求可以发送到任何 Web 服务器,这些服务器从共享数据存储中获取状态数据。状态数据存储在共享数据存储中,并远离 Web 服务器。无状态系统更简单、更可靠且可扩展。图 1-14 显示了具有无状态 Web 层的更新设计。
In Figure 1-14, we move the session data out of the web tier and store them in the persistent data store. The shared data store could be a relational database, Memcached/Redis, NoSQL, etc. The NoSQL data store is chosen as it is easy to scale. Autoscaling means adding or removing web servers automatically based on the traffic load. After the state data is removed out of web servers, auto-scaling of the web tier is easily achieved by adding or removing servers based on traffic load. Your website grows rapidly and attracts a significant number of users internationally. To improve availability and provide a better user experience across wider geographical areas, supporting multiple data centers is crucial. 在图 1-14 中,我们将会话数据移出 Web 层,并将它们存储在持久性数据存储中。共享数据存储可以是关系数据库,Memcached/Redis,NoSQL等。选择 NoSQL 数据存储是因为它易于扩展。自动缩放意味着根据流量负载自动添加或删除 Web 服务器。从 Web 服务器中删除状态数据后,可以通过根据流量负载添加或删除服务器轻松实现 Web 层的自动缩放。 您的网站发展迅速,并在国际上吸引了大量用户。为了提高可用性并在更广泛的地理区域提供更好的用户体验,支持多个数据中心至关重要。
Figure 1-15 shows an example setup with two data centers. In normal operation, users are geoDNS-routed, also known as geo-routed, to the closest data center, with a split traffic of x% in US-East and (100 – x)% in US-West. geoDNS is a DNS service that allows domain names to be resolved to IP addresses based on the location of a user 图 1-15 显示了具有两个数据中心的示例设置。在正常操作中,用户通过 geoDNS 路由(也称为地理路由)到最近的数据中心,在美国东部的拆分流量为 x%,在美国东部的拆分流量为 (100 – x)%。geoDNS 是一种 DNS 服务,允许根据用户的位置将域名解析为 IP 地址
In the event of any significant data center outage, we direct all traffic to a healthy data center. In Figure 1-16, data center 2 (US-West) is offline, and 100% of the traffic is routed to data center 1 (US-East) 如果发生任何重大数据中心中断,我们会将所有流量定向到健康的数据中心。在图 1-16 中,数据中心 2(美国西部)处于脱机状态,并且 100% 的流量路由到数据中心 1(美国东部)
Several technical challenges must be resolved to achieve multi-data center setup: • Traffic redirection: Effective tools are needed to direct traffic to the correct data center. GeoDNS can be used to direct traffic to the nearest data center depending on where a user is located. • Data synchronization: Users from different regions could use different local databases or caches. In failover cases, traffic might be routed to a data center where data is unavailable. A common strategy is to replicate data across multiple data centers. A previous study shows how Netflix implements asynchronous multi-data center replication [11]. • Test and deployment: With multi-data center setup, it is important to test your website/application at different locations. Automated deployment tools are vital to keep services consistent through all the data centers [11]. To further scale our system, we need to decouple different components of the system so they can be scaled independently. Messaging queue is a key strategy employed by many realworld distributed systems to solve this problem. 要实现多数据中心设置,必须解决几个技术挑战: • 流量重定向:需要有效的工具将流量引导至正确的数据中心。 GeoDNS 可用于根据用户所在的位置将流量定向到最近的数据中心。 • 数据同步:来自不同地区的用户可以使用不同的本地数据库或缓存。在故障转移情况下,流量可能会被路由到数据不可用的数据中心。一种常见的策略是跨多个数据中心复制数据。之前的一项研究展示了 Netflix 如何实现异步多数据中心复制 [11]。 • 测试和部署:对于多数据中心设置,在不同位置测试您的网站/应用程序非常重要。自动化部署工具对于保持所有数据中心的服务一致至关重要 [11]。 为了进一步扩展我们的系统,我们需要解耦系统的不同组件,以便它们可以独立扩展。消息队列是许多现实世界的分布式系统用来解决这个问题的关键策略。
A message queue is a durable component, stored in memory, that supports asynchronous communication. It serves as a buffer and distributes asynchronous requests. The basic architecture of a message queue is simple. Input services, called producers/publishers, create messages, and publish them to a message queue. Other services or servers, called consumers/subscribers, connect to the queue, and perform actions defined by the messages. The model is shown in Figure 1-17. 消息队列是一个持久的组件,存储在内存中,支持异步通信。它充当缓冲区并分发异步请求。消息队列的基本架构很简单。称为生产者/发布者的输入服务创建消息并将其发布到消息队列。称为消费者/订阅者的其他服务或服务器连接到队列,并执行消息定义的操作。该模型如图1-17所示。
Decoupling makes the message queue a preferred architecture for building a scalable and reliable application. With the message queue, the producer can post a message to the queue when the consumer is unavailable to process it. The consumer can read messages from the queue even when the producer is unavailable. Consider the following use case: your application supports photo customization, including cropping, sharpening, blurring, etc. Those customization tasks take time to complete. In Figure 1-18, web servers publish photo processing jobs to the message queue. Photo processing workers pick up jobs from the message queue and asynchronously perform photo customization tasks. The producer and the consumer can be scaled independently. When the size of the queue becomes large, more workers are added to reduce the processing time. However, if the queue is empty most of the time, the number of workers can be reduced.
Decoupling makes the message queue a preferred architecture for building a scalable and reliable application. With the message queue, the producer can post a message to the queue when the consumer is unavailable to process it. The consumer can read messages from the queue even when the producer is unavailable. Consider the following use case: your application supports photo customization, including cropping, sharpening, blurring, etc. Those customization tasks take time to complete. In Figure 1-18, web servers publish photo processing jobs to the message queue. Photo processing workers pick up jobs from the message queue and asynchronously perform photo customization tasks. The producer and the consumer can be scaled independently. When the size of the queue becomes large, more workers are added to reduce the processing time. However, if the queue is empty most of the time, the number of workers can be reduced. 解耦使消息队列成为构建可伸缩且可靠的应用程序的首选体系结构。使用消息队列,生产者可以在使用者无法处理消息时将消息发布到队列。即使生产者不可用,使用者也可以从队列中读取消息。 请考虑以下用例:您的应用程序支持照片自定义,包括裁剪、锐化、模糊等。这些自定义任务需要时间才能完成。在图 1-18 中,Web 服务器将照片处理作业发布到消息队列。照片处理工作线程从邮件队列中选取作业,并异步执行照片自定义任务。生产者和消费者可以独立扩展。当队列的大小变大时,将添加更多工作线程以减少处理时间。但是,如果队列大部分时间都是空的,则可以减少工作器的数量。
When working with a small website that runs on a few servers, logging, metrics, and automation support are good practices but not a necessity. However, now that your site has grown to serve a large business, investing in those tools is essential. Logging: Monitoring error logs is important because it helps to identify errors and problems in the system. You can monitor error logs at per server level or use tools to aggregate them to a centralized service for easy search and viewing. Metrics: Collecting different types of metrics help us to gain business insights and understand the health status of the system. Some of the following metrics are useful: • Host level metrics: CPU, Memory, disk I/O, etc. • Aggregated level metrics: for example, the performance of the entire database tier, cache tier, etc. • Key business metrics: daily active users, retention, revenue, etc. Automation: When a system gets big and complex, we need to build or leverage automation tools to improve productivity. Continuous integration is a good practice, in which each code check-in is verified through automation, allowing teams to detect problems early. Besides, automating your build, test, deploy process, etc. could improve developer productivity significantly. 当使用在几台服务器上运行的小型网站时,日志记录、指标和自动化支持是很好的做法,但不是必需的。但是,既然您的站点已经发展到可以为大型企业提供服务,那么投资这些工具是必不可少的。日志记录:监视错误日志很重要,因为它有助于识别系统中的错误和问题。您可以在每个服务器级别监控错误日志,或使用工具将它们聚合到一个集中的服务中,以便于搜索和查看。指标:收集不同类型的指标有助于我们获得业务洞察力并了解系统的健康状况。以下一些指标很有用: • 主机级别指标:CPU、内存、磁盘 I/O 等。 • 聚合级别指标:例如,整个数据库层、缓存层等的性能。 • 关键业务指标:每日活跃用户、留存率、收入等。 自动化:当系统变得庞大和复杂时,我们需要构建或利用自动化工具来提高生产力。持续集成是一种很好的做法,其中每个代码签入都通过自动化进行验证,从而使团队能够及早发现问题。此外,自动化构建、测试、部署过程等可以显着提高开发人员的工作效率。
Figure 1-19 shows the updated design. Due to the space constraint, only one data center is shown in the figure. 1. The design includes a message queue, which helps to make the system more loosely coupled and failure resilient. 2. Logging, monitoring, metrics, and automation tools are included. 图 1-19 显示了更新后的设计。由于空间限制,图中仅显示了一个数据中心。1. 该设计包括一个消息队列,这有助于使系统更加松散耦合和故障弹性。2. 包括日志记录、监控、指标和自动化工具。
As the data grows every day, your database gets more overloaded. It is time to scale the data tier. 随着数据每天的增长,数据库会变得更加过载。是时候缩放数据层了。
There are two broad approaches for database scaling: vertical scaling and horizontal scaling
Vertical scaling, also known as scaling up, is the scaling by adding more power (CPU, RAM, DISK, etc.) to an existing machine. There are some powerful database servers. According to Amazon Relational Database Service (RDS) [12], you can get a database server with 24 TB of RAM. This kind of powerful database server could store and handle lots of data. For example, stackoverflow.com in 2013 had over 10 million monthly unique visitors, but it only had 1 master database [13]. However, vertical scaling comes with some serious drawbacks: • You can add more CPU, RAM, etc. to your database server, but there are hardware limits. If you have a large user base, a single server is not enough. • Greater risk of single point of failures. • The overall cost of vertical scaling is high. Powerful servers are much more expensive. 垂直扩展,也称为纵向扩展,是通过向现有计算机添加更多功率(CPU、RAM、磁盘等)来进行扩展。有一些功能强大的数据库服务器。根据 Amazon Relational Database Service (RDS) [12],您可以获得具有 24 TB RAM 的数据库服务器。这种功能强大的数据库服务器可以存储和处理大量数据。例如,stackoverflow.com 在2013年每月有超过1000万独立访问者,但它只有一个主数据库[13]。但是,垂直扩展有一些严重的缺点: •您可以添加更多CPU,RAM等。到数据库服务器,但存在硬件限制。如果您拥有庞大的用户群,则单个服务器是不够的。 • 单点故障的风险更大。 • 垂直扩展的总体成本很高。强大的服务器要贵得多
Horizontal scaling, also known as sharding, is the practice of adding more servers. Figure 1- 20 compares vertical scaling with horizontal scaling. 水平扩展,也称为分片,是添加更多服务器的做法。图 1- 20 比较了垂直缩放与水平缩放
Sharding separates large databases into smaller, more easily managed parts called shards. Each shard shares the same schema, though the actual data on each shard is unique to the shard. Figure 1-21 shows an example of sharded databases. User data is allocated to a database server based on user IDs. Anytime you access data, a hash function is used to find the corresponding shard. In our example, user_id % 4 is used as the hash function. If the result equals to 0, shard 0 is used to store and fetch data. If the result equals to 1, shard 1 is used. The same logic applies to other shards. 分片将大型数据库分成更小、更易于管理的部分,称为分片。每个分片共享相同的架构,尽管每个分片上的实际数据对于分片是唯一的。 图 1-21 显示了分片数据库的示例。用户数据根据用户 ID 分配给数据库服务器。每当您访问数据时,都会使用哈希函数来查找相应的分片。在我们的示例中,user_id% 4 用作哈希函数。如果结果等于 0,则使用分片 0 存储和获取数据。如果结果等于 1,则使用分片 1。同样的逻辑也适用于其他分片。
Figure 1-22 shows the user table in sharded databases. 图 1-22 显示了分片数据库中的用户表。
The most important factor to consider when implementing a sharding strategy is the choice of the sharding key. Sharding key (known as a partition key) consists of one or more columns that determine how data is distributed. As shown in Figure 1-22, “user_id” is the sharding key. A sharding key allows you to retrieve and modify data efficiently by routing database queries to the correct database. When choosing a sharding key, one of the most important criteria is to choose a key that can evenly distributed data. Sharding is a great technique to scale the database but it is far from a perfect solution. It introduces complexities and new challenges to the system: 实施分片策略时要考虑的最重要因素是分片键的选择。分片键(称为分区键)由确定数据分布方式的一个或多个列组成。如图 1-22 所示,“user_id”是分片键。分片键允许您通过将数据库查询路由到正确的数据库来有效地检索和修改数据。在选择分片键时,最重要的标准之一是选择能够均匀分布数据的键。分片是扩展数据库的好技术,但它远非完美的解决方案。它给系统带来了复杂性和新的挑战: 【Resharding data】: Resharding data is needed when 1) a single shard could no longer hold more data due to rapid growth. 2) Certain shards might experience shard exhaustion faster than others due to uneven data distribution. When shard exhaustion happens, it requires updating the sharding function and moving data around. Consistent hashing, which will be discussed in Chapter 5, is a commonly used technique to solve this problem. 【重新分片数据】:当 1) 单个分片由于快速增长而无法再容纳更多数据时,需要重新分片数据。2) 由于数据分布不均匀,某些分片可能会比其他分片更快地耗尽分片。当分片耗尽时,需要更新分片函数并移动数据。一致性散列(将在第5章中讨论)是解决此问题的常用技术。 【Celebrity problem】: This is also called a hotspot key problem. Excessive access to a specific shard could cause server overload. Imagine data for Katy Perry, Justin Bieber, and Lady Gaga all end up on the same shard. For social applications, that shard will be overwhelmed with read operations. To solve this problem, we may need to allocate a shard for each celebrity. Each shard might even require further partition. 【明星问题】:这也称为热点关键问题。对特定分片的过度访问可能会导致服务器过载。想象一下,Katy Perry、Justin Bieber和Lady Gaga的数据最终都位于同一个碎片上。对于社交应用程序,该分片将被读取操作淹没。为了解决这个问题,我们可能需要为每个名人分配一个分片。每个分片甚至可能需要进一步的分区。 【Join and de-normalization】: Once a database has been sharded across multiple servers, it is hard to perform join operations across database shards. A common workaround is to de-normalize the database so that queries can be performed in a single table. 【联接与反规范化】:数据库跨多台服务器分片后,很难跨数据库分片执行联接操作。一种常见的解决方法是取消规范化数据库,以便可以在单个表中执行查询。
In Figure 1-23, we shard databases to support rapidly increasing data traffic. At the same time, some of the non-relational functionalities are moved to a NoSQL data store to reduce the database load. Here is an article that covers many use cases of NoSQL [14]. 在图 1-23 中,我们对数据库进行分片以支持快速增长的数据流量。同时,一些非关系功能被移动到NoSQL数据存储中,以减少数据库负载。这是一篇涵盖NoSQL的许多用例的文章[14]
Scaling a system is an iterative process. Iterating on what we have learned in this chapter could get us far. More fine-tuning and new strategies are needed to scale beyond millions of users. For example, you might need to optimize your system and decouple the system to even smaller services. All the techniques learned in this chapter should provide a good foundation to tackle new challenges. To conclude this chapter, we provide a summary of how we scale our system to support millions of users: • Keep web tier stateless • Build redundancy at every tier • Cache data as much as you can • Support multiple data centers • Host static assets in CDN • Scale your data tier by sharding • Split tiers into individual services • Monitor your system and use automation tools Congratulations on getting this far! Now give yourself a pat on the back. Good job! 扩展系统是一个迭代过程。迭代我们在本章中学到的东西可能会让我们走得更远。需要更多的微调和新策略来扩展到数百万用户之外。例如,您可能需要优化系统并将系统分离到更小的服务。本章中学到的所有技术都应该为应对新的挑战奠定良好的基础。为了结束本章,我们总结了我们如何扩展系统以支持数百万用户: • 保持 Web 层无状态 • 在每个层建立冗余 • 尽可能多地缓存数据 • 支持多个数据中心 • 在 CDN 中托管静态资产 • 通过分片扩展数据层 • 将层拆分为单个服务 • 监控您的系统并使用自动化工具 恭喜您走到了这一步!现在拍拍自己的背。干得好
[1] Hypertext Transfer Protocol: https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol [2] Should you go Beyond Relational Databases?: https://blog.teamtreehouse.com/should-you-go-beyond-relational-databases [3] Replication: https://en.wikipedia.org/wiki/Replication_(computing) [4] Multi-master replication: https://en.wikipedia.org/wiki/Multi-master_replication [5] NDB Cluster Replication: Multi-Master and Circular Replication: https://dev.mysql.com/doc/refman/5.7/en/mysql-cluster-replication-multi-master.html [6] Caching Strategies and How to Choose the Right One: https://codeahoy.com/2017/08/11/caching-strategies-and-how-to-choose-the-right-one/ [7] R. Nishtala, "Facebook, Scaling Memcache at," 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’13). [8] Single point of failure: https://en.wikipedia.org/wiki/Single_point_of_failure [9] Amazon CloudFront Dynamic Content Delivery: https://aws.amazon.com/cloudfront/dynamic-content/ [10] Configure Sticky Sessions for Your Classic Load Balancer: https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/elb-sticky-sessions.html [11] Active-Active for Multi-Regional Resiliency: https://netflixtechblog.com/active-active-for-multi-regional-resiliency-c47719f6685b [12] Amazon EC2 High Memory Instances: https://aws.amazon.com/ec2/instance-types/high-memory/ [13] What it takes to run Stack Overflow: http://nickcraver.com/blog/2013/11/22/what-it-takes-to-run-stack-overflow [14] What The Heck Are You Actually Using NoSQL For: http://highscalability.com/blog/2010/12/6/what-the-heck-are-you-actually-using-nosqlfor.html
本文作者:Eric
本文链接:
版权声明:本博客所有文章除特别声明外,均采用 BY-NC-SA 许可协议。转载请注明出处!