When searching on Google or shopping at Amazon, as you type in the search box, one or more matches for the search term are presented to you. This feature is referred to as autocomplete, typeahead, search-as-you-type, or incremental search. Figure 13-1 presents an example of a Google search showing a list of autocompleted results when “dinner” is typed into the search box. Search autocomplete is an important feature of many products. This leads us to the interview question: design a search autocomplete system, also called “design top k” or “design top k most searched queries”. 在 Google 上搜索或在亚马逊购物时,当您在搜索框中输入内容时,系统会向您显示搜索词的一个或多个匹配项。此功能称为自动完成、提前键入、键入时搜索或增量搜索。图 13-1 显示了一个 Google 搜索示例,其中显示了在搜索框中键入“晚餐”时自动完成的结果列表。搜索自动完成是许多产品的重要功能。这就引出了面试问题:设计一个搜索自动完成系统,也称为“设计前k”或“设计前k个搜索最多的查询”。
The first step to tackle any system design interview question is to ask enough questions to clarify requirements. Here is an example of candidate-interviewer interaction: Candidate: Is the matching only supported at the beginning of a search query or in the middle as well? Interviewer: Only at the beginning of a search query. Candidate: How many autocomplete suggestions should the system return? Interviewer: 5 Candidate: How does the system know which 5 suggestions to return? Interviewer: This is determined by popularity, decided by the historical query frequency. Candidate: Does the system support spell check? Interviewer: No, spell check or autocorrect is not supported. Candidate: Are search queries in English? Interviewer: Yes. If time allows at the end, we can discuss multi-language support. Candidate: Do we allow capitalization and special characters? Interviewer: No, we assume all search queries have lowercase alphabetic characters. Candidate: How many users use the product? Interviewer: 10 million DAU. 解决任何系统设计面试问题的第一步是提出足够的问题来阐明需求。以下是候选人与面试官互动的示例: 候选项:匹配是否仅在搜索查询开始时或中间也支持? 面试官:仅在搜索查询开始时。 应聘者:系统应返回多少个自动完成建议? 面试官: 5 应聘者:系统如何知道要返回哪 5 个建议? 主持人:这是由热度决定的,由历史查询频率决定的。 应聘者:系统是否支持拼写检查? 主持人:不支持拼写检查或自动更正。 应聘者:搜索查询是英文的吗? 采访者:是的。如果最后时间允许,我们可以讨论多语言支持。 应聘者:我们是否允许大写和特殊字符? 采访者:不,我们假设所有搜索查询都有小写字母字符。 应聘者:有多少用户使用该产品? 采访者:1000万DAU。
Here is a summary of the requirements: • Fast response time: As a user types a search query, autocomplete suggestions must show up fast enough. An article about Facebook’s autocomplete system [1] reveals that the system needs to return results within 100 milliseconds. Otherwise it will cause stuttering. • Relevant: Autocomplete suggestions should be relevant to the search term. • Sorted: Results returned by the system must be sorted by popularity or other ranking models. • Scalable: The system can handle high traffic volume. • Highly available: The system should remain available and accessible when part of the system is offline, slows down, or experiences unexpected network errors. 以下是要求的摘要: • 快速响应时间:当用户键入搜索查询时,自动完成建议的显示速度必须足够快。一篇关于Facebook自动完成系统[1]的文章显示,系统需要在100毫秒内返回结果。否则会导致口吃。 • 相关:自动填充建议应与搜索字词相关。 • 排序:系统返回的结果必须按受欢迎程度或其他排名模型排序。 • 可扩展:系统可以处理高流量。 • 高可用性:当系统的一部分脱机、速度变慢或遇到意外的网络错误时,系统应保持可用和可访问。
• Assume 10 million daily active users (DAU). • An average person performs 10 searches per day. • 20 bytes of data per query string: • Assume we use ASCII character encoding. 1 character = 1 byte • Assume a query contains 4 words, and each word contains 5 characters on average. • That is 4 x 5 = 20 bytes per query. • For every character entered into the search box, a client sends a request to the backend for autocomplete suggestions. On average, 20 requests are sent for each search query. For example, the following 6 requests are sent to the backend by the time you finish typing “dinner”. • 假设每日活跃用户 (DAU) 为 1000 万。 •普通人每天进行10次搜索。 • 每个查询字符串 20 字节的数据: • 假设我们使用 ASCII 字符编码。1 个字符 = 1 个字节 • 假设查询包含 4 个单词,每个单词平均包含 5 个字符。 • 即每个查询 4 x 5 = 20 字节。 • 对于在搜索框中输入的每个字符,客户端都会向后端发送请求以获取自动完成建议。平均而言,每个搜索查询发送 20 个请求。例如,当您键入完“晚餐”时,以下 6 个请求将发送到后端。
search?q=d search?q=di search?q=din search?q=dinn search?q=dinne search?q=dinner
• ~24,000 query per second (QPS) = 10,000,000 users * 10 queries / day * 20 characters / 24 hours / 3600 seconds. • Peak QPS = QPS * 2 = ~48,000 • Assume 20% of the daily queries are new. 10 million * 10 queries / day * 20 byte per query * 20% = 0.4 GB. This means 0.4GB of new data is added to storage daily. • ~24000 次/秒查询 (QPS) = 10,000,000 个用户 * 10 次查询/天 * 20 个字符 / 24 小时 / 3600 秒。 • 峰值 QPS = QPS * 2 = ~48,000 • 假设 20% 的每日查询是新的。1000 万 * 10 次查询/天 * 每个查询 20 字节 * 20% = 0.4 GB。这意味着每天有 0.4GB 的新数据添加到存储中。
At the high-level, the system is broken down into two: • Data gathering service: It gathers user input queries and aggregates them in real-time. Real-time processing is not practical for large data sets; however, it is a good starting point. We will explore a more realistic solution in deep dive. • Query service: Given a search query or prefix, return 5 most frequently searched terms. 在高级别,系统分为两个: • 数据收集服务:收集用户输入查询并实时汇总。实时处理对于大型数据集是不切实际的;但是,这是一个很好的起点。我们将在深入探讨中探索更现实的解决方案。 • 查询服务:给定搜索查询或前缀,返回 5 个最常搜索的词。
Let us use a simplified example to see how data gathering service works. Assume we have a frequency table that stores the query string and its frequency as shown in Figure 13-2. In the beginning, the frequency table is empty. Later, users enter queries “twitch”, “twitter”, “twitter,” and “twillo” sequentially. Figure 13-2 shows how the frequency table is updated. 让我们使用一个简化的示例来了解数据收集服务的工作原理。假设我们有一个频率表,用于存储查询字符串及其频率,如图 13-2 所示。一开始,频率表是空的。之后,用户依次输入查询“twitch”、“twitter”、“twitter”和“twillo”。图 13-2 显示了如何更新频率表。
Assume we have a frequency table as shown in Table 13-1. It has two fields. • Query: it stores the query string. • Frequency: it represents the number of times a query has been searched. 假设我们有一个频率表,如表 13-1 所示。它有两个字段。 • 查询:它存储查询字符串。 • 频率:表示搜索查询的次数。
When a user types “tw” in the search box, the following top 5 searched queries are displayed (Figure 13-3), assuming the frequency table is based on Table 13-1. 当用户在搜索框中键入“tw”时,将显示以下前 5 个搜索查询(图 13-3),假设频率表基于表 13-1。
To get top 5 frequently searched queries, execute the following SQL query 若要获取前 5 个经常搜索的查询,请执行以下 SQL 查询
This is an acceptable solution when the data set is small. When it is large, accessing the database becomes a bottleneck. We will explore optimizations in deep dive. 当数据集较小时,这是一个可接受的解决方案。当数据库很大时,访问数据库将成为瓶颈。我们将深入探讨优化。
In the high-level design, we discussed data gathering service and query service. The highlevel design is not optimal, but it serves as a good starting point. In this section, we will dive deep into a few components and explore optimizations as follows: • Trie data structure • Data gathering service • Query service • Scale the storage • Trie operations 在高级设计中,我们讨论了数据收集服务和查询服务。高级设计不是最佳的,但它是一个很好的起点。在本节中,我们将深入探讨一些组件并探索优化,如下所示: • Trie 数据结构 • 数据收集服务 • 查询服务 • 扩展存储 • Trie 操作
Relational databases are used for storage in the high-level design. However, fetching the top 5 search queries from a relational database is inefficient. The data structure trie (prefix tree) is used to overcome the problem. As trie data structure is crucial for the system, we will dedicate significant time to design a customized trie. Please note that some of the ideas are from articles [2] and [3]. Understanding the basic trie data structure is essential for this interview question. However,this is more of a data structure question than a system design question. Besides, many online materials explain this concept. In this chapter, we will only discuss an overview of the trie data structure and focus on how to optimize the basic trie to improve response time. 关系数据库用于高级设计中的存储。但是,从关系数据库获取前 5 个搜索查询效率低下。数据结构trie(前缀树)用于克服该问题。由于trie数据结构对系统至关重要,我们将投入大量时间来设计定制的trie。请注意,其中一些想法来自文章 [2] 和 [3]。 了解基本的trie数据结构对于这个面试问题至关重要。然而,这更像是一个数据结构问题,而不是一个系统设计问题。此外,许多在线材料都解释了这个概念。在本章中,我们将只讨论trie数据结构的概述,并重点介绍如何优化基本trie以提高响应时间。
Trie (pronounced “try”) is a tree-like data structure that can compactly store strings. The name comes from the word retrieval, which indicates it is designed for string retrieval operations. The main idea of trie consists of the following: • A trie is a tree-like data structure. • The root represents an empty string. • Each node stores a character and has 26 children, one for each possible character. To save space, we do not draw empty links. • Each tree node represents a single word or a prefix string. Trie(发音为“try”)是一种树状数据结构,可以紧凑地存储字符串。该名称来自单词 retrieval,表示它是为字符串检索操作而设计的。trie的主要思想包括以下内容: • trie 是一种树状数据结构。 • 根表示空字符串。 • 每个节点存储一个角色,有 26 个子节点,每个可能的角色一个。为了节省空间,我们不绘制空链接。 • 每个树节点代表一个单词或前缀字符串。
Figure 13-5 shows a trie with search queries “tree”, “try”, “true”, “toy”, “wish”, “win”. Search queries are highlighted with a thicker border. 图 13-5 显示了搜索查询“tree”, “try”, “true”, “toy”, “wish”, “win”的尝试。搜索查询以较粗的边框突出显示。
Basic trie data structure stores characters in nodes. To support sorting by frequency, frequency info needs to be included in nodes. Assume we have the following frequency table. 基本 trie 数据结构将字符存储在节点中。为了支持按频率排序,节点中需要包含频率信息。假设我们有以下频率表。
After adding frequency info to nodes, updated trie data structure is shown in Figure 13-6. 将频率信息添加到节点后,更新后的trie数据结构如图13-6所示。
How does autocomplete work with trie? Before diving into the algorithm, let us define some terms. • p: length of a prefix • n: total number of nodes in a trie • c: number of children of a given node Steps to get top k most searched queries are listed below: 1. Find the prefix. Time complexity: O(p). 2. Traverse the subtree from the prefix node to get all valid children. A child is valid if it can form a valid query string. Time complexity: O(c) 3. Sort the children and get top k. Time complexity: O(clogc) 自动完成如何与 trie 配合使用?在深入研究算法之前,让我们定义一些术语。 • p:前缀的长度 • n:trie 中的节点总数 • c:给定节点的子节点数 下面列出了获取搜索次数最多的查询的前 k 个步骤: 1. 找到前缀。时间复杂度:O(p)。 2. 从前缀节点遍历子树以获取所有有效的子节点。如果子项可以形成有效的查询字符串,则该子项有效。时间复杂度:O(c) 3.对孩子进行排序并获得顶部k。时间复杂度:O(堵塞)
Let us use an example as shown in Figure 13-7 to explain the algorithm. Assume k equals to 2 and a user types “tr” in the search box. The algorithm works as follows: • Step 1: Find the prefix node “tr”. • Step 2: Traverse the subtree to get all valid children. In this case, nodes [tree: 10], [true: 35], [try: 29] are valid. • Step 3: Sort the children and get top 2. [true: 35] and [try: 29] are the top 2 queries with prefix “tr 让我们使用如图 13-7 所示的示例来解释算法。假设 k 等于 2,并且用户在搜索框中键入“tr”。该算法的工作原理如下: • 第 1 步:找到前缀节点“tr”。 • 第 2 步:遍历子树以获取所有有效的子项。在这种情况下,节点 [tree: 10]、[true: 35]、[try: 29] 是有效的。 • 第 3 步:对孩子进行排序并获得前 2 名。[true: 35] 和 [try: 29] 是前缀为“tr
The time complexity of this algorithm is the sum of time spent on each step mentioned above: O(p) + O(c) + O(clogc) The above algorithm is straightforward. However, it is too slow because we need to traverse the entire trie to get top k results in the worst-case scenario. Below are two optimizations: 1. Limit the max length of a prefix 2. Cache top search queries at each node Let us look at these optimizations one by one. 该算法的时间复杂度是上面提到的每个步骤所花费的时间总和:O(p) + O(c) + O(clogc) 上面的算法很简单。但是,它太慢了,因为我们需要遍历整个尝试才能在最坏的情况下获得前 k 个结果。以下是两个优化: 1. 限制前缀的最大长度 2. 在每个节点缓存热门搜索查询 让我们一一看一下这些优化。
Users rarely type a long search query into the search box. Thus, it is safe to say p is a small integer number, say 50. If we limit the length of a prefix, the time complexity for “Find the prefix” can be reduced from O(p) to O(small constant), aka O(1). 用户很少在搜索框中键入较长的搜索查询。因此,可以肯定地说p是一个小整数,比如50。如果我们限制前缀的长度,“查找前缀”的时间复杂度可以从 O(p) 降低到 O(小常数),也就是 O(1)。
To avoid traversing the whole trie, we store top k most frequently used queries at each node. Since 5 to 10 autocomplete suggestions are enough for users, k is a relatively small number. In our specific case, only the top 5 search queries are cached. By caching top search queries at every node, we significantly reduce the time complexity to retrieve the top 5 queries. However, this design requires a lot of space to store top queries at every node. Trading space for time is well worth it as fast response time is very important. Figure 13-8 shows the updated trie data structure. Top 5 queries are stored on each node. For example, the node with prefix “be” stores the following: [best: 35, bet: 29, bee: 20, be: 15, beer: 10]. 为了避免遍历整个尝试,我们将前 k 个最常用的查询存储在每个节点上。由于 5 到 10 个自动完成建议对用户来说就足够了,因此 k 是一个相对较小的数字。在我们的特定情况下,仅缓存前 5 个搜索查询。 通过在每个节点缓存排名靠前的搜索查询,我们显著降低了检索前 5 个查询的时间复杂度。但是,此设计需要大量空间来存储每个节点上的顶级查询。用空间换时间是非常值得的,因为快速响应时间非常重要。 图 13-8 显示了更新的 trie 数据结构。前 5 个查询存储在每个节点上。例如,前缀为“be”的节点存储以下内容:[最佳:35,赌注:29,蜜蜂:20,be:15,啤酒:10]。
Let us revisit the time complexity of the algorithm after applying those two optimizations: 1. Find the prefix node. Time complexity: O(1) 2. Return top k. Since top k queries are cached, the time complexity for this step is O(1). As the time complexity for each of the steps is reduced to O(1), our algorithm takes only O(1) to fetch top k queries. 在应用这两个优化之后,让我们重新审视算法的时间复杂度: 1. 找到前缀节点。时间复杂度:O(1) 2. 返回顶部 k。由于缓存了前 k 个查询,因此此步骤的时间复杂度为 O(1)。由于每个步骤的时间复杂度降低到 O(1),我们的算法只需要 O(1) 来获取前 k 个查询。
In our previous design, whenever a user types a search query, data is updated in real-time. This approach is not practical for the following two reasons: • Users may enter billions of queries per day. Updating the trie on every query significantly slows down the query service. • Top suggestions may not change much once the trie is built. Thus, it is unnecessary to update the trie frequently. 在我们以前的设计中,每当用户键入搜索查询时,数据都会实时更新。由于以下两个原因,此方法不切实际: • 用户每天可能输入数十亿个查询。更新每个查询的 trie 会显著降低查询服务的速度。 • 一旦 trie 建成,热门建议可能不会有太大变化。因此,没有必要经常更新trie。
To design a scalable data gathering service, we examine where data comes from and how data is used. Real-time applications like Twitter require up to date autocomplete suggestions. However, autocomplete suggestions for many Google keywords might not change much on a daily basis. Despite the differences in use cases, the underlying foundation for data gathering service remains the same because data used to build the trie is usually from analytics or logging services. Figure 13-9 shows the redesigned data gathering service. Each component is examined one by one. 为了设计可扩展的数据收集服务,我们检查数据的来源以及数据的使用方式。像Twitter这样的实时应用程序需要最新的自动完成建议。不过,许多 Google 关键字的自动填充建议可能不会每天发生太大变化。 尽管用例存在差异,但数据收集服务的基础保持不变,因为用于构建 trie 的数据通常来自分析或日志记录服务。 图 13-9 显示了重新设计的数据收集服务。每个组件都逐个检查。
Analytics Logs. It stores raw data about search queries. Logs are append-only and are not indexed. Table 13-3 shows an example of the log file. 分析日志。它存储有关搜索查询的原始数据。日志仅追加,不编制索引。表 13-3 显示了日志文件的示例。
Aggregators. The size of analytics logs is usually very large, and data is not in the right format. We need to aggregate data so it can be easily processed by our system. Depending on the use case, we may aggregate data differently. For real-time applications such as Twitter, we aggregate data in a shorter time interval as real-time results are important. On the other hand, aggregating data less frequently, say once per week, might be good enough for many use cases. During an interview session, verify whether real-time results are important. We assume trie is rebuilt weekly. 聚合器。分析日志的大小通常非常大,并且数据的格式不正确。我们需要聚合数据,以便我们的系统可以轻松处理它。 根据用例,我们可能会以不同的方式聚合数据。对于 Twitter 等实时应用程序,我们会在更短的时间间隔内聚合数据,因为实时结果很重要。另一方面,减少聚合数据的频率,比如每周一次,对于许多用例来说可能已经足够了。在面试期间,验证实时结果是否重要。我们假设trie每周重建一次。
Table 13-4 shows an example of aggregated weekly data. “time” field represents the start time of a week. “frequency” field is the sum of the occurrences for the corresponding query in that week. 表 13-4 显示了汇总的每周数据示例。“时间”字段表示一周的开始时间。“频率”字段是该周相应查询的出现次数的总和。
Workers. Workers are a set of servers that perform asynchronous jobs at regular intervals. They build the trie data structure and store it in Trie DB. Trie Cache. Trie Cache is a distributed cache system that keeps trie in memory for fast read. It takes a weekly snapshot of the DB. Trie DB. Trie DB is the persistent storage. Two options are available to store the data: 1. Document store: Since a new trie is built weekly, we can periodically take a snapshot of it, serialize it, and store the serialized data in the database. Document stores like MongoDB [4] are good fits for serialized data. 2. Key-value store: A trie can be represented in a hash table form [4] by applying the following logic: • Every prefix in the trie is mapped to a key in a hash table. • Data on each trie node is mapped to a value in a hash table. Figure 13-10 shows the mapping between the trie and hash table. 工人。辅助角色是一组定期执行异步作业的服务器。他们构建trie数据结构并将其存储在Trie DB中。 Trie Cache。Trie Cache 是一种分布式缓存系统,可将 trie 保存在内存中以便快速读取。它每周拍摄数据库的快照。 特里数据库。Trie DB 是持久存储。有两个选项可用于存储数据: 1. 文档存储:由于每周都会构建一个新的 trie,因此我们可以定期拍摄它的快照,对其进行序列化,并将序列化数据存储在数据库中。像MongoDB [4]这样的文档存储非常适合序列化数据。 2. 键值存储:通过应用以下逻辑,trie 可以用哈希表形式 [4] 表示: • trie 中的每个前缀都映射到哈希表中的键。 • 每个 trie 节点上的数据都映射到哈希表中的值。 图 13-10 显示了 trie 和哈希表之间的映射。
In Figure 13-10, each trie node on the left is mapped to the <key, value> pair on the right. If you are unclear how key-value stores work, refer to Chapter 6: Design a key-value store. 在图 13-10 中,左侧的每个 trie 节点都映射到右侧的<键、值>对。如果您不清楚键值存储的工作原理,请参阅第 6 章:设计键值存储。
In the high-level design, query service calls the database directly to fetch the top 5 results. Figure 13-11 shows the improved design as previous design is inefficient. 在高级设计中,查询服务直接调用数据库以获取前 5 个结果。图 13-11 显示了改进后的设计,因为以前的设计效率低下。
1. A search query is sent to the load balancer. 2. The load balancer routes the request to API servers. 3. API servers get trie data from Trie Cache and construct autocomplete suggestions for the client. 4. In case the data is not in Trie Cache, we replenish data back to the cache. This way, all subsequent requests for the same prefix are returned from the cache. A cache miss can happen when a cache server is out of memory or offline. 1. 将搜索查询发送到负载均衡器。 2. 负载均衡器将请求路由到 API 服务器。 3. API 服务器从 Trie 缓存中获取 trie 数据,并为客户端构建自动完成建议。 4. 如果数据不在 Trie 缓存中,我们会将数据补充回缓存。这样,对同一前缀的所有后续请求都将从缓存中返回。当缓存服务器内存不足或脱机时,可能会发生缓存未命中。
Query service requires lightning-fast speed. We propose the following optimizations: • AJAX request. For web applications, browsers usually send AJAX requests to fetch autocomplete results. The main benefit of AJAX is that sending/receiving a request/response does not refresh the whole web page. • Browser caching. For many applications, autocomplete search suggestions may not change much within a short time. Thus, autocomplete suggestions can be saved in browser cache to allow subsequent requests to get results from the cache directly. Google search engine uses the same cache mechanism. Figure 13-12 shows the response header when you type “system design interview” on the Google search engine. As you can see, Google caches the results in the browser for 1 hour. Please note: “private” in cache-control means results are intended for a single user and must not be cached by a shared cache. “maxage=3600” means the cache is valid for 3600 seconds, aka, an hour. 查询服务需要闪电般的速度。我们提出以下优化: • AJAX 请求。对于 Web 应用程序,浏览器通常会发送 AJAX 请求来获取 自动完成结果。AJAX 的主要优点是发送/接收 请求/响应不会刷新整个网页。 • 浏览器缓存。对于许多应用程序,自动完成搜索建议可能不会 在短时间内改变很多。因此,自动完成建议可以保存在浏览器缓存中,以允许后续请求直接从缓存中获取结果。谷歌搜索引擎使用相同的缓存机制。图 13-12 显示了您在 Google 搜索引擎上键入“系统设计面试”时的响应标头。如您所见,Google在浏览器中缓存结果1小时。请注意:缓存控制中的“私有”意味着结果是针对单个用户的,并且不得由共享缓存缓存。“maxage=3600”表示缓存的有效期为 3600 秒,即一小时。
• Data sampling: For a large-scale system, logging every search query requires a lot of processing power and storage. Data sampling is important. For instance, only 1 out of every N requests is logged by the system. • 数据采样:对于大型系统,记录每个搜索查询需要大量的处理能力和存储。数据采样很重要。例如,系统仅记录每 N 个请求中的 1 个。
Trie is a core component of the autocomplete system. Let us look at how trie operations (create, update, and delete) work. Trie是自动完成系统的核心组件。让我们看看trie操作(创建,更新和删除)的工作原理。
Trie is created by workers using aggregated data. The source of data is from Analytics Log/DB. Trie 由工作人员使用聚合数据创建。数据源来自分析日志/数据库。
There are two ways to update the trie. Option 1: Update the trie weekly. Once a new trie is created, the new trie replaces the old one. Option 2: Update individual trie node directly. We try to avoid this operation because it is slow. However, if the size of the trie is small, it is an acceptable solution. When we update a trie node, its ancestors all the way up to the root must be updated because ancestors store top queries of children. Figure 13-13 shows an example of how the update operation works. On the left side, the search query “beer” has the original value 10. On the right side, it is updated to 30. As you can see, the node and its ancestors have the “beer” value updated to 30. 有两种方法可以更新 trie。 选项 1:每周更新 trie。创建新的尝试后,新的尝试将替换旧的尝试。 选项 2:直接更新单个 trie 节点。我们尽量避免此操作,因为它很慢。但是,如果trie的大小很小,则这是一个可以接受的解决方案。当我们更新一个trie节点时,它的祖先一直到根都必须更新,因为祖先存储了子节点的顶级查询。图 13-13 显示了更新操作工作原理的示例。在左侧,搜索查询“beer”的原始值为 10。在右侧,它更新为 30。如您所见,节点及其祖先的“bear”值更新为 30。
We have to remove hateful, violent, sexually explicit, or dangerous autocomplete suggestions. We add a filter layer (Figure 13-14) in front of the Trie Cache to filter out unwanted suggestions. Having a filter layer gives us the flexibility of removing results based on different filter rules. Unwanted suggestions are removed physically from the database asynchronically so the correct data set will be used to build trie in the next update cycle. 我们必须删除仇恨、暴力、色情或危险的自动完成 建议。我们在 Trie 缓存前面添加一个过滤器层(图 13-14),以过滤掉不需要的建议。拥有过滤层使我们能够灵活地根据不同的过滤规则删除结果。不需要的建议会以物理方式从数据库中删除,因此在下一个更新周期中将使用正确的数据集来构建 trie。
Now that we have developed a system to bring autocomplete queries to users, it is time to solve the scalability issue when the trie grows too large to fit in one server. Since English is the only supported language, a naive way to shard is based on the first character. Here are some examples. • If we need two servers for storage, we can store queries starting with ‘a’ to ‘m’ on the first server, and ‘n’ to ‘z’ on the second server. • If we need three servers, we can split queries into ‘a’ to ‘i’, ‘j’ to ‘r’ and ‘s’ to ‘z’. Following this logic, we can split queries up to 26 servers because there are 26 alphabetic characters in English. Let us define sharding based on the first character as first level sharding. To store data beyond 26 servers, we can shard on the second or even at the thirdlevel. For example, data queries that start with ‘a’ can be split into 4 servers: ‘aa-ag’, ‘ahan’, ‘ao-au’, and ‘av-az’. 现在我们已经开发了一个系统来向用户提供自动完成查询,现在是时候解决 trie 变得太大而无法容纳一台服务器时的可伸缩性问题了。 由于英语是唯一受支持的语言,因此基于第一个字符的简单分片方式。以下是一些示例。 • 如果我们需要两台服务器进行存储,我们可以在第一台服务器上存储以“a”到“m”开头的查询,在第二台服务器上存储以“n”到“z”开头的查询。 • 如果我们需要三台服务器,我们可以将查询拆分为“a”到“i”,“j”到“r”和“s”到“z”。 按照这个逻辑,我们最多可以将查询拆分为 26 台服务器,因为英语中有 26 个字母字符。让我们将基于第一个字符的分片定义为一级分片。要存储超过 26 台服务器的数据,我们可以在第二级甚至第三级进行分片。例如,以“a”开头的数据查询可以拆分为 4 个服务器:“aa-ag”、“ahan”、“ao-au”和“av-az”。
At the first glance this approach seems reasonable, until you realize that there are a lot more words that start with the letter ‘c’ than ‘x’. This creates uneven distribution. To mitigate the data imbalance problem, we analyze historical data distribution pattern and apply smarter sharding logic as shown in Figure 13-15. The shard map manager maintains a lookup database for identifying where rows should be stored. For example, if there are a similar number of historical queries for ‘s’ and for ‘u’, ‘v’, ‘w’, ‘x’, ‘y’ and ‘z’ combined, we can maintain two shards: one for ‘s’ and one for ‘u’ to ‘z’. 乍一看,这种方法似乎是合理的,直到您意识到以字母“c”开头的单词比以“x”开头的单词要多得多。这会造成分布不均。 为了缓解数据不平衡问题,我们分析了历史数据分布模式,并应用了更智能的分片逻辑,如图 13-15 所示。分片映射管理器维护一个查找数据库,用于标识应存储行的位置。例如,如果 's' 和 'u'、'v'、'w、x'、'y' 和 'z' 组合的历史查询数量相似,我们可以维护两个分片:一个用于 's',一个用于 'u' 到 'z'。
After you finish the deep dive, your interviewer might ask you some follow up questions. Interviewer: How do you extend your design to support multiple languages? To support other non-English queries, we store Unicode characters in trie nodes. If you are not familiar with Unicode, here is the definition: “an encoding standard covers all the characters for all the writing systems of the world, modern and ancient” [5]. Interviewer: What if top search queries in one country are different from others? In this case, we might build different tries for different countries. To improve the response time, we can store tries in CDNs. Interviewer: How can we support the trending (real-time) search queries? Assuming a news event breaks out, a search query suddenly becomes popular. Our original design will not work because: • Offline workers are not scheduled to update the trie yet because this is scheduled to run on weekly basis. • Even if it is scheduled, it takes too long to build the trie. Building a real-time search autocomplete is complicated and is beyond the scope of this book so we will only give a few ideas: • Reduce the working data set by sharding. • Change the ranking model and assign more weight to recent search queries. • Data may come as streams, so we do not have access to all the data at once. Streaming data means data is generated continuously. Stream processing requires a different set of systems: Apache Hadoop MapReduce [6], Apache Spark Streaming [7], Apache Storm [8], Apache Kafka [9], etc. Because all those topics require specific domain knowledge, we are not going into detail here. Congratulations on getting this far! Now give yourself a pat on the back. Good job! 完成深入探讨后,面试官可能会问您一些后续问题。 采访者:你如何扩展你的设计以支持多种语言? 为了支持其他非英语查询,我们将 Unicode 字符存储在 trie 节点中。如果您不熟悉Unicode,以下是定义:“编码标准涵盖了世界上所有书写系统的所有字符,无论是现代的还是古老的” [5]。 采访者:如果一个国家的热门搜索查询与其他国家/地区不同怎么办? 在这种情况下,我们可能会为不同的国家/地区构建不同的尝试。为了缩短响应时间,我们可以将尝试存储在 CDN 中。 采访者:我们如何支持趋势(实时)搜索查询? 假设新闻事件爆发,搜索查询突然变得流行起来。我们的原始设计将不起作用,因为: • 脱机工作线程尚未计划更新 trie,因为这计划每周运行一次。 • 即使已安排,构建 trie 也需要很长时间。 构建实时搜索自动完成功能很复杂,超出了本书的范围,因此我们只给出一些想法: • 通过分片减少工作数据集。 • 更改排名模型并为最近的搜索查询分配更多权重。 • 数据可能以流的形式出现,因此我们无法一次访问所有数据。流数据意味着数据是连续生成的。流处理需要一组不同的系统:Apache Hadoop MapReduce [6],Apache Spark Streaming [7],Apache Storm [8],Apache Kafka [9]等。由于所有这些主题都需要特定的领域知识,因此我们在此不再详细介绍。 恭喜你走到了这一步!现在拍拍自己的背。干得好!
[1] The Life of a Typeahead Query: https://www.facebook.com/notes/facebookengineering/the-life-of-a-typeahead-query/389105248919/ [2] How We Built Prefixy: A Scalable Prefix Search Service for Powering Autocomplete: https://medium.com/@prefixyteam/how-we-built-prefixy-a-scalable-prefix-search-servicefor-powering-autocomplete-c20f98e2eff1 [3] Prefix Hash Tree An Indexing Data Structure over Distributed Hash Tables: https://people.eecs.berkeley.edu/~sylvia/papers/pht.pdf [4] MongoDB wikipedia: https://en.wikipedia.org/wiki/MongoDB [5] Unicode frequently asked questions: https://www.unicode.org/faq/basic_q.html [6] Apache hadoop: https://hadoop.apache.org/ [7] Spark streaming: https://spark.apache.org/streaming/ [8] Apache storm: https://storm.apache.org/ [9] Apache kafka: https://kafka.apache.org/documentation/
本文作者:Eric
本文链接:
版权声明:本博客所有文章除特别声明外,均采用 BY-NC-SA 许可协议。转载请注明出处!