In this chapter, we will tackle an interesting and classic system design interview question: designing a URL shortening service like tinyurl. 在本章中,我们将解决一个有趣而经典的系统设计面试问题: 设计一个像tinyurl这样的URL缩短服务。
System design interview questions are intentionally left open-ended. To design a well-crafted system, it is critical to ask clarification questions. 系统设计面试问题有意保持开放式。设计精心制作的 系统,提出澄清问题至关重要。
Candidate: Can you give an example of how a URL shortener work? Interviewer: Assume URL https://www.systeminterview.com/q=chatsystem&c=loggedin&v=v3&l=long is the original URL. Your service creates an alias with shorter length: https://tinyurl.com/ y7keocwj. If you click the alias, it redirects you to the original URL. Candidate: What is the traffic volume? Interviewer: 100 million URLs are generated per day. Candidate: How long is the shortened URL? Interviewer: As short as possible. Candidate: What characters are allowed in the shortened URL? Interviewer: Shortened URL can be a combination of numbers (0-9) and characters (a-z, AZ). Candidate: Can shortened URLs be deleted or updated? Interviewer: For simplicity, let us assume shortened URLs cannot be deleted or updated. 应聘者:你能举个例子来说明URL缩短器是如何工作的吗? 面试官:假设网址 https://www.systeminterview.com/q=chatsystem&c=loggedin&v=v3&l=long 是原版 网址。您的服务会创建长度较短的别名:https://tinyurl.com/ y7keocwj。如果你 单击别名,它会将您重定向到原始 URL。 应聘者:流量是多少? 采访者:每天生成 1 亿个 URL。 应聘者:缩短的URL有多长? 主持人:尽量短。 应聘者:缩短的 URL 中允许使用哪些字符? 面试官:缩短的URL可以是数字(0-9)和字符(a-z,AZ)的组合。 应聘者:缩短的网址可以删除或更新吗? 采访者:为简单起见,我们假设缩短的URL无法删除或更新。
Here are the basic use cases: 1.URL shortening: given a long URL => return a much shorter URL 2.URL redirecting: given a shorter URL => redirect to the original URL 3.High availability, scalability, and fault tolerance considerations 以下是基本用例: 1.URL缩短:给定一个长URL =>返回一个更短的URL 2.URL重定向:给定较短的URL =>重定向到原始URL 3.高可用性、可扩展性和容错注意事项
• Write operation: 100 million URLs are generated per day. • Write operation per second: 100 million / 24 /3600 = 1160 • Read operation: Assuming ratio of read operation to write operation is 10:1, read operation per second: 1160 * 10 = 11,600 • Assuming the URL shortener service will run for 10 years, this means we must support 100 million * 365 * 10 = 365 billion records. • Assume average URL length is 100. • Storage requirement over 10 years: 365 billion * 100 bytes * 10 years = 365 TB It is important for you to walk through the assumptions and calculations with your interviewer so that both of you are on the same page. • 写入操作:每天生成 1 亿个 URL。 • 每秒写入操作数:1 亿 / 24 /3600 = 1160 • 读操作:假设读写操作比为10:1,读每秒操作数:1160 * 10 = 11,600 • 假设URL缩短器服务将运行10年,这意味着我们必须支持1 亿 * 365 * 10 = 3650 亿条记录。 • 假设平均 URL 长度为 100。 • 10 年以上的存储需求:3650 亿 * 100 字节 * 10 年 = 365 TB 对您来说,与您的假设和计算一起完成假设和计算非常重要 面试官,这样你们俩就在同一页面上。
In this section, we discuss the API endpoints, URL redirecting, and URL shortening flows. 在本节中,我们将讨论 API 端点、URL 重定向和 URL 缩短流程。
API endpoints facilitate the communication between clients and servers. We will design the APIs REST-style. If you are unfamiliar with restful API, you can consult external materials, such as the one in the reference material [1]. A URL shortener primary needs two API endpoints. 1.URL shortening. To create a new short URL, a client sends a POST request, which contains one parameter: the original long URL. The API looks like this: POST api/v1/data/shorten • request parameter: {longUrl: longURLString} • return shortURL 2.URL redirecting. To redirect a short URL to the corresponding long URL, a client sends a GET request. The API looks like this: GET api/v1/shortUrl • Return longURL for HTTP redirection API 端点有助于客户端和服务器之间的通信。我们将设计 API REST 样式。如果你不熟悉 restful API,可以查阅外部资料, 如参考资料[1]中的那个。一个 URL 缩短器主要需要两个 API端点。 1.网址缩短。要创建新的短 URL,客户端会发送一个 POST 请求,其中包含一个参数:原始长 URL。该 API 如下所示: POST api/v1/data/short • 请求参数:{longUrl: longURLString} • 返回短网址 2.网址重定向。要将短 URL 重定向到相应的长 URL,客户端会发送获取请求。该 API 如下所示: GET api/v1/shortUrl • 返回用于 HTTP 重定向的长网址
Figure 8-1 shows what happens when you enter a tinyurl onto the browser. Once the server receives a tinyurl request, it changes the short URL to the long URL with 301 redirect. 图 8-1 显示了在浏览器上输入 tinyurl 时会发生什么。一旦服务器 收到一个 tinyurl 请求,它会将短 URL 更改为带有 301 重定向的长 URL。
The detailed communication between clients and servers is shown in Figure 8-2.
One thing worth discussing here is 301 redirect vs 302 redirect. 301 redirect. A 301 redirect shows that the requested URL is “permanently” moved to the long URL. Since it is permanently redirected, the browser caches the response, and subsequent requests for the same URL will not be sent to the URL shortening service. Instead, requests are redirected to the long URL server directly. 302 redirect. A 302 redirect means that the URL is “temporarily” moved to the long URL, meaning that subsequent requests for the same URL will be sent to the URL shortening service first. Then, they are redirected to the long URL server. Each redirection method has its pros and cons. If the priority is to reduce the server load, using 301 redirect makes sense as only the first request of the same URL is sent to URL shortening servers. However, if analytics is important, 302 redirect is a better choice as it can track click rate and source of the click more easily. 这里值得讨论的一件事是 301 重定向与 302 重定向。 301重定向。301 重定向显示请求的 URL 已“永久”移动到长网址。由于它是永久重定向的,因此浏览器会缓存响应,并且对同一 URL 的后续请求将不会发送到 URL 缩短服务。相反,请求将直接重定向到长 URL 服务器。 302重定向。302 重定向意味着 URL “暂时”移动到长 URL,这意味着对同一 URL 的后续请求将被发送到 URL 缩短服务至上。然后,它们被重定向到长 URL 服务器。 每种重定向方法都有其优点和缺点。如果优先级是减少服务器负载,使用 301 重定向是有意义的,因为只有相同 URL 的第一个请求才会发送到 URL缩短服务器。但是,如果分析很重要,302重定向是更好的选择,因为它可以 更轻松地跟踪点击率和点击来源。
The most intuitive way to implement URL redirecting is to use hash tables. Assuming the hash table stores <shortURL, longURL> pairs, URL redirecting can be implemented by the following: • Get longURL: longURL = hashTable.get(shortURL) • Once you get the longURL, perform the URL redirect. 实现 URL 重定向的最直观方法是使用哈希表。假设 哈希表存储<短URL,长URL>对,URL重定向可以通过 以后: • Get longURL: longURL = hashTable.get(shortURL) • 获取长网址后,执行网址重定向。
Let us assume the short URL looks like this: www.tinyurl.com/{hashValue}. To support the URL shortening use case, we must find a hash function fx that maps a long URL to the hashValue, as shown in Figure 8-3. 让我们假设短URL看起来像这样:www.tinyurl.com/{hashValue}。为了支持 URL 缩短用例,我们必须找到一个哈希函数 fx 将长 URL 映射到 哈希值,如图 8-3 所示。
The hash function must satisfy the following requirements: • Each longURL must be hashed to one hashValue. • Each hashValue can be mapped back to the longURL. Detailed design for the hash function is discussed in deep dive. 哈希函数必须满足以下要求: • 每个长URL必须散列为一个哈希值。 • 每个哈希值都可以映射回长URL。 哈希函数的详细设计将在深入探讨中讨论。
Up until now, we have discussed the high-level design of URL shortening and URL redirecting. In this section, we dive deep into the following: data model, hash function, URL shortening and URL redirecting. 到目前为止,我们已经讨论了 URL 缩短和 URL 的高级设计重定向。在本节中,我们将深入探讨以下内容:数据模型、哈希函数、URL 缩短和 URL 重定向。
In the high-level design, everything is stored in a hash table. This is a good starting point; however, this approach is not feasible for real-world systems as memory resources are limited and expensive. A better option is to store <shortURL, longURL> mapping in a relational database. Figure 8-4 shows a simple database table design. The simplified version of the table contains 3 columns: id, shortURL, longURL. 在高级设计中,所有内容都存储在哈希表中。这是一个很好的起点; 但是,这种方法对于实际系统是不可行的,因为内存资源有限。 而且价格昂贵。更好的选择是在关系中存储<短URL,长URL>映射 数据库。图 8-4 显示了一个简单的数据库表设计。表格的简化版本 包含 3 列:id、shortURL、longURL。
Hash function is used to hash a long URL to a short URL, also known as hashValue. 哈希函数用于将长 URL 哈希为短 URL,也称为哈希值。
The hashValue consists of characters from [0-9, a-z, A-Z], containing 10 + 26 + 26 = 62 possible characters. To figure out the length of hashValue, find the smallest n such that 62^n ≥ 365 billion. The system must support up to 365 billion URLs based on the back of the envelope estimation. Table 8-1 shows the length of hashValue and the corresponding maximal number of URLs it can support. 哈希值由 [0-9, a-z, A-Z] 中的字符组成,包含 10 + 26 + 26 = 62 可能的字符。要计算哈希值的长度,请找到最小的 n,使得 62^n≥3650亿。系统必须支持多达 3650 亿个基于 包络估计。表 8-1 显示了哈希值的长度和相应的它可以支持的最大 URL 数。
When n = 7, 62 ^ n = ~3.5 trillion, 3.5 trillion is more than enough to hold 365 billion URLs, so the length of hashValue is 7. We will explore two types of hash functions for a URL shortener. The first one is “hash + collision resolution”, and the second one is “base 62 conversion.” Let us look at them one by one 当 n = 7, 62 ^ n = ~3.5 万亿时,3.5 万亿足以容纳 3650 亿个 URL, 所以哈希值的长度是 7。 我们将探索 URL 缩短器的两种类型的哈希函数。第一个是“哈希+ 冲突解决“,第二个是”Base 62 转换”。让我们一一看一下 一
To shorten a long URL, we should implement a hash function that hashes a long URL to a 7- character string. A straightforward solution is to use well-known hash functions like CRC32, MD5, or SHA-1. The following table compares the hash results after applying different hash functions on this URL: https://en.wikipedia.org/wiki/Systems_design. 为了缩短长 URL,我们应该实现一个哈希函数,将长 URL 哈希为 7- 字符串。一个简单的解决方案是使用众所周知的哈希函数,如CRC32, MD5 或 SHA-1。下表比较了应用不同哈希后的哈希结果 此 URL 上的函数:https://en.wikipedia.org/wiki/Systems_design。
As shown in Table 8-2, even the shortest hash value (from CRC32) is too long (more than 7 characters). How can we make it shorter? The first approach is to collect the first 7 characters of a hash value; however, this method can lead to hash collisions. To resolve hash collisions, we can recursively append a new predefined string until no more collision is discovered. This process is explained in Figure 8- 5. 如表 8-2 所示,即使是最短的哈希值(来自 CRC32)也太长(超过 7 字符)。我们怎样才能缩短它? 第一种方法是收集哈希值的前 7 个字符;但是,这种方法 可能导致哈希冲突。为了解决哈希冲突,我们可以递归地附加一个新的 预定义字符串,直到不再发现冲突。这个过程在图 8-5 中进行了说明
This method can eliminate collision; however, it is expensive to query the database to check if a shortURL exists for every request. A technique called bloom filters [2] can improve performance. A bloom filter is a space-efficient probabilistic technique to test if an element is a member of a set. Refer to the reference material [2] for more details. 这种方法可以消除碰撞;但是,查询数据库进行检查的成本很高 如果每个请求都存在短 URL。一种称为布隆过滤器 [2] 的技术可以改善 性能。布隆过滤器是一种节省空间的概率技术,用于测试元素是否 集合的成员。有关更多详细信息,请参阅参考资料 [2]。
Base conversion is another approach commonly used for URL shorteners. Base conversion helps to convert the same number between its different number representation systems. Base 62 conversion is used as there are 62 possible characters for hashValue. Let us use an example to explain how the conversion works: convert 1115710 to base 62 representation (1115710 represents 11157 in a base 10 system). • From its name, base 62 is a way of using 62 characters for encoding. The mappings are: 0-0, ..., 9-9, 10-a, 11-b, ..., 35-z, 36-A, ..., 61-Z, where ‘a’ stands for 10, ‘Z’ stands for 61, etc. • 1115710 = 2 x 622 + 55 x 621 + 59 x 620 = [2, 55, 59] -> [2, T, X] in base 62 representation. Figure 8-6 shows the conversation process. 基本转换是 URL 缩短器常用的另一种方法。基本转换 有助于在其不同的数字表示系统之间转换相同的数字。基础 使用 62 转换,因为哈希值有 62 个可能的字符。让我们使用 说明转换工作原理的示例:将1115710转换为基数 62 表示形式 (1115710 表示以 10 为基数的系统中的 11157)。 • 顾名思义,base 62 是一种使用 62 个字符进行编码的方式。映射是: 0-0, ..., 9-9, 10-a, 11-b, ..., 35-z, 36-A, ..., 61-Z, 其中 'a' 代表 10, 'Z' 代表 61, 等。 • 1115710 = 2 x 622 + 55 x 621 + 59 x 620 = [2, 55, 59] -> [2, T, X] 以 62 为底 表示法。图 8-6 显示了对话过程。
• Thus, the short URL is https://tinyurl.com /2TX • 因此,短网址是 https://tinyurl.com /2TX
Table 8-3 shows the differences of the two approaches.
As one of the core pieces of the system, we want the URL shortening flow to be logically simple and functional. Base 62 conversion is used in our design. We build the following diagram (Figure 8-7) to demonstrate the flow. 作为系统的核心部分之一,我们希望 URL 缩短流程在逻辑上是 简单实用。我们的设计中使用了 Base 62 转换。我们构建以下内容 图(图 8-7)来演示流程。
1. longURL is the input. 2. The system checks if the longURL is in the database. 3. If it is, it means the longURL was converted to shortURL before. In this case, fetch the shortURL from the database and return it to the client. 4. If not, the longURL is new. A new unique ID (primary key) Is generated by the unique ID generator. 5. Convert the ID to shortURL with base 62 conversion. 6. Create a new database row with the ID, shortURL, and longURL. 1. 长网址是输入。 2. 系统检查长URL是否在数据库中。 3.如果是,则表示之前已将长URL转换为短URL。在这种情况下,获取数据库中的 shortURL 并将其返回给客户端。 4. 如果不是,则长网址是新的。一个新的唯一 ID(主键)由唯一ID 生成器。 5. 使用 base 62 转换将 ID 转换为短 URL。 6. 创建一个包含 ID、短 URL 和长 URL 的新数据库行。
To make the flow easier to understand, let us look at a concrete example. • Assuming the input longURL is: https://en.wikipedia.org/wiki/Systems_design • Unique ID generator returns ID: 2009215674938. • Convert the ID to shortURL using the base 62 conversion. ID (2009215674938) is converted to “zn9edcu”. • Save ID, shortURL, and longURL to the database as shown in Table 8-4. 为了使流程更易于理解,让我们看一个具体的例子。 • 假设输入的长URL是:https://en.wikipedia.org/wiki/Systems_design • 唯一 ID 生成器返回 ID:2009215674938。 • 使用 base 62 转换将 ID 转换为短 URL。ID (2009215674938) 是转换为“Zn9edcu”。 • 将 ID、短 URL 和长 URL 保存到数据库中,如表 8-4 所示。
The distributed unique ID generator is worth mentioning. Its primary function is to generate globally unique IDs, which are used for creating shortURLs. In a highly distributed environment, implementing a unique ID generator is challenging. Luckily, we have already discussed a few solutions in “Chapter 7: Design A Unique ID Generator in Distributed Systems”. You can refer back to it to refresh your memory. 分布式唯一 ID 生成器值得一提。它的主要功能是生成 全局唯一 ID,用于创建短 URL。在高度分布 环境中,实现唯一的 ID 生成器具有挑战性。幸运的是,我们已经 在“第 7 章:在分布式中设计唯一 ID 生成器”中讨论了一些解决方案 系统”。您可以参考它以刷新您的记忆。
Figure 8-8 shows the detailed design of the URL redirecting. As there are more reads than writes, <shortURL, longURL> mapping is stored in a cache to improve performance. 图 8-8 显示了 URL 重定向的详细设计。由于读取次数多于 writes、<shortURL、longURL>映射存储在缓存中以提高性能。
The flow of URL redirecting is summarized as follows: 1. A user clicks a short URL link: https://tinyurl.com/zn9edcu 2. The load balancer forwards the request to web servers. 3. If a shortURL is already in the cache, return the longURL directly. 4. If a shortURL is not in the cache, fetch the longURL from the database. If it is not in the database, it is likely a user entered an invalid shortURL. 5. The longURL is returned to the user. URL 重定向的流程总结如下: 1. 用户点击短 URL 链接:https://tinyurl.com/zn9edcu 2. 负载均衡器将请求转发到 Web 服务器。 3. 如果缓存中已有短网址,请直接返回长网址。 4. 如果缓存中没有短 URL,请从数据库中获取长 URL。如果它不在 数据库,则可能是用户输入了无效的短URL。 5. 将长 URL 返回给用户。
In this chapter, we talked about the API design, data model, hash function, URL shortening, and URL redirecting. If there is extra time at the end of the interview, here are a few additional talking points. • Rate limiter: A potential security problem we could face is that malicious users send an overwhelmingly large number of URL shortening requests. Rate limiter helps to filter out requests based on IP address or other filtering rules. If you want to refresh your memory about rate limiting, refer to “Chapter 4: Design a rate limiter”. • Web server scaling: Since the web tier is stateless, it is easy to scale the web tier by adding or removing web servers. • Database scaling: Database replication and sharding are common techniques. • Analytics: Data is increasingly important for business success. Integrating an analytics solution to the URL shortener could help to answer important questions like how many people click on a link? When do they click the link? etc. • Availability, consistency, and reliability. These concepts are at the core of any large system’s success. We discussed them in detail in Chapter 1, please refresh your memory on these topics. Congratulations on getting this far! Now give yourself a pat on the back. Good job! 在本章中,我们讨论了 API 设计、数据模型、哈希函数、URL 缩短、 和网址重定向。 如果面试结束时有额外的时间,这里有一些额外的谈话要点。 • 速率限制器:我们可能面临的潜在安全问题是恶意用户发送大量的 URL 缩短请求。速率限制器有助于过滤掉 基于 IP 地址或其他筛选规则的请求。如果您想刷新记忆关于速率限制,请参阅“第 4 章:设计速率限制器”。 • Web 服务器缩放:由于 Web 层是无状态的,因此很容易通过以下方式扩展 Web 层添加或删除 Web 服务器。 • 数据库扩展:数据库复制和分片是常用技术。 • 分析:数据对于业务成功越来越重要。集成分析URL缩短器的解决方案可以帮助回答重要问题,例如有多少 人们点击链接?他们什么时候点击链接?等。 • 可用性、一致性和可靠性。这些概念是任何大型的核心系统的成功。我们在第一章详细讨论过,请刷新记忆 关于这些主题。 恭喜你走到了这一步!现在拍拍自己的背。干得好!
[1] A RESTful Tutorial: https://www.restapitutorial.com/index.html [2] Bloom filter: https://en.wikipedia.org/wiki/Bloom_filter
本文作者:Eric
本文链接:
版权声明:本博客所有文章除特别声明外,均采用 BY-NC-SA 许可协议。转载请注明出处!