CHAPTER 2: BACK-OF-THE-ENVELOPE ESTIMATION 粗略估计


In a system design interview, sometimes you are asked to estimate system capacity or
performance requirements using a back-of-the-envelope estimation. According to Jeff Dean,
Google Senior Fellow, “back-of-the-envelope calculations are estimates you create using a
combination of thought experiments and common performance numbers to get a good feel for
which designs will meet your requirements” [1].
You need to have a good sense of scalability basics to effectively carry out back-of-theenvelope estimation. The following concepts should be well understood: power of two [2],
latency numbers every programmer should know, and availability numbers.



在系统设计面试中，有时要求您使用粗略的估计来估计系统容量或性能要求。根据谷歌高级研究员杰夫·迪恩（Jeff Dean）的说法，“粗略计算是你使用思想实验和常见性能数字的组合来创建的估计，以便很好地了解哪些设计将满足您的要求” [1]。

您需要对可伸缩性基础知识有很好的了解，才能有效地执行粗略估计。应该很好地理解以下概念：2 的幂 [2]、每个程序员都应该知道的延迟数字和可用性数字

Power of two 数量化


Although data volume can become enormous when dealing with distributed systems,
calculation all boils down to the basics. To obtain correct calculations, it is critical to know
the data volume unit using the power of 2. A byte is a sequence of 8 bits. An ASCII character
uses one byte of memory (8 bits). Below is a table explaining the data volume unit (Table 2-
1).

尽管在处理分布式系统时，数据量可能会变得巨大，但计算都归结为基础知识。为了获得正确的计算，使用2的幂了解数据量单位至关重要。字节是 8 位的序列。ASCII 字符使用一个字节的内存（8 位）。下表解释了数据量单位（表 2- 1）。

approximative value 近似值

Latency numbers every programmer should know 每个程序员都应该知道的延迟数字


Dr. Dean from Google reveals the length of typical computer operations in 2010 [1]. Some
numbers are outdated as computers become faster and more powerful. However, those
numbers should still be able to give us an idea of the fastness and slowness of different
computer operations.


来自谷歌的Dean博士揭示了2010年[1]中典型计算机操作的长度。随着计算机变得更快和更强大，一些数字已经过时了。然而，这些数字仍然能够让我们了解不同计算机操作的速度和缓慢。

Branch mispredict 分支预测错误 Mutex lock/unlock 互斥锁机制 compress 1k bytes with Zippy 使用 zippy 编写 1k 字节 read 1MB sequentially from memory 从内存中按顺序读取1MB round trip within the same datacenter 在同一数据中心内的往返行程 disk seek 磁盘查找


Notes
-----------
ns = nanosecond, µs = microsecond, ms = millisecond
1 ns = 10^-9 seconds
1 µs= 10^-6 seconds = 1,000 ns
1 ms = 10^-3 seconds = 1,000 µs = 1,000,000 ns
A Google software engineer built a tool to visualize Dr. Dean’s numbers. The tool also takes
the time factor into consideration. Figures 2-1 shows the visualized latency numbers as of
2020 (source of figures: reference material [3])

注释 
----------- 
ns = 纳秒，μs = 微秒，ms = 毫秒 
1 ns = 10^-9 秒 
1 μs = 10^-6 秒 = 1，000 ns 
1 ms = 10^-3 秒 = 1，000 μs = 1，000，000 ns 

一位谷歌软件工程师构建了一个工具来可视化迪恩博士的数字。该工具还考虑了时间因素。图 2-1 显示了截至 2020 年的可视化延迟数字（数据来源：参考资料 [3]）


By analyzing the numbers in Figure 2-1, we get the following conclusions:
• Memory is fast but the disk is slow.
• Avoid disk seeks if possible.
• Simple compression algorithms are fast.
• Compress data before sending it over the internet if possible.
• Data centers are usually in different regions, and it takes time to send data between them.

通过分析图 2-1 中的数字，我们得出以下结论： 
• 内存很快，但磁盘速度较慢。
• 尽可能避免磁盘寻道。
• 简单的压缩算法是快速的。
• 如果可能，在通过互联网发送数据之前压缩数据。
• 数据中心通常位于不同的区域，在它们之间发送数据需要时间。

Availability numbers 可用性数字


High availability is the ability of a system to be continuously operational for a desirably long
period of time. High availability is measured as a percentage, with 100% means a service that
has 0 downtime. Most services fall between 99% and 100%.

A service level agreement (SLA) is a commonly used term for service providers. This is an
agreement between you (the service provider) and your customer, and this agreement
formally defines the level of uptime your service will deliver. Cloud providers Amazon [4],
Google [5] and Microsoft [6] set their SLAs at 99.9% or above. Uptime is traditionally
measured in nines. The more the nines, the better. As shown in Table 2-3, the number of
nines correlate to the expected system downtime.

高可用性是系统在理想的长时间内连续运行的能力。高可用性以百分比表示，100% 表示服务停机时间为 0。大多数服务介于 99% 到 100% 之间。

服务级别协议 （SLA） 是服务提供商的常用术语。这是您（服务提供商）与您的客户之间的协议，本协议正式定义了您的服务将提供的正常运行时间级别。云提供商亚马逊[4]，谷歌[5]和微软[6]将其SLA设置为99.9%或以上。正常运行时间传统上以九为单位。九越多越好。如表 2-3 所示，9 的数量与预期的系统停机时间相关。

Example: Estimate Twitter QPS and storage requirements 示例：估算推特 QPS 和存储需求


Please note the following numbers are for this exercise only as they are not real numbers
from Twitter.
Assumptions:
• 300 million monthly active users.
• 50% of users use Twitter daily.
• Users post 2 tweets per day on average.
• 10% of tweets contain media.
• Data is stored for 5 years.
Estimations:
Query per second (QPS) estimate:
• Daily active users (DAU) = 300 million * 50% = 150 million
• Tweets QPS = 150 million * 2 tweets / 24 hour / 3600 seconds = ~3500
• Peek QPS = 2 * QPS = ~7000
We will only estimate media storage here.
• Average tweet size:
• tweet_id 64 bytes
• text 140 bytes
• media 1 MB
• Media storage: 150 million * 2 * 10% * 1 MB = 30 TB per day
• 5-year media storage: 30 TB * 365 * 5 = ~55 PB


请注意，以下数字仅用于本练习，因为它们不是推特上的真实数字。
假设：
• 活跃用户为每月3亿。
• 50%的用户每天都在使用推特。
• 用户平均每天发布2条推特。
• 有10%的推文包含媒体。
• 数据存储时间5年。

估计：
每秒查询（QPS）估计：
* 每日活跃用户（DAU）=3亿*50%=1.5亿条
* 推文QPS=1.5亿*2条推文/ 24小时/ 3600秒=~3500
* Peek QPS = 2 * QPS = ~7000

我们只估计这里的媒体存储。
• 平均tweet大小：
• tweet_id 64字节
• 文本140字节
• 媒体1 MB
• 媒体存储：1.5亿* 2 * 10% * 1 MB = 30 TB每天
• 5年媒体存储： 30 TB * 365 * 5 = ~55 PB

Tips 技巧


Back-of-the-envelope estimation is all about the process. Solving the problem is more
important than obtaining results. Interviewers may test your problem-solving skills. Here are
a few tips to follow:
• Rounding and Approximation. It is difficult to perform complicated math operations
during the interview. For example, what is the result of “99987 / 9.1”? There is no need to
spend valuable time to solve complicated math problems. Precision is not expected. Use
round numbers and approximation to your advantage. The division question can be
simplified as follows: “100,000 / 10”.
• Write down your assumptions. It is a good idea to write down your assumptions to be
referenced later.
• Label your units. When you write down “5”, does it mean 5 KB or 5 MB? You might
confuse yourself with this. Write down the units because “5 MB” helps to remove
ambiguity.
• Commonly asked back-of-the-envelope estimations: QPS, peak QPS, storage, cache,
number of servers, etc. You can practice these calculations when preparing for an
interview. Practice makes perfect.
Congratulations on getting this far! Now give yourself a pat on the back. Good job!

事后估计的一切都是关于这个过程的。解决这个问题比获得结果更重要。面试官可能会测试你解决问题的能力。下面有一些建议：
* 四舍五入和近似值。在面试过程中，很难执行复杂的数学操作。例如，“99987/9.1”的结果是什么？没有必要花宝贵的时间来解决复杂的数学问题。精度不是预期的。最好使用整数和近似值。划分问题可以简化为：“100000/10”。
• 写下你的假设。写下你的假设以供以后引用是个好主意。
• 标记你的单位。当你写下“5”时，它是指5 KB还是5 MB？你可能会把自己搞混了。写下单位，因为“5MB”有助于消除歧义。
• 通常要求的信封背面估计： QPS、峰值QPS、存储、缓存、服务器数量等。你可以在准备面试时练习这些计算。

熟能生巧。祝贺你能走到这一步！现在拍拍自己的背。干得好！

Reference materials


[1] J. Dean.Google Pro Tip: Use Back-Of-The-Envelope-Calculations To Choose The Best
Design:
http://highscalability.com/blog/2011/1/26/google-pro-tip-use-back-of-the-envelopecalculations-to-choo.html
[2] System design primer: https://github.com/donnemartin/system-design-primer
[3] Latency Numbers Every Programmer Should Know:
https://colin-scott.github.io/personal_website/research/interactive_latency.html
[4] Amazon Compute Service Level Agreement:
https://aws.amazon.com/compute/sla/
[5] Compute Engine Service Level Agreement (SLA):
https://cloud.google.com/compute/sla
[6] SLA summary for Azure services: https://azure.microsoft.com/enus/support/legal/sla/summary/

目录

CHAPTER 2: BACK-OF-THE-ENVELOPE ESTIMATION 粗略估计

Power of two 数量化

Latency numbers every programmer should know 每个程序员都应该知道的延迟数字

Availability numbers 可用性数字

Example: Estimate Twitter QPS and storage requirements 示例：估算推特 QPS 和存储需求

Tips 技巧

Reference materials