CHAPTER 15: DESIGN GOOGLE DRIVE


In recent years, cloud storage services such as Google Drive, Dropbox, Microsoft OneDrive, and Apple iCloud have become very popular. In this chapter, you are asked to design Google Drive.

Let us take a moment to understand Google Drive before jumping into the design. Google Drive is a file storage and synchronization service that helps you store documents, photos, videos, and other files in the cloud. You can access your files from any computer, smartphone, and tablet. You can easily share those files with friends, family, and coworkers [1]. Figure 15-1 and 15-2 show what Google drive looks like on a browser and mobile application, respectively

近年来，Google Drive，Dropbox，Microsoft OneDrive和Apple iCloud等云存储服务变得非常流行。在本章中，要求您设计Google云端硬盘。

在进入设计之前，让我们花点时间了解Google云端硬盘。Google 云端硬盘是一项文件存储和同步服务，可帮助您将文档、照片、视频和其他文件存储在云中。您可以从任何计算机，智能手机和平板电脑访问文件。您可以轻松地与朋友、家人和同事共享这些文件 [1]。图 15-1 和 15-2 分别显示了 Google 云端硬盘在浏览器和移动应用程序上的外观

Step 1 - Understand the problem and establish design scope


Designing a Google drive is a big project, so it is important to ask questions to narrow down the scope.

设计Google驱动器是一个大项目，因此提出问题以缩小范围非常重要。


Candidate: What are the most important features?
Interviewer: Upload and download files, file sync, and notifications.
Candidate: Is this a mobile app, a web app, or both?
Interviewer: Both.
Candidate: What are the supported file formats?
Interviewer: Any file type.
Candidate: Do files need to be encrypted?
Interview: Yes, files in the storage must be encrypted.
Candidate: Is there a file size limit?
Interview: Yes, files must be 10 GB or smaller.
Candidate: How many users does the product have?
Interviewer: 10M DAU.


应聘者：最重要的功能是什么？
面试官：上传和下载文件、文件同步和通知。
应聘者：这是移动应用程序、Web 应用程序还是两者兼而有之？
主持人：两者都有。
应聘者：支持的文件格式有哪些？
采访者：任何文件类型。
应聘者：文件需要加密吗？
采访：是的，存储中的文件必须加密。
应聘者：有文件大小限制吗？
采访：是的，文件必须为 10 GB 或更小。
应聘者：产品有多少用户？
采访者：10M DAU。


In this chapter, we focus on the following features:
• Add files. The easiest way to add a file is to drag and drop a file into Google drive.
• Download files. 
• Sync files across multiple devices. When a file is added to one device, it is automatically synced to other devices.
• See file revisions.
• Share files with your friends, family, and coworkers
• Send a notification when a file is edited, deleted, or shared with you. Features not discussed in this chapter include:
• Google doc editing and collaboration. Google doc allows multiple people to edit the same document simultaneously. This is out of our design scope.

在本章中，我们将重点介绍以下功能：
• 添加文件。添加文件的最简单方法是将文件拖放到 Google 云端硬盘中。
• 下载文件。
• 跨多个设备同步文件。将文件添加到一台设备时，它会自动同步到其他设备。
• 请参阅文件修订。
• 与您的朋友、家人和同事共享文件
• 在编辑、删除文件或与您共享文件时发送通知。本章未讨论的功能包括：
• 谷歌文档编辑和协作。Google 文档允许多人同时编辑同一文档。这超出了我们的设计范围。


Other than clarifying requirements, it is important to understand non-functional requirements:
• Reliability. Reliability is extremely important for a storage system. Data loss is unacceptable.
• Fast sync speed. If file sync takes too much time, users will become impatient and abandon the product.
• Bandwidth usage. If a product takes a lot of unnecessary network bandwidth, users will be unhappy, especially when they are on a mobile data plan. 
• Scalability. The system should be able to handle high volumes of traffic.
• High availability. Users should still be able to use the system when some servers are offline, slowed down, or have unexpected network errors.

除了阐明需求之外，了解非功能性需求也很重要：
•可靠性。可靠性对于存储系统非常重要。数据丢失是不可接受的。
•快速同步速度。如果文件同步花费太多时间，用户将变得不耐烦并放弃产品。
• 带宽使用情况。如果产品占用大量不必要的网络带宽，用户将不满意，尤其是当他们使用移动数据计划时。
• 可扩展性。系统应该能够处理大量流量。
• 高可用性。当某些服务器脱机、速度变慢或出现意外的网络错误时，用户仍应能够使用该系统。

Back of the envelope estimation


• Assume the application has 50 million signed up users and 10 million DAU.
• Users get 10 GB free space.
• Assume users upload 2 files per day. The average file size is 500 KB.
• 1:1 read to write ratio.
• Total space allocated: 50 million * 10 GB = 500 Petabyte
• QPS for upload API: 10 million * 2 uploads / 24 hours / 3600 seconds = ~ 240
• Peak QPS = QPS * 2 = 480

• 假设应用程序有 5000 万注册用户和 1000 万 DAU。
• 用户可获得 10 GB 可用空间。
• 假设用户每天上传 2 个文件。平均文件大小为 500 KB。
• 1：1 读写比。
• 分配的总空间：5000 万 * 10 GB = 500 PB
• 上传 API QPS：1000 万 * 2 次上传 / 24 小时 / 3600 秒 = ~ 240
• 峰值 QPS = QPS * 2 = 480

Step 2 - Propose high-level design and get buy-in


Instead of showing the high-level design diagram from the beginning, we will use a slightly different approach. We will start with something simple: build everything in a single server. Then, gradually scale it up to support millions of users. By doing this exercise, it will refresh your memory about some important topics covered in the book. Let us start with a single server setup as listed below:
• A web server to upload and download files.
• A database to keep track of metadata like user data, login info, files info, etc.
• A storage system to store files. We allocate 1TB of storage space to store files.


我们将使用稍微不同的方法，而不是从头开始显示高级设计图。我们将从简单的事情开始：在单个服务器中构建所有内容。然后，逐步扩展它以支持数百万用户。通过做这个练习，它将刷新你对书中涵盖的一些重要主题的记忆。让我们从下面列出的单个服务器设置开始：
• 用于上传和下载文件的 Web 服务器。
• 用于跟踪用户数据、登录信息、文件信息等元数据的数据库。
• 用于存储文件的存储系统。我们分配1TB的存储空间来存储文件。


We spend a few hours setting up an Apache web server, a MySql database, and a directory called drive/ as the root directory to store uploaded files. Under drive/ directory, there is a list of directories, known as namespaces. Each namespace contains all the uploaded files for that user. The filename on the server is kept the same as the original file name. Each file or folder can be uniquely identified by joining the namespace and the relative path. 

Figure 15-3 shows an example of how the /drive directory looks like on the left side and its expanded view on the right side.

我们花了几个小时设置一个ApacheWeb服务器，一个MySQL数据库和一个名为drive/的目录作为存储上传文件的根目录。在 drive/ 目录下，有一个目录列表，称为命名空间。每个命名空间都包含该用户的所有上传文件。服务器上的文件名与原始文件名保持相同。每个文件或文件夹都可以通过联接命名空间和相对路径来唯一标识。

图 15-3 显示了左侧 /drive 目录的外观及其右侧展开视图的示例。

APIs


What do the APIs look like? We primary need 3 APIs: upload a file, download a file, and get file revisions.
API 是什么样的？我们主要需要 3 个 API：上传文件、下载文件和获取文件修订。

1. Upload a file to Google Drive


Two types of uploads are supported:
• Simple upload. Use this upload type when the file size is small.
• Resumable upload. Use this upload type when the file size is large and there is high chance of network interruption.

Here is an example of resumable upload API:
https://api.example.com/files/upload?uploadType=resumable
Params:
• uploadType=resumable
• data: Local file to be uploaded.
A resumable upload is achieved by the following 3 steps [2]:
• Send the initial request to retrieve the resumable URL.
• Upload the data and monitor upload state.
• If upload is disturbed, resume the upload.

支持两种类型的上传：
•简单的上传。当文件大小较小时，使用此上传类型。
• 断点续传。当文件大小较大且网络中断的可能性很高时，请使用此上传类型。

以下是断点续传 API 的示例：
https://api.example.com/files/upload?uploadType=resumable
参数：
• 上传类型=可断点
• 数据：要上传的本地文件。
通过以下 3 个步骤 [2] 实现断点续传：
• 发送初始请求以检索可恢复 URL。
• 上传数据并监控上传状态。
• 如果上传受到干扰，请恢复上传。

2. Download a file from Google Drive


Example API: https://api.example.com/files/download
Params:
• path: download file path.
Example params:
{
"path": "/recipes/soup/best_soup.txt"
}

3. Get file revisions


Example API: https://api.example.com/files/list_revisions
Params:
• path: The path to the file you want to get the revision history.
• limit: The maximum number of revisions to return.
Example params:
{
"path": "/recipes/soup/best_soup.txt",
"limit": 20
}


All the APIs require user authentication and use HTTPS. Secure Sockets Layer (SSL)
protects data transfer between the client and backend servers.

所有 API 都需要用户身份验证并使用 HTTPS。安全套接字层 （SSL） 保护客户端和后端服务器之间的数据传输。

Move away from single server


As more files are uploaded, eventually you get the space full alert as shown in Figure 15-4.


Only 10 MB of storage space is left! This is an emergency as users cannot upload files anymore. The first solution comes to mind is to shard the data, so it is stored on multiple storage servers. Figure 15-5 shows an example of sharding based on user_id 

只剩下 10 MB 的存储空间！这是紧急情况，因为用户无法再上传文件。想到的第一个解决方案是将数据分片，以便将其存储在多个存储服务器上。图 15-5 显示了基于 user_id 的分片示例


You pull an all-nighter to set up database sharding and monitor it closely. Everything works smoothly again. You have stopped the fire, but you are still worried about potential data losses in case of storage server outage. You ask around and your backend guru friend Frank told you that many leading companies like Netflix and Airbnb use Amazon S3 for storage.
“Amazon Simple Storage Service (Amazon S3) is an object storage service that offers
industry-leading scalability, data availability, security, and performance” [3]. You decide to do some research to see if it is a good fit.

After a lot of reading, you gain a good understanding of the S3 storage system and decide to store files in S3. Amazon S3 supports same-region and cross-region replication. A region is a geographic area where Amazon web services (AWS) have data centers. As shown in Figure 15-6, data can be replicated on the same-region (left side) and cross-region (right side). Redundant files are stored in multiple regions to guard against data loss and ensure availability. A bucket is like a folder in file systems.

你通宵达旦地设置数据库分片并密切监控它。一切又顺利了。您已经阻止了火灾，但您仍然担心在存储服务器中断的情况下潜在的数据丢失。你四处打听，你的后端专家朋友弗兰克告诉你，许多领先的公司，如Netflix和Airbnb使用Amazon S3进行存储。
“Amazon Simple Storage Service （Amazon S3） 是一种对象存储服务，提供
行业领先的可扩展性、数据可用性、安全性和性能“ [3]。你决定做一些研究，看看它是否合适。

经过大量阅读，您对 S3 存储系统有了很好的了解，并决定将文件存储在 S3 中。Amazon S3 支持同区域和跨区域复制。区域是亚马逊网络服务 （AWS） 拥有数据中心的地理区域。如图 15-6 所示，数据可以在同一区域（左侧）和跨区域（右侧）上复制。冗余文件存储在多个区域中，以防止数据丢失并确保可用性。存储桶类似于文件系统中的文件夹。


After putting files in S3, you can finally have a good night's sleep without worrying about data losses. To stop similar problems from happening in the future, you decide to do further research on areas you can improve. Here are a few areas you find:
• Load balancer: Add a load balancer to distribute network traffic. A load balancer ensures evenly distributed traffic, and if a web server goes down, it will redistribute the traffic.
• Web servers: After a load balancer is added, more web servers can be added/removed easily, depending on the traffic load.
• Metadata database: Move the database out of the server to avoid single point of failure. In the meantime, set up data replication and sharding to meet the availability and scalability requirements.
• File storage: Amazon S3 is used for file storage. To ensure availability and durability, files are replicated in two separate geographical regions.

将文件放入S3后，您终于可以睡个好觉，而不必担心数据丢失。为了防止将来发生类似问题，您决定对可以改进的领域进行进一步研究。以下是您找到的几个方面：
• 负载均衡器：添加负载均衡器以分配网络流量。负载均衡器可确保均匀分布流量，如果 Web 服务器出现故障，它将重新分配流量。
• Web 服务器：添加负载均衡器后，可以根据流量负载轻松添加/删除更多 Web 服务器。
• 元数据数据库：将数据库移出服务器以避免单点故障。同时，设置数据复制和分片以满足可用性和可扩展性要求。
• 文件存储：Amazon S3 用于文件存储。为了确保可用性和持久性，文件在两个单独的地理区域中复制。


After applying the above improvements, you have successfully decoupled web servers,
metadata database, and file storage from a single server. The updated design is shown in Figure 15-7.

应用上述改进后，您已成功解耦 Web 服务器，
元数据数据库和来自单个服务器的文件存储。更新后的设计如图 15-7 所示。

Sync conflicts


For a large storage system like Google Drive, sync conflicts happen from time to time. When two users modify the same file or folder at the same time, a conflict happens. How can we resolve the conflict? Here is our strategy: the first version that gets processed wins, and the version that gets processed later receives a conflict. Figure 15-8 shows an example of a sync conflict.

对于像 Google 云端硬盘这样的大型存储系统，同步冲突时有发生。当两个用户同时修改同一文件或文件夹时，会发生冲突。我们如何解决冲突？这是我们的策略：处理的第一个版本获胜，稍后处理的版本发生冲突。图 15-8 显示了同步冲突的示例。


In Figure 15-8, user 1 and user 2 tries to update the same file at the same time, but user 1’s file is processed by our system first. User 1’s update operation goes through, but, user 2 gets a sync conflict. How can we resolve the conflict for user 2? Our system presents both copies of the same file: user 2’s local copy and the latest version from the server (Figure 15-9). User 2 has the option to merge both files or override one version with the other.

在图 15-8 中，用户 1 和用户 2 尝试同时更新同一文件，但用户 1 的文件首先由我们的系统处理。用户 1 的更新操作通过，但用户 2 收到同步冲突。我们如何解决用户 2 的冲突？我们的系统呈现同一文件的两个副本：用户 2 的本地副本和来自服务器的最新版本（图 15-9）。用户 2 可以选择合并两个文件或用另一个版本覆盖一个版本。


While multiple users are editing the same document at the same, it is challenging to keep the document synchronized. Interested readers should refer to the reference material [4] [5].

当多个用户同时编辑同一文档时，保持文档同步具有挑战性。有兴趣的读者请参考参考资料[4] [5]。

High-level design


Figure 15-10 illustrates the proposed high-level design. Let us examine each component of the system.


User: A user uses the application either through a browser or mobile app.
Block servers: Block servers upload blocks to cloud storage. Block storage, referred to as block-level storage, is a technology to store data files on cloud-based environments. A file can be split into several blocks, each with a unique hash value, stored in our metadata database. Each block is treated as an independent object and stored in our storage system (S3). To reconstruct a file, blocks are joined in a particular order. As for the block size, we use Dropbox as a reference: it sets the maximal size of a block to 4MB [6].

Cloud storage: A file is split into smaller blocks and stored in cloud storage.

Cold storage: Cold storage is a computer system designed for storing inactive data, meaning files are not accessed for a long time. Load balancer: A load balancer evenly distributes requests among API servers.

API servers: These are responsible for almost everything other than the uploading flow. API servers are used for user authentication, managing user profile, updating file metadata, etc.

Metadata database: It stores metadata of users, files, blocks, versions, etc. Please note that files are stored in the cloud and the metadata database only contains metadata.

Metadata cache: Some of the metadata are cached for fast retrieval. 

Notification service: It is a publisher/subscriber system that allows data to be transferred from notification service to clients as certain events happen. In our specific case, notification service notifies relevant clients when a file is added/edited/removed elsewhere so they can pull the latest changes.

Offline backup queue: If a client is offline and cannot pull the latest file changes, the offline backup queue stores the info so changes will be synced when the client is online. 

We have discussed the design of Google Drive at the high-level. Some of the components are complicated and worth careful examination; we will discuss these in detail in the deep dive.

用户：用户通过浏览器或移动应用使用应用程序。
块服务器：块服务器将块上传到云存储。块存储，称为块级存储，是一种将数据文件存储在基于云的环境中的技术。一个文件可以分成几个块，每个块都有一个唯一的哈希值，存储在我们的元数据数据库中。每个块都被视为一个独立的对象，并存储在我们的存储系统 （S3） 中。要重建文件，块按特定顺序连接。至于块大小，我们使用Dropbox作为参考：它将块的最大大小设置为4MB [6]。

云存储：文件被拆分为较小的块并存储在云存储中。

冷存储：冷存储是一种设计用于存储非活动数据的计算机系统，这意味着文件长时间不被访问。负载均衡器：负载均衡器在 API 服务器之间均匀分配请求。

API 服务器：这些服务器负责除上传流程之外的几乎所有内容。API 服务器用于用户身份验证、管理用户配置文件、更新文件元数据等。

元数据数据库：它存储用户，文件，块，版本等的元数据。请注意，文件存储在云中，元数据数据库仅包含元数据。

元数据缓存：缓存某些元数据是为了快速检索。

通知服务：它是一个发布者/订阅者系统，允许在某些事件发生时将数据从通知服务传输到客户端。在我们的特定情况下，通知服务会在其他地方添加/编辑/删除文件时通知相关客户端，以便他们可以提取最新更改。

脱机备份队列：如果客户端处于脱机状态并且无法拉取最新的文件更改，则脱机备份队列将存储信息，以便在客户端联机时同步更改。

我们已经在高层次上讨论了谷歌云端硬盘的设计。有些组件很复杂，值得仔细检查;我们将在深入探讨中详细讨论这些内容。

Step 3 - Design deep dive


In this section, we will take a close look at the following: block servers, metadata database, upload flow, download flow, notification service, save storage space and failure handling.

在本节中，我们将仔细研究以下内容：块服务器、元数据数据库、上传流程、下载流程、通知服务、节省存储空间和故障处理。

Block servers 块服务器


For large files that are updated regularly, sending the whole file on each update consumes a lot of bandwidth. Two optimizations are proposed to minimize the amount of network traffic being transmitted:
• Delta sync. When a file is modified, only modified blocks are synced instead of the whole file using a sync algorithm [7] [8].
• Compression. Applying compression on blocks can significantly reduce the data size. Thus, blocks are compressed using compression algorithms depending on file types. For example, gzip and bzip2 are used to compress text files. Different compression algorithms are needed to compress images and videos.

对于定期更新的大文件，每次更新时发送整个文件都会消耗大量带宽。提出了两种优化来最小化传输的网络流量：
• 增量同步。修改文件时，仅使用同步算法 [7] [8] 同步修改的块而不是整个文件。
• 压缩。对块应用压缩可以显著减小数据大小。因此，根据文件类型使用压缩算法压缩块。例如，gzip 和 bzip2 用于压缩文本文件。压缩图像和视频需要不同的压缩算法。


In our system, block servers do the heavy lifting work for uploading files. Block servers process files passed from clients by splitting a file into blocks, compressing each block, and encrypting them. Instead of uploading the whole file to the storage system, only modified blocks are transferred.

在我们的系统中，块服务器为上传文件承担了繁重的工作。块服务器通过将文件拆分为块、压缩每个块并对其进行加密来处理从客户端传递的文件。不会将整个文件上传到存储系统，而是仅传输修改后的块。


Figure 15-11 shows how a block server works when a new file is added.

图 15-11 显示了添加新文件时块服务器的工作方式。


• A file is split into smaller blocks.
• Each block is compressed using compression algorithms.
• To ensure security, each block is encrypted before it is sent to cloud storage.
• Blocks are uploaded to the cloud storage.

• 文件被拆分为较小的块。
• 每个块都使用压缩算法进行压缩。
• 为了确保安全性，每个块在发送到云存储之前都经过加密。
• 块上传到云存储。


Figure 15-12 illustrates delta sync, meaning only modified blocks are transferred to cloud storage. Highlighted blocks “block 2” and “block 5” represent changed blocks. Using delta sync, only those two blocks are uploaded to the cloud storage.

图 15-12 说明了增量同步，这意味着只有修改后的块才会传输到云存储。突出显示的块“块 2”和“块 5”表示更改的块。使用增量同步，只有这两个块上传到云存储。


Block servers allow us to save network traffic by providing delta sync and compression.

块服务器允许我们通过提供增量同步和压缩来节省网络流量。

High consistency requirement 一致性要求高


Our system requires strong consistency by default. It is unacceptable for a file to be shown differently by different clients at the same time. The system needs to provide strong consistency for metadata cache and database layers.

Memory caches adopt an eventual consistency model by default, which means different
replicas might have different data. To achieve strong consistency, we must ensure the following:
• Data in cache replicas and the master is consistent.
• Invalidate caches on database write to ensure cache and database hold the same value.

Achieving strong consistency in a relational database is easy because it maintains the ACID (Atomicity, Consistency, Isolation, Durability) properties [9]. However, NoSQL databases do not support ACID properties by default. ACID properties must be programmatically incorporated in synchronization logic. In our design, we choose relational databases because the ACID is natively supported.

默认情况下，我们的系统需要很强的一致性。不同客户端同时以不同方式显示文件是不可接受的。系统需要为元数据缓存和数据库层提供强一致性。

内存缓存默认采用最终一致性模型，这意味着不同
副本可能具有不同的数据。为了实现强一致性，我们必须确保以下几点：
• 缓存副本中的数据与主节点一致。
• 在数据库写入时使缓存失效，以确保缓存和数据库保持相同的值。

在关系数据库中实现强一致性很容易，因为它保持了ACID（原子性，一致性，隔离性，持久性）属性[9]。但是，默认情况下，NoSQL 数据库不支持 ACID 属性。必须以编程方式将 ACID 属性合并到同步逻辑中。在我们的设计中，我们选择关系数据库，因为 ACID 是本机支持的。

Metadata database 元数据数据库


Figure 15-13 shows the database schema design. Please note this is a highly simplified version as it only includes the most important tables and interesting fields.
图 15-13 显示了数据库架构设计。请注意，这是一个高度简化的版本，因为它只包括最重要的表格和有趣的字段。


User: The user table contains basic information about the user such as username, email, profile photo, etc.
Device: Device table stores device info. Push_id is used for sending and receiving mobile push notifications. Please note a user can have multiple devices.
Namespace: A namespace is the root directory of a user.
File: File table stores everything related to the latest file.
File_version: It stores version history of a file. Existing rows are read-only to keep the integrity of the file revision history.
Block: It stores everything related to a file block. A file of any version can be reconstructed by joining all the blocks in the correct order.

用户：用户表包含有关用户的基本信息，例如用户名、电子邮件、个人资料照片等。
设备：设备表存储设备信息。Push_id用于发送和接收移动推送通知。请注意，一个用户可以拥有多个设备。
命名空间：命名空间是用户的根目录。
文件：文件表存储与最新文件相关的所有内容。
File_version：它存储文件的版本历史记录。现有行是只读的，以保持文件修订历史记录的完整性。
块：它存储与文件块相关的所有内容。任何版本的文件都可以通过按正确的顺序连接所有块来重建。

Upload flow 上传流程


Let us discuss what happens when a client uploads a file. To better understand the flow, we draw the sequence diagram as shown in Figure 15-14.

让我们讨论一下客户端上传文件时会发生什么。为了更好地理解流程，我们绘制了如图 15-14 所示的序列图。


In Figure 15-14, two requests are sent in parallel: add file metadata and upload the file to cloud storage. Both requests originate from client 1.
• Add file metadata. 
    1. Client 1 sends a request to add the metadata of the new file.
    2. Store the new file metadata in metadata DB and change the file upload status to
    “pending.”
    3. Notify the notification service that a new file is being added.
    4. The notification service notifies relevant clients (client 2) that a file is being uploaded.
    
• Upload files to cloud storage.
    2.1 Client 1 uploads the content of the file to block servers.
    2.2 Block servers chunk the files into blocks, compress, encrypt the blocks, and
    upload them to cloud storage.
    2.3 Once the file is uploaded, cloud storage triggers upload completion callback. The request is sent to API servers.
    2.4 File status changed to “uploaded” in Metadata DB.
    2.5 Notify the notification service that a file status is changed to “uploaded.”
    2.6 The notification service notifies relevant clients (client 2) that a file is fully uploaded.
    
When a file is edited, the flow is similar, so we will not repeat it.

在图 15-14 中，并行发送两个请求：添加文件元数据并将文件上传到云存储。这两个请求都源自客户端 1。
• 添加文件元数据。
    1. 客户端 1 发送请求以添加新文件的元数据。
    2. 将新的文件元数据存储在元数据数据库中，并将文件上传状态更改为
    “待定。”
    3. 通知通知服务正在添加新文件。
    4. 通知服务通知相关客户端（客户端 2）正在上传文件。
    
•将文件上传到云存储。
    2.1 客户端 1 将文件内容上传到阻止服务器。
    2.2 块服务器将文件分块为块，压缩，加密块，以及
    将它们上传到云存储。
    2.3 上传文件后，云存储会触发上传完成回调。请求将发送到 API 服务器。
    2.4 元数据数据库中的文件状态更改为“已上传”。
    2.5 通知通知服务文件状态更改为“已上传”。
    2.6 通知服务通知相关客户端（客户端 2）文件已完全上传。
    
编辑文件时，流程类似，因此我们不会重复。

Download flow


Download flow is triggered when a file is added or edited elsewhere. How does a client know if a file is added or edited by another client? There are two ways a client can know:

    • If client A is online while a file is changed by another client, notification service will inform client A that changes are made somewhere so it needs to pull the latest data.
    • If client A is offline while a file is changed by another client, data will be saved to the cache. When the offline client is online again, it pulls the latest changes.
    
Once a client knows a file is changed, it first requests metadata via API servers, then downloads blocks to construct the file. Figure 15-15 shows the detailed flow. Note, only the most important components are shown in the diagram due to space constraint.

在其他地方添加或编辑文件时，将触发下载流程。客户端如何知道文件是否由其他客户端添加或编辑？客户可以通过两种方式知道：

• 如果客户端 A 在另一个客户端更改文件时处于联机状态，则通知服务将通知客户端 A 在某处进行了更改，因此需要提取最新数据。
    • 如果客户端 A 脱机，而另一个客户端更改了文件，则数据将保存到缓存中。当脱机客户端再次联机时，它会拉取最新的更改。
    
一旦客户端知道文件已更改，它首先通过 API 服务器请求元数据，然后下载块来构建文件。图 15-15 显示了详细流程。请注意，由于空间限制，图中仅显示最重要的组件。


1. Notification service informs client 2 that a file is changed somewhere else.
2. Once client 2 knows that new updates are available, it sends a request to fetch metadata.
3. API servers call metadata DB to fetch metadata of the changes.
4. Metadata is returned to the API servers.
5. Client 2 gets the metadata.
6. Once the client receives the metadata, it sends requests to block servers to download blocks.
7. Block servers first download blocks from cloud storage.
8. Cloud storage returns blocks to the block servers.
9. Client 2 downloads all the new blocks to reconstruct the file.

1. 通知服务通知客户端 2 文件在其他地方发生了更改。
2. 客户端 2 知道有新的更新可用后，它会发送请求以获取元数据。
3. API 服务器调用元数据数据库来获取更改的元数据。
4. 元数据返回到 API 服务器。
5. 客户端 2 获取元数据。
6. 客户端收到元数据后，会向块服务器发送请求以下载块。
7. 块服务器首先从云存储下载块。
8. 云存储将块返回到块服务器。
9. 客户端 2 下载所有新块以重建文件。

Notification service 通知服务


To maintain file consistency, any mutation of a file performed locally needs to be informed to other clients to reduce conflicts. Notification service is built to serve this purpose. At the high-level, notification service allows data to be transferred to clients as events happen. Here are a few options:
• Long polling. Dropbox uses long polling [10].
• WebSocket. WebSocket provides a persistent connection between the client and the
server. Communication is bi-directional.

为了保持文件一致性，需要将本地执行的文件的任何更改通知其他客户端，以减少冲突。通知服务就是为了达到这个目的而构建的。在高级别，通知服务允许在事件发生时将数据传输到客户端。以下是一些选项：
• 长时间轮询。Dropbox 使用长轮询 [10]。
• 网络套接字。WebSocket 在客户端和 服务器。沟通是双向的。


Even though both options work well, we opt for long polling for the following two reasons:
• Communication for notification service is not bi-directional. The server sends
information about file changes to the client, but not vice versa.
• WebSocket is suited for real-time bi-directional communication such as a chat app. For Google Drive, notifications are sent infrequently with no burst of data.

尽管这两个选项都运行良好，但我们选择长轮询有两个原因：
• 通知服务的通信不是双向的。服务器发送
有关对客户端的文件更改的信息，反之则不然。
• WebSocket 适用于实时双向通信，例如聊天应用程序。对于 Google 云端硬盘，通知不会经常发送，不会突发数据。


With long polling, each client establishes a long poll connection to the notification service. If changes to a file are detected, the client will close the long poll connection. Closing the connection means a client must connect to the metadata server to download the latest changes. After a response is received or connection timeout is reached, a client immediately sends a new request to keep the connection open.

使用长轮询时，每个客户端都会与通知服务建立长轮询连接。如果检测到对文件的更改，客户端将关闭长轮询连接。关闭连接意味着客户端必须连接到元数据服务器才能下载最新更改。收到响应或达到连接超时后，客户端会立即发送新请求以保持连接打开。

Save storage space 节省存储空间


To support file version history and ensure reliability, multiple versions of the same file are stored across multiple data centers. Storage space can be filled up quickly with frequent backups of all file revisions. Three techniques are proposed to reduce storage costs:

• De-duplicate data blocks. Eliminating redundant blocks at the account level is an easy way to save space. Two blocks are identical if they have the same hash value.
• Adopt an intelligent data backup strategy. Two optimization strategies can be applied:

• Set a limit: We can set a limit for the number of versions to store. If the limit is reached, the oldest version will be replaced with the new version.
• Keep valuable versions only: Some files might be edited frequently. For example, saving every edited version for a heavily modified document could mean the file is
saved over 1000 times within a short period. To avoid unnecessary copies, we could
limit the number of saved versions. We give more weight to recent versions. Experimentation is helpful to figure out the optimal number of versions to save.
• Moving infrequently used data to cold storage. Cold data is the data that has not been active for months or years. Cold storage like Amazon S3 glacier [11] is much cheaper than S3.

为了支持文件版本历史记录并确保可靠性，同一文件的多个版本存储在多个数据中心。通过频繁备份所有文件修订，可以快速填满存储空间。提出了三种技术来降低存储成本：

• 删除重复数据块。在帐户级别消除冗余块是节省空间的简单方法。如果两个块具有相同的哈希值，则它们是相同的。
• 采用智能数据备份策略。可以应用两种优化策略：

• 设置限制：我们可以为要存储的版本数量设置限制。如果达到限制，则最旧的版本将替换为新版本。
• 仅保留有价值的版本：某些文件可能会经常编辑。例如，为经过大量修改的文档保存每个编辑的版本可能意味着该文件是
在短时间内保存了1000多次。为了避免不必要的复制，我们可以
限制保存的版本数。我们更加重视最新版本。试验有助于确定要保存的最佳版本数。
• 将不常用的数据移动到冷存储。冷数据是几个月或几年未处于活动状态的数据。像亚马逊S3冰川[11]这样的冷库比S3便宜得多。

Failure Handling 故障处理


Failures can occur in a large-scale system and we must adopt design strategies to address these failures. Your interviewer might be interested in hearing about how you handle the following system failures:
• Load balancer failure: If a load balancer fails, the secondary would become active and pick up the traffic. Load balancers usually monitor each other using a heartbeat, a periodic  signal sent between load balancers. A load balancer is considered as failed if it has not sent a heartbeat for some time.
• Block server failure: If a block server fails, other servers pick up unfinished or pending jobs.
• Cloud storage failure: S3 buckets are replicated multiple times in different regions. If files are not available in one region, they can be fetched from different regions.
• API server failure: It is a stateless service. If an API server fails, the traffic is redirected to other API servers by a load balancer.
• Metadata cache failure: Metadata cache servers are replicated multiple times. If one node goes down, you can still access other nodes to fetch data. We will bring up a new cache server to replace the failed one.
• Metadata DB failure.
• Master down: If the master is down, promote one of the slaves to act as a new master and bring up a new slave node.
• Slave down: If a slave is down, you can use another slave for read operations and
bring another database server to replace the failed one. 
• Notification service failure: Every online user keeps a long poll connection with the notification server. Thus, each notification server is connected with many users. According to the Dropbox talk in 2012 [6], over 1 million connections are open per machine. If a server goes down, all the long poll connections are lost so clients must reconnect to a different server. Even though one server can keep many open connections, it cannot reconnect all the lost connections at once. Reconnecting with all the lost clients is a relatively slow process. 
• Offline backup queue failure: Queues are replicated multiple times. If one queue fails, consumers of the queue may need to re-subscribe to the backup queue.

在大规模系统中可能会发生故障，我们必须采用设计策略来解决这些故障。您的面试官可能有兴趣了解您如何处理以下系统故障：
• 负载均衡器故障：如果负载均衡器发生故障，辅助负载均衡器将变为活动状态并拾取流量。负载均衡器通常使用检测信号（负载均衡器之间发送的定期信号）相互监控。如果负载均衡器在一段时间内未发送检测信号，则将其视为失败。
• 块服务器故障：如果块服务器发生故障，其他服务器将拾取未完成或挂起的作业。
• 云存储故障：S3 存储桶在不同区域中多次复制。如果文件在一个区域中不可用，则可以从不同的区域获取这些文件。
• API 服务器故障：它是无状态服务。如果某个 API 服务器发生故障，负载均衡器会将流量重定向到其他 API 服务器。
• 元数据缓存失败：元数据缓存服务器被多次复制。如果一个节点出现故障，您仍然可以访问其他节点来获取数据。我们将启动一个新的缓存服务器来替换失败的缓存服务器。
• 元数据数据库故障。
• 主节点关闭：如果主节点关闭，则提升其中一个从节点充当新的主节点并启动新的从节点。
• 从站关闭：如果一个从站关闭，您可以使用另一个从站进行读取操作和
使用另一个数据库服务器来替换发生故障的数据库服务器。
• 通知服务失败：每个在线用户都与通知服务器保持长轮询连接。因此，每个通知服务器都与许多用户连接。根据 Dropbox 在 2012 年的谈话 [6]，每台机器打开的连接超过 100 万个。如果服务器出现故障，所有长轮询连接都将丢失，因此客户端必须重新连接到其他服务器。即使一台服务器可以保留许多打开的连接，它也无法一次重新连接所有丢失的连接。与所有丢失的客户重新连接是一个相对缓慢的过程。
• 脱机备份队列失败：多次复制队列。如果一个队列失败，队列的使用者可能需要重新订阅备份队列。

Step 4 - Wrap up


In this chapter, we proposed a system design to support Google Drive. The combination of strong consistency, low network bandwidth and fast sync make the design interesting. Our design contains two flows: manage file metadata and file sync. Notification service is another important component of the system. It uses long polling to keep clients up to date with file changes. 

Like any system design interview questions, there is no perfect solution. Every company has its unique constraints and you must design a system to fit those constraints. Knowing the tradeoffs of your design and technology choices are important. If there are a few minutes left, you can talk about different design choices.

For example, we can upload files directly to cloud storage from the client instead of going through block servers. The advantage of this approach is that it makes file upload faster because a file only needs to be transferred once to the cloud storage. In our design, a file is transferred to block servers first, and then to the cloud storage. However, the new approach has a few drawbacks:

• First, the same chunking, compression, and encryption logic must be implemented on different platforms (iOS, Android, Web). It is error-prone and requires a lot of engineering effort. In our design, all those logics are implemented in a centralized place: block servers.
• Second, as a client can easily be hacked or manipulated, implementing encrypting logic on the client side is not ideal. Another interesting evolution of the system is moving online/offline logic to a separate service. Let us call it presence service. By moving presence service out of notification servers, online/offline functionality can easily be integrated by other services.

Congratulations on getting this far! Now give yourself a pat on the back. Good job!

在本章中，我们提出了一个支持Google云端硬盘的系统设计。强一致性、低网络带宽和快速同步的结合使设计变得有趣。我们的设计包含两个流程：管理文件元数据和文件同步。通知服务是系统的另一个重要组成部分。它使用长轮询使客户端了解文件更改的最新情况。

像任何系统设计面试问题一样，没有完美的解决方案。每个公司都有其独特的约束，您必须设计一个系统来适应这些约束。了解设计和技术选择的权衡非常重要。如果还有几分钟，您可以谈论不同的设计选择。

例如，我们可以将文件从客户端直接上传到云存储，而不是通过块服务器。这种方法的优点是它使文件上传更快，因为文件只需要传输到云存储一次。在我们的设计中，文件首先传输到块服务器，然后传输到云存储。但是，新方法有一些缺点：

首先，必须在不同的平台（iOS、Android、Web）上实现相同的分块、压缩和加密逻辑。它容易出错，需要大量的工程工作。在我们的设计中，所有这些逻辑都是在一个集中的地方实现的：块服务器。
其次，由于客户端很容易被黑客入侵或操纵，因此在客户端实现加密逻辑并不理想。该系统的另一个有趣的演变是将联机/脱机逻辑移动到单独的服务。我们称之为在线状态服务。通过将状态服务移出通知服务器，其他服务可以轻松地集成联机/脱机功能。

恭喜你走到了这一步！现在拍拍自己的背。干得好！

Reference materials


[1] Google Drive: https://www.google.com/drive/
[2] Upload file data: https://developers.google.com/drive/api/v2/manage-uploads
[3] Amazon S3: https://aws.amazon.com/s3
[4] Differential Synchronization https://neil.fraser.name/writing/sync/
[5] Differential Synchronization youtube talk https://www.youtube.com/watch?
v=S2Hp_1jqpY8
[6] How We’ve Scaled Dropbox: https://youtu.be/PE4gwstWhmc
[7] Tridgell, A., & Mackerras, P. (1996). The rsync algorithm.
[8] Librsync. (n.d.). Retrieved April 18, 2015, from https://github.com/librsync/librsync
[9] ACID: https://en.wikipedia.org/wiki/ACID
[10] Dropbox security white paper:
https://www.dropbox.com/static/business/resources/Security_Whitepaper.pdf
[11] Amazon S3 Glacier: https://aws.amazon.com/glacier/faqs/