CHAPTER 6: DESIGN A KEY-VALUE STORE 设计键值存储


A key-value store, also referred to as a key-value database, is a non-relational database. Each
unique identifier is stored as a key with its associated value. This data pairing is known as a
“key-value” pair.
In a key-value pair, the key must be unique, and the value associated with the key can be
accessed through the key. Keys can be plain text or hashed values. For performance reasons,
a short key works better. What do keys look like? Here are a few examples:
• Plain text key: “last_logged_in_at”
• Hashed key: 253DDEC4
The value in a key-value pair can be strings, lists, objects, etc. The value is usually treated as
an opaque object in key-value stores, such as Amazon dynamo [1], Memcached [2], Redis
[3], etc.
Here is a data snippet in a key-value store:

键值存储（也称为键值数据库）是一种非关系数据库。每
唯一标识符存储为键及其关联值。此数据配对称为“键值”对。

在键值对中，键必须是唯一的，并且与键关联的值可以是
通过密钥访问。键可以是纯文本或哈希值。出于性能原因，
短键效果更好。钥匙是什么样子的？以下是一些示例：


• 纯文本键：“last_logged_in_at”
• 哈希键：253DDEC4

键值对中的值可以是字符串、列表、对象等。该值通常被视为
键值存储中的不透明对象，例如 Amazon dynamo [1]、Memcached [2]、Redis
[3]等。
下面是键值存储中的数据片段：

截屏2023-05-25 10.14.27.png


In this chapter, you are asked to design a key-value store that supports the following
operations:
- put(key, value) // insert “value” associated with “key”
- get(key) // get “value” associated with “key”

在本章中，要求您设计一个支持以下内容的键值存储
操作：
- 放置（键，值） // 插入与“键”关联的“值”
- get（key） // get “value” 与 “key” 相关联

Understand the problem and establish design scope 了解问题并建立设计范围


There is no perfect design. Each design achieves a specific balance regarding the tradeoffs of
the read, write, and memory usage. Another tradeoff has to be made was between consistency
and availability. In this chapter, we design a key-value store that comprises of the following
characteristics:
• The size of a key-value pair is small: less than 10 KB.
• Ability to store big data.
• High availability: The system responds quickly, even during failures.
• High scalability: The system can be scaled to support large data set.
• Automatic scaling: The addition/deletion of servers should be automatic based on traffic.
• Tunable consistency.
• Low latency


沒有完美的設計。每個設計在讀取、寫入和 記憶體使用的權衡方面都達到了特定的平衡。必須做出的另一個權衡是在一致性和可用性之間。在本章中,我們設計了一個包含以下特性的鍵值存儲:

•键值对大小较小，一般小于10kb。
•具备大数据存储能力。
•高可用性:即使在故障期间，系统也能快速响应。
•高可扩展性:系统可扩展以支持大型数据集。
•自动扩展:服务器的添加/删除应该基于流量自动。
•可调一致性。
•低延迟

Single server key-value store 单服务器键值存储


Developing a key-value store that resides in a single server is easy. An intuitive approach is
to store key-value pairs in a hash table, which keeps everything in memory. Even though
memory access is fast, fitting everything in memory may be impossible due to the space
constraint. Two optimizations can be done to fit more data in a single server:
• Data compression
• Store only frequently used data in memory and the rest on disk
Even with these optimizations, a single server can reach its capacity very quickly. A
distributed key-value store is required to support big data.

开发驻留在单个服务器中的键值存储很容易。

一种直观的方法是将键值对存储在散列表中，它将所有内容保存在内存中。

尽管内存访问速度很快，但由于空间限制，在内存中拟合所有内容可能是不可能的。

为了在一台服务器上容纳更多的数据，可以进行两种优化:

•数据压缩
•仅将频繁使用的数据存储在内存中，其余数据存储在磁盘上。

即使使用这些优化，一台服务器也可以很快达到其容量。

支持大数据需要分布式键值存储。

Distributed key-value store 分布式键值存储


A distributed key-value store is also called a distributed hash table, which distributes keyvalue pairs across many servers. When designing a distributed system, it is important to
understand CAP (Consistency, Availability, Partition Tolerance) theorem.

分布式键值存储也称为分布式哈希表，它将键值对分布在许多服务器上。

在设计分布式系统时，理解CAP(一致性、可用性、分区容忍度)定理是很重要的。

CAP theorem CAP理论


CAP theorem states it is impossible for a distributed system to simultaneously provide more
than two of these three guarantees: consistency, availability, and partition tolerance. Let us
establish a few definitions.

CAP定理指出，分布式系统不可能同时提供以下三种保证中的两种以上:一致性、可用性和分区容忍度。让我们建立一些定义。


Consistency: consistency means all clients see the same data at the same time no matter
which node they connect to.
Availability: availability means any client which requests data gets a response even if some
of the nodes are down.
Partition Tolerance: a partition indicates a communication break between two nodes.
Partition tolerance means the system continues to operate despite network partitions.
CAP theorem states that one of the three properties must be sacrificed to support 2 of the 3 properties as shown in Figure 6-1.

一致性:一致性意味着所有客户端在同一时间看到相同的数据，无论它们连接到哪个节点。

可用性:可用性意味着任何请求数据的客户端都能得到响应，即使某些节点宕机。

分区容忍:一个分区表示两个节点之间的通信中断。分区容忍意味着系统在网络分区的情况下继续运行。

CAP定理表明，为了支持图6-1所示的3个属性中的2个，必须牺牲三个属性中的一个。


Nowadays, key-value stores are classified based on the two CAP characteristics they support:
CP (consistency and partition tolerance) systems: a CP key-value store supports
consistency and partition tolerance while sacrificing availability.
AP (availability and partition tolerance) systems: an AP key-value store supports
availability and partition tolerance while sacrificing consistency.
CA (consistency and availability) systems: a CA key-value store supports consistency and
availability while sacrificing partition tolerance. Since network failure is unavoidable, a
distributed system must tolerate network partition. Thus, a CA system cannot exist in realworld applications.

现在，键值存储根据它们支持的两个CAP特征进行分类:

CP(一致性和分区容忍度)系统:CP键值存储在牺牲可用性的同时支持一致性和分区容忍度。

AP(可用性和分区容忍度)系统:AP键值存储在牺牲一致性的同时支持可用性和分区容忍度。

CA(一致性和可用性)系统:CA键值存储在牺牲分区容错性的同时支持一致性和可用性。

由于网络故障是不可避免的，分布式系统必须这样做


What you read above is mostly the definition part. To make it easier to understand, let us take
a look at some concrete examples. In distributed systems, data is usually replicated multiple
times. Assume data are replicated on three replica nodes, n1, n2 and n3 as shown in Figure 6-
2.

你在上面读到的大部分是定义部分。

为了更容易理解，让我们看一些具体的例子。

在分布式系统中，数据通常被复制多次。假设数据在三个复制节点n1、n2和n3上复制，如图6- 2所示。

Ideal situation 理想情况下


In the ideal world, network partition never occurs. Data written to n1 is automatically
replicated to n2 and n3. Both consistency and availability are achieved.

在理想情况下，永远不会发生网络分区。
写入n1的数据会自动复制到n2和n3。

一致性和可用性都得到了实现。

Real-world distributed systems 真实的分布系统


In a distributed system, partitions cannot be avoided, and when a partition occurs, we must
choose between consistency and availability. In Figure 6-3, n3 goes down and cannot
communicate with n1 and n2. If clients write data to n1 or n2, data cannot be propagated to
n3. If data is written to n3 but not propagated to n1 and n2 yet, n1 and n2 would have stale
data.

在分布式系统中，分区是不可避免的，当分区发生时，我们必须在一致性和可用性之间做出选择。

在图6-3中，n3 down，无法与n1和n2通信。如果客户端将数据写入n1或n2，则无法将数据传播到n3。

如果数据被写入n3，但尚未传播到n1和n2，则n1和n2将具有过时的数据。


If we choose consistency over availability (CP system), we must block all write operations to
n1 and n2 to avoid data inconsistency among these three servers, which makes the system
unavailable. Bank systems usually have extremely high consistent requirements. For
example, it is crucial for a bank system to display the most up-to-date balance info. If
inconsistency occurs due to a network partition, the bank system returns an error before the
inconsistency is resolved.

如果我们选择一致性而不是可用性(CP系统)，我们必须阻塞对n1和n2的所有写操作，以避免这三台服务器之间的数据不一致，从而导致系统不可用。

银行系统通常具有极高的一致性要求。例如，银行系统显示最新的余额信息是至关重要的。

如果不一致是由于网络分区引起的，银行系统会在不一致解决之前返回一个错误。


However, if we choose availability over consistency (AP system), the system keeps accepting
reads, even though it might return stale data. For writes, n1 and n2 will keep accepting writes,
and data will be synced to n3 when the network partition is resolved.
Choosing the right CAP guarantees that fit your use case is an important step in building a
distributed key-value store. You can discuss this with your interviewer and design the system
accordingly.

但是，如果我们选择可用性而不是一致性(AP系统)，系统将继续接受读取，即使它可能返回过时的数据。对于写操作，n1和n2将继续接受写操作，当网络分区被解析后，数据将同步到n3。

选择合适的CAP保证适合您的用例是构建分布式键值存储的重要步骤。你可以和面试官讨论这个问题，并据此设计面试系统。

System components 系统组件


In this section, we will discuss the following core components and techniques used to build a
key-value store:
• Data partition
• Data replication
• Consistency
• Inconsistency resolution
• Handling failures
• System architecture diagram
• Write path
• Read path
The content below is largely based on three popular key-value store systems: Dynamo [4],
Cassandra [5], and BigTable [6].

在本节中，我们将讨论用于构建键值存储的以下核心组件和技术:

•数据分区
•数据复制
•一致性
•不一致性解决
•故障处理
•系统架构图
•写路径
•读路径

下面的内容主要基于三种流行的键值存储系统:Dynamo [4]， Cassandra[5]和BigTable[6]。

Data partition 数据分区


For large applications, it is infeasible to fit the complete data set in a single server. The
simplest way to accomplish this is to split the data into smaller partitions and store them in
multiple servers. There are two challenges while partitioning the data:
• Distribute data across multiple servers evenly.
• Minimize data movement when nodes are added or removed.

对于大型应用程序，在单个服务器中容纳完整的数据集是不可行的。

实现这一目标的最简单方法是将数据分割成更小的分区，并将它们存储在多个服务器中。

在对数据进行分区时存在两个挑战:

•将数据均匀地分布在多个服务器上。
•在增加或删除节点时尽量减少数据移动


Consistent hashing discussed in Chapter 5 is a great technique to solve these problems. Let us
revisit how consistent hashing works at a high-level.
• First, servers are placed on a hash ring. In Figure 6-4, eight servers, represented by s0,
s1, …, s7, are placed on the hash ring.
• Next, a key is hashed onto the same ring, and it is stored on the first server encountered
while moving in the clockwise direction. For instance, key0 is stored in s1 using this logic.

在第5章中讨论的一致性哈希是解决这些问题的一个很好的技术。

让我们回顾一下一致性哈希在高层次上是如何工作的。

•首先，服务器被放置在哈希环上。在图6-4中，8个服务器，用50,s1，…，s7表示，放在哈希环上。

•接下来，将密钥散列到相同的环上，并将其存储在顺时针方向移动时遇到的第一个服务器上。例如，key0使用这种逻辑存储在s1中。


Using consistent hashing to partition data has the following advantages:
Automatic scaling: servers could be added and removed automatically depending on the
load.
Heterogeneity: the number of virtual nodes for a server is proportional to the server capacity.
For example, servers with higher capacity are assigned with more virtual nodes.

使用一致散列对数据进行分区具有以下优点:

自动伸缩:可以根据负载自动添加和删除服务器。
异构:服务器的虚拟节点数量与服务器容量成正比。

例如，容量越大的服务器被分配的虚拟节点越多。

Data replication 数据复制


To achieve high availability and reliability, data must be replicated asynchronously over N
servers, where N is a configurable parameter. These N servers are chosen using the following
logic: after a key is mapped to a position on the hash ring, walk clockwise from that position
and choose the first N servers on the ring to store data copies. In Figure 6-5 (N = 3), key0 is
replicated at s1, s2, and s3.


为了实现高可用性和可靠性，必须在N台服务器上异步复制数据，其中N是一个可配置参数。

使用以下逻辑选择这N个服务器:将键映射到哈希环上的某个位置后，从该位置顺时针走，并选择环上的前N个服务器来存储数据副本。

在图6-5 (N = 3)中，key0在s1、s2和s3上被复制。


With virtual nodes, the first N nodes on the ring may be owned by fewer than N physical
servers. To avoid this issue, we only choose unique servers while performing the clockwise
walk logic.
Nodes in the same data center often fail at the same time due to power outages, network
issues, natural disasters, etc. For better reliability, replicas are placed in distinct data centers,
and data centers are connected through high-speed networks.

对于虚拟节点，环上的前N个节点可能由少于N个物理服务器拥有。

为了避免这个问题，我们在执行顺时针行走逻辑时只选择唯一的服务器。

由于断电、网络问题、自然灾害等原因，同一数据中心的节点经常同时发生故障。

为了提高可靠性，副本放置在不同的数据中心，数据中心通过高速网络连接。

Consistency 一致性


Since data is replicated at multiple nodes, it must be synchronized across replicas. Quorum
consensus can guarantee consistency for both read and write operations. Let us establish a
few definitions first.
N = The number of replicas
W = A write quorum of size W. For a write operation to be considered as successful, write
operation must be acknowledged from W replicas.
R = A read quorum of size R. For a read operation to be considered as successful, read
operation must wait for responses from at least R replicas.
Consider the following example shown in Figure 6-6 with N = 3.

由于数据是在多个节点上复制的，因此必须跨多个副本同步数据。

仲裁一致可以保证读写操作的一致性。让我们先确定几个定义。

N =副本数
W =大小为W的写仲裁。要认为一个写操作成功，必须有W个副本确认该写操作。

R =大小为R的读仲裁。要认为一个读操作成功，读操作必须等待至少R个副本的响应。

考虑如下图6-6中N = 3的例子。


W = 1 does not mean data is written on one server. For instance, with the configuration in
Figure 6-6, data is replicated at s0, s1, and s2. W = 1 means that the coordinator must receive
at least one acknowledgment before the write operation is considered as successful. For
instance, if we get an acknowledgment from s1, we no longer need to wait for
acknowledgements from s0 and s2. A coordinator acts as a proxy between the client and the
nodes

W = 1并不意味着数据被写入到一台服务器上。

以图6-6中的配置为例，在s0、s1、s2处复制数据。

W = 1表示协调器在认为写操作成功之前必须至少收到一次确认。

例如，如果我们收到来自s1的确认，我们不再需要等待来自50和s2的确认。

协调器充当客户机和节点之间的代理


The configuration of W, R and N is a typical tradeoff between latency and consistency. If W =
1 or R = 1, an operation is returned quickly because a coordinator only needs to wait for a
response from any of the replicas. If W or R > 1, the system offers better consistency;
however, the query will be slower because the coordinator must wait for the response from
the slowest replica.


W、R和N的配置是延迟和一致性之间的典型权衡。

如果W = 1或R = 1，则快速返回操作，因为协调器只需要等待来自任何副本的响应。

当W或R > 1时，系统具有较好的一致性;但是，查询将会变慢，因为协调器必须等待最慢副本的响应。


If W + R > N, strong consistency is guaranteed because there must be at least one
overlapping node that has the latest data to ensure consistency.
How to configure N, W, and R to fit our use cases? Here are some of the possible setups:
If R = 1 and W = N, the system is optimized for a fast read.
If W = 1 and R = N, the system is optimized for fast write.
If W + R > N, strong consistency is guaranteed (Usually N = 3, W = R = 2).
If W + R <= N, strong consistency is not guaranteed.
Depending on the requirement, we can tune the values of W, R, N to achieve the desired level
of consistency

当W + R > N时，至少有一个节点的数据是最新的，保证了数据的一致性。如何配置N、W和R以适应我们的用例?以下是一些可能的设置:

如果R = 1且W = N，则系统针对快速读取进行了优化。

如果W = 1, R = N，则系统优化为快速写入。

如果W + R > N，则保证强一致性(通常N = 3, W = R = 2)，

如果W + R <= N，则不保证强一致性。

根据需求，我们可以调整W、R、N的值，以达到所需的一致性水平

Consistency models 一致性模型


Consistency model is other important factor to consider when designing a key-value store. A
consistency model defines the degree of data consistency, and a wide spectrum of possible
consistency models exist:
• Strong consistency: any read operation returns a value corresponding to the result of the
most updated write data item. A client never sees out-of-date data.
• Weak consistency: subsequent read operations may not see the most updated value.
• Eventual consistency: this is a specific form of weak consistency. Given enough time, all
updates are propagated, and all replicas are consistent.

一致性模型是设计键值存储时要考虑的另一个重要因素。

一致性模型定义了数据的一致性程度，存在多种可能的一致性模型:

•强一致性:任何读操作都返回与最近一次写入数据项的结果相对应的值。客户机永远不会看到过期的数据。

•弱一致性:后续读操作可能看不到最新的值。


•最终一致性:这是弱一致性的一种特殊形式。如果有足够的时间，所有更新都会被传播，并且所有副本都是一致的。


Strong consistency is usually achieved by forcing a replica not to accept new reads/writes
until every replica has agreed on current write. This approach is not ideal for highly available
systems because it could block new operations. Dynamo and Cassandra adopt eventual
consistency, which is our recommended consistency model for our key-value store. From
concurrent writes, eventual consistency allows inconsistent values to enter the system and
force the client to read the values to reconcile. The next section explains how reconciliation
works with versioning.

强一致性通常通过强制副本不接受新的读/写来实现，直到每个副本都同意当前的写。

这种方法对于高可用性系统来说并不理想，因为它可能会阻塞新的操作。

Dynamo和Cassandra采用最终一致性，这是我们推荐的键值存储一致性模型。


通过并发写，最终一致性允许不一致的值进入系统，并迫使客户端读取这些值以进行协调。下一节将解释协调如何与版本控制一起工作。

Inconsistency resolution: versioning 不一致解决:版本控制


Replication gives high availability but causes inconsistencies among replicas. Versioning and
vector locks are used to solve inconsistency problems. Versioning means treating each data
modification as a new immutable version of data. Before we talk about versioning, let us use
an example to explain how inconsistency happens:
As shown in Figure 6-7, both replica nodes n1 and n2 have the same value. Let us call this
value the original value. Server 1 and server 2 get the same value for get(“name”) operation.

复制提供高可用性，但会导致副本之间的不一致性。

版本控制和向量锁用于解决不一致问题。

版本控制意味着将每个数据修改视为数据的新不可变版本。

在讨论版本控制之前，让我们用一个例子来解释不一致是如何发生的:

如图6-7所示，两个副本节点n1和n2具有相同的值。我们称这个值为原始值。服务器1和服务器2通过get(“name”)操作获得相同的值。


Next, server 1 changes the name to “johnSanFrancisco”, and server 2 changes the name to
“johnNewYork” as shown in Figure 6-8. These two changes are performed simultaneously.
Now, we have conflicting values, called versions v1 and v2.

接下来，服务器1将名称更改为“johnSanFrancisco”，服务器2将名称更改为“johnNewYork”，

如图6-8所示。这两个更改是同时执行的。

现在，我们有冲突的值，称为版本v1和版本v2。


In this example, the original value could be ignored because the modifications were based on
it. However, there is no clear way to resolve the conflict of the last two versions. To resolve
this issue, we need a versioning system that can detect conflicts and reconcile conflicts. A
vector clock is a common technique to solve this problem. Let us examine how vector clocks
work.
A vector clock is a [server, version] pair associated with a data item. It can be used to check
if one version precedes, succeeds, or in conflict with others.

在本例中，可以忽略原始值，因为修改是基于它的。

然而，没有明确的方法来解决最后两个版本的冲突。

为了解决这个问题，我们需要一个能够检测冲突并协调冲突的版本控制系统。

矢量时钟是解决这个问题的常用技术。让我们来看看矢量时钟是如何工作的。

向量时钟是与数据项相关联的[服务器，版本]对。

它可以用来检查一个版本是否先于其他版本、是否成功或是否与其他版本冲突。


Assume a vector clock is represented by D([S1, v1], [S2, v2], …, [Sn, vn]), where D is a data
item, v1 is a version counter, and s1 is a server number, etc. If data item D is written to server
Si, the system must perform one of the following tasks.
• Increment vi if [Si, vi] exists.
• Otherwise, create a new entry [Si, 1].
The above abstract logic is explained with a concrete example as shown in Figure 6-9.

假设一个矢量时钟用D([S1, v1]， [S2, v2]，…，[Sn, vn])表示，其中D是一个数据项，v1是一个版本计数器，S1是一个服务器编号，等等。

如果将数据项D写入服务器Si，则系统必须执行以下任务之一。

•如果[Si, vi]存在，则增加vi。
•否则，创建新表项[Si, 1]。

上面的抽象逻辑用一个具体的例子来说明，如图6-9所示。


1. A client writes a data item D1 to the system, and the write is handled by server Sx,
which now has the vector clock D1[(Sx, 1)].
2. Another client reads the latest D1, updates it to D2, and writes it back. D2 descends
from D1 so it overwrites D1. Assume the write is handled by the same server Sx, which
now has vector clock D2([Sx, 2]).
3. Another client reads the latest D2, updates it to D3, and writes it back. Assume the write
is handled by server Sy, which now has vector clock D3([Sx, 2], [Sy, 1])).
4. Another client reads the latest D2, updates it to D4, and writes it back. Assume the write
is handled by server Sz, which now has D4([Sx, 2], [Sz, 1])).
5. When another client reads D3 and D4, it discovers a conflict, which is caused by data
item D2 being modified by both Sy and Sz. The conflict is resolved by the client and
updated data is sent to the server. Assume the write is handled by Sx, which now has
D5([Sx, 3], [Sy, 1], [Sz, 1]). We will explain how to detect conflict shortly.

1. 客户端将数据项D1写入系统，写入由服务器Sx处理，服务器Sx现在拥有矢量时钟D1[(Sx, 1)]。

2. 另一个客户机读取最新的D1，将其更新为D2，并将其写回。D2来自D1，所以它覆盖了D1。假设写操作由同一个服务器Sx处理，该服务器现在具有向量时钟D2([Sx, 2])。

3.另一个客户机读取最新的D2，将其更新为D3，并将其写回。假设写操作由服务器Sy处理，该服务器现在拥有向量时钟D3([Sx, 2]， [Sy, 1]))。

4. 另一个客户机读取最新的D2，将其更新为D4，然后将其写回。假设写操作由服务器Sz处理，该服务器现在有D4([Sx, 2]， [Sz, 1]))。

5. 当另一个客户端读取D3和D4时，它会发现冲突，这是由于数据项D2被Sy和Sz同时修改造成的。冲突由客户端解决，更新后的数据被发送到服务器。
假设写操作由Sx处理，它现在有D5([Sx, 3]， [Sy, 1]， [Sz, 1])。
稍后我们将解释如何检测冲突。


Using vector clocks, it is easy to tell that a version X is an ancestor (i.e. no conflict) of
version Y if the version counters for each participant in the vector clock of Y is greater than or
equal to the ones in version X. For example, the vector clock D([s0, 1], [s1, 1])] is an
ancestor of D([s0, 1], [s1, 2]). Therefore, no conflict is recorded.
Similarly, you can tell that a version X is a sibling (i.e., a conflict exists) of Y if there is any
participant in Y's vector clock who has a counter that is less than its corresponding counter in
X. For example, the following two vector clocks indicate there is a conflict: D([s0, 1], [s1,
2]) and D([s0, 2], [s1, 1]).

使用向量时钟，如果Y的向量时钟中每个参与者的版本计数器大于或等于版本X中的版本计数器，则很容易判断版本X是版本Y的祖先(即没有冲突)。例如，向量时钟D([so0, 1]， [s1, 1])]是D([so0, 1]， [s1, 2])的祖先。

因此，不记录冲突。类似地，如果Y的向量时钟中有任何参与者的计数器小于X中的相应计数器，则可以判断版本X是Y的兄弟(即存在冲突)。例如，以下两个向量时钟为i


Even though vector clocks can resolve conflicts, there are two notable downsides. First,
vector clocks add complexity to the client because it needs to implement conflict resolution
logic.
Second, the [server: version] pairs in the vector clock could grow rapidly. To fix this
problem, we set a threshold for the length, and if it exceeds the limit, the oldest pairs are
removed. This can lead to inefficiencies in reconciliation because the descendant relationship
cannot be determined accurately. However, based on Dynamo paper [4], Amazon has not yet
encountered this problem in production; therefore, it is probably an acceptable solution for
most companies.

尽管矢量时钟可以解决冲突，但有两个明显的缺点。

首先，矢量时钟增加了客户机的复杂性，因为它需要实现冲突解决逻辑。

其次，向量时钟中的[server: version]对可能会快速增长。

为了解决这个问题，我们为长度设置了一个阈值，如果它超过了限制，那么最老的对将被删除。

这可能导致协调效率低下，因为后代关系无法准确确定。

然而，基于Dynamo论文[4]，亚马逊在生产中尚未遇到此问题;因此，它可能是大多数公司可以接受的解决方案。

Handling failures 处理故障


As with any large system at scale, failures are not only inevitable but common. Handling
failure scenarios is very important. In this section, we first introduce techniques to detect
failures. Then, we go over common failure resolution strategies.

与任何大规模的大型系统一样，失败不仅不可避免，而且很常见。

处理故障场景非常重要。

在本节中，我们首先介绍检测故障的技术。

然后，我们将讨论常见的故障解决策略。

Failure detection 故障检测


In a distributed system, it is insufficient to believe that a server is down because another
server says so. Usually, it requires at least two independent sources of information to mark a
server down.
As shown in Figure 6-10, all-to-all multicasting is a straightforward solution. However, this is
inefficient when many servers are in the system.

在分布式系统中，仅仅因为另一个服务器说服务器已经关闭，就认为服务器已经关闭是不够的。

通常，至少需要两个独立的信息源来标记服务器。

如图6-10所示，all-to-all组播是一种简单的解决方案。但是，当系统中有许多服务器时，这是低效的。


A better solution is to use decentralized failure detection methods like gossip protocol.
Gossip protocol works as follows:
• Each node maintains a node membership list, which contains member IDs and heartbeat
counters.
• Each node periodically increments its heartbeat counter.
• Each node periodically sends heartbeats to a set of random nodes, which in turn
propagate to another set of nodes.
• Once nodes receive heartbeats, membership list is updated to the latest info.
• If the heartbeat has not increased for more than predefined periods, the member is
considered as offline.

更好的解决方案是使用分散的故障检测方法，如八卦协议。八卦协议的工作原理如下:

•每个节点维护一个节点成员列表，该列表包含成员id和心跳计数器。
•各节点定时增加心跳计数器。
•每个节点周期性地将心跳发送到一组随机节点，这些随机节点依次传播到另一组节点。
•一旦节点接收到心跳，成员列表将更新为最新信息。
•如果心跳没有增加超过预定义的时间，则认为该成员离线。


As shown in Figure 6-11:
• Node s0 maintains a node membership list shown on the left side.
• Node s0 notices that node s2’s (member ID = 2) heartbeat counter has not increased for a
long time.
• Node s0 sends heartbeats that include s2’s info to a set of random nodes. Once other
nodes confirm that s2’s heartbeat counter has not been updated for a long time, node s2 is
marked down, and this information is propagated to other nodes.

如图6-11所示:

•节点50维护一个节点成员列表。
•节点50注意到节点s2(成员ID = 2)的心跳计数器很长时间没有增加。
•节点50将包含s2信息的心跳发送给一组随机节点。一旦其他节点确认s2的心跳计数器很长时间没有更新，节点s2就会被标记下来，并将此信息传播给其他节点。

Handling temporary failures 处理临时故障


After failures have been detected through the gossip protocol, the system needs to deploy
certain mechanisms to ensure availability. In the strict quorum approach, read and write
operations could be blocked as illustrated in the quorum consensus section.
A technique called “sloppy quorum” [4] is used to improve availability. Instead of enforcing
the quorum requirement, the system chooses the first W healthy servers for writes and first R
healthy servers for reads on the hash ring. Offline servers are ignored.

通过八卦协议检测到故障后，系统需要部署一定的机制来保证可用性。

在严格仲裁方法中，读取和写入操作可能被阻塞，如仲裁共识部分所示。

一种叫做“草率仲裁”的技术[4]被用来提高可用性。

系统没有强制执行仲裁要求，而是选择前W个正常运行的服务器进行写操作，选择前R个正常运行的服务器进行读操作。

离线服务器将被忽略。


If a server is unavailable due to network or server failures, another server will process
requests temporarily. When the down server is up, changes will be pushed back to achieve
data consistency. This process is called hinted handoff. Since s2 is unavailable in Figure 6-
12, reads and writes will be handled by s3 temporarily. When s2 comes back online, s3 will
hand the data back to s2.

如果一个服务器由于网络或服务器故障而不可用，另一个服务器将临时处理请求。

当停机服务器启动时，更改将被推回以实现数据一致性。

这个过程被称为暗示移交。

由于s2在图6- 12中不可用，读取和写入将暂时由s3处理。当s2重新联机时，s3将把数据交还给s2。

Handling permanent failures 处理永久性故障


Hinted handoff is used to handle temporary failures. What if a replica is permanently
unavailable? To handle such a situation, we implement an anti-entropy protocol to keep
replicas in sync. Anti-entropy involves comparing each piece of data on replicas and updating
each replica to the newest version. A Merkle tree is used for inconsistency detection and
minimizing the amount of data transferred.
Quoted from Wikipedia [7]: “A hash tree or Merkle tree is a tree in which every non-leaf
node is labeled with the hash of the labels or values (in case of leaves) of its child nodes.
Hash trees allow efficient and secure verification of the contents of large data structures”.

提示切换用于处理临时故障。

如果副本永久不可用怎么办?为了处理这种情况，我们实现了一个反熵协议来保持副本同步。

反熵包括比较副本上的每个数据块，并将每个副本更新为最新版本。Merkle树用于不一致检测和最小化传输的数据量。

引用自维基百科[7]:“哈希树或默克尔树是一棵树，其中每个非叶子节点都被标记为其子节点的标签或值的哈希值(在叶子的情况下)。

哈希树允许对大型数据结构的内容进行有效和安全的验证”。


Assuming key space is from 1 to 12, the following steps show how to build a Merkle tree.
Highlighted boxes indicate inconsistency.
Step 1: Divide key space into buckets (4 in our example) as shown in Figure 6-13. A bucket
is used as the root level node to maintain a limited depth of the tree.

假设键空间从1到12，下面的步骤展示了如何构建Merkle树。高亮显示的框表示不一致。

步骤1:将密钥空间划分为多个桶(本例中为4个)，如图6-13所示。桶用作根级节点，以保持树的有限深度。


Step 2: Once the buckets are created, hash each key in a bucket using a uniform hashing
method (Figure 6-14).

步骤2:创建桶后，使用统一的哈希方法对桶中的每个键进行哈希(图6-14)。


Step 3: Create a single hash node per bucket (Figure 6-15).

步骤3:为每个桶创建单个哈希节点(图6-15)。


Step 4: Build the tree upwards till root by calculating hashes of children (Figure 6-16)

步骤4:通过计算子节点的哈希值向上构建树直到根(图6-16)


To compare two Merkle trees, start by comparing the root hashes. If root hashes match, both
servers have the same data. If root hashes disagree, then the left child hashes are compared
followed by right child hashes. You can traverse the tree to find which buckets are not
synchronized and synchronize those buckets only.
Using Merkle trees, the amount of data needed to be synchronized is proportional to the
differences between the two replicas, and not the amount of data they contain. In real-world
systems, the bucket size is quite big. For instance, a possible configuration is one million
buckets per one billion keys, so each bucket only contains 1000 keys.

要比较两个Merkle树，首先比较根哈希。

如果根散列匹配，则两台服务器具有相同的数据。

如果根哈希不一致，则比较左子哈希，然后比较右子哈希。

您可以遍历树以查找未同步的桶，并仅同步这些桶。

使用Merkle树，需要同步的数据量与两个副本之间的差异成正比，而不是与它们包含的数据量成正比。

在现实世界的系统中，桶的大小是相当大的。例如，一个可能的配置是一百万桶

Handling data center outage 处理数据中心中断


Data center outage could happen due to power outage, network outage, natural disaster, etc.
To build a system capable of handling data center outage, it is important to replicate data
across multiple data centers. Even if a data center is completely offline, users can still access
data through the other data centers.

由于停电、网络中断、自然灾害等原因，可能导致数据中心中断。

要构建能够处理数据中心中断的系统，跨多个数据中心复制数据非常重要。

即使一个数据中心完全脱机，用户仍然可以通过其他数据中心访问数据。

System architecture diagram 系统架构图


Now that we have discussed different technical considerations in designing a key-value store,
we can shift our focus on the architecture diagram, shown in Figure 6-17.

既然我们已经讨论了设计键-值存储时的不同技术考虑，我们可以将重点转移到架构图上，如图6-17所示。


Main features of the architecture are listed as follows:
• Clients communicate with the key-value store through simple APIs: get(key) and put(key,
value).
• A coordinator is a node that acts as a proxy between the client and the key-value store.
• Nodes are distributed on a ring using consistent hashing.
• The system is completely decentralized so adding and moving nodes can be automatic.
• Data is replicated at multiple nodes.
• There is no single point of failure as every node has the same set of responsibilities.

该架构的主要特点如下:
•客户端通过简单的api与键值存储通信:get(key)和put(key, value)。
•协调器是一个节点，充当客户端和键值存储之间的代理。
•节点使用一致哈希分布在一个环上。
•系统是完全分散的，因此可以自动添加和移动节点。
•数据在多个节点进行复制。
•没有单点故障，因为每个节点都有相同的责任集。


As the design is decentralized, each node performs many tasks as presented in Figure 6-18.

由于分布式设计，每个节点执行的任务较多，如图6-18所示。

Write path


Figure 6-19 explains what happens after a write request is directed to a specific node. Please
note the proposed designs for write/read paths are primary based on the architecture of
Cassandra [8].

将写请求定向到指定节点后的处理如图6-19所示。

请注意，所提出的写/读路径设计主要基于Cassandra的架构[8]。


1. The write request is persisted on a commit log file. 
2. Data is saved in the memory cache.
3. When the memory cache is full or reaches a predefined threshold, data is flushed to
SSTable [9] on disk. Note: A sorted-string table (SSTable) is a sorted list of <key, value>
pairs. For readers interested in learning more about SStable, refer to the reference material
[9].

1. 写请求保存在提交日志文件中。
2. 数据保存在内存缓存中。
3.当内存缓存已满或达到预定义阈值时，数据将刷新到磁盘上的SSTable[9]。

注意:排序字符串表(SSTable)是<键，值>对的排序列表。对于有兴趣了解更多SStable的读者，请参阅参考资料[9]。

Read path


After a read request is directed to a specific node, it first checks if data is in the memory
cache. If so, the data is returned to the client as shown in Figure 6-20.

在将读请求定向到特定节点后，它首先检查数据是否在内存缓存中。如果是，则返回数据给客户端，如图6-20所示。


If the data is not in memory, it will be retrieved from the disk instead. We need an efficient
way to find out which SSTable contains the key. Bloom filter [10] is commonly used to solve
this problem.
The read path is shown in Figure 6-21 when data is not in memory.


如果数据不在内存中，它将从磁盘中检索。

我们需要一种有效的方法来找出哪个SSTable包含密钥。

布隆过滤器[10]是解决这一问题的常用方法。

当数据不在内存中时，读路径如图6-21所示。


1. The system first checks if data is in memory. If not, go to step 2.
2. If data is not in memory, the system checks the bloom filter.
3. The bloom filter is used to figure out which SSTables might contain the key.
4. SSTables return the result of the data set.
5. The result of the data set is returned to the client

1. 系统首先检查数据是否在内存中。否= >步骤2。
2. 如果数据不在内存中，系统将检查布隆过滤器。
3. 布隆过滤器用于找出哪些sstable可能包含密钥。
4. sstable返回数据集的结果。
5. 数据集的结果返回给客户端

Summary


This chapter covers many concepts and techniques. To refresh your memory, the following
table summarizes features and corresponding techniques used for a distributed key-value
store.

本章涵盖了许多概念和技术。为了刷新您的记忆，下表总结了用于分布式键值存储的特性和相应技术。

Reference materials


[1] Amazon DynamoDB: https://aws.amazon.com/dynamodb/
[2] memcached: https://memcached.org/
[3] Redis: https://redis.io/
[4] Dynamo: Amazon’s Highly Available Key-value Store:
https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
[5] Cassandra: https://cassandra.apache.org/
[6] Bigtable: A Distributed Storage System for Structured Data:
https://static.googleusercontent.com/media/research.google.com/en//archive/bigtableosdi06.pdf
[7] Merkle tree: https://en.wikipedia.org/wiki/Merkle_tree
[8] Cassandra architecture: https://cassandra.apache.org/doc/latest/architecture/
[9] SStable: https://www.igvita.com/2012/02/06/sstable-and-log-structured-storage-leveldb/
[10] Bloom filter https://en.wikipedia.org/wiki/Bloom_filter

目录

CHAPTER 6: DESIGN A KEY-VALUE STORE 设计键值存储

Understand the problem and establish design scope 了解问题并建立设计范围

Single server key-value store 单服务器键值存储

Distributed key-value store 分布式键值存储

CAP theorem CAP理论

Ideal situation 理想情况下

Real-world distributed systems 真实的分布系统

System components 系统组件

Data partition 数据分区

Data replication 数据复制

Consistency 一致性

Consistency models 一致性模型

Inconsistency resolution: versioning 不一致解决:版本控制

Handling failures 处理故障

Failure detection 故障检测

Handling temporary failures 处理临时故障

Handling permanent failures 处理永久性故障

Handling data center outage 处理数据中心中断

System architecture diagram 系统架构图

Write path

Read path

Summary

Reference materials