CHAPTER 14: DESIGN YOUTUBE


In this chapter, you are asked to design YouTube. The solution to this question can be applied to other interview questions like designing a video sharing platform such as Netflix and Hulu. Figure 14-1 shows the YouTube homepage.

在本章中，您被要求设计YouTube。这个问题的解决方案可以应用于其他面试问题，例如设计视频共享平台，例如Netflix和Hulu。图 14-1 显示了 YouTube 主页。


YouTube looks simple: content creators upload videos and viewers click play. Is it really that simple? Not really. There are lots of complex technologies underneath the simplicity. Let us look at some impressive statistics, demographics, and fun facts of YouTube in 2020 [1] [2].
    • Total number of monthly active users: 2 billion.
    • Number of videos watched per day: 5 billion.
    • 73% of US adults use YouTube.
    • 50 million creators on YouTube.
    • YouTube’s Ad revenue was $15.1 billion for the full year 2019, up 36% from 2018.
    • YouTube is responsible for 37% of all mobile internet traffic.
    • YouTube is available in 80 different languages.
    
YouTube看起来很简单：内容创作者上传视频，观众点击播放。真的有那么简单吗？没有。在简单性之下有许多复杂的技术。让我们看看 2020 年 YouTube 的一些令人印象深刻的统计数据、人口统计数据和有趣的事实 [1] [2]。
    • 月活跃用户总数：20亿。
    • 每天观看的视频数量：50亿。
    73%的美国成年人使用YouTube。
    • YouTube 上有 5000 万创作者。
    • YouTube 2019年全年的广告收入为151亿美元，比2018年增长36%。
    •YouTube负责所有移动互联网流量的37%。
    •YouTube有80种不同的语言版本。


From these statistics, we know YouTube is enormous, global and makes a lot of money.

Step 1 - Understand the problem and establish design scope


As revealed in Figure 14-1, besides watching a video, you can do a lot more on YouTube. For example, comment, share, or like a video, save a video to playlists, subscribe to a channel, etc. It is impossible to design everything within a 45- or 60-minute interview. Thus, it is important to ask questions to narrow down the scope.

如图 14-1 所示，除了观看视频之外，您还可以在 YouTube 上做更多的事情。例如，评论、分享或喜欢视频、将视频保存到播放列表、订阅频道等。在45或60分钟的面试中不可能设计出所有内容。因此，提出问题以缩小范围很重要。


Candidate: What features are important?
Interviewer: Ability to upload a video and watch a video.
Candidate: What clients do we need to support?
Interviewer: Mobile apps, web browsers, and smart TV.
Candidate: How many daily active users do we have?
Interviewer: 5 million
Candidate: What is the average daily time spent on the product?
Interviewer: 30 minutes.
Candidate: Do we need to support international users?
Interviewer: Yes, a large percentage of users are international users.
Candidate: What are the supported video resolutions?
Interviewer: The system accepts most of the video resolutions and formats.
Candidate: Is encryption required?
Interviewer: Yes
Candidate: Any file size requirement for videos?
Interviewer: Our platform focuses on small and medium-sized videos. The maximum
allowed video size is 1GB.
Candidate: Can we leverage some of the existing cloud infrastructures provided by Amazon,
Google, or Microsoft?
Interviewer: That is a great question. Building everything from scratch is unrealistic for most
companies, it is recommended to leverage some of the existing cloud services. 
In the chapter, we focus on designing a video streaming service with the following features:
    • Ability to upload videos fast
    • Smooth video streaming
    • Ability to change video quality
    • Low infrastructure cost
    • High availability, scalability, and reliability requirements
    • Clients supported: mobile apps, web browser, and smart TV

应聘者：哪些功能很重要？
面试官：能够上传视频和观看视频。
应聘者：我们需要支持哪些客户？
面试官：移动应用程序、网络浏览器和智能电视。
应聘者：我们每天有多少活跃用户？
面试人数：500万
应聘者：每天花在产品上的平均时间是多少？
采访者：30分钟。
应聘者：我们需要支持国际用户吗？
采访者：是的，很大一部分用户是国际用户。
应聘者：支持的视频分辨率是多少？
主持人：系统接受大多数视频分辨率和格式。
应聘者：是否需要加密？
主持人：是的
应聘者：视频有什么文件大小要求吗？
主持人：我们的平台专注于中小尺寸视频。最大允许的视频大小为 1GB。
应聘者：我们能否利用亚马逊提供的一些现有云基础设施，谷歌，还是微软？
采访者：这是一个很好的问题。对于大多数人来说，从头开始构建所有内容是不现实的
公司，建议利用一些现有的云服务。
在本章中，我们将重点设计具有以下功能的视频流服务：
    •能够快速上传视频
    • 流畅的视频流
    •能够更改视频质量
    • 基础设施成本低
    • 高可用性、可扩展性和可靠性要求
    •支持的客户端：移动应用程序，Web浏览器和智能电视

Back of the envelope estimation


The following estimations are based on many assumptions, so it is important to communicate with the interviewer to make sure she is on the same page.
    • Assume the product has 5 million daily active users (DAU).
    • Users watch 5 videos per day.
    • 10% of users upload 1 video per day.
    • Assume the average video size is 300 MB.
    • Total daily storage space needed: 5 million * 10% * 300 MB = 150TB
    • CDN cost.
        • When cloud CDN serves a video, you are charged for data transferred out of the CDN.
        • Let us use Amazon’s CDN CloudFront for cost estimation (Figure 14-2) [3]. Assume
        100% of traffic is served from the United States. The average cost per GB is $0.02.
        For simplicity, we only calculate the cost of video streaming.
        • 5 million * 5 videos * 0.3GB * $0.02 = $150,000 per day.

以下估计基于许多假设，因此与面试官沟通以确保她在同一页面上很重要。
    • 假设产品有 500 万日活跃用户 （DAU）。
    •用户每天观看5个视频。
    • 10% 的用户每天上传 1 个视频。
    •假设平均视频大小为300 MB。
    • 每日所需存储空间总额：500 万 * 10% * 300 MB = 150TB
    • 加元成本。
        • 当云 CDN 提供视频时，您需要为从 CDN 传输的数据付费。
        • 让我们使用 Amazon 的 CDN CloudFront 进行成本估算（图 14-2） [3]。假设
        100%的流量来自美国。每 GB 的平均成本为 0.02 USD。
        为简单起见，我们只计算视频流的成本。
        • 500 万 * 5 个视频 * 0.3GB * 0.02 USD = 150，000 USD/天。


Even though cloud providers are willing to lower the CDN costs significantly for big customers, the cost is still substantial. We will discuss ways to reduce CDN costs in deep dive.

尽管云提供商愿意为大客户大幅降低CDN成本，但成本仍然很高。我们将深入探讨降低 CDN 成本的方法。

Step 2 - Propose high-level design and get buy-in


As discussed previously, the interviewer recommended leveraging existing cloud services instead of building everything from scratch. CDN and blob storage are the cloud services we will leverage. Some readers might ask why not building everything by ourselves? Reasons are listed below:

• System design interviews are not about building everything from scratch. Within the limited time frame, choosing the right technology to do a job right is more important than explaining how the technology works in detail. For instance, mentioning blob storage for storing source videos is enough for the interview. Talking about the detailed design for blob storage could be an overkill.
• Building scalable blob storage or CDN is extremely complex and costly. Even large
companies like Netflix or Facebook do not build everything themselves. Netflix leverages Amazon’s cloud services [4], and Facebook uses Akamai’s CDN [5].
At the high-level, the system comprises three components (Figure 14-3).

如前所述，面试官建议利用现有的云服务，而不是从头开始构建所有内容。CDN 和 blob 存储是我们将利用的云服务。一些读者可能会问，为什么不自己构建所有东西？原因如下：

• 系统设计面试不是从头开始构建一切。在有限的时间范围内，选择正确的技术来完成正确的工作比详细解释技术的工作原理更重要。例如，提及用于存储源视频的 blob 存储就足以满足采访需求。谈论 blob 存储的详细设计可能有点矫枉过正。
• 构建可缩放的 blob 存储或 CDN 非常复杂且成本高昂。甚至大
像Netflix或Facebook这样的公司并不是自己建造一切的。Netflix利用亚马逊的云服务[4]，Facebook使用Akamai的CDN [5]。
在高级别，该系统由三个组件组成（图 14-3）。


Client: You can watch YouTube on your computer, mobile phone, and smartTV.

CDN: Videos are stored in CDN. When you press play, a video is streamed from the CDN.

API servers: Everything else except video streaming goes through API servers. This includes feed recommendation, generating video upload URL, updating metadata database and cache, user signup, etc. 

In the question/answer session, the interviewer showed interests in two flows:
• Video uploading flow
• Video streaming flow
We will explore the high-level design for each of them.

客户端：您可以在计算机，手机和智能电视上观看YouTube。

CDN：视频存储在 CDN 中。当您按下播放时，将从 CDN 流式传输视频。

API 服务器：除视频流外，其他所有内容都通过 API 服务器。这包括提要推荐、生成视频上传 URL、更新元数据数据库和缓存、用户注册等。

在问答环节中，面试官对两个流程表现出兴趣：
• 视频上传流程
• 视频流流
我们将探讨它们中的每一个的高级设计。

Video uploading flow


Figure 14-4 shows the high-level design for the video uploading.

图 14-4 显示了视频上传的高级设计。


It consists of the following components:
• User: A user watches YouTube on devices such as a computer, mobile phone, or smart TV.
• Load balancer: A load balancer evenly distributes requests among API servers.
• API servers: All user requests go through API servers except video streaming.
• Metadata DB: Video metadata are stored in Metadata DB. It is sharded and replicated to meet performance and high availability requirements.
• Metadata cache: For better performance, video metadata and user objects are cached.
• Original storage: A blob storage system is used to store original videos. A quotation in Wikipedia regarding blob storage shows that: “A Binary Large Object (BLOB) is a collection of binary data stored as a single entity in a database management system” [6].
• Transcoding servers: Video transcoding is also called video encoding. It is the process of converting a video format to other formats (MPEG, HLS, etc), which provide the best video streams possible for different devices and bandwidth capabilities.
• Transcoded storage: It is a blob storage that stores transcoded video files.
• CDN: Videos are cached in CDN. When you click the play button, a video is streamed from the CDN.
• Completion queue: It is a message queue that stores information about video transcoding completion events.
• Completion handler: This consists of a list of workers that pull event data from the completion queue and update metadata cache and database.


它由以下组件组成：
• 用户：用户在计算机、手机或智能电视等设备上观看 YouTube。
• 负载均衡器：负载均衡器在 API 服务器之间均匀分配请求。
• API 服务器：除视频流外，所有用户请求都通过 API 服务器。
• 元数据数据库：视频元数据存储在元数据数据库中。它经过分片和复制，以满足性能和高可用性要求。
• 元数据缓存：为了获得更好的性能，将缓存视频元数据和用户对象。
• 原始存储：Blob 存储系统用于存储原始视频。维基百科中关于blob存储的引用表明：“二进制大对象（BLOB）是作为单个实体存储在数据库管理系统中的二进制数据的集合” [6]。
• 转码服务器：视频转码也称为视频编码。它是将视频格式转换为其他格式（MPEG，HLS等）的过程，可为不同的设备和带宽功能提供最佳的视频流。
• 转码存储：它是存储转码后的视频文件的 blob 存储。
• CDN：视频缓存在 CDN 中。单击播放按钮时，将从 CDN 流式传输视频。
• 完成队列：它是一个消息队列，用于存储有关视频转码完成事件的信息。
• 完成处理程序：这包括从完成队列中提取事件数据并更新元数据缓存和数据库的工作线程列表。


Now that we understand each component individually, let us examine how the video
uploading flow works. The flow is broken down into two processes running in parallel.

a. Upload the actual video.
b. Update video metadata. Metadata contains information about video URL, size,
resolution, format, user info, etc.

现在我们分别了解了每个组件，让我们来看看视频是如何
上传流程有效。流分为两个并行运行的进程。

a.上传实际视频。
b.更新视频元数据。元数据包含有关视频 URL、大小、分辨率、格式、用户信息等

Flow a: upload the actual video


Figure 14-5 shows how to upload the actual video. The explanation is shown below:
1. Videos are uploaded to the original storage.
2. Transcoding servers fetch videos from the original storage and start transcoding.
3. Once transcoding is complete, the following two steps are executed in parallel:
    3a. Transcoded videos are sent to transcoded storage.
    3b. Transcoding completion events are queued in the completion queue.
3a.1. Transcoded videos are distributed to CDN.
3b.1. Completion handler contains a bunch of workers that continuously pull event data from the queue.
3b.1.a. and 3b.1.b. Completion handler updates the metadata database and cache when
video transcoding is complete.
4. API servers inform the client that the video is successfully uploaded and is ready for streaming.

图 14-5 显示了如何上传实际视频。说明如下：
1. 视频上传到原始存储。
2. 转码服务器从原始存储中获取视频并开始转码。
3. 转码完成后，将并行执行以下两个步骤：
    1铝转码后的视频会发送到转码后的存储。
    3b.转码完成事件在完成队列中排队。
3a.1. 转码后的视频将分发到 CDN。
3b.1. 完成处理程序包含一堆不断从队列中提取事件数据的工作线程。
3b.1.a. 和 3b.1.b.完成处理程序在以下情况下更新元数据数据库和缓存
视频转码完成。
4. API 服务器通知客户端视频已成功上传并准备好流式传输。

Flow b: update the metadata


While a file is being uploaded to the original storage, the client in parallel sends a request to update the video metadata as shown in Figure 14-6. The request contains video metadata, including file name, size, format, etc. API servers update the metadata cache and database.

将文件上传到原始存储时，客户端会并行发送更新视频元数据的请求，如图 14-6 所示。请求包含视频元数据，包括文件名、大小、格式等。API 服务器更新元数据缓存和数据库。

Video streaming flow


Whenever you watch a video on YouTube, it usually starts streaming immediately and you do not wait until the whole video is downloaded. Downloading means the whole video is copied to your device, while streaming means your device continuously receives video streams from remote source videos. When you watch streaming videos, your client loads a little bit of data at a time so you can watch videos immediately and continuously.

每当您在YouTube上观看视频时，它通常会立即开始流式传输，并且您不会等到整个视频下载完毕。下载意味着将整个视频复制到您的设备，而流式传输意味着您的设备持续接收来自远程源视频的视频流。当您观看流媒体视频时，您的客户端一次加载一点数据，以便您可以立即连续观看视频。


Before we discuss video streaming flow, let us look at an important concept: streaming protocol. This is a standardized way to control data transfer for video streaming. Popular streaming protocols are:
• MPEG–DASH. MPEG stands for “Moving Picture Experts Group” and DASH stands for
"Dynamic Adaptive Streaming over HTTP".
• Apple HLS. HLS stands for “HTTP Live Streaming”.
• Microsoft Smooth Streaming.
• Adobe HTTP Dynamic Streaming (HDS).

在讨论视频流之前，让我们看一个重要的概念：流协议。这是控制视频流数据传输的标准化方法。流行的流媒体协议是：
• MPEG–DASH。MPEG代表“运动图像专家组”，DASH代表
“基于 HTTP 的动态自适应流式处理”。
• 苹果 HLS。HLS 代表“HTTP Live Streaming”。
• 微软平滑流。
Adobe HTTP Dynamic Streaming （HDS）。


You do not need to fully understand or even remember those streaming protocol names as they are low-level details that require specific domain knowledge. The important thing here is to understand that different streaming protocols support different video encodings and playback players. When we design a video streaming service, we have to choose the right streaming protocol to support our use cases. To learn more about streaming protocols, here is an excellent article [7].


Videos are streamed from CDN directly. The edge server closest to you will deliver the video. Thus, there is very little latency. Figure 14-7 shows a high level of design for video streaming.

您不需要完全理解甚至记住这些流协议名称，因为它们是需要特定领域知识的低级详细信息。这里重要的是了解不同的流媒体协议支持不同的视频编码和播放播放器。当我们设计视频流服务时，我们必须选择正确的流协议来支持我们的用例。要了解有关流协议的更多信息，这里有一篇出色的文章 [7]。

视频直接从 CDN 流式传输。离您最近的边缘服务器将传送视频。因此，延迟非常小。图 14-7 显示了视频流的高级设计。

Step 3 - Design deep dive


In the high-level design, the entire system is broken down in two parts: video uploading flow and video streaming flow. In this section, we will refine both flows with important optimizations and introduce error handling mechanisms.


在高级设计中，整个系统分为两部分：视频上传流和视频流流。在本节中，我们将通过重要的优化来优化这两个流程，并引入错误处理机制。

Video transcoding 视频转码


When you record a video, the device (usually a phone or camera) gives the video file a certain format. If you want the video to be played smoothly on other devices, the video must be encoded into compatible bitrates and formats. Bitrate is the rate at which bits are processed over time. A higher bitrate generally means higher video quality. High bitrate streams need more processing power and fast internet speed.

录制视频时，设备（通常是手机或相机）会为视频文件提供某种格式。如果您希望视频在其他设备上流畅播放，则必须将视频编码为兼容的比特率和格式。比特率是一段时间内处理位的速率。更高的比特率通常意味着更高的视频质量。高比特率流需要更多的处理能力和更快的互联网速度。


Video transcoding is important for the following reasons:
• Raw video consumes large amounts of storage space. An hour-long high definition video recorded at 60 frames per second can take up a few hundred GB of space.
• Many devices and browsers only support certain types of video formats. Thus, it is important to encode a video to different formats for compatibility reasons.
• To ensure users watch high-quality videos while maintaining smooth playback, it is a good idea to deliver higher resolution video to users who have high network bandwidth and lower resolution video to users who have low bandwidth.
• Network conditions can change, especially on mobile devices. To ensure a video is
played continuously, switching video quality automatically or manually based on network conditions is essential for smooth user experience.

视频转码很重要，原因如下：
• 原始视频占用大量存储空间。以每秒 60 帧的速度录制长达一小时的高清视频可能会占用几百 GB 的空间。
•许多设备和浏览器仅支持某些类型的视频格式。因此，出于兼容性原因，将视频编码为不同的格式非常重要。
• 为确保用户在保持流畅播放的同时观看高质量视频，最好向具有高网络带宽的用户提供更高分辨率的视频，向具有低带宽的用户提供较低分辨率的视频。
• 网络状况可能会发生变化，尤其是在移动设备上。确保视频
连续播放，根据网络条件自动或手动切换视频质量对于流畅的用户体验至关重要。


Many types of encoding formats are available; however, most of them contain two parts:
• Container: This is like a basket that contains the video file, audio, and metadata. You can tell the container format by the file extension, such as .avi, .mov, or .mp4.
• Codecs: These are compression and decompression algorithms aim to reduce the video size while preserving the video quality. The most used video codecs are H.264, VP9, and HEVC.


有许多类型的编码格式可用;但是，它们中的大多数包含两个部分：
• 容器：这就像一个包含视频文件、音频和元数据的篮子。您可以通过文件扩展名（例如.avi、.mov或.mp4）来判断容器格式。
•编解码器：这些是压缩和解压缩算法，旨在减小视频大小，同时保持视频质量。最常用的视频编解码器是 H.264、VP9 和 HEVC。

Directed acyclic graph (DAG) model 有向无环图（DAG）模型


Transcoding a video is computationally expensive and time-consuming. Besides, different content creators may have different video processing requirements. For instance, some content creators require watermarks on top of their videos, some provide thumbnail images themselves, and some upload high definition videos, whereas others do not.

To support different video processing pipelines and maintain high parallelism, it is important to add some level of abstraction and let client programmers define what tasks to execute. For example, Facebook’s streaming video engine uses a directed acyclic graph (DAG) programming model, which defines tasks in stages so they can be executed sequentially or parallelly [8]. In our design, we adopt a similar DAG model to achieve flexibility and parallelism. Figure 14-8 represents a DAG for video transcoding.

视频转码的计算成本高昂且耗时。此外，不同的内容创作者可能有不同的视频处理要求。例如，一些内容创建者要求在他们的视频顶部添加水印，一些自己提供缩略图，一些上传高清视频，而另一些则不需要。

为了支持不同的视频处理管道并保持高并行度，添加一定程度的抽象并让客户端程序员定义要执行的任务非常重要。例如，Facebook的流媒体视频引擎使用有向无环图（DAG）编程模型，该模型分阶段定义任务，以便它们可以按顺序或并行执行[8]。在我们的设计中，我们采用类似的DAG模型来实现灵活性和并行性。图 14-8 表示用于视频转码的 DAG。


In Figure 14-8, the original video is split into video, audio, and metadata. Here are some of the tasks that can be applied on a video file:
• Inspection: Make sure videos have good quality and are not malformed.
• Video encodings: Videos are converted to support different resolutions, codec, bitrates, etc. Figure 14-9 shows an example of video encoded files.
• Thumbnail. Thumbnails can either be uploaded by a user or automatically generated by the system.
• Watermark: An image overlay on top of your video contains identifying information
about your video.

在图 14-8 中，原始视频分为视频、音频和元数据。以下是可以应用于视频文件的一些任务：
• 检查：确保视频质量良好且没有格式错误。
•视频编码：视频被转换为支持不同的分辨率，编解码器，比特率等。图 14-9 显示了视频编码文件的示例。
•缩略图。缩略图可以由用户上传，也可以由系统自动生成。
• 水印：视频顶部的图像叠加层包含识别信息
关于您的视频。

Video transcoding architecture 视频转码架构


The proposed video transcoding architecture that leverages the cloud services, is shown in Figure 14-10.

利用云服务的拟议视频转码架构如图 14-10 所示。


The architecture has six main components: preprocessor, DAG scheduler, resource manager, task workers, temporary storage, and encoded video as the output. Let us take a close look at each component.

该体系结构有六个主要组件：预处理器、DAG 计划程序、资源管理器、任务工作线程、临时存储和编码视频作为输出。让我们仔细看看每个组件。

Preprocessor 预处理


The preprocessor has 4 responsibilities:
1. Video splitting. Video stream is split or further split into smaller Group of Pictures (GOP) alignment. GOP is a group/chunk of frames arranged in a specific order. Each chunk is an independently playable unit, usually a few seconds in length.
2. Some old mobile devices or browsers might not support video splitting. Preprocessor split videos by GOP alignment for old clients.
3. DAG generation. The processor generates DAG based on configuration files client
programmers write. Figure 14-12 is a simplified DAG representation which has 2 nodes and 1 edge:

预处理器有 4 个职责：
1.视频分割。视频流被拆分或进一步拆分为较小的图片组 （GOP） 对齐方式。GOP 是按特定顺序排列的一组/帧块。每个块都是一个可独立播放的单元，通常长度为几秒钟。
2.某些旧的移动设备或浏览器可能不支持视频分割。预处理器按旧客户端的 GOP 对齐方式拆分视频。
3. DAG 生成。处理器基于配置文件客户端生成DAG
程序员写道。图 14-12 是简化的 DAG 表示形式，具有 2 个节点和 1 条边：


This DAG representation is generated from the two configuration files below (Figure 14-13):
此 DAG 表示形式是从下面的两个配置文件生成的（图 14-13）：


4. Cache data. The preprocessor is a cache for segmented videos. For better reliability, the preprocessor stores GOPs and metadata in temporary storage. If video encoding fails, the system could use persisted data for retry operations.

4.缓存数据。预处理器是分段视频的缓存。为了获得更好的可靠性，预处理器将 GOP 和元数据存储在临时存储中。如果视频编码失败，系统可以使用持久化的数据进行重试操作。

DAG scheduler


The DAG scheduler splits a DAG graph into stages of tasks and puts them in the task queue in the resource manager. Figure 14-15 shows an example of how the DAG scheduler works.

DAG 计划程序将 DAG 图拆分为任务的各个阶段，并将它们放入资源管理器的任务队列中。图 14-15 显示了 DAG 计划程序工作原理的示例。


As shown in Figure 14-15, the original video is split into three stages: Stage 1: video, audio, and metadata. The video file is further split into two tasks in stage 2: video encoding and thumbnail. The audio file requires audio encoding as part of the stage 2 tasks.

如图 14-15 所示，原始视频分为三个阶段：阶段 1：视频、音频和元数据。在第 2 阶段，视频文件进一步分为两个任务：视频编码和缩略图。音频文件需要音频编码作为阶段 2 任务的一部分。

Resource manager 资源管理器


The resource manager is responsible for managing the efficiency of resource allocation. It contains 3 queues and a task scheduler as shown in Figure 14-17.
• Task queue: It is a priority queue that contains tasks to be executed.
• Worker queue: It is a priority queue that contains worker utilization info.
• Running queue: It contains info about the currently running tasks and workers running the tasks.
• Task scheduler: It picks the optimal task/worker, and instructs the chosen task worker to execute the job.

资源管理器负责管理资源分配的效率。它包含 3 个队列和一个任务计划程序，如图 14-17 所示。
• 任务队列：它是包含要执行的任务的优先级队列。
• 工作线程队列：它是包含工作线程利用率信息的优先级队列。
• 运行队列：它包含有关当前正在运行的任务和运行任务的工作线程的信息。
• 任务计划程序：它选择最佳任务/工作线程，并指示所选任务工作线程执行作业。


The resource manager works as follows:
• The task scheduler gets the highest priority task from the task queue.
• The task scheduler gets the optimal task worker to run the task from the worker queue.
• The task scheduler instructs the chosen task worker to run the task.
• The task scheduler binds the task/worker info and puts it in the running queue.
• The task scheduler removes the job from the running queue once the job is done.


资源管理器的工作方式如下：
• 任务计划程序从任务队列中获取优先级最高的任务。
• 任务计划程序从工作线程队列中获取运行任务的最佳任务工作线程。
• 任务计划程序指示所选任务工作线程运行任务。
• 任务计划程序绑定任务/工作线程信息并将其放入正在运行的队列中。
• 作业完成后，任务计划程序会将作业从正在运行的队列中删除。

Task workers


Task workers run the tasks which are defined in the DAG. Different task workers may run different tasks as shown in Figure 14-19.

任务工作人员运行 DAG 中定义的任务。不同的任务工作线程可以运行不同的任务，如图 14-19 所示。

Temporary storage 临时存储


Multiple storage systems are used here. The choice of storage system depends on factors like data type, data size, access frequency, data life span, etc. For instance, metadata is frequently accessed by workers, and the data size is usually small. Thus, caching metadata in memory is a good idea. For video or audio data, we put them in blob storage. Data in temporary storage is freed up once the corresponding video processing is complete.

此处使用多个存储系统。存储系统的选择取决于数据类型、数据大小、访问频率、数据寿命等因素。例如，工作人员经常访问元数据，并且数据大小通常很小。因此，在内存中缓存元数据是一个好主意。对于视频或音频数据，我们将其放在 Blob 存储中。一旦相应的视频处理完成，临时存储中的数据就会被释放。

Encoded video 编码视频


Encoded video is the final output of the encoding pipeline. Here is an example of the output: funny_720p.mp4 .

编码的视频是编码管道的最终输出。下面是输出的示例：funny_720p.mp4 。

System optimizations 系统优化


At this point, you ought to have good understanding about the video uploading flow, video streaming flow and video transcoding. Next, we will refine the system with optimizations, including speed, safety, and cost-saving.

此时，您应该对视频上传流程、视频流流程和视频转码有很好的了解。接下来，我们将通过优化来完善系统，包括速度、安全性和成本节约。

Speed optimization: parallelize video uploading 速度优化：并行上传视频


Uploading a video as a whole unit is inefficient. We can split a video into smaller chunks by GOP alignment as shown in Figure 14-22.

将视频作为一个整体上传是低效的。我们可以通过 GOP 对齐将视频拆分为更小的块，如图 14-22 所示。


This allows fast resumable uploads when the previous upload failed. The job of splitting a video file by GOP can be implemented by the client to improve the upload speed as shown in Figure 14-23.

这允许在上一次上传失败时快速恢复式上传。客户端可以实现通过GOP拆分视频文件的工作，以提高上传速度，如图14-23所示。

Speed optimization: place upload centers close to users 速度优化：将上传中心放置在靠近用户的位置


Another way to improve the upload speed is by setting up multiple upload centers across the globe (Figure 14-24). People in the United States can upload videos to the North America upload center, and people in China can upload videos to the Asian upload center. To achieve this, we use CDN as upload centers.

提高上传速度的另一种方法是在全球范围内设置多个上传中心（图 14-24）。美国用户可以将视频上传到北美上传中心，中国用户可以将视频上传到亚洲上传中心。为此，我们使用 CDN 作为上传中心。

Speed optimization: parallelism everywhere 速度优化：无处不在的并行性


Achieving low latency requires serious efforts. Another optimization is to build a loosely coupled system and enable high parallelism.

Our design needs some modifications to achieve high parallelism. Let us zoom in to the flow of how a video is transferred from original storage to the CDN. The flow is shown in Figure 14-25, revealing that the output depends on the input of the previous step. This dependency makes parallelism difficult.

实现低延迟需要认真的努力。另一个优化是构建松散耦合的系统并实现高并行度。

我们的设计需要一些修改才能实现高并行度。让我们放大视频如何从原始存储传输到 CDN 的流程。流程如图 14-25 所示，表明输出取决于上一步的输入。这种依赖关系使并行性变得困难。


To make the system more loosely coupled, we introduced message queues as shown in Figure 14-26. Let us use an example to explain how message queues make the system more loosely coupled.
• Before the message queue is introduced, the encoding module must wait for the output of the download module.
• After the message queue is introduced, the encoding module does not need to wait for the output of the download module anymore. If there are events in the message queue, the encoding module can execute those jobs in parallel.

为了使系统更加松散耦合，我们引入了消息队列，如图 14-26 所示。让我们用一个例子来解释消息队列如何使系统更加松散耦合。
• 在引入消息队列之前，编码模块必须等待下载模块的输出。
• 引入消息队列后，编码模块不再需要等待下载模块的输出。如果消息队列中有事件，则编码模块可以并行执行这些作业。

Safety optimization: pre-signed upload URL 安全优化：预签名上传网址


Safety is one of the most important aspects of any product. To ensure only authorized users upload videos to the right location, we introduce pre-signed URLs as shown in Figure 14-27.

安全是任何产品最重要的方面之一。为了确保只有授权用户才能将视频上传到正确的位置，我们引入了预签名 URL，如图 14-27 所示。


The upload flow is updated as follows:
1. The client makes a HTTP request to API servers to fetch the pre-signed URL, which gives the access permission to the object identified in the URL. The term pre-signed URL is used by uploading files to Amazon S3. Other cloud service providers might use a different name. For instance, Microsoft Azure blob storage supports the same feature, but call it “Shared Access Signature” [10].
2. API servers respond with a pre-signed URL.
3. Once the client receives the response, it uploads the video using the pre-signed URL.


上传流程更新如下：
1. 客户端向 API 服务器发出 HTTP 请求以获取预签名 URL，从而授予对 URL 中标识的对象的访问权限。术语预签名 URL 用于将文件上传到 Amazon S3。其他云服务提供商可能使用不同的名称。例如，Microsoft Azure blob 存储支持相同的功能，但将其称为“共享访问签名” [10]。
2. API 服务器使用预签名 URL 进行响应。
3. 客户端收到响应后，将使用预签名 URL 上传视频。

Safety optimization: protect your videos


Many content makers are reluctant to post videos online because they fear their original videos will be stolen. To protect copyrighted videos, we can adopt one of the following three safety options:
• Digital rights management (DRM) systems: Three major DRM systems are Apple
FairPlay, Google Widevine, and Microsoft PlayReady.
• AES encryption: You can encrypt a video and configure an authorization policy. The encrypted video will be decrypted upon playback. This ensures that only authorized users can watch an encrypted video.
• Visual watermarking: This is an image overlay on top of your video that contains
identifying information for your video. It can be your company logo or company name.

许多内容制作者不愿意在网上发布视频，因为他们担心他们的原创视频会被窃取。为了保护受版权保护的视频，我们可以采用以下三种安全选项之一：
数字版权管理（DRM）系统：三大DRM系统是苹果
FairPlay，Google Widevine和Microsoft PlayReady。
• AES加密：您可以加密视频并配置授权策略。加密的视频将在播放时解密。这可确保只有授权用户才能观看加密视频。
•视觉水印：这是视频顶部的图像叠加层，其中包含
识别视频的信息。它可以是您的公司徽标或公司名称。

Cost-saving optimization 节省成本的优化


CDN is a crucial component of our system. It ensures fast video delivery on a global scale. However, from the back of the envelope calculation, we know CDN is expensive, especially when the data size is large. How can we reduce the cost?

Previous research shows that YouTube video streams follow long-tail distribution [11] [12]. It means a few popular videos are accessed frequently but many others have few or no viewers. Based on this observation, we implement a few optimizations:
1. Only serve the most popular videos from CDN and other videos from our high capacity storage video servers (Figure 14-28).

CDN 是我们系统的重要组成部分。它确保在全球范围内快速传输视频。但是，从信封计算的背面，我们知道 CDN 很昂贵，尤其是当数据量很大时。我们如何降低成本？

先前的研究表明，YouTube视频流遵循长尾分布[11] [12]。这意味着一些流行的视频经常被访问，但许多其他视频的观众很少或没有。基于这一观察结果，我们实现了一些优化：
1. 仅提供来自 CDN 的最热门视频和来自我们大容量存储视频服务器的其他视频（图 14-28）。


2. For less popular content, we may not need to store many encoded video versions. Short videos can be encoded on-demand.
3. Some videos are popular only in certain regions. There is no need to distribute these videos to other regions.
4. Build your own CDN like Netflix and partner with Internet Service Providers (ISPs). Building your CDN is a giant project; however, this could make sense for large streaming companies. An ISP can be Comcast, AT&T, Verizon, or other internet providers. ISPs are located all around the world and are close to users. By partnering with ISPs, you can improve the viewing experience and reduce the bandwidth charges. 

All those optimizations are based on content popularity, user access pattern, video size, etc. It is important to analyze historical viewing patterns before doing any optimization. Here are some of the interesting articles on this topic: [12] [13].

2.对于不太受欢迎的内容，我们可能不需要存储许多编码的视频版本。短视频可以按需编码。
3. 有些视频只在某些地区流行。无需将这些视频分发到其他地区。
4. 像 Netflix 一样构建自己的 CDN，并与互联网服务提供商 （ISP） 合作。构建您的 CDN 是一个巨大的项目;但是，这对于大型流媒体公司来说可能是有意义的。ISP可以是Comcast，AT&T，Verizon或其他互联网提供商。ISP遍布世界各地，靠近用户。通过与 ISP 合作，您可以改善观看体验并降低带宽费用。

所有这些优化都基于内容受欢迎程度、用户访问模式、视频大小等。在进行任何优化之前，分析历史查看模式非常重要。以下是有关此主题的一些有趣文章： [12] [13]。

Error handling


For a large-scale system, system errors are unavoidable. To build a highly fault-tolerant system, we must handle errors gracefully and recover from them fast. Two types of errors exist:
• Recoverable error. For recoverable errors such as video segment fails to transcode, the general idea is to retry the operation a few times. If the task continues to fail and the system believes it is not recoverable, it returns a proper error code to the client.
• Non-recoverable error. For non-recoverable errors such as malformed video format, the system stops the running tasks associated with the video and returns the proper error code to the client.

对于大型系统，系统错误是不可避免的。要构建一个高度容错的系统，我们必须优雅地处理错误并快速从错误中恢复。存在两种类型的错误：
• 可恢复的错误。对于视频片段无法转码等可恢复错误，一般思路是重试几次操作。如果任务继续失败，并且系统认为它不可恢复，则会向客户端返回正确的错误代码。
• 不可恢复的错误。对于不可恢复的错误（如视频格式格式错误），系统会停止与视频关联的正在运行的任务，并向客户端返回正确的错误代码。


Typical errors for each system component are covered by the following playbook:
• Upload error: retry a few times.
• Split video error: if older versions of clients cannot split videos by GOP alignment, the entire video is passed to the server. The job of splitting videos is done on the server-side.
• Transcoding error: retry.
• Preprocessor error: regenerate DAG diagram.
• DAG scheduler error: reschedule a task.
• Resource manager queue down: use a replica.
• Task worker down: retry the task on a new worker.
• API server down: API servers are stateless so requests will be directed to a different API server.
• Metadata cache server down: data is replicated multiple times. If one node goes down, you can still access other nodes to fetch data. We can bring up a new cache server to replace the dead one.
• Metadata DB server down:
• Master is down. If the master is down, promote one of the slaves to act as the new master.
• Slave is down. If a slave goes down, you can use another slave for reads and bring up another database server to replace the dead one.

以下行动手册涵盖了每个系统组件的典型错误：
• 上传错误：重试几次。
• 拆分视频错误：如果旧版本的客户端无法通过GOP对齐拆分视频，则整个视频将传递到服务器。拆分视频的工作是在服务器端完成的。
• 转码错误：重试。
• 预处理器错误：重新生成 DAG 图。
• DAG 计划程序错误：重新计划任务。
• 资源管理器队列关闭：使用副本。
• 任务工作人员关闭：在新工作人员身上重试任务。
• API 服务器关闭：API 服务器是无状态的，因此请求将被定向到不同的 API 服务器。
• 元数据缓存服务器关闭：数据被多次复制。如果一个节点出现故障，您仍然可以访问其他节点来获取数据。我们可以调出一个新的缓存服务器来替换死的缓存服务器。
• 元数据数据库服务器关闭：
• 大师已关闭。如果主站倒下，则提升其中一个从站充当新的主站。
• 从属服务器已关闭。如果一个从站出现故障，您可以使用另一个从站进行读取，并启动另一个数据库服务器来替换死服务器。

Step 4 - Wrap up


In this chapter, we presented the architecture design for video streaming services like YouTube. If there is extra time at the end of the interview, here are a few additional points:
• Scale the API tier: Because API servers are stateless, it is easy to scale API tier horizontally.
• Scale the database: You can talk about database replication and sharding.
• Live streaming: It refers to the process of how a video is recorded and broadcasted in real time. Although our system is not designed specifically for live streaming, live streaming and non-live streaming have some similarities: both require uploading, encoding, and streaming. The notable differences are:
• Live streaming has a higher latency requirement, so it might need a different
streaming protocol.
• Live streaming has a lower requirement for parallelism because small chunks of data are already processed in real-time.
• Live streaming requires different sets of error handling. Any error handling that
takes too much time is not acceptable.
• Video takedowns: Videos that violate copyrights, pornography, or other illegal acts shall be removed. Some can be discovered by the system during the upload process, while others might be discovered through user flagging.

Congratulations on getting this far! Now give yourself a pat on the back. Good job!

在本章中，我们介绍了YouTube等视频流服务的架构设计。如果面试结束时有额外的时间，这里有几点补充：
• 缩放 API 层：由于 API 服务器是无状态的，因此很容易水平缩放 API 层。
• 扩展数据库：您可以讨论数据库复制和分片。
• 直播：是指视频如何实时录制和播放的过程。虽然我们的系统不是专门为直播而设计的，但直播和非直播有一些相似之处：两者都需要上传、编码和流媒体。显著的区别是：
• 实时流式传输具有更高的延迟要求，因此可能需要不同的
流式处理协议。
• 实时流式传输对并行性的要求较低，因为小块数据已经实时处理。
• 实时流式传输需要不同的错误处理集。处理的任何错误
花费太多时间是不可接受的。
• 视频移除：侵犯版权、色情或其他违法行为的视频应被移除。有些可以在上传过程中由系统发现，而另一些则可以通过用户标记发现。

恭喜你走到了这一步！现在拍拍自己的背。干得好！

Reference materials


[1] YouTube by the numbers: https://www.omnicoreagency.com/youtube-statistics/
[2] 2019 YouTube Demographics:
https://blog.hubspot.com/marketing/youtube-demographics
[3] Cloudfront Pricing: https://aws.amazon.com/cloudfront/pricing/
[4] Netflix on AWS: https://aws.amazon.com/solutions/case-studies/netflix/
[5] Akamai homepage: https://www.akamai.com/
[6] Binary large object: https://en.wikipedia.org/wiki/Binary_large_object
[7] Here’s What You Need to Know About Streaming Protocols:
https://www.dacast.com/blog/streaming-protocols/
[8] SVE: Distributed Video Processing at Facebook Scale:
https://www.cs.princeton.edu/~wlloyd/papers/sve-sosp17.pdf
[9] Weibo video processing architecture (in Chinese):
https://www.upyun.com/opentalk/399.html
[10] Delegate access with a shared access signature:
https://docs.microsoft.com/en-us/rest/api/storageservices/delegate-access-with-shared-accesssignature
[11] YouTube scalability talk by early YouTube employee: https://www.youtube.com/watch?
v=w5WVu624fY8
[12] Understanding the characteristics of internet short video sharing: A youtube-based
measurement study. https://arxiv.org/pdf/0707.3670.pdf
[13] Content Popularity for Open Connect:
https://netflixtechblog.com/content-popularity-for-open-connect-b86d56f613b

目录

CHAPTER 14: DESIGN YOUTUBE

Step 1 - Understand the problem and establish design scope

Back of the envelope estimation

Step 2 - Propose high-level design and get buy-in

Video uploading flow

Flow a: upload the actual video

Flow b: update the metadata

Video streaming flow

Step 3 - Design deep dive

Video transcoding 视频转码

Directed acyclic graph (DAG) model 有向无环图 （DAG） 模型

Video transcoding architecture 视频转码架构

Preprocessor 预处理

DAG scheduler

Resource manager 资源管理器

Task workers

Temporary storage 临时存储

Encoded video 编码视频

System optimizations 系统优化

Speed optimization: parallelize video uploading 速度优化：并行上传视频

Speed optimization: place upload centers close to users 速度优化：将上传中心放置在靠近用户的位置

Speed optimization: parallelism everywhere 速度优化：无处不在的并行性

Safety optimization: pre-signed upload URL 安全优化：预签名上传网址

Safety optimization: protect your videos

Cost-saving optimization 节省成本的优化

Error handling

Step 4 - Wrap up

Reference materials

Directed acyclic graph (DAG) model 有向无环图（DAG）模型