In this chapter, we focus on web crawler design: an interesting and classic system design interview question. A web crawler is known as a robot or spider. It is widely used by search engines to discover new or updated content on the web. Content can be a web page, an image, a video, a PDF file, etc. A web crawler starts by collecting a few web pages and then follows links on those pages to collect new content. Figure 9-1 shows a visual example of the crawl process. 在本章中,我们重点介绍网络爬虫设计:一个有趣而经典的系统设计 面试问题。 网络爬虫被称为机器人或蜘蛛。它被搜索引擎广泛用于发现 网络上新增或更新的内容。内容可以是网页、图像、视频、PDF 文件等。网络爬虫首先收集一些网页,然后跟踪这些网页上的链接 页面以收集新内容。图 9-1 显示了爬网过程的直观示例。
A crawler is used for many purposes: • Search engine indexing: This is the most common use case. A crawler collects web pages to create a local index for search engines. For example, Googlebot is the web crawler behind the Google search engine. • Web archiving: This is the process of collecting information from the web to preserve data for future uses. For instance, many national libraries run crawlers to archive web sites. Notable examples are the US Library of Congress [1] and the EU web archive [2]. • Web mining: The explosive growth of the web presents an unprecedented opportunity for data mining. Web mining helps to discover useful knowledge from the internet. For example, top financial firms use crawlers to download shareholder meetings and annual reports to learn key company initiatives. • Web monitoring. The crawlers help to monitor copyright and trademark infringements over the Internet. For example, Digimarc [3] utilizes crawlers to discover pirated works and reports. 爬网程序有多种用途: • 搜索引擎索引:这是最常见的用例。爬虫收集网络页面,为搜索引擎创建本地索引。例如,Googlebot是网络。谷歌搜索引擎背后的爬虫。 • Web 存档:这是从 Web 收集信息以保留的过程数据以供将来使用。例如,许多国家图书馆运行爬虫来存档网络网站。值得注意的例子是美国国会图书馆[1]和欧盟网络档案[2]。 • 网络挖矿:网络的爆炸性增长为我们提供了前所未有的机遇 • 数据挖掘。网络挖掘有助于从互联网上发现有用的知识。例如,顶级金融公司使用爬虫下载股东大会和年度 报告以了解公司的关键举措。 • 网络监控。爬虫有助于监控版权和商标侵权。通过互联网。例如,Digimarc [3] 利用爬虫来发现盗版作品 和报告。
The complexity of developing a web crawler depends on the scale we intend to support. It could be either a small school project, which takes only a few hours to complete or a gigantic project that requires continuous improvement from a dedicated engineering team. Thus, we will explore the scale and features to support below. 开发网络爬虫的复杂性取决于我们打算支持的规模。它 可以是只需要几个小时就能完成的小型学校项目,也可以是一个巨大的 需要专业工程团队持续改进的项目。因此,我们 将在下面探讨要支持的规模和功能。
The basic algorithm of a web crawler is simple: 1. Given a set of URLs, download all the web pages addressed by the URLs. 2. Extract URLs from these web pages 3. Add new URLs to the list of URLs to be downloaded. Repeat these 3 steps. Does a web crawler work truly as simple as this basic algorithm? Not exactly. Designing a vastly scalable web crawler is an extremely complex task. It is unlikely for anyone to design a massive web crawler within the interview duration. Before jumping into the design, we must ask questions to understand the requirements and establish design scope: 网络爬虫的基本算法很简单: 1. 给定一组 URL,下载由 URL 寻址的所有网页。 2. 从这些网页中提取网址 3. 将新 URL 添加到要下载的 URL 列表中。重复这 3 个步骤。 网络爬虫真的像这个基本算法一样简单吗?不完全是。设计一个 可大规模扩展的网络爬虫是一项极其复杂的任务。任何人都不太可能设计 面试期间的大量网络爬虫。在进入设计之前,我们 必须提出问题以了解要求并确定设计范围:
Candidate: What is the main purpose of the crawler? Is it used for search engine indexing, data mining, or something else? Interviewer: Search engine indexing. Candidate: How many web pages does the web crawler collect per month? Interviewer: 1 billion pages. Candidate: What content types are included? HTML only or other content types such as PDFs and images as well? Interviewer: HTML only. Candidate: Shall we consider newly added or edited web pages? Interviewer: Yes, we should consider the newly added or edited web pages. Candidate: Do we need to store HTML pages crawled from the web? Interviewer: Yes, up to 5 years Candidate: How do we handle web pages with duplicate content? Interviewer: Pages with duplicate content should be ignored 应聘者:爬虫的主要用途是什么?它是否用于搜索引擎索引, 数据挖掘,还是别的什么? 采访者:搜索引擎索引。 应聘者:网络爬虫每月收集多少个网页? 采访者:10亿页。 应聘者:包括哪些内容类型?仅 HTML 或其他内容类型,例如 PDF和图像也是如此? 面试官:仅限 HTML。 应聘者:我们应该考虑新添加或编辑的网页吗? 主持人:是的,我们应该考虑新添加或编辑的网页。 应聘者:我们是否需要存储从 Web 抓取的 HTML 页面? 面试官:是的,最长5年 应聘者:我们如何处理内容重复的网页? 采访者:应忽略内容重复的页面
Above are some of the sample questions that you can ask your interviewer. It is important to understand the requirements and clarify ambiguities. Even if you are asked to design a straightforward product like a web crawler, you and your interviewer might not have the same assumptions. Beside functionalities to clarify with your interviewer, it is also important to note down the following characteristics of a good web crawler: • Scalability: The web is very large. There are billions of web pages out there. Web crawling should be extremely efficient using parallelization. • Robustness: The web is full of traps. Bad HTML, unresponsive servers, crashes, malicious links, etc. are all common. The crawler must handle all those edge cases. • Politeness: The crawler should not make too many requests to a website within a short time interval. • Extensibility: The system is flexible so that minimal changes are needed to support new content types. For example, if we want to crawl image files in the future, we should not need to redesign the entire system. 以上是您可以向面试官提出的一些示例问题。重要的是 了解要求并澄清歧义。即使你被要求设计一个 像网络爬虫这样的简单产品,您和您的面试官可能没有 相同的假设。 除了与面试官澄清的功能外,记下 好的网络爬虫的以下特征: • 可扩展性:网络非常大。那里有数十亿个网页。蹼 使用并行化进行爬网应该非常有效。 • 稳健性:网络上充满了陷阱。错误的 HTML、无响应的服务器、崩溃、恶意链接等都很常见。爬网程序必须处理所有这些边缘情况。 • 礼貌:爬虫不应在短时间内向网站发出太多请求时间间隔。 • 可扩展性:系统非常灵活,因此只需进行最少的更改即可支持新的内容类型。例如,如果将来我们想抓取图像文件,我们不应该需要重新设计整个系统。
The following estimations are based on many assumptions, and it is important to communicate with the interviewer to be on the same page. • Assume 1 billion web pages are downloaded every month. • QPS: 1,000,000,000 / 30 days / 24 hours / 3600 seconds = ~400 pages per second. • Peak QPS = 2 * QPS = 800 • Assume the average web page size is 500k. • 1-billion-page x 500k = 500 TB storage per month. If you are unclear about digital storage units, go through “Power of 2” section in Chapter 2 again. • Assuming data are stored for five years, 500 TB * 12 months * 5 years = 30 PB. A 30 PB storage is needed to store five-year content. 以下估计基于许多假设,重要的是 与面试官沟通以保持一致。 • 假设每月下载 10 亿个网页。 • QPS:1,000,000,000 / 30 天 / 24 小时 / 3600 秒 = ~400 页/秒。 • 峰值 QPS = 2 * QPS = 800 • 假设平均网页大小为 500k。 • 每月 10 亿页 x 500k = 500 TB 存储空间。如果您不清楚数字存储单元,请再次浏览第 2 章中的“2 次方”部分。 • 假设数据存储五年,则 500 TB * 12 个月 * 5 年 = 30 PB。一个 30 PB需要存储来存储五年的内容。
Once the requirements are clear, we move on to the high-level design. Inspired by previous studies on web crawling [4] [5], we propose a high-level design as shown in Figure 9-2. 一旦明确了要求,我们就会进入高级设计。灵感来自以前的 关于网络爬虫的研究[4] [5],我们提出了如图9-2所示的高级设计。
First, we explore each design component to understand their functionalities. Then, we examine the crawler workflow step-by-step.
A web crawler uses seed URLs as a starting point for the crawl process. For example, to crawl all web pages from a university’s website, an intuitive way to select seed URLs is to use the university’s domain name. To crawl the entire web, we need to be creative in selecting seed URLs. A good seed URL serves as a good starting point that a crawler can utilize to traverse as many links as possible. The general strategy is to divide the entire URL space into smaller ones. The first proposed approach is based on locality as different countries may have different popular websites. Another way is to choose seed URLs based on topics; for example, we can divide URL space into shopping, sports, healthcare, etc. Seed URL selection is an open-ended question. You are not expected to give the perfect answer. Just think out loud. 网络爬网程序使用种子 URL 作为爬网过程的起点。例如,到 从大学网站抓取所有网页,选择种子URL的一种直观方法是 使用大学的域名。 要抓取整个网络,我们需要在选择种子 URL 时具有创造性。一个好的种子网址 作为爬网程序可以用来遍历尽可能多的链接的良好起点。 一般策略是将整个 URL 空间划分为较小的空间。第一个提议 方法基于地区,因为不同的国家可能有不同的流行网站。 另一种方法是根据主题选择种子 URL;例如,我们可以划分 URL 空间 进入购物、体育、医疗保健等领域。种子 URL 选择是一个开放式问题。你是 不期望给出完美的答案。只是大声思考。
Most modern web crawlers split the crawl state into two: to be downloaded and already downloaded. The component that stores URLs to be downloaded is called the URL Frontier. You can refer to this as a First-in-First-out (FIFO) queue. For detailed information about the URL Frontier, refer to the deep dive. 大多数现代网络爬虫将爬网状态分为两种:待下载和已经 下载。存储要下载的 URL 的组件称为 URL 边界。 您可以将其称为先进先出 (FIFO) 队列。有关 网址前沿,参考深度剖析。
HTML Downloader The HTML downloader downloads web pages from the internet. Those URLs are provided by the URL Frontier. DNS Resolver To download a web page, a URL must be translated into an IP address. The HTML Downloader calls the DNS Resolver to get the corresponding IP address for the URL. For instance, URL www.wikipedia.org is converted to IP address 198.35.26.96 as of 3/5/2019. Content Parser After a web page is downloaded, it must be parsed and validated because malformed web pages could provoke problems and waste storage space. Implementing a content parser in a crawl server will slow down the crawling process. Thus, the content parser is a separate component. 网页下载器 HTML下载器从互联网下载网页。提供这些网址 通过网址边界。 DNS 解析器 要下载网页,必须将 URL 转换为 IP 地址。该目录 下载程序调用 DNS 解析程序以获取 URL 的相应 IP 地址。为 实例,自 2019 年 3 月 5 日起,URL www.wikipedia.org 转换为 IP 地址 198.35.26.96。 内容解析器 下载网页后,必须对其进行解析和验证,因为 Web 格式不正确 页面可能会引发问题并浪费存储空间。在 爬网服务器将减慢爬网过程。因此,内容解析器是一个单独的 元件。
Content Seen? Online research [6] reveals that 29% of the web pages are duplicated contents, which may cause the same content to be stored multiple times. We introduce the “Content Seen?” data structure to eliminate data redundancy and shorten processing time. It helps to detect new content previously stored in the system. To compare two HTML documents, we can compare them character by character. However, this method is slow and time-consuming, especially when billions of web pages are involved. An efficient way to accomplish this task is to compare the hash values of the two web pages [7]. Content Storage It is a storage system for storing HTML content. The choice of storage system depends on factors such as data type, data size, access frequency, life span, etc. Both disk and memory are used. • Most of the content is stored on disk because the data set is too big to fit in memory. • Popular content is kept in memory to reduce latency. 看到的内容? 在线研究[6]显示,29%的网页是重复的内容,这可能 导致多次存储相同的内容。我们介绍“看到的内容”数据 消除数据冗余并缩短处理时间的结构。它有助于检测新的 以前存储在系统中的内容。要比较两个 HTML 文档,我们可以比较 他们一个角色一个角色。但是,这种方法既慢又耗时,尤其是 当涉及数十亿个网页时。完成此任务的有效方法是 比较两个网页 [7] 的哈希值。 内容存储 它是用于存储HTML内容的存储系统。存储系统的选择取决于 数据类型、数据大小、访问频率、寿命等因素。磁盘和内存 被使用。 • 大部分内容都存储在磁盘上,因为数据集太大,无法放入内存。 • 热门内容保存在内存中以减少延迟。
URL Extractor URL Extractor parses and extracts links from HTML pages. Figure 9-3 shows an example of a link extraction process. Relative paths are converted to absolute URLs by adding the “https://en.wikipedia.org” prefix. 网址提取器 URL 提取器解析和提取来自 HTML 页面的链接。图 9-3 显示了 链接提取过程。通过添加 “https://en.wikipedia.org”前缀。
URL Filter The URL filter excludes certain content types, file extensions, error links and URLs in “blacklisted” sites. URL Seen? “URL Seen?” is a data structure that keeps track of URLs that are visited before or already in the Frontier. “URL Seen?” helps to avoid adding the same URL multiple times as this can increase server load and cause potential infinite loops. Bloom filter and hash table are common techniques to implement the “URL Seen?” component. We will not cover the detailed implementation of the bloom filter and hash table here. For more information, refer to the reference materials [4] [8]. 网址过滤器 URL 过滤器排除了某些内容类型、文件扩展名、错误链接和 URL “列入黑名单”的网站。 看到网址? “URL Seen?”是一种数据结构,用于跟踪之前或已经访问过的 URL 边疆。“看到的网址?”有助于避免多次添加相同的网址 增加服务器负载并导致潜在的无限循环。 布隆过滤器和哈希表是实现“URL Seen? 元件。我们将不涉及布隆过滤器和哈希表的详细实现 这里。有关更多信息,请参阅参考资料 [4] [8]。
URL Storage URL Storage stores already visited URLs. So far, we have discussed every system component. Next, we put them together to explain the workflow. Web crawler workflow To better explain the workflow step-by-step, sequence numbers are added in the design diagram as shown in Figure 9-4. 网址存储 网址存储存储已访问的网址。 到目前为止,我们已经讨论了每个系统组件。接下来,我们把它们放在一起来解释 工作流。 网络爬虫工作流程 为了更好地逐步解释工作流程,在设计中添加了序列号 如图9-4所示。
Step 1: Add seed URLs to the URL Frontier Step 2: HTML Downloader fetches a list of URLs from URL Frontier. Step 3: HTML Downloader gets IP addresses of URLs from DNS resolver and starts downloading. Step 4: Content Parser parses HTML pages and checks if pages are malformed. Step 5: After content is parsed and validated, it is passed to the “Content Seen?” component. Step 6: “Content Seen” component checks if a HTML page is already in the storage. • If it is in the storage, this means the same content in a different URL has already been processed. In this case, the HTML page is discarded. • If it is not in the storage, the system has not processed the same content before. The content is passed to Link Extractor. Step 7: Link extractor extracts links from HTML pages. Step 8: Extracted links are passed to the URL filter. Step 9: After links are filtered, they are passed to the “URL Seen?” component. Step 10: “URL Seen” component checks if a URL is already in the storage, if yes, it is processed before, and nothing needs to be done. Step 11: If a URL has not been processed before, it is added to the URL Frontier. 步骤 1:将种子 URL 添加到 URL 边界 第 2 步:HTML Downloader 从 URL Frontier 获取 URL 列表。 第 3 步:HTML 下载器从 DNS 解析器获取 URL 的 IP 地址并启动 下载。 第 4 步:内容解析器解析 HTML 页面并检查页面格式是否不正确。 第 5 步:解析和验证内容后,将其传递给“看到的内容?”组件。 第 6 步:“已查看内容”组件检查 HTML 页面是否已在存储中。 • 如果它位于存储中,则表示不同 URL 中的相同内容已经存在 处理。在这种情况下,HTML 页面将被丢弃。 • 如果它不在存储中,则系统以前未处理过相同的内容。这 内容将传递到链接提取器。 第 7 步:链接提取器从 HTML 页面中提取链接。 第 8 步:提取的链接将传递到 URL 过滤器。 步骤9:过滤链接后,它们被传递给“看到的URL?”组件。 步骤10:“已看到的URL”组件检查URL是否已在存储中,如果是,则为 之前处理过,什么都不需要做。 第 11 步:如果之前未处理过 URL,则会将其添加到 URL 边界。
Up until now, we have discussed the high-level design. Next, we will discuss the most important building components and techniques in depth: • Depth-first search (DFS) vs Breadth-first search (BFS) • URL frontier • HTML Downloader • Robustness • Extensibility • Detect and avoid problematic content 到目前为止,我们已经讨论了高级设计。接下来,我们将讨论最多的 深入介绍重要的建筑构件和技术: • 深度优先搜索 (DFS) 与广度优先搜索 (BFS) • 网址前沿 • 网页下载器 • 鲁棒性 • 扩展 • 检测并避免有问题的内容
You can think of the web as a directed graph where web pages serve as nodes and hyperlinks (URLs) as edges. The crawl process can be seen as traversing a directed graph from one web page to others. Two common graph traversal algorithms are DFS and BFS. However, DFS is usually not a good choice because the depth of DFS can be very deep. BFS is commonly used by web crawlers and is implemented by a first-in-first-out (FIFO) queue. In a FIFO queue, URLs are dequeued in the order they are enqueued. However, this implementation has two problems: • Most links from the same web page are linked back to the same host. In Figure 9-5, all the links in wikipedia.com are internal links, making the crawler busy processing URLs from the same host (wikipedia.com). When the crawler tries to download web pages in parallel, Wikipedia servers will be flooded with requests. This is considered as “impolite”. 您可以将 Web 视为一个有向图,其中网页充当节点和超链接 (网址)作为边缘。抓取过程可以看作是从一个网络遍历有向图 页面到其他人。两种常见的图遍历算法是DFS和BFS。但是,DFS 是 通常不是一个好的选择,因为DFS的深度可能很深。 BFS 通常由网络爬虫使用,并由先进先出 (FIFO) 实现 队列。在 FIFO 队列中,URL 按其排队顺序取消排队。然而,这 实现有两个问题: • 来自同一网页的大多数链接都链接回同一主机。在图 9-5 中,所有 wikipedia.com 中的链接是内部链接,使爬虫忙于处理URL 来自同一主机 (wikipedia.com)。当爬虫尝试在 同时,维基百科服务器将充斥着请求。这被认为是“不礼貌的”。
• Standard BFS does not take the priority of a URL into consideration. The web is large and not every page has the same level of quality and importance. Therefore, we may want to prioritize URLs according to their page ranks, web traffic, update frequency, etc. • 标准 BFS 不考虑 URL 的优先级。网络很大 并非每个页面都具有相同的质量和重要性。因此,我们可能想要 根据页面排名、网络流量、更新频率等对 URL 进行优先级排序。
URL frontier helps to address these problems. A URL frontier is a data structure that stores URLs to be downloaded. The URL frontier is an important component to ensure politeness, URL prioritization, and freshness. A few noteworthy papers on URL frontier are mentioned in the reference materials [5] [9]. The findings from these papers are as follows: URL 边界有助于解决这些问题。URL 边界是一种存储的数据结构 要下载的网址。URL 边界是确保礼貌的重要组成部分, URL 优先级和新鲜度。提到了一些关于URL前沿的值得注意的论文 在参考资料[5] [9]中。这些论文的发现如下:
Generally, a web crawler should avoid sending too many requests to the same hosting server within a short period. Sending too many requests is considered as “impolite” or even treated as denial-of-service (DOS) attack. For example, without any constraint, the crawler can send thousands of requests every second to the same website. This can overwhelm the web servers. The general idea of enforcing politeness is to download one page at a time from the same host. A delay can be added between two download tasks. The politeness constraint is implemented by maintain a mapping from website hostnames to download (worker) threads. Each downloader thread has a separate FIFO queue and only downloads URLs obtained from that queue. Figure 9-6 shows the design that manages politeness. 通常,网络爬虫应避免向同一托管服务器发送太多请求 在短时间内。发送太多请求被视为“不礼貌”甚至被视为“不礼貌” 作为拒绝服务 (DOS) 攻击。例如,在没有任何约束的情况下,爬虫可以发送 每秒有数千个请求发送到同一个网站。这可能会使网络不堪重负 服务器。 强制礼貌的一般思路是一次从同一页面下载一页 主机。可以在两个下载任务之间添加延迟。礼貌约束是 通过维护从网站主机名到下载(工作线程)线程的映射来实现。 每个下载程序线程都有一个单独的 FIFO 队列,并且仅下载从 那个队列。图 9-6 显示了管理礼貌的设计。
• Queue router: It ensures that each queue (b1, b2, … bn) only contains URLs from the same host. • Mapping table: It maps each host to a queue. •队列路由器:它确保每个队列(b1,b2,...bn) 仅包含来自 同一个主机。 • 映射表:它将每个主机映射到一个队列。
• FIFO queues b1, b2 to bn: Each queue contains URLs from the same host. • Queue selector: Each worker thread is mapped to a FIFO queue, and it only downloads URLs from that queue. The queue selection logic is done by the Queue selector. • Worker thread 1 to N. A worker thread downloads web pages one by one from the same host. A delay can be added between two download tasks. • FIFO 队列 b1、b2 到 bn:每个队列包含来自同一主机的 URL。 • 队列选择器:每个工作线程都映射到 FIFO 队列,并且仅下载 该队列中的网址。队列选择逻辑由队列选择器完成。 • 工作线程 1 到 N。工作线程从同一线程逐个下载网页 主机。可以在两个下载任务之间添加延迟。
A random post from a discussion forum about Apple products carries very different weight than posts on the Apple home page. Even though they both have the “Apple” keyword, it is sensible for a crawler to crawl the Apple home page first. We prioritize URLs based on usefulness, which can be measured by PageRank [10], website traffic, update frequency, etc. “Prioritizer” is the component that handles URL prioritization. Refer to the reference materials [5] [10] for in-depth information about this concept. Figure 9-7 shows the design that manages URL priority 来自讨论论坛的关于 Apple 产品的随机帖子具有非常不同的权重 而不是苹果主页上的帖子。尽管他们都有“苹果”关键字,但它是 爬虫首先抓取苹果主页是明智的。 我们根据有用性对URL进行优先级排序,这可以通过PageRank [10],网站来衡量 流量、更新频率等。“优先级设置器”是处理 URL 优先级的组件。 有关此概念的详细信息,请参阅参考资料 [5] [10]。 图 9-7 显示了管理 URL 优先级的设计
• Prioritizer: It takes URLs as input and computes the priorities. • Queue f1 to fn: Each queue has an assigned priority. Queues with high priority are selected with higher probability. • Queue selector: Randomly choose a queue with a bias towards queues with higher priority. Figure 9-8 presents the URL frontier design, and it contains two modules: • Front queues: manage prioritization • Back queues: manage politeness • 优先级:它将 URL 作为输入并计算优先级。 • 队列 f1 到 fn:每个队列都有一个分配的优先级。具有高优先级的队列是 以更高的概率选择。 • 队列选择器:随机选择一个偏向于较高队列的队列 优先权。 图 9-8 显示了 URL 边界设计,它包含两个模块: • 排在队列:管理优先级 • 排回队列:管理礼貌
recrawl downloaded pages to keep our data set fresh. Recrawl all the URLs is timeconsuming and resource intensive. Few strategies to optimize freshness are listed as follows: • Recrawl based on web pages’ update history. • Prioritize URLs and recrawl important pages first and more frequently. 重新抓取已下载的网页,以保持数据集的新鲜度。重新抓取所有网址既耗时又耗费资源。优化新鲜度的策略如下: • 根据网页的更新历史记录重新抓取。 • 优先处理网址,并首先、更频繁地重新抓取重要网页。
In real-world crawl for search engines, the number of URLs in the frontier could be hundreds of millions [4]. Putting everything in memory is neither durable nor scalable. Keeping everything in the disk is undesirable neither because the disk is slow; and it can easily become a bottleneck for the crawl. We adopted a hybrid approach. The majority of URLs are stored on disk, so the storage space is not a problem. To reduce the cost of reading from the disk and writing to the disk, we maintain buffers in memory for enqueue/dequeue operations. Data in the buffer is periodically written to the disk. 在搜索引擎的实际抓取中,前沿的URL数量可能达到数百个。 数以百万计[4]。将所有内容放入内存既不持久也不可扩展。仍 磁盘中的所有内容都是不可取的,也不是因为磁盘很慢;它可以很容易地 成为爬行的瓶颈。 我们采用了混合方法。大多数 URL 都存储在磁盘上,因此存储空间 不是问题。为了降低从磁盘读取和写入磁盘的成本,我们 在内存中维护缓冲区以用于排队/取消排队操作。缓冲区中的数据为 定期写入磁盘。
The HTML Downloader downloads web pages from the internet using the HTTP protocol. Before discussing the HTML Downloader, we look at Robots Exclusion Protocol first. HTML下载器使用HTTP协议从互联网下载网页。 在讨论HTML下载器之前,我们先看一下机器人排除协议。
Robots.txt, called Robots Exclusion Protocol, is a standard used by websites to communicate with crawlers. It specifies what pages crawlers are allowed to download. Before attempting to crawl a web site, a crawler should check its corresponding robots.txt first and follow its rules. To avoid repeat downloads of robots.txt file, we cache the results of the file. The file is downloaded and saved to cache periodically. Here is a piece of robots.txt file taken from https://www.amazon.com/robots.txt. Some of the directories like creatorhub are disallowed for Google bot. User-agent: Googlebot Disallow: /creatorhub/* Disallow: /rss/people/*/reviews Disallow: /gp/pdp/rss/*/reviews Disallow: /gp/cdp/member-reviews/ Disallow: /gp/aw/cr/ Besides robots.txt, performance optimization is another important concept we will cover for the HTML downloader. 机器人.txt,称为机器人排除协议,是网站用于通信的标准 与爬虫。它指定允许抓取程序下载哪些页面。在尝试之前 抓取一个网站,爬虫应该首先检查其对应的机器人.txt并遵循其规则。 为了避免重复下载robots.txt文件,我们缓存了文件的结果。该文件是 定期下载并保存到缓存中。这是一段机器人.txt文件取自 https://www.amazon.com/robots.txt。某些目录(如创建者中心)是不允许的 对于谷歌机器人。 用户代理:谷歌机器人 禁止: /creatorhub/* 禁止: /rss/people/*/reviews 禁止: /gp/pdp/rss/*/reviews 禁止: /gp/cdp/member-reviews/ 禁止: /gp/aw/cr/ 除了机器人.txt,性能优化是我们将介绍的另一个重要概念 HTML 下载器。
Performance optimization Below is a list of performance optimizations for HTML downloader. 性能优化 以下是 HTML 下载器的性能优化列表。
To achieve high performance, crawl jobs are distributed into multiple servers, and each server runs multiple threads. The URL space is partitioned into smaller pieces; so, each downloader is responsible for a subset of the URLs. Figure 9-9 shows an example of a distributed crawl. 为了实现高性能,爬网作业分布到多个服务器中,每个服务器 运行多个线程。URL 空间被分区为更小的部分;所以,每个下载器 负责 URL 的子集。图 9-9 显示了分布式爬网的示例。
DNS Resolver is a bottleneck for crawlers because DNS requests might take time due to the synchronous nature of many DNS interfaces. DNS response time ranges from 10ms to 200ms. Once a request to DNS is carried out by a crawler thread, other threads are blocked until the first request is completed. Maintaining our DNS cache to avoid calling DNS frequently is an effective technique for speed optimization. Our DNS cache keeps the domain name to IP address mapping and is updated periodically by cron jobs. DNS 解析器是爬网程序的瓶颈,因为 DNS 请求可能需要一些时间,因为 许多 DNS 接口的同步特性。DNS 响应时间范围从 10 毫秒到 200毫秒。一旦爬网程序线程执行了对 DNS 的请求,其他线程就会被阻止 直到第一个请求完成。维护我们的 DNS 缓存以避免调用 DNS 经常是速度优化的有效技术。我们的 DNS 缓存保留域 名称到 IP 地址的映射,并由 cron 作业定期更新。
Distribute crawl servers geographically. When crawl servers are closer to website hosts, crawlers experience faster download time. Design locality applies to most of the system components: crawl servers, cache, queue, storage, etc. 按地理位置分布爬网服务器。当爬网服务器离网站主机更近时, 爬虫体验更快的下载时间。设计局部性适用于系统的大部分内容 组件:爬网服务器、缓存、队列、存储等。
Some web servers respond slowly or may not respond at all. To avoid long wait time, a maximal wait time is specified. If a host does not respond within a predefined time, the crawler will stop the job and crawl some other pages. 某些 Web 服务器响应缓慢或可能根本没有响应。为避免长时间等待,a 指定了最大等待时间。如果主机未在预定义的时间内响应,则 爬网程序将停止作业并抓取其他一些页面。
Besides performance optimization, robustness is also an important consideration. We present a few approaches to improve the system robustness: • Consistent hashing: This helps to distribute loads among downloaders. A new downloader server can be added or removed using consistent hashing. Refer to Chapter 5: Design consistent hashing for more details. • Save crawl states and data: To guard against failures, crawl states and data are written to a storage system. A disrupted crawl can be restarted easily by loading saved states and data. • Exception handling: Errors are inevitable and common in a large-scale system. The crawler must handle exceptions gracefully without crashing the system. • Data validation: This is an important measure to prevent system errors. 除了性能优化,鲁棒性也是一个重要的考虑因素。我们介绍 提高系统鲁棒性的几种方法: • 一致的哈希:这有助于在下载器之间分配负载。一个新的 可以使用一致哈希添加或删除下载器服务器。请参阅第 5 章: 设计一致的哈希以获取更多详细信息。 • 保存爬网状态和数据:为了防止失败,爬网状态和数据将写入 存储系统。通过加载保存的状态和 数据。 • 异常处理:错误在大规模系统中是不可避免的和常见的。这 爬网程序必须正常处理异常,而不会使系统崩溃。 • 数据验证:这是防止系统错误的重要措施。
As almost every system evolves, one of the design goals is to make the system flexible enough to support new content types. The crawler can be extended by plugging in new modules. Figure 9-10 shows how to add new modules. 随着几乎每个系统的发展,设计目标之一是使系统灵活 足以支持新的内容类型。爬虫可以通过插入新的来扩展 模块。图 9-10 显示了如何添加新模块。
This section discusses the detection and prevention of redundant, meaningless, or harmful content. 1. Redundant content As discussed previously, nearly 30% of the web pages are duplicates. Hashes or checksums help to detect duplication [11]. 2. Spider traps A spider trap is a web page that causes a crawler in an infinite loop. For instance, an infinite deep directory structure is listed as follows: www.spidertrapexample.com/foo/bar/foo/bar/foo/bar/ 本节讨论检测和预防冗余、无意义或有害的内容 内容。 1. 冗余内容 如前所述,近 30% 的网页是重复的。哈希或校验和 帮助检测重复[11]。 2. 蜘蛛陷阱 蜘蛛陷阱是一个网页,它会导致爬虫陷入无限循环。例如,无限 深层目录结构列出如下: www.spidertrapexample.com/foo/bar/foo/bar/foo/bar/
Such spider traps can be avoided by setting a maximal length for URLs. However, no onesize-fits-all solution exists to detect spider traps. Websites containing spider traps are easy to identify due to an unusually large number of web pages discovered on such websites. It is hard to develop automatic algorithms to avoid spider traps; however, a user can manually verify and identify a spider trap, and either exclude those websites from the crawler or apply some customized URL filters. 可以通过设置 URL 的最大长度来避免此类蜘蛛陷阱。然而,没有一刀切的解决方案来检测蜘蛛陷阱。包含蜘蛛陷阱的网站很容易 由于在此类网站上发现的网页数量异常多而进行识别。是的 难以开发自动算法以避免蜘蛛陷阱;但是,用户可以手动 验证并识别蜘蛛陷阱,并从爬虫中排除这些网站或应用 一些自定义的网址过滤器。
3. Data noise Some of the contents have little or no value, such as advertisements, code snippets, spam URLs, etc. Those contents are not useful for crawlers and should be excluded if possible. 3. 数据噪声 某些内容几乎没有价值,例如广告、代码片段、垃圾邮件 网址等这些内容对爬网程序没有用处,应尽可能将其排除。
In this chapter, we first discussed the characteristics of a good crawler: scalability, politeness, extensibility, and robustness. Then, we proposed a design and discussed key components. Building a scalable web crawler is not a trivial task because the web is enormously large and full of traps. Even though we have covered many topics, we still miss many relevant talking points: • Server-side rendering: Numerous websites use scripts like JavaScript, AJAX, etc to generate links on the fly. If we download and parse web pages directly, we will not be able to retrieve dynamically generated links. To solve this problem, we perform server-side rendering (also called dynamic rendering) first before parsing a page [12]. • Filter out unwanted pages: With finite storage capacity and crawl resources, an anti-spam component is beneficial in filtering out low quality and spam pages [13] [14]. • Database replication and sharding: Techniques like replication and sharding are used to improve the data layer availability, scalability, and reliability. • Horizontal scaling: For large scale crawl, hundreds or even thousands of servers are needed to perform download tasks. The key is to keep servers stateless. • Availability, consistency, and reliability: These concepts are at the core of any large system’s success. We discussed these concepts in detail in Chapter 1. Refresh your memory on these topics. • Analytics: Collecting and analyzing data are important parts of any system because data is key ingredient for fine-tuning. Congratulations on getting this far! Now give yourself a pat on the back. Good job! 在本章中,我们首先讨论了一个好的爬虫的特征:可扩展性、礼貌性、 可扩展性和健壮性。然后,我们提出了一个设计并讨论了关键组件。 构建可扩展的网络爬虫并非易事,因为网络非常大, 充满了陷阱。尽管我们已经涵盖了许多主题,但我们仍然错过了许多相关的谈话 点: •服务器端渲染:许多网站使用JavaScript,AJAX等脚本来 即时生成链接。如果我们直接下载和解析网页,我们将无法 以检索动态生成的链接。为了解决这个问题,我们执行服务器端 在解析页面之前先呈现(也称为动态渲染) [12]。 • 过滤掉不需要的页面:具有有限的存储容量和爬网资源,反垃圾邮件 组件有利于过滤低质量和垃圾邮件页面[13] [14]。 • 数据库复制和分片:复制和分片等技术用于 提高数据层的可用性、可扩展性和可靠性。 • 水平扩展:对于大规模爬网,数百甚至数千台服务器 需要执行下载任务。关键是保持服务器无状态。 • 可用性、一致性和可靠性:这些概念是任何大型产品的核心 系统的成功。我们在第一章中详细讨论了这些概念。刷新您的 关于这些主题的记忆。 • 分析:收集和分析数据是任何系统的重要组成部分,因为数据 是微调的关键因素。 恭喜你走到了这一步!现在拍拍自己的背。干得好!
[1] US Library of Congress: https://www.loc.gov/websites/ [2] EU Web Archive: http://data.europa.eu/webarchive [3] Digimarc: https://www.digimarc.com/products/digimarc-services/piracy-intelligence [4] Heydon A., Najork M. Mercator: A scalable, extensible web crawler World Wide Web, 2 (4) (1999), pp. 219-229 [5] By Christopher Olston, Marc Najork: Web Crawling. http://infolab.stanford.edu/~olston/publications/crawling_survey.pdf [6] 29% Of Sites Face Duplicate Content Issues: https://tinyurl.com/y6tmh55y [7] Rabin M.O., et al. Fingerprinting by random polynomials Center for Research in Computing Techn., Aiken Computation Laboratory, Univ. (1981) [8] B. H. Bloom, “Space/time trade-offs in hash coding with allowable errors,” Communications of the ACM, vol. 13, no. 7, pp. 422–426, 1970. [9] Donald J. Patterson, Web Crawling: https://www.ics.uci.edu/~lopes/teaching/cs221W12/slides/Lecture05.pdf [10] L. Page, S. Brin, R. Motwani, and T. Winograd, “The PageRank citation ranking: Bringing order to the web,” Technical Report, Stanford University, 1998. [11] Burton Bloom. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7), pages 422--426, July 1970. [12] Google Dynamic Rendering: https://developers.google.com/search/docs/guides/dynamic-rendering [13] T. Urvoy, T. Lavergne, and P. Filoche, “Tracking web spam with hidden style similarity,” in Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the Web, 2006. [14] H.-T. Lee, D. Leonard, X. Wang, and D. Loguinov, “IRLbot: Scaling to 6 billion pages and beyond,” in Proceedings of the 17th International World Wide Web Conference, 2008.
本文作者:Eric
本文链接:
版权声明:本博客所有文章除特别声明外,均采用 BY-NC-SA 许可协议。转载请注明出处!