Web crawler
chapter 9 of System Design Interview – An insider’s guide
Questions
What
- What is the main purpose of the crawler? (Search engine indexing)
- What content types are included? (HTML only)
- What is the volume of web pages collected? (1 billion pages per month)
- What should be done with duplicate content? (Ignore pages with duplicate content)
How
- How should newly added or edited web pages be handled? (Consider newly added or edited pages)
- How long should we store the HTML pages crawled from the web? (Up to 5 years)
Overview
-
URL frontier check below in details
-
PNG Downloader module is plugged-in to download PNG files.
-
Web Monitor module is added to monitor the web and prevent copyright and trademark infringements.
-
Definition of Web Crawler:
- Also known as a robot or spider.
- Used by search engines to discover new or updated web content (e.g., web pages, images, videos, PDF files).
- Starts with a few web pages and follows links to gather more content.
-
Purposes of Web Crawlers:
- Search Engine Indexing:
- Collects web pages to create a local index for search engines.
- Example: Googlebot for Google search engine.
- Web Archiving:
- Collects information to preserve data for future use.
- Examples: US Library of Congress, EU web archive.
- Web Mining:
- Discovers useful knowledge from the internet.
- Example: Financial firms use crawlers to download shareholder meetings and annual reports.
- Web Monitoring:
- Monitors copyright and trademark infringements.
- Example: Digimarc uses crawlers to discover and report pirated works.
- Search Engine Indexing:
-
Characteristics of a Good Web Crawler:
- Scalability:
- Must efficiently handle billions of web pages using parallelization.
- Robustness:
- Capable of handling bad HTML, unresponsive servers, crashes, malicious links, etc.
- Politeness:
- Should not overload a website with too many requests in a short time.
- Extensibility:
- Flexible enough to support new content types with minimal changes (e.g., adding image file crawling without redesigning the system).
- Scalability:
BFS is preferable
-
Web as a Directed Graph:
- Web pages are nodes.
- Hyperlinks (URLs) are edges.
- Crawling is traversing this directed graph from one web page to others.
-
Graph Traversal Algorithms:
- DFS (Depth-First Search):
- Not ideal due to potentially deep traversal depth.
- BFS (Breadth-First Search):
- Commonly used by web crawlers.
- Implemented with a first-in-first-out (FIFO) queue.
- URLs are dequeued in the order they are enqueued.
- DFS (Depth-First Search):
-
Problems with Standard BFS Implementation:
- Host Overload:
- Most links from a web page link back to the same host.
- Example: Links in wikipedia.com are internal, leading to the crawler processing URLs from the same host.
- Floods the host with requests when downloading pages in parallel, considered “impolite”.
- Lack of Priority:
- Standard BFS does not account for URL priority.
- Not all web pages have the same quality and importance.
- Prioritization based on page ranks, web traffic, update frequency, etc., is necessary.
- Host Overload:
URL frontier
-
URL Frontier:
- A data structure storing URLs to be downloaded.
- Ensures politeness, URL prioritization, and freshness.
-
Politeness:
- Avoids sending too many requests to the same server within a short period.
- Prevents overwhelming web servers, which could be seen as a denial-of-service (DOS) attack.
- Downloads one page at a time from the same host with delays between downloads.
- Uses a mapping from hostnames to download threads, each with a separate FIFO queue.
-
Priority:
- Not all URLs have equal importance.
- Prioritizes URLs based on usefulness, such as PageRank, website traffic, update frequency.
- The “Prioritizer” component handles URL prioritization.
- Example: Prioritizes the Apple homepage over a random forum post about Apple products.
-
Freshness:
- Web pages are frequently updated, added, or deleted.
- Periodically recrawls downloaded pages to maintain data freshness.
- Strategies for optimizing freshness:
- Recrawl based on the update history of web pages.
- Prioritize and recrawl important pages first and more frequently.
-
Storage for URL Frontier:
- Real-world crawls can involve hundreds of millions of URLs.
- Storing everything in memory is not durable or scalable.
- Storing everything on disk is slow and can become a bottleneck.
DNS resolver
-
DNS Resolver Bottleneck:
- DNS requests are often slow due to their synchronous nature.
- Response time ranges from 10ms to 200ms.
- When a DNS request is made, other crawler threads may be blocked until the request is completed.
-
Optimization with DNS Cache:
- Maintains a DNS cache to store domain name to IP address mappings.
- Reduces the frequency of DNS calls, speeding up the crawling process.
- DNS cache is updated periodically using cron jobs.
-
Detection and Avoidance of Problematic Content:
-
Redundant Content:
- Nearly 30% of web pages are duplicates.
- Use hashes or checksums to detect duplication.
-
Spider Traps:
- A spider trap is a web page causing a crawler to loop infinitely.
- Example: An infinitely deep directory structure (e.g., http://www.spidertrapexample.com/foo/bar/foo/bar/foo/bar/ …).
- Avoidance Strategies:
- Set a maximal length for URLs.
- Identify sites with unusually large numbers of discovered web pages.
- Manual verification and identification.
- Exclude problematic websites or apply customized URL filters.
-
Data Noise:
- Low-value content such as advertisements, code snippets, spam URLs.
- Exclude these contents to enhance crawler efficiency.
-
Last updated on