Web crawler

chapter 9 of System Design Interview – An insider’s guide

Questions

What

What is the main purpose of the crawler? (Search engine indexing)
What content types are included? (HTML only)
What is the volume of web pages collected? (1 billion pages per month)
What should be done with duplicate content? (Ignore pages with duplicate content)

How

How should newly added or edited web pages be handled? (Consider newly added or edited pages)
How long should we store the HTML pages crawled from the web? (Up to 5 years)

Overview

URL frontier check below in details
PNG Downloader module is plugged-in to download PNG files.
Web Monitor module is added to monitor the web and prevent copyright and trademark infringements.
Definition of Web Crawler:
- Also known as a robot or spider.
- Used by search engines to discover new or updated web content (e.g., web pages, images, videos, PDF files).
- Starts with a few web pages and follows links to gather more content.
Purposes of Web Crawlers:
- Search Engine Indexing:
  - Collects web pages to create a local index for search engines.
  - Example: Googlebot for Google search engine.
- Web Archiving:
  - Collects information to preserve data for future use.
  - Examples: US Library of Congress, EU web archive.
- Web Mining:
  - Discovers useful knowledge from the internet.
  - Example: Financial firms use crawlers to download shareholder meetings and annual reports.
- Web Monitoring:
  - Monitors copyright and trademark infringements.
  - Example: Digimarc uses crawlers to discover and report pirated works.
Characteristics of a Good Web Crawler:
- Scalability:
  - Must efficiently handle billions of web pages using parallelization.
- Robustness:
  - Capable of handling bad HTML, unresponsive servers, crashes, malicious links, etc.
- Politeness:
  - Should not overload a website with too many requests in a short time.
- Extensibility:
  - Flexible enough to support new content types with minimal changes (e.g., adding image file crawling without redesigning the system).

BFS is preferable

Web as a Directed Graph:
- Web pages are nodes.
- Hyperlinks (URLs) are edges.
- Crawling is traversing this directed graph from one web page to others.
Graph Traversal Algorithms:
- DFS (Depth-First Search):
  - Not ideal due to potentially deep traversal depth.
- BFS (Breadth-First Search):
  - Commonly used by web crawlers.
  - Implemented with a first-in-first-out (FIFO) queue.
  - URLs are dequeued in the order they are enqueued.
Problems with Standard BFS Implementation:
- Host Overload:
  - Most links from a web page link back to the same host.
  - Example: Links in wikipedia.com are internal, leading to the crawler processing URLs from the same host.
  - Floods the host with requests when downloading pages in parallel, considered “impolite”.
- Lack of Priority:
  - Standard BFS does not account for URL priority.
  - Not all web pages have the same quality and importance.
  - Prioritization based on page ranks, web traffic, update frequency, etc., is necessary.

URL frontier

URL Frontier:
- A data structure storing URLs to be downloaded.
- Ensures politeness, URL prioritization, and freshness.
Politeness:
- Avoids sending too many requests to the same server within a short period.
- Prevents overwhelming web servers, which could be seen as a denial-of-service (DOS) attack.
- Downloads one page at a time from the same host with delays between downloads.
- Uses a mapping from hostnames to download threads, each with a separate FIFO queue.
Priority:
- Not all URLs have equal importance.
- Prioritizes URLs based on usefulness, such as PageRank, website traffic, update frequency.
- The “Prioritizer” component handles URL prioritization.
- Example: Prioritizes the Apple homepage over a random forum post about Apple products.
Freshness:
- Web pages are frequently updated, added, or deleted.
- Periodically recrawls downloaded pages to maintain data freshness.
- Strategies for optimizing freshness:
  - Recrawl based on the update history of web pages.
  - Prioritize and recrawl important pages first and more frequently.
Storage for URL Frontier:
- Real-world crawls can involve hundreds of millions of URLs.
- Storing everything in memory is not durable or scalable.
- Storing everything on disk is slow and can become a bottleneck.

DNS resolver

DNS Resolver Bottleneck:
- DNS requests are often slow due to their synchronous nature.
- Response time ranges from 10ms to 200ms.
- When a DNS request is made, other crawler threads may be blocked until the request is completed.
Optimization with DNS Cache:
- Maintains a DNS cache to store domain name to IP address mappings.
- Reduces the frequency of DNS calls, speeding up the crawling process.
- DNS cache is updated periodically using cron jobs.
Detection and Avoidance of Problematic Content:
- Redundant Content:
  - Nearly 30% of web pages are duplicates.
  - Use hashes or checksums to detect duplication.
- Spider Traps:
  - A spider trap is a web page causing a crawler to loop infinitely.
  - Example: An infinitely deep directory structure (e.g., http://www.spidertrapexample.com/foo/bar/foo/bar/foo/bar/ …).
  - Avoidance Strategies:
    - Set a maximal length for URLs.
    - Identify sites with unusually large numbers of discovered web pages.
    - Manual verification and identification.
    - Exclude problematic websites or apply customized URL filters.
- Data Noise:
  - Low-value content such as advertisements, code snippets, spam URLs.
  - Exclude these contents to enhance crawler efficiency.