Skip to Content

Web crawler

chapter 9 of System Design Interview – An insider’s guide 

Questions

What

  • What is the main purpose of the crawler? (Search engine indexing)
  • What content types are included? (HTML only)
  • What is the volume of web pages collected? (1 billion pages per month)
  • What should be done with duplicate content? (Ignore pages with duplicate content)

How

  • How should newly added or edited web pages be handled? (Consider newly added or edited pages)
  • How long should we store the HTML pages crawled from the web? (Up to 5 years)

Overview

  • URL frontier check below in details

  • PNG Downloader module is plugged-in to download PNG files.

  • Web Monitor module is added to monitor the web and prevent copyright and trademark infringements.

  • Definition of Web Crawler:

    • Also known as a robot or spider.
    • Used by search engines to discover new or updated web content (e.g., web pages, images, videos, PDF files).
    • Starts with a few web pages and follows links to gather more content.
  • Purposes of Web Crawlers:

    • Search Engine Indexing:
      • Collects web pages to create a local index for search engines.
      • Example: Googlebot for Google search engine.
    • Web Archiving:
      • Collects information to preserve data for future use.
      • Examples: US Library of Congress, EU web archive.
    • Web Mining:
      • Discovers useful knowledge from the internet.
      • Example: Financial firms use crawlers to download shareholder meetings and annual reports.
    • Web Monitoring:
      • Monitors copyright and trademark infringements.
      • Example: Digimarc uses crawlers to discover and report pirated works.
  • Characteristics of a Good Web Crawler:

    • Scalability:
      • Must efficiently handle billions of web pages using parallelization.
    • Robustness:
      • Capable of handling bad HTML, unresponsive servers, crashes, malicious links, etc.
    • Politeness:
      • Should not overload a website with too many requests in a short time.
    • Extensibility:
      • Flexible enough to support new content types with minimal changes (e.g., adding image file crawling without redesigning the system).

BFS is preferable

  • Web as a Directed Graph:

    • Web pages are nodes.
    • Hyperlinks (URLs) are edges.
    • Crawling is traversing this directed graph from one web page to others.
  • Graph Traversal Algorithms:

    • DFS (Depth-First Search):
      • Not ideal due to potentially deep traversal depth.
    • BFS (Breadth-First Search):
      • Commonly used by web crawlers.
      • Implemented with a first-in-first-out (FIFO) queue.
      • URLs are dequeued in the order they are enqueued.
  • Problems with Standard BFS Implementation:

    • Host Overload:
      • Most links from a web page link back to the same host.
      • Example: Links in wikipedia.com are internal, leading to the crawler processing URLs from the same host.
      • Floods the host with requests when downloading pages in parallel, considered “impolite”.
    • Lack of Priority:
      • Standard BFS does not account for URL priority.
      • Not all web pages have the same quality and importance.
      • Prioritization based on page ranks, web traffic, update frequency, etc., is necessary.

URL frontier

  • URL Frontier:

    • A data structure storing URLs to be downloaded.
    • Ensures politeness, URL prioritization, and freshness.
  • Politeness:

    • Avoids sending too many requests to the same server within a short period.
    • Prevents overwhelming web servers, which could be seen as a denial-of-service (DOS) attack.
    • Downloads one page at a time from the same host with delays between downloads.
    • Uses a mapping from hostnames to download threads, each with a separate FIFO queue.
  • Priority:

    • Not all URLs have equal importance.
    • Prioritizes URLs based on usefulness, such as PageRank, website traffic, update frequency.
    • The “Prioritizer” component handles URL prioritization.
    • Example: Prioritizes the Apple homepage over a random forum post about Apple products.
  • Freshness:

    • Web pages are frequently updated, added, or deleted.
    • Periodically recrawls downloaded pages to maintain data freshness.
    • Strategies for optimizing freshness:
      • Recrawl based on the update history of web pages.
      • Prioritize and recrawl important pages first and more frequently.
  • Storage for URL Frontier:

    • Real-world crawls can involve hundreds of millions of URLs.
    • Storing everything in memory is not durable or scalable.
    • Storing everything on disk is slow and can become a bottleneck.

DNS resolver

  • DNS Resolver Bottleneck:

    • DNS requests are often slow due to their synchronous nature.
    • Response time ranges from 10ms to 200ms.
    • When a DNS request is made, other crawler threads may be blocked until the request is completed.
  • Optimization with DNS Cache:

    • Maintains a DNS cache to store domain name to IP address mappings.
    • Reduces the frequency of DNS calls, speeding up the crawling process.
    • DNS cache is updated periodically using cron jobs.
  • Detection and Avoidance of Problematic Content:

    • Redundant Content:

      • Nearly 30% of web pages are duplicates.
      • Use hashes or checksums to detect duplication.
    • Spider Traps:

      • A spider trap is a web page causing a crawler to loop infinitely.
      • Example: An infinitely deep directory structure (e.g., http://www.spidertrapexample.com/foo/bar/foo/bar/foo/bar/ …).
      • Avoidance Strategies:
        • Set a maximal length for URLs.
        • Identify sites with unusually large numbers of discovered web pages.
        • Manual verification and identification.
        • Exclude problematic websites or apply customized URL filters.
    • Data Noise:

      • Low-value content such as advertisements, code snippets, spam URLs.
      • Exclude these contents to enhance crawler efficiency.
Last updated on