• A system for downloading, storing and analysing web pages

  • Use cases

    • Search engine indexing
    • Web archiving
    • Web monitoring for copyright or trademark violation
  • Steps,

    Given a list of seed URLs -> Visit each URL -> store the web page -> Extract URLs from the current page -> Append the URLs to the list of URLs to visit -> repeat

  • Characteristics of a good web crawler,

    • Should be scalable
    • Should be robust enough to handle poorly formatted HTML, malicious sites, crashes, etc.
    • Should avoid making too many requests to a website in a very short time as it might lead to a DDoS attack
    • Should be extensible for future changes

Overview

Seed URLS

  • Initial links that are used at the starting point of the crawling process
  • Choosing the right URLs can impact the number of web pages crawled

URL Frontier

  • A queue data structure that holds the URLs that have to be fetched and analysed

HTML Fetcher

  • Downloads the web page pointed by the URL given by the URL frontier

DNS Resolver

  • Translates the URL to the web-page’s IP address

HTML Parser

  • Check integrity of a web page’s data
  • Checks for poorly formatted HTML and malware

Duplicate detection

  • Storing duplicated leads to unnecessary space usage and slow down the system

Cache

  • To improve web crawlers efficiency
  • Stores most recently crawled URLs

Data Storage

  • The web page’s data is stored in a storage system

URL Extractor

  • Extracts URLs from the current HTML page

URL Filter

  • Filters out faulty or malicious URLs

URL Loader or Detector

  • Filters out URLs that have already been visited
  • Bloom filters are commonly used (Why? extremely space efficient when compared to hash tables at the cost of missing out on some of the URLs)

URL Storage

  • Keeps track of all the visited URLs

References

  1. Varna’s presentation