Web crawlers

A system for downloading, storing and analysing web pages
Use cases
- Search engine indexing
- Web archiving
- Web monitoring for copyright or trademark violation
Steps,

Given a list of seed URLs -> Visit each URL -> store the web page -> Extract URLs from the current page -> Append the URLs to the list of URLs to visit -> repeat
Characteristics of a good web crawler,
- Should be scalable
- Should be robust enough to handle poorly formatted HTML, malicious sites, crashes, etc.
- Should avoid making too many requests to a website in a very short time as it might lead to a DDoS attack
- Should be extensible for future changes

Overview

Seed URLS

Initial links that are used at the starting point of the crawling process
Choosing the right URLs can impact the number of web pages crawled

URL Frontier

A queue data structure that holds the URLs that have to be fetched and analysed

HTML Fetcher

Downloads the web page pointed by the URL given by the URL frontier

DNS Resolver

Translates the URL to the web-page’s IP address

HTML Parser

Check integrity of a web page’s data
Checks for poorly formatted HTML and malware

Duplicate detection

Storing duplicated leads to unnecessary space usage and slow down the system

Cache

To improve web crawlers efficiency
Stores most recently crawled URLs

Data Storage

The web page’s data is stored in a storage system

URL Extractor

Extracts URLs from the current HTML page

URL Filter

Filters out faulty or malicious URLs

URL Loader or Detector

Filters out URLs that have already been visited
Bloom filters are commonly used (Why? extremely space efficient when compared to hash tables at the cost of missing out on some of the URLs)

URL Storage

Keeps track of all the visited URLs

References

Varna’s presentation

🪴 Nishanth Gobi

Web crawlers

Overview

Seed URLS

URL Frontier

HTML Fetcher

DNS Resolver

HTML Parser

Duplicate detection

URL Extractor

URL Filter

URL Loader or Detector

References

Graph View

Table of Contents

🪴 Nishanth Gobi

Web crawlers

Overview §

Seed URLS §

URL Frontier §

HTML Fetcher §

DNS Resolver §

HTML Parser §

Duplicate detection §

URL Extractor §

URL Filter §

URL Loader or Detector §

References §

Graph View

Table of Contents

Overview

Seed URLS

URL Frontier

HTML Fetcher

DNS Resolver

HTML Parser

Duplicate detection

URL Extractor

URL Filter

URL Loader or Detector

References