Skip to content

Crack SDE

Most of the content are generated by AI, with human being reviewed, edited, and revised

Menu
  • Home
  • Daily English Story
  • Tech Interviews
  • Cloud Native
  • DevOps
  • Artificial Intelligence
Menu

System Design – Design a Web Crawler

Posted on 12/20/202312/28/2023 by user

Design a web crawler that fetches every page on en.wikipedia.org exactly 1 time. You have 10,000 servers you can use and you are not allowed to fetch a URL more than once. If a URL fails to be fetched (because of a timeout or server failure), it can be discarded.

Requirements

Functional

  • Download all (6.7m pages) URLs from en.wikipedia.com
  • Only download each URL once
  • Using 10k 2-core servers
  • Only processing on the content is extract URLs otherwise persist the content to local storage
  • Don’t crawl images
  • Only crawl English Wikipedia
  • Minimize network traffic between each server.

Non-functional

  • No need to self-rate limit
  • Fetch the content as fast as possible

High-level Analysis

How much storage will we need to store 6,700,000 URLs and their HTML documents?

The average internet URL length is 66 characters. Since we don’t need to track the domain name or HTTPS prefix, we will round down to 60 characters.

60 characters = 60 bytes

60 bytes *  6,700,000 URLs = 400000000 bytes or 400 Megabytes

The average HTML page is about 100kb.

100 kilobytes * 6,700,000 documents = 670 Gigabytes

(from: https://leetcode.com/discuss/interview-question/723588/designing-a-distributed-web-crawler)

URL Frontier

A URL Frontier acts as a queue system, effectively deciding the order in which pages are crawled.

Use a centralized database or a distributed system like Apache Cassandra to maintain a list of URLs to be crawled. This database should be designed to handle high read/write throughput. Implement a mechanism to ensure each URL is unique in the frontier. This can be achieved by using a hash table or a set or a bloom filter.

Here are the key components and considerations in designing a URL Frontier:

  1. Queue Management:
    • Multiple queues can be used to prioritize URLs based on certain criteria (like domain, page importance, update frequency).
    • Priority queue algorithms can ensure that more important URLs are crawled first.
  2. URL Deduplication:
    • Use a distributed cache or a bloom filter to store visited URLs efficiently and check against this before crawling a URL.
    • Can be implemented using hash tables, bloom filters, or other data structures that efficiently check if a URL has already been added.
  3. URL Normalization:
    • Standardizes URLs to a consistent format, ensuring that different forms of the same URL are recognized as identical.
    • Includes converting to lowercase, removing “www,” and handling redirects.
  4. Politeness Policy:
    • Ensures the crawler respects site-specific constraints, like crawl-delay defined in robots.txt.
    • Helps in maintaining a good relationship with web servers and avoiding IP bans.
  5. Robots.txt Compliance:
    • The Frontier must check robots.txt files and filter out URLs disallowed by them.
    • This requires fetching and parsing robots.txt for each domain.

DNS Resolver

DNS resolver is indeed an important component of a web crawler, especially if the crawler performs frequent DNS lookups. DNS (Domain Name System) resolution is the process of translating a domain name (like www.example.com) into an IP address that can be used to make network requests.

Crawler Engine(Coordinator)

Crawler coordinator works as a scheduler to fetch URLs from URL frontier and arrange the URL visit taks for worker nodes. It puts all the visiting tasks to a distributed messaged queue such that worker nodes can fetch the URL and continue the visiting.

Message Queue

Implement a load balancer or a distributed queue system (like Kafka) to evenly distribute URLs among crawler worker nodes. Each node independently fetches URLs requests from message queue.

Crawler Instance(Worker nodes)

Deploy multiple crawler nodes (servers) that can operate independently. Each node is responsible for fetching web pages, extracting links, and passing new URLs back to a central system.

Since the crawl is restricted to “en.wikipedia.org”, configure the crawler to ignore links that lead outside this domain. Each node should use concurrent requests to maximize efficiency but also implement rate limiting to comply with Wikipedia’s policies and avoid overloading their servers. Ensure the crawler respects the directives in Wikipedia’s robots.txt file. Adhere to legal and ethical norms, especially regarding data usage and storage.

If a URL fails to be fetched due to timeout or server failure, mark it as ‘failed’ in the database and do not attempt to refetch. Implement a heartbeat system where each node regularly updates its status. If a node fails, its queue of URLs can be returned to the central database for redistribution.

Each node should parse the HTML content of fetched pages to extract new URLs.

URL Filter

URL filter receives URLs parsed from content fetching, filter out any URLs that lead outside of “en.wikipedia.org” or are duplicates of already discovered URLs, and save URLs into URL store.

Content & Metadata Storage

Store the fetched content in a distributed file system or a big data storage solution like Hadoop HDFS or Amazon S3. Store metadata (like URL, fetch status, timestamp) in a distributed database for easy access and management. Check metadata store before saving the content to content store to avoid duplication.

  • Employ a distributed database system to handle large volumes of data.
  • Use database sharding to distribute data across multiple servers.
  • Implement efficient indexing strategies for quick retrieval of stored data.

Monitoring and Logging:

  • Monitoring System: Set up a monitoring system to track the progress, performance metrics, and error rates of the crawl.
  • Logging: Log detailed information about the crawling process for debugging and analysis purposes.

Ethical Considerations

  • Responsible Crawling: Always prioritize the stability and performance of Wikipedia’s servers over the crawler’s data collection speed.
  • Transparency: Be transparent about the crawler’s activities and purpose, especially if the data collected will be used for public or commercial purposes.
  • Compliance with Legal and Ethical Standards: Ensure that the crawling activities comply with legal requirements and ethical standards, respecting user privacy and copyright laws.

robots.txt File

Here is a simple example of what a robots.txt file might look like:

User-agent: *
Disallow: /private/
Disallow: /temp/
Allow: /

Sitemap: http://www.example.com/sitemap.xml

Handling Retries for Failed/Errored URLs:

  1. Retry Logic:
    • Implement a retry mechanism for transient errors (like network issues or server errors). The number of retries can be limited to avoid endless loops.
  2. Exponential Backoff:
    • Use an exponential backoff strategy for retries, gradually increasing the wait time between retries to reduce load on the server and network.
  3. Error Categorization:
    • Categorize errors to decide which ones are worth retrying. For example, a 404 Not Found might not be retried, but a 503 Service Unavailable might be.
  4. Logging and Monitoring:
    • Log failures and monitor error rates to identify and address systemic issues.
  5. Feedback Loop:
    • Use feedback from the Content Fetcher to update the URL Frontier about the status of each URL, ensuring that the system is aware of which URLs need to be retried.

Share this:

  • Click to share on Facebook (Opens in new window) Facebook
  • Click to share on X (Opens in new window) X

Related

Recent Posts

  • LC#622 Design Circular Queue
  • Started with OpenTelemetry in Go
  • How Prometheus scrap works, and how to find the target node and get the metrics files
  • How to collect metrics of container, pods, node and cluster in k8s?
  • LC#200 island problem

Recent Comments

  1. another user on A Journey of Resilience

Archives

  • May 2025
  • April 2025
  • February 2025
  • July 2024
  • April 2024
  • January 2024
  • December 2023
  • November 2023
  • October 2023
  • September 2023
  • August 2023
  • June 2023
  • May 2023

Categories

  • Artificial Intelligence
  • Cloud Computing
  • Cloud Native
  • Daily English Story
  • Database
  • DevOps
  • Golang
  • Java
  • Leetcode
  • Startups
  • Tech Interviews
©2025 Crack SDE | Design: Newspaperly WordPress Theme
Manage Cookie Consent
To provide the best experiences, we use technologies like cookies to store and/or access device information. Consenting to these technologies will allow us to process data such as browsing behavior or unique IDs on this site. Not consenting or withdrawing consent, may adversely affect certain features and functions.
Functional Always active
The technical storage or access is strictly necessary for the legitimate purpose of enabling the use of a specific service explicitly requested by the subscriber or user, or for the sole purpose of carrying out the transmission of a communication over an electronic communications network.
Preferences
The technical storage or access is necessary for the legitimate purpose of storing preferences that are not requested by the subscriber or user.
Statistics
The technical storage or access that is used exclusively for statistical purposes. The technical storage or access that is used exclusively for anonymous statistical purposes. Without a subpoena, voluntary compliance on the part of your Internet Service Provider, or additional records from a third party, information stored or retrieved for this purpose alone cannot usually be used to identify you.
Marketing
The technical storage or access is required to create user profiles to send advertising, or to track the user on a website or across several websites for similar marketing purposes.
Manage options Manage services Manage {vendor_count} vendors Read more about these purposes
View preferences
{title} {title} {title}