Distributed Web Crawling over DHTs

Boon Thau Loo, Sailesh Krishnamurthy and Owen Cooper

EECS Department
University of California, Berkeley
Technical Report No. UCB/CSD-04-1305
2004

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2004/CSD-04-1305.pdf

In this paper, we present the design and implementation of distributed web crawler. We begin by motivating the need for such a crawler, as a basic building block for decentralized web search applications. The distributed crawler harnesses the excess bandwidth and computing resources of clients to crawl the web. Nodes participating in the crawl use a Distributed Hash Table (DHT) to coordinate and distribute work. We study different crawl distribution strategies and investigate the trade-offs in communication overheads, crawl throughput, balancing load on the crawlers as well as crawl targets, and the ability to exploit network proximity. We present an implementation of the distributed crawler using PIER, a relational query processor that runs over the Bamboo DHT, and compare different crawl strategies on PlanetLab querying live web sources.


BibTeX citation:

@techreport{Loo:CSD-04-1305,
    Author = {Loo, Boon Thau and Krishnamurthy, Sailesh and Cooper, Owen},
    Title = {Distributed Web Crawling over DHTs},
    Institution = {EECS Department, University of California, Berkeley},
    Year = {2004},
    URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2004/5370.html},
    Number = {UCB/CSD-04-1305},
    Abstract = {In this paper, we present the design and implementation of distributed web crawler. We begin by motivating the need for such a crawler, as a basic building block for decentralized web search applications. The distributed crawler harnesses the excess bandwidth and computing resources of clients to crawl the web. Nodes participating in the crawl use a Distributed Hash Table (DHT) to coordinate and distribute work. We study different crawl distribution strategies and investigate the trade-offs in communication overheads, crawl throughput, balancing load on the crawlers as well as crawl targets, and the ability to exploit network proximity. We present an implementation of the distributed crawler using PIER, a relational query processor that runs over the Bamboo DHT, and compare different crawl strategies on PlanetLab querying live web sources.}
}

EndNote citation:

%0 Report
%A Loo, Boon Thau
%A Krishnamurthy, Sailesh
%A Cooper, Owen
%T Distributed Web Crawling over DHTs
%I EECS Department, University of California, Berkeley
%D 2004
%@ UCB/CSD-04-1305
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2004/5370.html
%F Loo:CSD-04-1305