Web Crawling

Christopher Olston; Marc Najork

doi:10.1561/1500000017

Foundations and Trends® in Information Retrieval > Vol 4 > Issue 3

Web Crawling

By Christopher Olston, Yahoo! Research, USA, olston@yahoo-inc.com | Marc Najork, Microsoft Research, USA, najork@microsoft.com

Suggested Citation

Christopher Olston and Marc Najork (2010), "Web Crawling", Foundations and Trends® in Information Retrieval: Vol. 4: No. 3, pp 175-246. http://dx.doi.org/10.1561/1500000017

Publication Date: 12 Feb 2010

Subjects

Databases on the Web

Journal details

Download article

In this article:

Abstract

This is a survey of the science and practice of web crawling. While at first glance web crawling may appear to be merely an application of breadth-first-search, the truth is that there are many challenges ranging from systems concerns such as managing very large data structures to theoretical questions such as how often to revisit evolving content sources. This survey outlines the fundamental challenges and describes the state-of-the-art models and solutions. It also highlights avenues for future work.

DOI:10.1561/1500000017

Book details

ISBN: 978-1-60198-323-7

80 pp. $100.00

Buy E-book (.pdf)

Table of contents:

1: Introduction

2: Crawler Architecture

3: Crawl Ordering Problem

4: Batch Crawl Ordering

5: Incremental Crawl Ordering

6: Avoiding Problematic and Undesirable Content

7: Deep Web Crawling

8: Future Directions

References

Web Crawling

The magic of search engines starts with crawling. While at first glance web crawling may appear to be merely an application of breadth-first-search, the truth is that there are many challenges ranging from systems concerns such as managing very large data structures, to theoretical questions such as how often to revisit evolving content sources. Web Crawling outlines the key scientific and practical challenges, describes the state-of-the-art models and solutions, and highlights avenues for future work. Web Crawling is intended for anyone who wishes to understand or develop crawler software, or conduct research related to crawling

1 Introduction
2 Crawler Architecture
3 Crawl Ordering Problem
4 Batch Crawl Ordering
5 Incremental Crawl Ordering
6 Avoiding Problematic and Undesirable Content
7 Deep Web Crawling
8 Future Directions
References

Web Crawling

Free Preview:

Share

Journal details

Abstract

Book details

Web Crawling