Foundations and Trends® in Databases > Vol 2 > Issue 4

Approximate String Processing

By Marios Hadjieleftheriou, AT&T Labs - Research, USA, marioh@research.att.com | Divesh Srivastava, AT&T Labs - Research, USA, divesh@research.att.com

 
Suggested Citation
Marios Hadjieleftheriou and Divesh Srivastava (2011), "Approximate String Processing", Foundations and Trends® in Databases: Vol. 2: No. 4, pp 267-402. http://dx.doi.org/10.1561/1900000010

Publication Date: 22 Feb 2011
© 2011 M. Hadjieleftheriou and D. Srivastava
 
Subjects
Approximate and Interactive Query Processing
 

Free Preview:

Download extract

Share

Download article
In this article:
1. Introduction 
2. String Similarity Functions 
3. String Tokenization 
4. Query Types 
5. Index Structures 
6. Algorithms for Set Based Similarity Using Inverted Indexes 
7. Algorithms for Set Based Similarity Using Filtering Techniques 
8. Algorithms for Edit Based Similarity 
9. Conclusion 
Acknowledgments 
References 

Abstract

One of the most important primitive data types in modern data processing is text. Text data are known to have a variety of inconsistencies (e.g., spelling mistakes and representational variations). For that reason, there exists a large body of literature related to approximate processing of text. This monograph focuses specifically on the problem of approximate string matching, where, given a set of strings S and a query string υ, the goal is to find all strings s ε S that have a user specified degree of similarity to υ. Set S could be, for example, a corpus of documents, a set of web pages, or an attribute of a relational table. The similarity between strings is always defined with respect to a similarity function that is chosen based on the characteristics of the data and application at hand. This work presents a survey of indexing techniques and algorithms specifically designed for approximate string matching. We concentrate on inverted indexes, filtering techniques, and tree data structures that can be used to evaluate a variety of set based and edit based similarity functions. We focus on all-match and top-k flavors of selection and join queries, and discuss the applicability, advantages and disadvantages of each technique for every query type.

DOI:10.1561/1900000010
ISBN: 978-1-60198-418-0
144 pp. $95.00
Buy book (pb)
 
ISBN: 978-1-60198-419-7
144 pp. $100.00
Buy E-book (.pdf)
Table of contents:
1. Introduction
2. String Similarity Functions
3. String Tokenization
4. Query Types
5. Index Structures
6. Algorithms for Set Based Similarity Using Inverted Indexes
7. Algorithms for Set Based Similarity Using Filtering Techniques
8. Algorithms for Edit Based Similarity
9. Conclusion
Acknowledgements
References

Approximate String Processing

One of the most important primitive data types in modern data processing is text. Text data are known to have a variety of inconsistencies (e.g., spelling mistakes and representational variations). For that reason, there exists a large body of literature related to approximate processing of text.

Approximate String Processing focuses specifically on the problem of approximate string matching and surveys indexing techniques and algorithms specifically designed for this purpose. It concentrates on inverted indexes, filtering techniques, and tree data structures that can be used to evaluate a variety of set based and edit based similarity functions. The focus is on all-match and top-k flavors of selection and join queries, and it discusses the applicability, advantages and disadvantages of each technique for every query type.

Approximate String Processing is organized into nine chapters. Sandwiched between the Introduction and Conclusion, Chapters 2 to 5 discuss in detail the fundamental primitives that characterize any approximate string matching indexing technique. The next three chapters, 6 to 9, are dedicated to specialized indexing techniques and algorithms for approximate string matching.

 
DBS-010