Foundations and Trends® in Databases > Vol 5 > Issue 4

Trends in Cleaning Relational Data: Consistency and Deduplication

By Ihab F. Ilyas, University of Waterloo, Canada, ilyas@uwaterloo.ca | Xu Chu, University of Waterloo, Canada, x4chu@uwaterloo.ca

 
Suggested Citation
Ihab F. Ilyas and Xu Chu (2015), "Trends in Cleaning Relational Data: Consistency and Deduplication", Foundations and Trends® in Databases: Vol. 5: No. 4, pp 281-393. http://dx.doi.org/10.1561/1900000045

Publication Date: 30 Oct 2015
© 2015 I F. Ilyas and X. Chu
 
Subjects
Data Cleaning and Information Extraction,  Data Integration and Exchange
 

Free Preview:

Download extract

Share

Download article
In this article:
1. Introduction
2. Taxonomy of Anomaly Detection Techniques
3. Taxonomy of Data Repairing Techniques
4. Big Data Cleaning
5. Conclusion
References

Abstract

Data quality is one of the most important problems in data management, since dirty data often leads to inaccurate data analytics results and wrong business decisions. Poor data across businesses and the government cost the U.S. economy $3.1 trillion a year, according to a report by InsightSquared in 2012. To detect data errors, data quality rules or integrity constraints (ICs) have been proposed as a declarative way to describe legal or correct data instances. Any subset of data that does not conform to the defined rules is considered erroneous, which is also referred to as a violation. Various kinds of data repairing techniques with different objectives have been introduced, where algorithms are used to detect subsets of the data that violate the declared integrity constraints, and even to suggest updates to the database such that the new database instance conforms with these constraints. While some of these algorithms aim to minimally change the database, others involve human experts or knowledge bases to verify the repairs suggested by the automatic repeating algorithms. In this paper, we discuss the main facets and directions in designing error detection and repairing techniques. We propose a taxonomy of current anomaly detection techniques, including error types, the automation of the detection process, and error propagation. We also propose a taxonomy of current data repairing techniques, including the repair target, the automation of the repair process, and the update model. We conclude by highlighting current trends in “big data” cleaning.

DOI:10.1561/1900000045
ISBN: 978-1-68083-022-4
124 pp. $65.00
Buy book (pb)
 
ISBN: 978-1-68083-023-1
124 pp. $125.00
Buy E-book (.pdf)
Table of contents:
1. Introduction
2. Taxonomy of Anomaly Detection Techniques
3. Taxonomy of Data Repairing Techniques
4. Big Data Cleaning
5. Conclusion
References

Trends in Cleaning Relational Data

Data quality is one of the most important problems in data management, since dirty data often leads to inaccurate data analytics results and wrong business decisions. According to a report by InsightSquared in 2012, poor data across businesses and the government cost the United States economy 3.1 trillion dollars a year.

To detect data errors, data quality rules or integrity constraints (ICs) have been proposed as a declarative way to describe legal or correct data instances. Any subset of data that does not conform to the defined rules is considered erroneous, which is also referred to as a violation. Various kinds of data repairing techniques with different objectives have been introduced where algorithms are used to detect subsets of the data that violate the declared integrity constraints, and even to suggest updates to the database such that the new database instance conforms with these constraints. While some of these algorithms aim to minimally change the database, others involve human experts or knowledge bases to verify the repairs suggested by the automatic repeating algorithms.

Trends in Cleaning Relational Data: Consistency and Deduplication discusses the main facets and directions in designing error detection and repairing techniques. It proposes a taxonomy of current anomaly detection techniques, including error types, the automation of the detection process, and error propagation. It also sets out a taxonomy of current data repairing techniques, including the repair target, the automation of the repair process, and the update model. It concludes by highlighting current trends in “big data” cleaning.

 
DBS-045