Foundations and Trends® in Databases > Vol 8 > Issue 4

Algorithmic Aspects of Parallel Data Processing

By Paraschos Koutris, University of Wisconsin-Madison, USA, paris@cs.wisc.edu | Semih Salihoglu, University of Waterloo, Canada, semih.salihoglu@uwaterloo.ca | Dan Suciu, University of Washington, USA, suciu@cs.washington.edu

 
Suggested Citation
Paraschos Koutris, Semih Salihoglu and Dan Suciu (2018), "Algorithmic Aspects of Parallel Data Processing", Foundations and TrendsĀ® in Databases: Vol. 8: No. 4, pp 239-370. http://dx.doi.org/10.1561/1900000055

Publication Date: 22 Feb 2018
© 2018 P. Koutris, S. Salihoglu and D. Suciu
 
Subjects
Parallel and Distributed Database Systems,  Query Processing and Optimization
 

Free Preview:

Download extract

Share

Download article
In this article:
1. Introduction
2. Models of Parallel Computation
3. Two-way Join
4. Multiway Joins
5. Sorting
6. Matrix Multiplication
7. Conclusion
References

Abstract

In the last decade or so we have witnessed a growing interest in processing large data sets on large distributed clusters. The idea was pioneered by the MapReduce framework, and has been widely adopted by several other systems, including PigLatin, Hive, Scope, U-SQL, Dremmel, Spark and Myria. A large part of the complex data analysis performed by these systems consists of a sequence of relatively simple query operations, such as joining two or more tables. This survey discusses recent algorithmic developments for distributed data processing. It uses a theoretical model of parallel processing called the Massively Parallel Computation (MPC) model, which is a simplification of the BSP model where the only cost is given by the amount of communication and the number of communication rounds. The survey studies several algorithms for multi-join queries, for sorting, and for matrix multiplication, and discusses their relationships and common techniques applied across the different data processing tasks.

DOI:10.1561/1900000055
ISBN: 978-1-68083-406-2
144 pp. $95.00
Buy book (pb)
 
ISBN: 978-1-68083-407-9
144 pp. $140.00
Buy E-book (.pdf)
Table of contents:
1. Introduction
2. Models of Parallel Computation
3. Two-way Join
4. Multiway Joins
5. Sorting
6. Matrix Multiplication
7. Conclusion
References

Algorithmic Aspects of Parallel Data Processing

The last decade has seen a huge and growing interest in processing large data sets on large distributed clusters. This trend began with the MapReduce framework, and has been widely adopted by several other systems, including PigLatin, Hive, Scope, Dremmel, Spark and Myria to name a few. While the applications of such systems are diverse (for example, machine learning, data analytics), most involve relatively standard data processing tasks like identifying relevant data, cleaning, filtering, joining, grouping, transforming, extracting features, and evaluating results. This has generated great interest in the study of algorithms for data processing on large distributed clusters.

Algorithmic Aspects of Parallel Data Processing discusses recent algorithmic developments for distributed data processing. It uses a theoretical model of parallel processing called the Massively Parallel Computation (MPC) model, which is a simplification of the BSP model where the only cost is given by the amount of communication and the number of communication rounds. The survey studies several algorithms for multi-join queries, sorting, and matrix multiplication. It discusses their relationships and common techniques applied across the different data processing tasks.

 
DBS-055