By Bas Ketsman, Vrije Universiteit Brussel, Belgium, bas.ketsman@vub.be | Paraschos Koutris, University of Wisconsin-Madison, USA, paris@cs.wisc.edu
Recent years have seen a resurgence of interest from both the industry and research community in Datalog. Datalog is a declarative query language that extends relational algebra with recursion. It has been used to express a wide spectrum of modern data management tasks, such as data integration, declarative networking, graph analysis, business analytics, and program analysis. The result of this long line of research is a plethora of Datalog engines, which support different variants of Datalog, and have different technical specifications and capabilities. In this monograph, we provide an overview of the architecture and technical characteristics of these Datalog engines. We identify common architectural decisions and evaluation methods, as well as data structures and layouts used to speed up the query execution. We also discuss in what ways Datalog engines differ when they specialize to workloads with different characteristics (for example, data analytics vs program analysis vs graph analysis). One particular focus is how modern Datalog engines scale to massively parallel environments.
Recent years have seen a resurgence of interest in Datalog from both the industry and research community. Datalog is a declarative query language that extends relational algebra with recursion. It is used to express a wide spectrum of modern data management tasks such as data integration, declarative networking, graph analysis, business analytics, and program analysis. The result of this long line of research is a plethora of Datalog engines that support different variants of Datalog, and have different technical specifications and capabilities.
In this monograph, the authors provide an overview of the architecture and technical characteristics of the various Datalog engines. They identify common architectural decisions and evaluation methods as well as data structures and layouts used to speed up the query execution. They also discuss the ways in which Datalog engines differ when they specialize to workloads with different characteristics. A particular focus of this monograph is how modern Datalog engines scale to massively parallel environments, which is necessary to support the processing of very large datasets. The authors conclude with opportunities for future research directions and new possible applications for Datalog engines.