Foundations and Trends® in Databases > Vol 15 > Issue 2

Analytical Queries for Unstructured Data

By Daniel Kang, University of Illinois Urbana-Champaign, USA, ddkang@illinois.edu

 
Suggested Citation
Daniel Kang (2025), "Analytical Queries for Unstructured Data", Foundations and Trends® in Databases: Vol. 15: No. 2, pp 115-196. http://dx.doi.org/10.1561/1900000087

Publication Date: 27 Oct 2025
© 2025 D. Kang
 
Subjects
Approximate and interactive query processing,  Data models and query languages,  Database design and tuning,  Storage, access methods, and indexing,  Applications and case studies
 

Free Preview:

Download extract

Share

Download article
In this article:
1. Introduction
2. Background
3. Architecture
4. Expressing Queries
5. General Query Optimization
6. Approximate Queries with ML
7. Proxies, Indexes, and Storage
8. Efficient Query Execution
9. Video Queries
10. Text and Semi-structured Queries
11. Open Challenges
References

Abstract

Unstructured data, in the form of text, images, video, and audio, is produced at exponentially higher rates. In tandem, machine learning (ML) methods have become increasingly powerful at analyzing unstructured data. Modern ML methods can now detect objects in images, understand actions in videos, and even classify complex legal texts based on legal intent. Combined, these trends make it increasingly feasible for analysts and researchers to automatically understand the “real world.” However, there are major challenges in deploying these techniques: 1) executing queries efficiently given the expense of ML methods, 2) expressing queries over bespoke forms of data, and 3) handling errors in ML methods.

In this monograph, we discuss challenges and advances in data management systems for unstructured data using ML, with a particular focus on video analytics. Using ML to answer queries introduces new challenges. First, even turning user intent into queries can be challenging: it is not obvious how to express a query of the form “select instances of cars turning left.” Second, ML models can be orders of magnitude more expensive compared processing traditional structured data. Third, ML models and the methods to accelerate analytics with ML models can be error-prone.

Recent work in the data management community has aimed to address all of these challenges. Users can now express queries via user-defined functions, opaquely through standard structured schemas, and even by providing examples. Given a query, recent work focuses on optimizing queries by approximating expensive “gold” methods with varying levels of guarantees. Finally, to handle errors in ML models, recent work has focused on applying outlier and drift detection to data analytics with ML.

DOI:10.1561/1900000087
ISBN: 978-1-63828-643-1
94 pp. $70.00
Buy book (pb)
 
ISBN: 978-1-63828-642-4
94 pp. $160.00
Buy E-book (.pdf)
Table of contents:
1. Introduction
2. Background
3. Architecture
4. Expressing Queries
5. General Query Optimization
6. Approximate Queries with ML
7. Proxies, Indexes, and Storage
8. Efficient Query Execution
9. Video Queries
10. Text and Semi-structured Queries
11. Open Challenges
References

Analytical Queries for Unstructured Data

Unstructured data, in the form of text, images, video, and audio, is produced at exponentially higher rates. In tandem, machine learning (ML) methods have become increasingly powerful at analyzing unstructured data. Modern ML methods can now detect objects in images, understand actions in videos, and even classify complex legal texts based on legal intent. Combined, these trends make it increasingly feasible for analysts and researchers to automatically understand the “real world.” However, there are major challenges in deploying these techniques: 1) executing queries efficiently given the expense of ML methods, 2) expressing queries over bespoke forms of data, and 3) handling errors in ML methods.

In this monograph, challenges and advances in data management systems for unstructured data using ML are discussed, with a particular focus on video analytics. Using ML to answer queries introduces new challenges. First, even turning user intent into queries can be challenging: it is not obvious how to express a query of the form “select instances of cars turning left.” Second, ML models can be orders of magnitude more expensive compared processing traditional structured data. Third, ML models and the methods to accelerate analytics with ML models can be error-prone.

Recent work in the data management community has aimed to address all of these challenges. Users can now express queries via user-defined functions, opaquely through standard structured schemas, and even by providing examples. Given a query, recent work focuses on optimizing queries by approximating expensive “gold” methods with varying levels of guarantees. Finally, to handle errors in ML models, recent work has focused on applying outlier and drift detection to data analytics with ML.

 
DBS-087