Foundations and Trends® in Web Science > Vol 2 > Issue 2–3

The Foundations for Provenance on the Web

By Luc Moreau, University of Southampton, UK, L.Moreau@ecs.soton.ac.uk

 
Suggested Citation
Luc Moreau (2010), "The Foundations for Provenance on the Web", Foundations and Trends® in Web Science: Vol. 2: No. 2–3, pp 99-241. http://dx.doi.org/10.1561/1800000010

Publication Date: 29 Oct 2010
© 2010 L. Moreau
 
Subjects
Trust and provenance,  Privacy,  Identity
 

Free Preview:

Download extract

Share

Download article
In this article:
1 Introduction 
2 Analysis of the Provenance Literature 
3 Definition of Provenance 
4 Provenance in Workflows and Databases 
5 The Open Provenance Vision 
6 Provenance, the Web and the Semantic Web 
7 Accountability 
8 Conclusion 
Acknowledgments 
Provenance Bibliography 
References 

Abstract

Provenance, i.e., the origin or source of something, is becoming an important concern, since it offers the means to verify data products, to infer their quality, to analyse the processes that led to them, and to decide whether they can be trusted. For instance, provenance enables the reproducibility of scientific results; provenance is necessary to track attribution and credit in curated databases; and, it is essential for reasoners to make trust judgements about the information they use over the Semantic Web.

As the Web allows information sharing, discovery, aggregation, filtering and flow in an unprecedented manner, it also becomes very difficult to identify, reliably, the original source that produced an information item on the Web. Since the emerging use of provenance in niche applications is undoubtedly demonstrating the benefits of provenance, this monograph contends that provenance can and should reliably be tracked and exploited on the Web, and investigates the necessary foundations to achieve such a vision.

Multiple data sources have been used to compile the largest bibliographical database on provenance so far. This large corpus permits the analysis of emerging trends in the research community. Specifically, the CiteSpace tool identifies clusters of papers that constitute research fronts, from which characteristics are extracted to structure a foundational framework for provenance on the Web. Such an endeavour requires a multi-disciplinary approach, since it requires contributions from many computer science sub-disciplines, but also other non-technical fields given the human challenge that is anticipated.

To develop such a vision, it is necessary to provide a definition of provenance that applies to the Web context. A conceptual definition of provenance is expressed in terms of processes, and is shown to generalise various definitions of provenance commonly encountered. Furthermore, by bringing realistic distributed systems assumptions, this definition is refined as a query over assertions made by applications.

Given that the majority of work on provenance has been under-taken by the database, workflow and e-science communities, some of their work is reviewed, contrasting approaches, and focusing on important topics believed to be crucial for bringing provenance to the Web, such as abstraction, collections, storage, queries, workflow evolution, semantics and activities involving human interactions.

However, provenance approaches developed in the context of databases and workflows essentially deal with closed systems. By that, it is meant that workflow or database management systems are in full control of the data they manage, and track their provenance within their own scope, but not beyond. In the context of the Web, a broader approach is required by which chunks of provenance representation can be brought together to describe the provenance of information flowing across multiple systems. For this purpose, this monograph puts forward the Open Provenance Vision, which is an approach that consists of controlled vocabulary, serialisation formats and interfaces to allow the provenance of individual systems to be expressed, connected in a coherent fashion, and queried seamlessly. In this context, the Open Provenance Model is an emerging community-driven representation of provenance, which has been actively used by some 20 teams to exchange provenance information, in line with the Open Provenance Vision.

After identifying an open approach and a model for provenance, techniques to expose provenance over the Web are investigated. In particular, Semantic Web technologies are discussed since they have been successfully exploited to express, query and reason over provenance. Symmetrically, Semantic Web technologies such as RDF, underpinning the Linked Data effort, are analysed since they offer their own difficulties with respect to provenance.

A powerful argument for provenance is that it can help make systems transparent, so that it becomes possible to determine whether a particular use of information is appropriate under a set of rules. Such capability helps make systems and information accountable. To offer accountability, provenance itself must be authentic, and rely on security approaches, which are described in the monograph. This is then followed by systems where provenance is the basis of an auditing mechanism to check past processes against rules or regulations. In practice, not all users want to check and audit provenance, instead, they may rely on measures of quality or trust; hence, emerging provenance-based approaches to compute trust and quality of data are reviewed.

DOI:10.1561/1800000010
ISBN: 978-1-60198-386-2
160 pp. $99.00
Buy book (pb)
 
ISBN: 978-1-60198-387-9
160 pp. $150.00
Buy E-book (.pdf)
Table of contents:
1: Introduction
2: Analysis of the Provenance Literature
3: Definition of Provenance
4: Provenance in Workflows and Databases
5: The Open Provenance Vision
6: Provenance, the Web and the Semantic Web
7: Accountability
8: Conclusion
Acknowledgements
Provenance Bibliography
References

The Foundations for Provenance on the Web

As the Web allows information sharing, discovery, aggregation, filtering and flow in an unprecedented manner, it also becomes very difficult to identify, reliably, the original source that produced an information item on the Web. Hence, provenance, i.e., the origin or source of something, is becoming an important concern, since it offers the means to verify data products, to infer their quality, to analyse the processes that led to them, and to decide whether they can be trusted. For instance, provenance enables the reproducibility of scientific results; provenance is necessary to track attribution and credit in curated databases; and, it is essential for reasoners to make trust judgements about the information they use over the Semantic Web. Since the emerging use of provenance in niche applications is undoubtedly demonstrating benefits, this survey contends that provenance can and should reliably be tracked and exploited on the Web. The Foundations for Provenance on the Web is aimed at anyone who discovers or publishes information on the Web, and who cares about its origin and its quality. Based on an analysis of literature, this survey puts forward the Open Provenance Vision, a visionary but pragmatic, integrative conceptual framework allowing the provenance of information to be expressed, tracked, and queried seamlessly, as it crosses information systems across the Web. Some foundational work has already resulted in significant advances in semantics, data models and systems, which can underpin this vision. However, some shortcomings inevitably exist and are discussed. For this vision to succeed, it requires a multi-disciplinary approach, since it requires contributions from many computer science sub-disciplines, but also other non-technical fields given the human challenge that is anticipated.

 
WEB-010

Provenance Bibliography | 1800000010_Provenance_Survey.bib

The provenance bibliographical database was compiled using multiple sources: the author's own original database, the ACM, IEEE, Springer digital libraries, the DBLP computer science bibliography, and some programmes of provenance-specific events such as the International Provenance and Annotations workshops (IPAW'06, IPAW'08), and the Workshop on Theory and Practice of Provenance (TAPP'09). Each publication is maintained with the explicit list of publications it cites and its abstract. A transitive closure of citations is applied so as to ensure that each cited paper that contains the words "provenance" or "lineage" in its title is included in the database (provided it is a Computer Science paper).