Foundations and Trends® in Communications and Information Theory > Vol 19 > Issue 1

Information-Theoretic Foundations of DNA Data Storage

By Ilan Shomorony, University of Illinois at Urbana-Champaign, USA, ilans@illinois.edu | Reinhard Heckel, Technical University of Munich, Germany, and Rice University, USA, reinhard.heckel@tum.de

 
Suggested Citation
Ilan Shomorony and Reinhard Heckel (2022), "Information-Theoretic Foundations of DNA Data Storage", Foundations and Trends® in Communications and Information Theory: Vol. 19: No. 1, pp 1-106. http://dx.doi.org/10.1561/0100000117

Publication Date: 24 Feb 2022
© 2022 I. Shomorony and R. Heckel
 
Subjects
Coding theory and practice,  Information theory and computer science,  Shannon theory,  Storage and recording codes
 

Free Preview:

Download extract

Share

Download article
In this article:
1. Introduction
2. Channel Model
3. Shuffling Channels
4. Noisy Shuffling Channels
5. Multi-draw Channels: Clustering Output Sequences
6. Coding and Computational Aspects
7. Extensions and Open Problems
References

Abstract

Due to its longevity and enormous information density, DNA is an attractive medium for archival data storage. Natural DNA more than 700.000 years old has been recovered, and about 5 grams of DNA can in principle hold a Zetabyte of digital information, orders of magnitude more than what is achieved on conventional storage media. Thanks to rapid technological advances, DNA storage is becoming practically feasible, as demonstrated by a number of experimental storage systems, making it a promising solution for our society’s increasing need of data storage.

While in living things, DNA molecules can consist of millions of nucleotides, due to technological constraints, in practice, data is stored on many short DNA molecules, which are preserved in a DNA pool and cannot be spatially ordered. Moreover, imperfections in sequencing, synthesis, and handling, as well as DNA decay during storage, introduce random noise into the system, making the task of reliably storing and retrieving information in DNA challenging.

This unique setup raises a natural information-theoretic question: how much information can be reliably stored on and reconstructed from millions of short noisy sequences? The goal of this monograph is to address this question by discussing the fundamental limits of storing information on DNA. Motivated by current technological constraints on DNA synthesis and sequencing, we propose a probabilistic channel model that captures three key distinctive aspects of the DNA storage systems: (1) the data is written onto many short DNA molecules that are stored in an unordered fashion; (2) the molecules are corrupted by noise and (3) the data is read by randomly sampling from the DNA pool. Our goal is to investigate the impact of each of these key aspects on the capacity of the DNA storage system. Rather than focusing on coding-theoretic considerations and computationally efficient encoding and decoding, we aim to build an information-theoretic foundation for the analysis of these channels, developing tools for achievability and converse arguments.

DOI:10.1561/0100000117
ISBN: 978-1-68083-956-2
120 pp. $85.00
Buy book (pb)
 
ISBN: 978-1-68083-957-9
120 pp. $145.00
Buy E-book (.pdf)
Table of contents:
1. Introduction
2. Channel Model
3. Shuffling Channels
4. Noisy Shuffling Channels
5. Multi-draw Channels: Clustering Output Sequences
6. Coding and Computational Aspects
7. Extensions and Open Problems
References

Information-Theoretic Foundations of DNA Data Storage

Due to its longevity and enormous information density, DNA is an attractive medium for archival data storage. Natural DNA more than 700.000 years old has been recovered, and about 5 grams of DNA can in principle hold a Zetabyte of digital information, orders of magnitude more than what is achieved on conventional storage media. Thanks to rapid technological advances, DNA storage is becoming practically feasible, as demonstrated by a number of experimental storage systems, making it a promising solution for our society’s increasing need of data storage.

Nevertheless, all the systems suffer from having random noise introduced making the task of reliably storing and retrieving information in DNA challenging. This raises a natural information-theoretic question: how much information can be reliably stored on and reconstructed from millions of short noisy sequences?

In this book the authors address this question by discussing the fundamental limits of storing information on DNA. Motivated by current technological constraints on DNA synthesis and sequencing, they propose a probabilistic channel model that captures three key distinctive aspects of the DNA storage systems: (1) the data is written onto many short DNA molecules that are stored in an unordered fashion; (2) the molecules are corrupted by noise and (3) the data is read by randomly sampling from the DNA pool.

In building an information-theoretic foundation for the analysis of these channels, developing tools for achievability and converse arguments as they go, the authors introduce the reader to the fascinating and promising field of DNA storage.

This book provides a concise and in-depth starting point for students, researchers and practitioners covering the history of the DNA Storage development, discusses the various systems available to date and focuses on the challenges posed in the current state of research in the field.

 
CIT-117