By Vinayak Ramkumar, Indian Institute of Science, Bengaluru, India, vinram93@gmail.com | S. B. Balaji, Qualcomm, Bengaluru, India, balaji.profess@gmail.com | Birenjith Sasidharan, Govt. Engineering College, Barton Hill, Trivandrum, India, birenjith@gmail.com | Myna Vajha, Qualcomm, Bengaluru, India, mynaramana@gmail.com | M. Nikhil Krishnan, International Institute of Information Technology Bangalore, India, nikhilkrishnan.m@gmail.com | P. Vijay Kumar, Indian Institute of Science, Bengaluru, India, pvk1729@gmail.com
In distributed data storage, information pertaining to a given data file is stored across multiple storage units or nodes in redundant fashion to protect against the principal concern, namely, the possibility of data loss arising from the failure of individual nodes. The simplest form of such protection is replication. The explosive growth in the amount of data generated on a daily basis brought up a second major concern, namely minimization of the overhead associated with such redundant storage. This concern led to the adoption by the storage industry of erasure-recovery codes such as Reed- Solomon (RS) codes and more generally, maximum distance separable codes, as these codes offer the lowest-possible storage overhead for a given level of reliability.
In the setting of a large data center, where the amount of stored data can run into several exabytes, a third concern arises, namely the need for efficient recovery from a commonplace occurrence, the failure of a single storage unit. One measure of efficiency in node repair is how small one can make the amount of data download needed to repair a failed unit, termed the repair bandwidth. This was the subject of the seminal paper by Dimakis et al. [50] in which an entirely new class of codes called regenerating codes was introduced, that within a certain repair framework, had the minimum-possible repair bandwidth. A second measure relates to the number of helper nodes contacted for node repair, termed the repair degree. A low repair degree is desirable as this means that a smaller number of nodes are impacted by the failure of a given node. The landmark paper by Gopalan et al. [72] focuses on this second measure, leading to the development of the theory of locally recoverable codes. The two events also led to the creation of a third class of codes known as locally regenerating codes, where the aim is to simultaneously achieve reduced repair bandwidth and low repair degree. Research in a different direction led researchers to take a fresh look at the challenge of efficient RS-code repair, and led to the identification of improved repair schemes for RS codes that have significantly reduced repair bandwidth.
This monograph introduces the reader to these different approaches towards efficient node repair and presents many of the fundamental bounds and code constructions that have since emerged. Several open problems are identified, and many of the sections have a notes subsection at the end that provides additional background.
Data storage has grown such that distributed storage over a number of systems is now commonplace. This has given rise to an increase in the complexity of ensuring data loss does not occur, particularly where failure is due to the failure of individual nodes within the storage system. Redundancy was the main tool to combat this, but with huge increases in data, minimization of the overhead associated with this technique caused major concern. In a large data center, a third concern arose, namely the need for efficient recovery from the failure of a single storage unit.
In this monograph, the authors give a comprehensive overview of the role of differing types of codes in addressing the issues in large distributed storage systems. They introduce the reader to regenerative codes, locally recoverable codes and locally regenerative codes; the three main classes of codes used in such systems. They give an exhaustive overview of how these codes were created, their uses and the developments and improvements of the codes in the last decade.
This in-depth review gives the reader an accessible and complete overview of the modern codes used in distributed storage systems today. It is a one-stop source for students, researchers and practitioners working on any such system.