By Udi Wieder, VMware Research, USA, udi.wieder@gmail.com
Many tasks in computer systems could be abstracted as distributing items into buckets, so that the allocation of items across buckets is as balanced as possible, and furthermore, given an item’s identifier it is possible to determine quickly to which bucket it was assigned. A canonical example is a dictionary data structure, where ‘items’ stands for key-value pairs and ‘buckets’ for memory locations. Another example is a distributed key-value store, where the buckets represent locations in disk or even whole servers. A third example may be a distributed execution engine where items represent processes and buckets compute devices, and so on. A common technique in this domain is the use of a hash-function that maps an item into a relatively short fixed length string. The hash function is then used in some way to associate the item to its bucket. The use of a hash function is typically the first step in the solution and additional algorithmic ideas are required to deal with collisions and the imbalance of hash values. In this monograph we survey some of these techniques. We focus on multiple choice schemes where items are placed into buckets via the use of several independent hash functions, and typically an item is placed at the least loaded bucket at the time of placement. We analyze the distributions obtained in detail, and show how these ideas could be used to design basic data structures. With respect to data structures we focus on dictionaries, presenting linear probing, cuckoo hashing and many of their variants.
Many tasks in computer systems could be abstracted as distributing items into buckets, so that the allocation of items across buckets is as balanced as possible, and, furthermore, given an item’s identifier it is possible to determine quickly to which bucket it was assigned. A canonical example is a dictionary data structure, where ‘items’ stands for key-value pairs and ‘buckets’ for memory locations. Another example is a distributed key-value store, where the buckets represent locations in disk or even whole servers. A third example may be a distributed execution engine where items represent processes and buckets compute devices, and so on. A common technique in this domain is the use of a hash-function that maps an item into a relatively short fixed length string. The hash function is then used in some way to associate the item to its bucket. The use of a hash function is typically the first step in the solution and additional algorithmic ideas are required to deal with collisions and the imbalance of hash values.
Hashing, Load Balancing and Multiple Choice presents some of the basic algorithmic ideas that underpin many of the practical and theoretically interesting approaches for this problem. It focuses on multiple choice schemes where items are placed into buckets via the use of several independent hash functions, and typically an item is placed at the least loaded bucket at the time of placement. It analyses the distributions obtained, and shows how these ideas could be used to design basic data structures. With respect to data structures it focuses on dictionaries, presenting linear probing, cuckoo hashing and many of their variants.