By Akshay Agrawal, Stanford University, USA, akshaykagrawal7@gmail.com | Alnur Ali, Stanford University, USA, alnurali@stanford.edu | Stephen Boyd, Stanford University, USA, boyd@stanford.edu
We consider the vector embedding problem. We are given a finite set of items, with the goal of assigning a representative vector to each one, possibly under some constraints (such as the collection of vectors being standardized, i.e., having zero mean and unit covariance). We are given data indicating that some pairs of items are similar, and optionally, some other pairs are dissimilar. For pairs of similar items, we want the corresponding vectors to be near each other, and for dissimilar pairs, we want the vectors to not be near each other, measured in Euclidean distance. We formalize this by introducing distortion functions, defined for some pairs of items. Our goal is to choose an embedding that minimizes the total distortion, subject to the constraints. We call this the minimum-distortion embedding (MDE) problem.
The MDE framework is simple but general. It includes a wide variety of specific embedding methods, such as spectral embedding, principal component analysis, multidimensional scaling, Euclidean distance problems, dimensionality reduction methods (like Isomap and UMAP), semi-supervised learning, sphere packing, force-directed layout, and others. It also includes new embeddings, and provides principled ways of validating or sanity-checking historical and new embeddings alike.
In a few special cases, MDE problems can be solved exactly. For others, we develop a projected quasi-Newton method that approximately minimizes the distortion and scales to very large data sets, while placing few assumptions on the distortion functions and constraints. This monograph is accompanied by an open-source Python package, PyMDE, for approximately solving MDE problems. Users can select from a library of distortion functions and constraints or specify custom ones, making it easy to rapidly experiment with new embeddings. Because our algorithm is scalable, and because PyMDE can exploit GPUs, our software scales to problems with millions of items and tens of millions of distortion functions. Additionally, PyMDE is competitive in runtime with specialized implementations of specific embedding methods. To demonstrate our method, we compute embeddings for several real-world data sets, including images, an academic co-author network, US county demographic data, and single-cell mRNA transcriptomes.
Embeddings provide concrete numerical representations of otherwise abstract items, for use in downstream tasks. For example, a biologist might look for subfamilies of related cells by clustering embedding vectors associated with individual cells, while a machine learning practitioner might use vector representations of words as features for a classification task.
In this monograph the authors present a general framework for faithful embedding called minimum-distortion embedding (MDE) that generalizes the common cases in which similarities between items are described by weights or distances. The MDE framework is simple but general. It includes a wide variety of specific embedding methods, including spectral embedding, principal component analysis, multidimensional scaling, Euclidean distance problems, etc.
The authors provide a detailed description of minimum-distortion embedding problem and describe the theory behind creating solutions to all aspects. They also give describe in detail algorithms for computing minimum-distortion embeddings. Finally, they provide examples on how to approximately solve many MDE problems involving real datasets, including images, co-authorship networks, United States county demographics, population genetics, and single-cell mRNA transcriptomes.
An accompanying open-source software package, PyMDE, makes it easy for practitioners to experiment with different embeddings via different choices of distortion functions and constraint sets.
The theory and techniques described and illustrated in this book will be of interest to researchers and practitioners working on modern-day systems that look to adopt cutting-edge artificial intelligence.