By Shao-Lun Huang, Tsinghua-Berkeley Shenzhen Institute, China, shaolun.huang@sz.tsinghua.edu.cn | Anuran Makur, Purdue University, USA, amakur@purdue.edu | Gregory W. Wornell, Massachusetts Institute of Technology, USA, gww@mit.edu | Lizhong Zheng, Massachusetts Institute of Technology, USA, lizhong@mit.edu
This monograph develops unifying perspectives on the problem of identifying universal low-dimensional features from high-dimensional data for inference tasks in settings involving learning. For such problems, natural notions of universality are introduced, and a local equivalence among them is established. The analysis is naturally expressed via information geometry, which provides both conceptual and computational insights. The development reveals the complementary roles of the singular value decomposition, Hirschfeld-Gebelein-Rényi maximal correlation, the canonical correlation and principle component analyses of Hotelling and Pearson, Tishby’s information bottleneck, Wyner’s and Gács-Körner common information, Ky Fan k-norms, and Breiman and Friedman’s alternating conditional expectations algorithm. Among other uses, the framework facilitates understanding and optimizing aspects of learning systems, including multinomial logistic (softmax) regression and neural network architecture, matrix factorization methods for collaborative filtering and other applications, rank-constrained multivariate linear regression, and forms of semi-supervised learning.
In many contemporary and emerging applications of machine learning and statistical inference, the phenomena of interest are characterized by variables defined over large alphabets. This increasing size of both the data and the number of inferences, and the limited available training data means there is a need to understand which inference tasks can be most effectively carried out, and, in turn, what features of the data are most relevant to them.
In this monograph, the authors develop the idea of extracting “universally good” features, and establish that diverse notions of such universality lead to precisely the same features. The information-theoretic approach used results in a local information geometric analysis that facilitates their computation in a host of applications.
The authors provide a comprehensive treatment that guides the reader through the basic principles to the advanced techniques including many new results. They emphasize a development from first-principles together with common, unifying terminology and notation, and pointers to the rich embodying literature, both historical and contemporary.
Written for students and researchers, this monograph is a complete treatise on the information theoretic treatment of a recognized and current problem in machine learning and statistical inference.