Articles for MALhttps://nowpublishers.com/feed/MALArticles for MALhttp://nowpublishers.com/article/Details/MAL-068Introduction to Multi-Armed Bandits<h3>Abstract</h3>Multi-armed bandits a simple but very powerful framework
for algorithms that make decisions over time under uncertainty.
An enormous body of work has accumulated over
the years, covered in several books and surveys. This book
provides a more introductory, textbook-like treatment of
the subject. Each chapter tackles a particular line of work,
providing a self-contained, teachable technical introduction
and a brief review of the further developments.<h3>Suggested Citation</h3>Aleksandrs Slivkins (2019), "Introduction to Multi-Armed Bandits", Foundations and Trends® in Machine Learning: Vol. 12: No. 1-2, pp 1-286. http://dx.doi.org/10.1561/2200000068Fri, 08 Nov 2019 00:00:00 +0100http://nowpublishers.com/article/Details/MAL-073Computational Optimal Transport<h3>Abstract</h3>Optimal transport (OT) theory can be informally described
using the words of the French mathematician Gaspard
Monge (1746–1818): A worker with a shovel in hand has to
move a large pile of sand lying on a construction site. The
goal of the worker is to erect with all that sand a target
pile with a prescribed shape (for example, that of a giant
sand castle). Naturally, the worker wishes to minimize her
total effort, quantified for instance as the total distance
or time spent carrying shovelfuls of sand. Mathematicians
interested in OT cast that problem as that of comparing two
probability distributions—two different piles of sand of the
same volume. They consider all of the many possible ways
to morph, transport or reshape the first pile into the second,
and associate a “global” cost to every such transport, using
the “local” consideration of how much it costs to move a
grain of sand from one place to another. Mathematicians are
interested in the properties of that least costly transport,
as well as in its efficient computation. That smallest cost
not only defines a distance between distributions, but it also
entails a rich geometric structure on the space of probability
distributions. That structure is canonical in the sense that it
borrows key geometric properties of the underlying “ground”
space on which these distributions are defined. For instance,
when the underlying space is Euclidean, key concepts such
as interpolation, barycenters, convexity or gradients of functions
extend naturally to the space of distributions endowed
with an OT geometry.
OT has been (re)discovered in many settings and under different
forms, giving it a rich history. While Monge’s seminal
work was motivated by an engineering problem, Tolstoi in
the 1920s and Hitchcock, Kantorovich and Koopmans in
the 1940s established its significance to logistics and economics.
Dantzig solved it numerically in 1949 within the
framework of linear programming, giving OT a firm footing
in optimization. OT was later revisited by analysts in the
1990s, notably Brenier, while also gaining fame in computer
vision under the name of earth mover’s distances. Recent
years have witnessed yet another revolution in the spread of
OT, thanks to the emergence of approximate solvers that
can scale to large problem dimensions. As a consequence,
OT is being increasingly used to unlock various problems
in imaging sciences (such as color or texture processing),
graphics (for shape manipulation) or machine learning (for
regression, classification and generative modeling).
This paper reviews OT with a bias toward numerical methods,
and covers the theoretical properties of OT that can
guide the design of new algorithms.We focus in particular on
the recent wave of efficient algorithms that have helped OT
find relevance in data sciences. We give a prominent place to
the many generalizations of OT that have been proposed in
but a few years, and connect them with related approaches
originating from statistical inference, kernel methods and information
theory. All of the figures can be reproduced using
code made available on a companion website. This website
hosts the book project Computational Optimal Transport.
You will also find slides and computational resources.<h3>Suggested Citation</h3>Gabriel Peyré and Marco Cuturi (2019), "Computational Optimal Transport", Foundations and Trends® in Machine Learning: Vol. 11: No. 5-6, pp 355-607. http://dx.doi.org/10.1561/2200000073Tue, 12 Feb 2019 00:00:00 +0100http://nowpublishers.com/article/Details/MAL-071An Introduction to Deep Reinforcement Learning<h3>Abstract</h3>Deep reinforcement learning is the combination of reinforcement
learning (RL) and deep learning. This field of research
has been able to solve a wide range of complex decisionmaking
tasks that were previously out of reach for a machine.
Thus, deep RL opens up many new applications in domains
such as healthcare, robotics, smart grids, finance, and many
more. This manuscript provides an introduction to deep
reinforcement learning models, algorithms and techniques.
Particular focus is on the aspects related to generalization
and how deep RL can be used for practical applications. We
assume the reader is familiar with basic machine learning
concepts.<h3>Suggested Citation</h3>Vincent François-Lavet, Peter Henderson, Riashat Islam, Marc G. Bellemare and Joelle Pineau (2018), "An Introduction to Deep Reinforcement Learning", Foundations and Trends® in Machine Learning: Vol. 11: No. 3-4, pp 219-354. http://dx.doi.org/10.1561/2200000071Thu, 20 Dec 2018 00:00:00 +0100http://nowpublishers.com/article/Details/MAL-072An Introduction to Wishart Matrix Moments<h3>Abstract</h3>These lecture notes provide a comprehensive, self-contained
introduction to the analysis of Wishart matrix moments.
This study may act as an introduction to some particular
aspects of random matrix theory, or as a self-contained
exposition of Wishart matrix moments.
Random matrix theory plays a central role in statistical
physics, computational mathematics and engineering sciences,
including data assimilation, signal processing, combinatorial
optimization, compressed sensing, econometrics and
mathematical finance, among numerous others. The mathematical
foundations of the theory of random matrices lies at
the intersection of combinatorics, non-commutative algebra,
geometry, multivariate functional and spectral analysis, and
of course statistics and probability theory. As a result, most
of the classical topics in random matrix theory are technical,
and mathematically difficult to penetrate for non-experts
and regular users and practitioners.
The technical aim of these notes is to review and extend some
important results in random matrix theory in the specific
context of real random Wishart matrices. This special class
of Gaussian-type sample covariance matrix plays an important
role in multivariate analysis and in statistical theory.We
derive non-asymptotic formulae for the full matrix moments
of real valued Wishart random matrices. As a corollary, we
derive and extend a number of spectral and trace-type results
for the case of non-isotropic Wishart random matrices.
We also derive the full matrix moment analogues of some
classic spectral and trace-type moment results. For example,
we derive semi-circle and Marchencko–Pastur-type laws in
the non-isotropic and full matrix cases. Laplace matrix transforms
and matrix moment estimates are also studied, along
with new spectral and trace concentration-type inequalities.<h3>Suggested Citation</h3>Adrian N. Bishop, Pierre Del Moral and Angèle Niclas (2018), "An Introduction to Wishart Matrix Moments", Foundations and Trends® in Machine Learning: Vol. 11: No. 2, pp 97-218. http://dx.doi.org/10.1561/2200000072Fri, 07 Dec 2018 00:00:00 +0100http://nowpublishers.com/article/Details/MAL-070A Tutorial on Thompson Sampling<h3>Abstract</h3>Thompson sampling is an algorithm for online decision problems
where actions are taken sequentially in a manner that
must balance between exploiting what is known to maximize
immediate performance and investing to accumulate
new information that may improve future performance. The
algorithm addresses a broad range of problems in a computationally
efficient manner and is therefore enjoying wide
use. This tutorial covers the algorithm and its application,
illustrating concepts through a range of examples, including
Bernoulli bandit problems, shortest path problems, product
recommendation, assortment, active learning with neural
networks, and reinforcement learning in Markov decision
processes. Most of these problems involve complex information
structures, where information revealed by taking an
action informs beliefs about other actions. We will also discuss
when and why Thompson sampling is or is not effective
and relations to alternative algorithms.<h3>Suggested Citation</h3>Daniel J. Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband and Zheng Wen (2018), "A Tutorial on Thompson Sampling", Foundations and Trends® in Machine Learning: Vol. 11: No. 1, pp 1-96. http://dx.doi.org/10.1561/2200000070Thu, 12 Jul 2018 00:00:00 +0200http://nowpublishers.com/article/Details/MAL-064Explaining the Success of Nearest
Neighbor Methods in Prediction<h3>Abstract</h3>Many modern methods for prediction leverage nearest neighbor
search to find past training examples most similar to
a test example, an idea that dates back in text to at least
the 11th century and has stood the test of time. This monograph
aims to explain the success of these methods, both in
theory, for which we cover foundational nonasymptotic statistical
guarantees on nearest-neighbor-based regression and
classification, and in practice, for which we gather prominent
methods for approximate nearest neighbor search that
have been essential to scaling prediction systems reliant on
nearest neighbor analysis to handle massive datasets. Furthermore,
we discuss connections to learning distances for
use with nearest neighbor methods, including how random
decision trees and ensemble methods learn nearest neighbor
structure, as well as recent developments in crowdsourcing
and graphons.
In terms of theory, our focus is on nonasymptotic statistical
guarantees, which we state in the form of how many training
data and what algorithm parameters ensure that a nearest
neighbor prediction method achieves a user-specified error
tolerance. We begin with the most general of such results
for nearest neighbor and related kernel regression and classification
in general metric spaces. In such settings in which
we assume very little structure, what enables successful prediction
is smoothness in the function being estimated for
regression, and a low probability of landing near the decision
boundary for classification. In practice, these conditions
could be difficult to verify empirically for a real dataset. We
then cover recent theoretical guarantees on nearest neighbor
prediction in the three case studies of time series forecasting,
recommending products to people over time, and delineating
human organs in medical images by looking at image
patches. In these case studies, clustering structure, which
is easier to verify in data and more readily interpretable by
practitioners, enables successful prediction.<h3>Suggested Citation</h3>George H. Chen and Devavrat Shah (2018), "Explaining the Success of Nearest
Neighbor Methods in Prediction", Foundations and Trends® in Machine Learning: Vol. 10: No. 5-6, pp 337-588. http://dx.doi.org/10.1561/2200000064Thu, 31 May 2018 00:00:00 +0200http://nowpublishers.com/article/Details/MAL-058Non-convex Optimization for Machine Learning<h3>Abstract</h3>A vast majority of machine learning algorithms train their models and
perform inference by solving optimization problems. In order to capture
the learning and prediction problems accurately, structural constraints
such as sparsity or low rank are frequently imposed or else the objective
itself is designed to be a non-convex function. This is especially true
of algorithms that operate in high-dimensional spaces or that train
non-linear models such as tensor models and deep networks.
The freedom to express the learning problem as a non-convex optimization
problem gives immense modeling power to the algorithm
designer, but often such problems are NP-hard to solve. A popular
workaround to this has been to relax non-convex problems to convex
ones and use traditional methods to solve the (convex) relaxed optimization
problems. However this approach may be lossy and nevertheless
presents significant challenges for large scale optimization.
On the other hand, direct approaches to non-convex optimization
have met with resounding success in several domains and remain the
methods of choice for the practitioner, as they frequently outperform
relaxation-based techniques – popular heuristics include projected gradient
descent and alternating minimization. However, these are often
poorly understood in terms of their convergence and other properties.
This monograph presents a selection of recent advances that bridge
a long-standing gap in our understanding of these heuristics. We hope
that an insight into the inner workings of these methods will allow the
reader to appreciate the unique marriage of task structure and generative
models that allow these heuristic techniques to (provably) succeed.
The monograph will lead the reader through several widely used nonconvex
optimization techniques, as well as applications thereof. The
goal of this monograph is to both, introduce the rich literature in this
area, as well as equip the reader with the tools and techniques needed
to analyze these simple procedures for non-convex problems.<h3>Suggested Citation</h3>Prateek Jain and Purushottam Kar (2017), "Non-convex Optimization for Machine Learning", Foundations and Trends® in Machine Learning: Vol. 10: No. 3-4, pp 142-363. http://dx.doi.org/10.1561/2200000058Mon, 04 Dec 2017 00:00:00 +0100http://nowpublishers.com/article/Details/MAL-060Kernel Mean Embedding of Distributions: A Review and Beyond<h3>Abstract</h3>A Hilbert space embedding of a distribution—in short, a kernel mean
embedding—has recently emerged as a powerful tool for machine learning
and statistical inference. The basic idea behind this framework is
to map distributions into a reproducing kernel Hilbert space (RKHS)
in which the whole arsenal of kernel methods can be extended to probability
measures. It can be viewed as a generalization of the original
“feature map” common to support vector machines (SVMs) and other
kernel methods. In addition to the classical applications of kernel methods,
the kernel mean embedding has found novel applications in fields
ranging from probabilistic modeling to statistical inference, causal discovery,
and deep learning.
This survey aims to give a comprehensive review of existing work
and recent advances in this research area, and to discuss challenging
issues and open problems that could potentially lead to new research
directions. The survey begins with a brief introduction to the RKHS
and positive definite kernels which forms the backbone of this survey,
followed by a thorough discussion of the Hilbert space embedding of
marginal distributions, theoretical guarantees, and a review of its applications.
The embedding of distributions enables us to apply RKHS
methods to probability measures which prompts a wide range of applications
such as kernel two-sample testing, independent testing, and
learning on distributional data. Next, we discuss the Hilbert space embedding
for conditional distributions, give theoretical insights, and review
some applications. The conditional mean embedding enables us
to perform sum, product, and Bayes’ rules—which are ubiquitous in
graphical model, probabilistic inference, and reinforcement learning—
in a non-parametric way using this new representation of distributions.
We then discuss relationships between this framework and other related
areas. Lastly, we give some suggestions on future research directions.<h3>Suggested Citation</h3>Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur and Bernhard Schölkopf (2017), "Kernel Mean Embedding of Distributions: A Review and Beyond", Foundations and Trends® in Machine Learning: Vol. 10: No. 1-2, pp 1-141. http://dx.doi.org/10.1561/2200000060Wed, 28 Jun 2017 00:00:00 +0200http://nowpublishers.com/article/Details/MAL-067Tensor Networks for Dimensionality Reduction and Large-scale Optimization:
Part 2 Applications and Future Perspectives<h3>Abstract</h3>Part 2 of this monograph builds on the introduction to tensor networks
and their operations presented in Part 1. It focuses on tensor
network models for super-compressed higher-order representation of
data/parameters and related cost functions, while providing an outline
of their applications in machine learning and data analytics.
A particular emphasis is on the tensor train (TT) and Hierarchical
Tucker (HT) decompositions, and their physically meaningful interpretations
which reflect the scalability of the tensor network approach.
Through a graphical approach, we also elucidate how, by virtue of
the underlying low-rank tensor approximations and sophisticated contractions
of core tensors, tensor networks have the ability to perform
distributed computations on otherwise prohibitively large volumes of
data/parameters, thereby alleviating or even eliminating the curse of
dimensionality.
The usefulness of this concept is illustrated over a number of applied
areas, including generalized regression and classification (support tensor
machines, canonical correlation analysis, higher order partial least
squares), generalized eigenvalue decomposition, Riemannian optimization,
and in the optimization of deep neural networks.
Part 1 and Part 2 of this work can be used either as stand-alone
separate texts, or indeed as a conjoint comprehensive review of the
exciting field of low-rank tensor networks and tensor decompositions.<h3>Suggested Citation</h3>Andrzej Cichocki, Anh-Huy Phan, Qibin Zhao, Namgil Lee, Ivan Oseledets, Masashi Sugiyama and Danilo P. Mandic (2017), "Tensor Networks for Dimensionality Reduction and Large-scale Optimization:
Part 2 Applications and Future Perspectives", Foundations and Trends® in Machine Learning: Vol. 9: No. 6, pp 431-673. http://dx.doi.org/10.1561/2200000067Tue, 30 May 2017 00:00:00 +0200http://nowpublishers.com/article/Details/MAL-059Tensor Networks for Dimensionality Reduction and Large-scale Optimization: Part 1 Low-Rank Tensor Decompositions<h3>Abstract</h3>Modern applications in engineering and data science are increasingly
based on multidimensional data of exceedingly high volume, variety,
and structural richness. However, standard machine learning algorithms
typically scale exponentially with data volume and complexity
of cross-modal couplings - the so called curse of dimensionality -
which is prohibitive to the analysis of large-scale, multi-modal and
multi-relational datasets. Given that such data are often efficiently
represented as multiway arrays or tensors, it is therefore timely and
valuable for the multidisciplinary machine learning and data analytic
communities to review low-rank tensor decompositions and tensor networks
as emerging tools for dimensionality reduction and large scale
optimization problems. Our particular emphasis is on elucidating that,
by virtue of the underlying low-rank approximations, tensor networks
have the ability to alleviate the curse of dimensionality in a number
of applied areas. In Part 1 of this monograph we provide innovative
solutions to low-rank tensor network decompositions and easy to interpret
graphical representations of the mathematical operations on
tensor networks. Such a conceptual insight allows for seamless migration
of ideas from the flat-view matrices to tensor network operations
and vice versa, and provides a platform for further developments, practical
applications, and non-Euclidean extensions. It also permits the
introduction of various tensor network operations without an explicit
notion of mathematical expressions, which may be beneficial for many
research communities that do not directly rely on multilinear algebra.
Our focus is on the Tucker and tensor train (TT) decompositions and
their extensions, and on demonstrating the ability of tensor networks
to provide linearly or even super-linearly (e.g., logarithmically) scalable
solutions, as illustrated in detail in Part 2 of this monograph.<h3>Suggested Citation</h3>Andrzej Cichocki, Namgil Lee, Ivan Oseledets, Anh-Huy Phan, Qibin Zhao and Danilo P. Mandic (2016), "Tensor Networks for Dimensionality Reduction and Large-scale Optimization: Part 1 Low-Rank Tensor Decompositions", Foundations and Trends® in Machine Learning: Vol. 9: No. 4-5, pp 249-429. http://dx.doi.org/10.1561/2200000059Mon, 19 Dec 2016 00:00:00 +0100http://nowpublishers.com/article/Details/MAL-052Patterns of Scalable Bayesian Inference<h3>Abstract</h3>Datasets are growing not just in size but in complexity, creating a demand for rich models and quantification of uncertainty. Bayesian methods are an excellent fit for this demand, but scaling Bayesian inference is a challenge. In response to this challenge, there has been considerable recent work based on varying assumptions about model structure, underlying computational resources, and the importance of asymptotic correctness. As a result, there is a zoo of ideas with a wide range of assumptions and applicability.
In this paper, we seek to identify unifying principles, patterns, and intuitions for scaling Bayesian inference. We review existing work on utilizing modern computing resources with both MCMC and variational approximation techniques. From this taxonomy of ideas, we characterize the general principles that have proven successful for designing scalable inference procedures and comment on the path forward.
<h3>Suggested Citation</h3>Elaine Angelino, Matthew James Johnson and Ryan P. Adams (2016), "Patterns of Scalable Bayesian Inference", Foundations and Trends® in Machine Learning: Vol. 9: No. 2-3, pp 119-247. http://dx.doi.org/10.1561/2200000052Thu, 17 Nov 2016 00:00:00 +0100http://nowpublishers.com/article/Details/MAL-055Generalized Low Rank Models<h3>Abstract</h3>Principal components analysis (PCA) is a well-known technique for approximating a tabular data set by a low rank matrix. Here, we extend the idea of PCA to handle arbitrary data sets consisting of numerical, Boolean, categorical, ordinal, and other data types. This framework encompasses many well-known techniques in data analysis, such as nonnegative matrix factorization, matrix completion, sparse and robust PCA, k-means, k-SVD, and maximum margin matrix factorization. The method handles heterogeneous data sets, and leads to coherent schemes for compressing, denoising, and imputing missing entries across all data types simultaneously. It also admits a number of interesting interpretations of the low rank factors, which allow clustering of examples or of features. We propose several parallel algorithms for fitting generalized low rank models, and describe implementations and numerical
results.<h3>Suggested Citation</h3>Madeleine Udell, Corinne Horn, Reza Zadeh and Stephen Boyd (2016), "Generalized Low Rank Models", Foundations and Trends® in Machine Learning: Vol. 9: No. 1, pp 1-118. http://dx.doi.org/10.1561/2200000055Thu, 23 Jun 2016 00:00:00 +0200http://nowpublishers.com/article/Details/MAL-049Bayesian Reinforcement Learning: A Survey<h3>Abstract</h3>Bayesian methods for machine learning have been widely investigated,
yielding principled methods for incorporating prior information into
inference algorithms. In this survey, we provide an in-depth review
of the role of Bayesian methods for the reinforcement learning (RL)
paradigm. The major incentives for incorporating Bayesian reasoning
in RL are: 1) it provides an elegant approach to action-selection (exploration/
exploitation) as a function of the uncertainty in learning; and
2) it provides a machinery to incorporate prior knowledge into the algorithms.
We first discuss models and methods for Bayesian inference
in the simple single-step Bandit model. We then review the extensive
recent literature on Bayesian methods for model-based RL, where prior
information can be expressed on the parameters of the Markov model.
We also present Bayesian methods for model-free RL, where priors are
expressed over the value function or policy class. The objective of the
paper is to provide a comprehensive survey on Bayesian RL algorithms
and their theoretical and empirical properties.<h3>Suggested Citation</h3>Mohammad Ghavamzadeh, Shie Mannor, Joelle Pineau and Aviv Tamar (2015), "Bayesian Reinforcement Learning: A Survey", Foundations and Trends® in Machine Learning: Vol. 8: No. 5-6, pp 359-483. http://dx.doi.org/10.1561/2200000049Thu, 26 Nov 2015 00:00:00 +0100http://nowpublishers.com/article/Details/MAL-050Convex Optimization: Algorithms and Complexity<h3>Abstract</h3>This monograph presents the main complexity theorems in convex optimization
and their corresponding algorithms. Starting from the fundamental
theory of black-box optimization, the material progresses towards
recent advances in structural optimization and stochastic optimization.
Our presentation of black-box optimization, strongly influenced
by Nesterov’s seminal book and Nemirovski’s lecture notes,
includes the analysis of cutting plane methods, as well as (accelerated)
gradient descent schemes. We also pay special attention to non-
Euclidean settings (relevant algorithms include Frank-Wolfe, mirror
descent, and dual averaging) and discuss their relevance in machine
learning. We provide a gentle introduction to structural optimization
with FISTA (to optimize a sum of a smooth and a simple non-smooth
term), saddle-point mirror prox (Nemirovski’s alternative to Nesterov’s
smoothing), and a concise description of interior point methods. In
stochastic optimization we discuss stochastic gradient descent, minibatches,
random coordinate descent, and sublinear algorithms. We also
briefly touch upon convex relaxation of combinatorial problems and the
use of randomness to round solutions, as well as random walks based
methods.<h3>Suggested Citation</h3>Sébastien Bubeck (2015), "Convex Optimization: Algorithms and Complexity", Foundations and Trends® in Machine Learning: Vol. 8: No. 3-4, pp 231-357. http://dx.doi.org/10.1561/2200000050Thu, 12 Nov 2015 00:00:00 +0100http://nowpublishers.com/article/Details/MAL-048An Introduction to Matrix Concentration Inequalities<h3>Abstract</h3>Random matrices now play a role in many areas of theoretical, applied,
and computational mathematics. Therefore, it is desirable to have tools
for studying random matrices that are flexible, easy to use, and powerful.
Over the last fifteen years, researchers have developed a remarkable
family of results, called matrix concentration inequalities, that achieve
all of these goals.<p>This monograph offers an invitation to the field of matrix concentration
inequalities. It begins with some history of random matrix theory;
it describes a flexible model for random matrices that is suitable
for many problems; and it discusses the most important matrix concentration
results. To demonstrate the value of these techniques, the
presentation includes examples drawn from statistics, machine learning,
optimization, combinatorics, algorithms, scientific computing, and
beyond.</p><h3>Suggested Citation</h3>Joel A. Tropp (2015), "An Introduction to Matrix Concentration Inequalities", Foundations and Trends® in Machine Learning: Vol. 8: No. 1-2, pp 1-230. http://dx.doi.org/10.1561/2200000048Wed, 27 May 2015 00:00:00 +0200http://nowpublishers.com/article/Details/MAL-054Explicit-Duration Markov Switching Models<h3>Abstract</h3>Markov switching models (MSMs) are probabilistic models that employ
multiple sets of parameters to describe different dynamic regimes
that a time series may exhibit at different periods of time. The
switching mechanism between regimes is controlled by unobserved random
variables that form a first-order Markov chain. Explicit-duration
MSMs contain additional variables that explicitly model the distribution
of time spent in each regime. This allows to define duration distributions
of any form, but also to impose complex dependence between
the observations and to reset the dynamics to initial conditions. Models
that focus on the first two properties are most commonly known as hidden
semi-Markov models or segment models, whilst models that focus
on the third property are most commonly known as changepoint models
or reset models. In this monograph, we provide a description of explicit-duration
modelling by categorizing the different approaches into three
groups, which differ in encoding in the explicit-duration variables different
information about regime change/reset boundaries. The approaches
are described using the formalism of graphical models, which allows to
graphically represent and assess statistical dependence and therefore
to easily describe the structure of complex models and derive inference
routines. The presentation is intended to be pedagogical, focusing
on providing a characterization of the three groups in terms of model
structure constraints and inference properties. The monograph is supplemented
with a software package that contains most of the models
and examples described. The material presented should be useful to
both researchers wishing to learn about these models and researchers
wishing to develop them further.<p>
<strong>The supplementary data for this issue will be available soon.</strong>
</p><h3>Suggested Citation</h3>Silvia Chiappa (2014), "Explicit-Duration Markov Switching Models", Foundations and Trends® in Machine Learning: Vol. 7: No. 6, pp 803-886. http://dx.doi.org/10.1561/2200000054Tue, 23 Dec 2014 00:00:00 +0100http://nowpublishers.com/article/Details/MAL-051Adaptation, Learning, and Optimization over Networks<h3>Abstract</h3>This work deals with the topic of information processing over graphs. The presentation is largely self-contained and covers results that relate to the
analysis and design of multi-agent networks for the distributed solution of optimization, adaptation, and learning problems from streaming data through
localized interactions among agents. The results derived in this work are useful in comparing network topologies against each other, and in comparing networked
solutions against centralized or batch implementations. There are many good reasons for the peaked interest in distributed implementations, especially in this
day and age when the word “network” has become commonplace whether one is referring to social networks, power networks, transportation networks, biological
networks, or other types of networks. Some of these reasons have to do with the benefits of cooperation in terms of improved performance and improved resilience
to failure. Other reasons deal with privacy and secrecy considerations where agents may not be comfortable sharing their data with remote fusion centers. In other
situations, the data may already be available in dispersed locations, as happens with cloud computing. One may also be interested in learning through data mining
from big data sets. Motivated by these considerations, this work examines the limits of performance of distributed solutions and discusses procedures that help
bring forth their potential more fully. The presentation adopts a useful statistical framework and derives performance results that elucidate the mean-square
stability, convergence, and steady-state behavior of the learning networks. At the same time, the work illustrates how distributed processing over graphs
gives rise to some revealing phenomena due to the coupling effect among the agents. These phenomena are discussed in the context of adaptive networks, along
with examples from a variety of areas including distributed sensing, intrusion detection, distributed estimation, online adaptation, network system theory,
and machine learning.<h3>Suggested Citation</h3>Ali H. Sayed (2014), "Adaptation, Learning, and Optimization over Networks", Foundations and Trends® in Machine Learning: Vol. 7: No. 4-5, pp 311-801. http://dx.doi.org/10.1561/2200000051Fri, 25 Jul 2014 00:00:00 +0200http://nowpublishers.com/article/Details/MAL-037Theory of Disagreement-Based Active Learning<h3>Abstract</h3>Active learning is a protocol for supervised machine learning, in which a learning algorithm sequentially requests the labels of selected
data points from a large pool of unlabeled data. This contrasts with passive learning, where the labeled data are taken at random. The
objective in active learning is to produce a highly-accurate classifier, ideally using fewer labels than the number of random labeled data
sufficient for passive learning to achieve the same. This article describes recent advances in our understanding of the theoretical benefits
of active learning, and implications for the design of effective active learning algorithms. Much of the article focuses on a particular
technique, namely disagreement-based active learning, which by now has amassed a mature and coherent literature. It also briefly surveys
several alternative approaches from the literature. The emphasis is on theorems regarding the performance of a few general algorithms,
including rigorous proofs where appropriate. However, the presentation is intended to be pedagogical, focusing on results that illustrate
fundamental ideas, rather than obtaining the strongest or most general known theorems. The intended audience includes researchers and
advanced graduate students in machine learning and statistics, interested in gaining a deeper understanding of the recent and ongoing
developments in the theory of active learning.<h3>Suggested Citation</h3>Steve Hanneke (2014), "Theory of Disagreement-Based Active Learning", Foundations and Trends® in Machine Learning: Vol. 7: No. 2-3, pp 131-309. http://dx.doi.org/10.1561/2200000037Thu, 12 Jun 2014 00:00:00 +0200http://nowpublishers.com/article/Details/MAL-038From Bandits to Monte-Carlo Tree Search: The Optimistic Principle Applied to Optimization and Planning<h3>Abstract</h3>This work covers several aspects of the <em>optimism in the face of uncertainty</em> principle applied to large scale optimization
problems under finite numerical budget. The initial motivation for the research reported here originated from the empirical success of the
so-called <em>Monte-Carlo Tree Search</em> method popularized in Computer Go and further extended to many other games as well as optimization and
planning problems. Our objective is to contribute to the development of theoretical foundations of the field by characterizing the complexity
of the underlying optimization problems and designing efficient algorithms with performance guarantees.<p>The main idea presented here is that it is possible to decompose a complex decision making problem (such as an optimization problem in a
large search space) into a sequence of elementary decisions, where each decision of the sequence is solved using a <em>(stochastic) multi-armed
bandit</em> (simple mathematical model for decision making in stochastic environments). This so-called <em>hierarchical bandit</em> approach
(where the reward observed by a bandit in the hierarchy is itself the return of another bandit at a deeper level) possesses the nice
feature of starting the exploration by a quasi-uniform sampling of the space and then focusing progressively on the most promising area,
at different scales, according to the evaluations observed so far, until eventually performing a local search around the global optima of
the function. The performance of the method is assessed in terms of the optimality of the returned solution as a function of the number of
function evaluations.</p><p>Our main contribution to the field of function optimization is a class of hierarchical optimistic algorithms designed for general
search spaces (such as metric spaces, trees, graphs, Euclidean spaces) with different algorithmic instantiations depending on whether
the evaluations are noisy or noiseless and whether some measure of the “smoothness” of the function is known or unknown. The performance
of the algorithms depends on the “local” behavior of the function around its global optima expressed in terms of the quantity of near-optimal
states measured with some metric. If this local smoothness of the function is known then one can design very efficient optimization
algorithms (with convergence rate independent of the space dimension). When this information is unknown, one can build adaptive techniques
which, in some cases, perform almost as well as when it is known.</p><p>In order to be self-contained, we start with a brief introduction to the stochastic multi-armed bandit problem in Chapter 1 and describe
the UCB (Upper Confidence Bound) strategy and several extensions. In Chapter 2 we present the Monte-Carlo Tree Search method applied to
Computer Go and show the limitations of previous algorithms such as UCT (UCB applied to Trees). This provides motivation for designing
theoretically well-founded optimistic optimization algorithms. The main contributions on hierarchical optimistic optimization are described
in Chapters 3 and 4 where the general setting of a semi-metric space is introduced and algorithms designed for optimizing a function assumed
to be locally smooth (around its maxima) with respect to a semi-metric are presented and analyzed. Chapter 3 considers the case when the
semi-metric is known and can be used by the algorithm, whereas Chapter 4 considers the case when it is not known and describes an adaptive
technique that does almost as well as when it is known. Finally in Chapter 5 we describe optimistic strategies for a specific structured
problem, namely the planning problem in Markov decision processes with infinite horizon discounted rewards.</p><h3>Suggested Citation</h3>Rémi Munos (2014), "From Bandits to Monte-Carlo Tree Search: The Optimistic Principle Applied to Optimization and Planning", Foundations and Trends® in Machine Learning: Vol. 7: No. 1, pp 1-129. http://dx.doi.org/10.1561/2200000038Mon, 20 Jan 2014 00:00:00 +0100http://nowpublishers.com/article/Details/MAL-042A Tutorial on Linear Function Approximators for Dynamic Programming and Reinforcement Learning<h3>Abstract</h3>A Markov Decision Process (MDP) is a natural framework for formulating sequential decision-making problems under uncertainty. In recent
years, researchers have greatly advanced algorithms for learning and acting in MDPs. This article reviews such algorithms, beginning with
well-known dynamic programming methods for solving MDPs such as policy iteration and value iteration, then describes approximate dynamic
programming methods such as trajectory based value iteration, and finally moves to reinforcement learning methods such as Q-Learning, SARSA,
and least-squares policy iteration. We describe algorithms in a unified framework, giving pseudocode together with memory and iteration
complexity analysis for each. Empirical evaluations of these techniques with four representations across four domains, provide insight into
how these algorithms perform with various feature sets in terms of running time and performance.<h3>Suggested Citation</h3>Alborz Geramifard, Thomas J. Walsh, Stefanie Tellex, Girish Chowdhary, Nicholas Roy and Jonathan P. How (2013), "A Tutorial on Linear Function Approximators for Dynamic Programming and Reinforcement Learning", Foundations and Trends® in Machine Learning: Vol. 6: No. 4, pp 375-451. http://dx.doi.org/10.1561/2200000042Thu, 19 Dec 2013 00:00:00 +0100http://nowpublishers.com/article/Details/MAL-039Learning with Submodular Functions: A Convex Optimization Perspective<h3>Abstract</h3>Submodular functions are relevant to machine learning for at least two reasons: (1) some problems may be expressed
directly as the optimization of submodular functions and (2) the Lovász extension of submodular functions provides a
useful set of regularization functions for supervised and unsupervised learning. In this monograph, we present the
theory of submodular functions from a convex analysis perspective, presenting tight links between certain polyhedra,
combinatorial optimization and convex optimization problems. In particular, we show how submodular function
minimization is equivalent to solving a wide variety of convex optimization problems. This allows the derivation of new
efficient algorithms for approximate and exact submodular function minimization with theoretical guarantees and good
practical performance. By listing many examples of submodular functions, we review various applications to machine
learning, such as clustering, experimental design, sensor placement, graphical model structure learning or subset
selection, as well as a family of structured sparsity-inducing norms that can be derived and used from submodular
functions.<h3>Suggested Citation</h3>Francis Bach (2013), "Learning with Submodular Functions: A Convex Optimization Perspective", Foundations and Trends® in Machine Learning: Vol. 6: No. 2-3, pp 145-373. http://dx.doi.org/10.1561/2200000039Wed, 04 Dec 2013 00:00:00 +0100http://nowpublishers.com/article/Details/MAL-045Backward Simulation Methods for Monte Carlo Statistical Inference<h3>Abstract</h3>Monte Carlo methods, in particular those based on Markov chains and on interacting particle systems, are by now tools that are routinely used in machine learning. These methods have had a profound impact on statistical inference in a wide range of application areas where probabilistic models are used. Moreover, there are many algorithms in machine learning which are based on the idea of processing the data sequentially, first in the forward direction and then in the backward direction. In this tutorial, we will review a branch of Monte Carlo methods based on the forward–backward idea, referred to as backward simulators. These methods are useful for learning and inference in probabilistic models containing latent stochastic processes. The theory and practice of backward simulation algorithms have undergone a significant development in recent years and the algorithms keep finding new applications. The foundation for these methods is sequential Monte Carlo (SMC). SMC-based backward simulators are capable of addressing smoothing problems in sequential latent variable models, such as general, nonlinear/non-Gaussian state-space models (SSMs). However, we will also clearly show that the underlying backward simulation idea is by no means restricted to SSMs. Furthermore, backward simulation plays an important role in recent developments of Markov chain Monte Carlo (MCMC) methods. Particle MCMC is a systematic way of using SMC within MCMC. In this framework, backward simulation gives us a way to significantly improve the performance of the samplers. We review and discuss several related backward-simulation-based methods for state inference as well as learning of static parameters, both using a frequentistic and a Bayesian approach.<h3>Suggested Citation</h3>Fredrik Lindsten and Thomas B. Schön (2013), "Backward Simulation Methods for Monte Carlo Statistical Inference", Foundations and Trends® in Machine Learning: Vol. 6: No. 1, pp 1-143. http://dx.doi.org/10.1561/2200000045Fri, 30 Aug 2013 00:00:00 +0200http://nowpublishers.com/article/Details/MAL-019Metric Learning: A Survey<h3>Abstract</h3>The <em>metric learning</em> problem is concerned with learning a distance function tuned to a particular task, and has been shown to be useful when used in conjunction with nearest-neighbor methods and other techniques that rely on distances or similarities. This survey presents an overview of existing research in metric learning, including recent progress on scaling to high-dimensional feature spaces and to data sets with an extremely large number of data points. A goal of the survey is to present as unified as possible a framework under which existing research on metric learning can be cast. The first part of the survey focuses on linear metric learning approaches, mainly concentrating on the class of Mahalanobis distance learning methods. We then discuss nonlinear metric learning approaches, focusing on the connections between the nonlinear and linear approaches. Finally, we discuss extensions of metric learning, as well as applications to a variety of problems in computer vision, text analysis, program analysis, and multimedia.<h3>Suggested Citation</h3>Brian Kulis (2013), "Metric Learning: A Survey", Foundations and Trends® in Machine Learning: Vol. 5: No. 4, pp 287-364. http://dx.doi.org/10.1561/2200000019Wed, 31 Jul 2013 00:00:00 +0200http://nowpublishers.com/article/Details/MAL-044Determinantal Point Processes for Machine Learning<h3>Abstract</h3>Determinantal point processes (DPPs) are elegant probabilistic models of repulsion that arise in quantum physics and random matrix theory. In contrast to traditional structured models like Markov random fields, which become intractable and hard to approximate in the presence of negative correlations, DPPs offer efficient and exact algorithms for sampling, marginalization, conditioning, and other inference tasks. We provide a gentle introduction to DPPs, focusing on the intuitions, algorithms, and extensions that are most relevant to the machine learning community, and show how DPPs can be applied to real-world applications like finding diverse sets of high-quality search results, building informative summaries by selecting diverse sentences from documents, modeling nonoverlapping human poses in images or video, and automatically building timelines of important news stories.<h3>Suggested Citation</h3>Alex Kulesza and Ben Taskar (2012), "Determinantal Point Processes for Machine Learning", Foundations and Trends® in Machine Learning: Vol. 5: No. 2–3, pp 123-286. http://dx.doi.org/10.1561/2200000044Tue, 18 Dec 2012 00:00:00 +0100http://nowpublishers.com/article/Details/MAL-024Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems<h3>Abstract</h3>Multi-armed bandit problems are the most basic examples of sequential decision problems with an exploration-exploitation trade-off. This is the balance between staying with the option that gave highest payoffs in the past and exploring new options that might give higher payoffs in the future. Although the study of bandit problems dates back to the 1930s, exploration–exploitation trade-offs arise in several modern applications, such as ad placement, website optimization, and packet routing. Mathematically, a multi-armed bandit is defined by the payoff process associated with each option. In this monograph, we focus on two extreme cases in which the analysis of regret is particularly simple and elegant: i.i.d. payoffs and adversarial payoffs. Besides the basic setting of finitely many actions, we also analyze some of the most important variants and extensions, such as the contextual bandit model.<h3>Suggested Citation</h3>Sébastien Bubeck and Nicolò Cesa-Bianchi (2012), "Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems", Foundations and Trends® in Machine Learning: Vol. 5: No. 1, pp 1-122. http://dx.doi.org/10.1561/2200000024Wed, 12 Dec 2012 00:00:00 +0100http://nowpublishers.com/article/Details/MAL-013An Introduction to Conditional Random Fields<h3>Abstract</h3>Many tasks involve predicting a large number of variables that depend on each other as well as on other observed variables. Structured prediction methods are essentially a combination of classification and graphical modeling. They combine the ability of graphical models to compactly model multivariate data with the ability of classification methods to perform prediction using large sets of input features. This survey describes <em>conditional random fields</em>, a popular probabilistic method for structured prediction. CRFs have seen wide application in many areas, including natural language processing, computer vision, and bioinformatics. We describe methods for inference and parameter estimation for CRFs, including practical issues for implementing large-scale CRFs. We do not assume previous knowledge of graphical modeling, so this survey is intended to be useful to practitioners in a wide variety of fields.<h3>Suggested Citation</h3>Charles Sutton and Andrew McCallum (2012), "An Introduction to Conditional Random Fields", Foundations and Trends® in Machine Learning: Vol. 4: No. 4, pp 267-373. http://dx.doi.org/10.1561/2200000013Thu, 23 Aug 2012 00:00:00 +0200http://nowpublishers.com/article/Details/MAL-036Kernels for Vector-Valued Functions: A Review<h3>Abstract</h3>Kernel methods are among the most popular techniques in machine learning. From a regularization perspective they play a central role in regularization theory as they provide a natural choice for the hypotheses space and the regularization functional through the notion of reproducing kernel Hilbert spaces. From a probabilistic perspective they are the key in the context of Gaussian processes, where the kernel function is known as the covariance function. Traditionally, kernel methods have been used in supervised learning problems with scalar outputs and indeed there has been a considerable amount of work devoted to designing and learning kernels. More recently there has been an increasing interest in methods that deal with multiple outputs, motivated partially by frameworks like multitask learning. In this monograph, we review different methods to design or learn valid kernel functions for multiple outputs, paying particular attention to the connection between probabilistic and functional methods.<h3>Suggested Citation</h3>Mauricio A. Álvarez, Lorenzo Rosasco and Neil D. Lawrence (2012), "Kernels for Vector-Valued Functions: A Review", Foundations and Trends® in Machine Learning: Vol. 4: No. 3, pp 195-266. http://dx.doi.org/10.1561/2200000036Tue, 19 Jun 2012 00:00:00 +0200http://nowpublishers.com/article/Details/MAL-018Online Learning and Online Convex Optimization<h3>Abstract</h3>Online learning is a well established learning paradigm which has both theoretical and practical appeals. The goal of online learning is to make a sequence of accurate predictions given knowledge of the correct answer to previous prediction tasks and possibly additional available information. Online learning has been studied in several research fields including game theory, information theory, and machine learning. It also became of great interest to practitioners due the recent emergence of large scale applications such as online advertisement placement and online web ranking. In this survey we provide a modern overview of online learning. Our goal is to give the reader a sense of some of the interesting ideas and in particular to underscore the centrality of convexity in deriving efficient online learning algorithms. We do not mean to be comprehensive but rather to give a high-level, rigorous yet easy to follow, survey.<h3>Suggested Citation</h3>Shai Shalev-Shwartz (2012), "Online Learning and Online Convex Optimization", Foundations and Trends® in Machine Learning: Vol. 4: No. 2, pp 107-194. http://dx.doi.org/10.1561/2200000018Thu, 29 Mar 2012 00:00:00 +0200http://nowpublishers.com/article/Details/MAL-015Optimization with Sparsity-Inducing Penalties<h3>Abstract</h3>Sparse estimation methods are aimed at using or obtaining parsimonious representations of data or models. They were first dedicated to linear variable selection but numerous extensions have now emerged such as structured sparsity or kernel selection. It turns out that many of the related estimation problems can be cast as convex optimization problems by regularizing the empirical risk with appropriate nonsmooth norms. The goal of this monograph is to present from a general perspective optimization tools and techniques dedicated to such sparsity-inducing penalties. We cover proximal methods, block-coordinate descent, reweighted ℓ<sub>2</sub>-penalized techniques, workingset and homotopy methods, as well as non-convex formulations and extensions, and provide an extensive set of experiments to compare various algorithms from a computational point of view.<h3>Suggested Citation</h3>Francis Bach, Rodolphe Jenatton, Julien Mairal and Guillaume Obozinski (2012), "Optimization with Sparsity-Inducing Penalties", Foundations and Trends® in Machine Learning: Vol. 4: No. 1, pp 1-106. http://dx.doi.org/10.1561/2200000015Wed, 04 Jan 2012 00:00:00 +0100http://nowpublishers.com/article/Details/MAL-026On the Concentration Properties of Interacting Particle Processes<h3>Abstract</h3>This monograph presents some new concentration inequalities for Feynman-Kac particle processes. We analyze different types of stochastic particle models, including particle profile occupation measures, genealogical tree based evolution models, particle free energies, as well as backward Markov chain particle models. We illustrate these results with a series of topics related to computational physics and biology, stochastic optimization, signal processing and Bayesian statistics, and many other probabilistic machine learning algorithms. Special emphasis is given to the stochastic modeling, and to the quantitative performance analysis of a series of advanced Monte Carlo methods, including particle filters, genetic type island models, Markov bridge models, and interacting particle Markov chain Monte Carlo methodologies.<h3>Suggested Citation</h3>Pierre Del Moral, Peng Hu and Liming Wu (2012), "On the Concentration Properties of Interacting Particle Processes", Foundations and Trends® in Machine Learning: Vol. 3: No. 3–4, pp 225-389. http://dx.doi.org/10.1561/2200000026Thu, 26 Jan 2012 00:00:00 +0100http://nowpublishers.com/article/Details/MAL-035Randomized Algorithms for Matrices and Data<h3>Abstract</h3>Randomized algorithms for very large matrix problems have received a great deal of attention in recent years. Much of this work was motivated by problems in large-scale data analysis, largely since matrices are popular structures with which to model data drawn from a wide range of application domains, and this work was performed by individuals from many different research communities. While the most obvious benefit of randomization is that it can lead to faster algorithms, either in worst-case asymptotic theory and/or numerical implementation, there are numerous other benefits that are at least as important. For example, the use of randomization can lead to simpler algorithms that are easier to analyze or reason about when applied in counterintuitive settings; it can lead to algorithms with more interpretable output, which is of interest in applications where analyst time rather than just computational time is of interest; it can lead implicitly to regularization and more robust output; and randomized algorithms can often be organized to exploit modern computational architectures better than classical numerical methods.<h3>Suggested Citation</h3>Michael W. Mahoney (2011), "Randomized Algorithms for Matrices and Data", Foundations and Trends® in Machine Learning: Vol. 3: No. 2, pp 123-224. http://dx.doi.org/10.1561/2200000035Tue, 22 Nov 2011 00:00:00 +0100http://nowpublishers.com/article/Details/MAL-016Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers<h3>Abstract</h3>Many problems of recent interest in statistics and machine learning can be posed in the framework of convex optimization. Due to the explosion in size and complexity of modern datasets, it is increasingly important to be able to solve problems with a very large number of features or training examples. As a result, both the decentralized collection or storage of these datasets as well as accompanying distributed solution methods are either necessary or at least highly desirable. In this review, we argue that the <em>alternating direction method of multipliers</em> is well suited to distributed convex optimization, and in particular to large-scale problems arising in statistics, machine learning, and related areas. The method was developed in the 1970s, with roots in the 1950s, and is equivalent or closely related to many other algorithms, such as dual decomposition, the method of multipliers, Douglas–Rachford splitting, Spingarn's method of partial inverses, Dykstra's alternating projections, Bregman iterative algorithms for ℓ<sub>1</sub> problems, proximal methods, and others. After briefly surveying the theory and history of the algorithm, we discuss applications to a wide variety of statistical and machine learning problems of recent interest, including the lasso, sparse logistic regression, basis pursuit, covariance selection, support vector machines, and many others. We also discuss general distributed optimization, extensions to the nonconvex setting, and efficient implementation, including some details on distributed MPI and Hadoop Map Reduce implementations.<h3>Suggested Citation</h3>Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato and Jonathan Eckstein (2011), "Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers", Foundations and Trends® in Machine Learning: Vol. 3: No. 1, pp 1-122. http://dx.doi.org/10.1561/2200000016Tue, 26 Jul 2011 00:00:00 +0200http://nowpublishers.com/article/Details/MAL-002Dimension Reduction: A Guided Tour<h3>Abstract</h3>We give a tutorial overview of several foundational methods for dimension reduction. We divide the methods into projective methods and methods that model the manifold on which the data lies. For projective methods, we review projection pursuit, principal component analysis (PCA), kernel PCA, probabilistic PCA, canonical correlation analysis (CCA), kernel CCA, Fisher discriminant analysis, oriented PCA, and several techniques for sufficient dimension reduction. For the manifold methods, we review multidimensional scaling (MDS), landmark MDS, Isomap, locally linear embedding, Laplacian eigenmaps, and spectral clustering. Although this monograph focuses on foundations, we also provide pointers to some more modern techniques. We also describe the correlation dimension as one method for estimating the intrinsic dimension, and we point out that the notion of dimension can be a scale-dependent quantity. The Nyström method, which links several of the manifold algorithms, is also reviewed. We use a publicly available data set to illustrate some of the methods. The goal is to provide a self-contained overview of key concepts underlying many of these algorithms, and to give pointers for further reading.<h3>Suggested Citation</h3>Christopher J. C. Burges (2010), "Dimension Reduction: A Guided Tour", Foundations and Trends® in Machine Learning: Vol. 2: No. 4, pp 275-365. http://dx.doi.org/10.1561/2200000002Wed, 18 Aug 2010 00:00:00 +0200http://nowpublishers.com/article/Details/MAL-008Clustering Stability: An Overview<h3>Abstract</h3>A popular method for selecting the number of clusters is based on stability arguments: one chooses the number of clusters such that the corresponding clustering results are "most stable". In recent years, a series of papers has analyzed the behavior of this method from a theoretical point of view. However, the results are very technical and difficult to interpret for non-experts. In this monograph we give a high-level overview about the existing literature on clustering stability. In addition to presenting the results in a slightly informal but accessible way, we relate them to each other and discuss their different implications.<h3>Suggested Citation</h3>Ulrike von Luxburg (2010), "Clustering Stability: An Overview", Foundations and Trends® in Machine Learning: Vol. 2: No. 3, pp 235-274. http://dx.doi.org/10.1561/2200000008Wed, 21 Apr 2010 00:00:00 +0200http://nowpublishers.com/article/Details/MAL-005A Survey of Statistical Network Models<h3>Abstract</h3>Networks are ubiquitous in science and have become a focal point for discussion in everyday life. Formal statistical models for the analysis of network data have emerged as a major topic of interest in diverse areas of study, and most of these involve a form of graphical representation. Probability models on graphs date back to 1959. Along with empirical studies in social psychology and sociology from the 1960s, these early works generated an active "network community" and a substantial literature in the 1970s. This effort moved into the statistical literature in the late 1970s and 1980s, and the past decade has seen a burgeoning network literature in statistical physics and computer science. The growth of the World Wide Web and the emergence of online "networking communities" such as <em>Facebook, MySpace</em>, and <em>LinkedIn</em>, and a host of more specialized professional network communities has intensified interest in the study of networks and network data.<p>Our goal in this review is to provide the reader with an entry point to this burgeoning literature. We begin with an overview of the historical development of statistical network modeling and then we introduce a number of examples that have been studied in the network literature. Our subsequent discussion focuses on a number of prominent static and dynamic network models and their interconnections. We emphasize formal model descriptions, and pay special attention to the interpretation of parameters and their estimation. We end with a description of some open problems and challenges for machine learning and statistics.</p><h3>Suggested Citation</h3>Anna Goldenberg, Alice X. Zheng, Stephen E. Fienberg and Edoardo M. Airoldi (2010), "A Survey of Statistical Network Models", Foundations and Trends® in Machine Learning: Vol. 2: No. 2, pp 129-233. http://dx.doi.org/10.1561/2200000005Tue, 16 Feb 2010 00:00:00 +0100http://nowpublishers.com/article/Details/MAL-006Learning Deep Architectures for AI<h3>Abstract</h3>Theoretical results suggest that in order to learn the kind of complicated functions that can represent high-level abstractions (e.g., in vision, language, and other AI-level tasks), one may need <em>deep architectures</em>. Deep architectures are composed of multiple levels of non-linear operations, such as in neural nets with many hidden layers or in complicated propositional formulae re-using many sub-formulae. Searching the parameter space of deep architectures is a difficult task, but learning algorithms such as those for Deep Belief Networks have recently been proposed to tackle this problem with notable success, beating the state-of-the-art in certain areas. This monograph discusses the motivations and principles regarding learning algorithms for deep architectures, in particular those exploiting as building blocks unsupervised learning of single-layer models such as Restricted Boltzmann Machines, used to construct deeper models such as Deep Belief Networks.<h3>Suggested Citation</h3>Yoshua Bengio (2009), "Learning Deep Architectures for AI", Foundations and Trends® in Machine Learning: Vol. 2: No. 1, pp 1-127. http://dx.doi.org/10.1561/2200000006Sun, 15 Nov 2009 00:00:00 +0100http://nowpublishers.com/article/Details/MAL-003Learning Representation and Control in Markov Decision Processes: New Frontiers<h3>Abstract</h3>This paper describes a novel machine learning framework for solving sequential decision problems called Markov decision processes (MDPs) by iteratively computing low-dimensional representations and approximately optimal policies. A unified mathematical framework for learning representation and optimal control in MDPs is presented based on a class of singular operators called Laplacians, whose matrix representations have nonpositive off-diagonal elements and zero row sums. Exact solutions of discounted and average-reward MDPs are expressed in terms of a generalized spectral inverse of the Laplacian called the <em>Drazin inverse</em>. A generic algorithm called <em>representation policy iteration</em> (RPI) is presented which interleaves computing low-dimensional representations and approximately optimal policies. Two approaches for dimensionality reduction of MDPs are described based on geometric and reward-sensitive regularization, whereby low-dimensional representations are formed by <em>diagonalization</em> or <em>dilation</em> of Laplacian operators. Model-based and model-free variants of the RPI algorithm are presented; they are also compared experimentally on discrete and continuous MDPs. Some directions for future work are finally outlined.<h3>Suggested Citation</h3>Sridhar Mahadevan (2009), "Learning Representation and Control in Markov Decision Processes: New Frontiers", Foundations and Trends® in Machine Learning: Vol. 1: No. 4, pp 403-565. http://dx.doi.org/10.1561/2200000003Tue, 02 Jun 2009 00:00:00 +0200http://nowpublishers.com/article/Details/MAL-004Property Testing: A Learning Theory Perspective<h3>Abstract</h3>Property testing deals with tasks where the goal is to distinguish between the case that an object (e.g., function or graph) has a prespecified property (e.g., the function is linear or the graph is bipartite) and the case that it differs significantly from any such object. The task should be performed by observing only a very small part of the object, in particular by querying the object, and the algorithm is allowed a small failure probability.<p>One view of property testing is as a relaxation of learning the object (obtaining an approximate representation of the object). Thus property testing algorithms can serve as a preliminary step to learning. That is, they can be applied in order to select, very efficiently, what hypothesis class to use for learning. This survey takes the learning-theory point of view and focuses on results for testing properties of functions that are of interest to the learning theory community. In particular, we cover results for testing algebraic properties of functions such as linearity, testing properties defined by concise representations, such as having a small DNF representation, and more.</p><h3>Suggested Citation</h3>Dana Ron (2008), "Property Testing: A Learning Theory Perspective", Foundations and Trends® in Machine Learning: Vol. 1: No. 3, pp 307-402. http://dx.doi.org/10.1561/2200000004Thu, 27 Nov 2008 00:00:00 +0100http://nowpublishers.com/article/Details/MAL-001Graphical Models, Exponential Families, and Variational Inference<h3>Abstract</h3>The formalism of probabilistic graphical models provides a unifying framework for capturing complex dependencies among random variables, and building large-scale multivariate statistical models. Graphical models have become a focus of research in many statistical, computational and mathematical fields, including bioinformatics, communication theory, statistical physics, combinatorial optimization, signal and image processing, information retrieval and statistical machine learning. Many problems that arise in specific instances — including the key problems of computing marginals and modes of probability distributions — are best studied in the general setting. Working with exponential family representations, and exploiting the conjugate duality between the cumulant function and the entropy for exponential families, we develop general variational representations of the problems of computing likelihoods, marginal probabilities and most probable configurations. We describe how a wide variety of algorithms — among them sum-product, cluster variational methods, expectation-propagation, mean field methods, max-product and linear programming relaxation, as well as conic programming relaxations — can all be understood in terms of exact or approximate forms of these variational representations. The variational approach provides a complementary alternative to Markov chain Monte Carlo as a general source of approximation methods for inference in large-scale statistical models.<h3>Suggested Citation</h3>Martin J. Wainwright and Michael I. Jordan (2008), "Graphical Models, Exponential Families, and Variational Inference", Foundations and Trends® in Machine Learning: Vol. 1: No. 1–2, pp 1-305. http://dx.doi.org/10.1561/2200000001Tue, 18 Nov 2008 00:00:00 +0100