Foundations and Trends® in Machine Learning >
Vol 7 > Issue 1

By
**Rémi Munos**, INRIA Lille – Nord Europe, France, remi.munos@inria.fr

Rémi Munos (2014), "From Bandits to Monte-Carlo Tree Search: The Optimistic Principle Applied to Optimization and Planning", Foundations and Trends® in Machine Learning: Vol. 7: No. 1, pp 1-129. http://dx.doi.org/10.1561/2200000038

© 2014 R. Munos

Online learning, Optimization, Game theoretic learning, Algorithmic game theory, Operations Research, Markov Decision Processes, Stochastic Optimization

Download article
**In this article:**

This work covers several aspects of the *optimism in the face of uncertainty* principle applied to large scale optimization
problems under finite numerical budget. The initial motivation for the research reported here originated from the empirical success of the
so-called *Monte-Carlo Tree Search* method popularized in Computer Go and further extended to many other games as well as optimization and
planning problems. Our objective is to contribute to the development of theoretical foundations of the field by characterizing the complexity
of the underlying optimization problems and designing efficient algorithms with performance guarantees.

The main idea presented here is that it is possible to decompose a complex decision making problem (such as an optimization problem in a
large search space) into a sequence of elementary decisions, where each decision of the sequence is solved using a *(stochastic) multi-armed
bandit* (simple mathematical model for decision making in stochastic environments). This so-called *hierarchical bandit* approach
(where the reward observed by a bandit in the hierarchy is itself the return of another bandit at a deeper level) possesses the nice
feature of starting the exploration by a quasi-uniform sampling of the space and then focusing progressively on the most promising area,
at different scales, according to the evaluations observed so far, until eventually performing a local search around the global optima of
the function. The performance of the method is assessed in terms of the optimality of the returned solution as a function of the number of
function evaluations.

Our main contribution to the field of function optimization is a class of hierarchical optimistic algorithms designed for general search spaces (such as metric spaces, trees, graphs, Euclidean spaces) with different algorithmic instantiations depending on whether the evaluations are noisy or noiseless and whether some measure of the “smoothness” of the function is known or unknown. The performance of the algorithms depends on the “local” behavior of the function around its global optima expressed in terms of the quantity of near-optimal states measured with some metric. If this local smoothness of the function is known then one can design very efficient optimization algorithms (with convergence rate independent of the space dimension). When this information is unknown, one can build adaptive techniques which, in some cases, perform almost as well as when it is known.

In order to be self-contained, we start with a brief introduction to the stochastic multi-armed bandit problem in Chapter 1 and describe the UCB (Upper Confidence Bound) strategy and several extensions. In Chapter 2 we present the Monte-Carlo Tree Search method applied to Computer Go and show the limitations of previous algorithms such as UCT (UCB applied to Trees). This provides motivation for designing theoretically well-founded optimistic optimization algorithms. The main contributions on hierarchical optimistic optimization are described in Chapters 3 and 4 where the general setting of a semi-metric space is introduced and algorithms designed for optimizing a function assumed to be locally smooth (around its maxima) with respect to a semi-metric are presented and analyzed. Chapter 3 considers the case when the semi-metric is known and can be used by the algorithm, whereas Chapter 4 considers the case when it is not known and describes an adaptive technique that does almost as well as when it is known. Finally in Chapter 5 we describe optimistic strategies for a specific structured problem, namely the planning problem in Markov decision processes with infinite horizon discounted rewards.

142 pp. $95.00

Buy book (pb)
142 pp. $120.00

Buy E-book (.pdf)
1. The Stochastic Multi-armed Bandit Problem

2. Monte-Carlo Tree Search

3. Optimistic Optimization with Known Smoothness

4. Optimistic Optimization with Unknown Smoothness

5. Optimistic Planning

6. Conclusion

Acknowledgments

References

*From Bandits to Monte-Carlo Tree Search: The Optimistic Principle Applied to Optimization and Planning* covers several
aspects of the "optimism in the face of uncertainty" principle for large scale optimization problems under finite numerical budget.

The monograph’s initial motivation came from the empirical success of the so-called "Monte-Carlo Tree Search" method popularized in Computer Go and further extended to many other games as well as optimization and planning problems. It lays out the theoretical foundations of the field by characterizing the complexity of the optimization problems and designing efficient algorithms with performance guarantees.

The main direction followed in this monograph consists in decomposing a complex decision making problem (such as an optimization problem in a large search space) into a sequence of elementary decisions, where each decision of the sequence is solved using a stochastic "multi-armed bandit" (mathematical model for decision making in stochastic environments). This defines a hierarchical search which possesses the nice feature of starting the exploration by a quasi-uniform sampling of the space and then focusing, at different scales, on the most promising areas (using the optimistic principle) until eventually performing a local search around the global optima of the function.

This monograph considers the problem of function optimization in general search spaces (such as metric spaces, structured spaces, trees, and graphs) as well as the problem of planning in Markov decision processes. Its main contribution is a class of hierarchical optimistic algorithms with different algorithmic instantiations depending on whether the evaluations are noisy or noiseless and whether some measure of the local "smoothness" of the function around the global maximum is known or unknown.