Authors
Michael Kearns, Satinder Singh
Publication date
2002/11
Journal
Machine learning
Volume
49
Pages
209-232
Publisher
Kluwer Academic Publishers
Description
We present new algorithms for reinforcement learning and prove that they have polynomial bounds on the resources required to achieve near-optimal return in general Markov decision processes. After observing that the number of actions required to approach the optimal return is lower bounded by the mixing time T of the optimal policy (in the undiscounted case) or by the horizon time T (in the discounted case), we then give algorithms requiring a number of actions and total computation time that are only polynomial in T and the number of states and actions, for both the undiscounted and discounted cases. An interesting aspect of our algorithms is their explicit handling of the Exploration-Exploitation trade-off.
Total citations
200120022003200420052006200720082009201020112012201320142015201620172018201920202021202220232024513151825294442504658604550394351729112211011610637