Minimal Exploration in Episodic Reinforcement Learning

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Abstract: Exploration-exploitation trade-off is a fundamental dilemma that reinforcement learning algorithms face. This dilemma is also central to the design of various state of the art bandit algorithms. We take inspiration from these algorithms and try to design reinforcement learning algorithms in an episodic setting. In this work, we develop two algorithms which are based on the principle of optimism in face of uncertainty to minimize exploration. The idea is that the agent follows the optimal policy for a surrogate model, named optimistic model, which is close enough to the former but leads to a higher longterm reward. We show extensively through experiments on synthetic toy MDP’s that the performance of our algorithms is in line (even better in the case where the reward dynamics are known) with the algorithms based on the Bayesian treatment of the problem and other algorithms based on the optimism in face of uncertainty principle. The algorithms suggested in this thesis trump the Bayesian algorithms in terms of the variance of the regret achieved by the algorithms over multiple runs. Another contribution is the derivation of several regret lower bounds,such as a problem specific (both, asymptotic and non-asymptotic) and a minimax regret lower bound, for any uniformly good algorithm in an episodic setting.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)