Reinforcement Learning– Intelligent Weighting of Monte Carlo and Temporal Differences
Abstract: In Reinforcement learning the updating of the value functions determines the information spreading across the state/state-action space which condenses the valuebased control policy. It is important to have an information propagation across the value domain in a manner that is effective. Two common ways to update the value function is Monte-Carlo updating and temporal difference updating. They are two extreme cases opposite of another. Monte-Carlo updates in episodic manner where fully played out episodes are used to collect the environment responses and rewards. The value function gets updated at the end of every episode. Monte-Carlo updating needs a large amount of episodes and time steps to converge to an accurate result which is of course a downside. However, the positive is that it will be an unbiased approximation of the value function. In circumstances like simulations and small real world problems it can be applied successfully. However, for larger problems it will cause problems regarding learning time and computer power. On the other hand, by use of temporal difference updating one can in some cases achieve a more effective spreading of information across the value domain. It uses, in contrary to Monte-Carlo update, an incremental update at every time step with the newest information together with an approximation of the expected total discounted accumulated reward for the rest of the episode. In this way the agent learns at every time-step. This leads to a more effective updating of the Q-value function. However the downside is that it introduces biases due to the approximation. Another drawback is that the algorithm only passes information one time step backward in time. By combining Monte-Carlo and Temporal-Difference update the best of the two can be exploited. A popular way to do that is by weighting the importance of the two. The method is called TD(λ) where the λ variable is a tuning parameter "how much to trust the long term update vs. the step wise update. TD(λ = 0) takes one step in the environment, bootstrapping the rest and updates. TD(λ = 1) updates with received rewards and hence it does not make use of any approximation. A value of λ in between is weighting the importance of the two. The optimal choice of λ depends on the specific situation and is dependant on many factors both from the environment and the control problem itself. This thesis proposes an idea to intelligently choose a proper value for λ dynamically together with choosing the values of other hyper parameters used in the reinforcement learning strategy. The main idea is to use a dropout technique as an inferential prediction for the uncertainty in the system. High inferential uncertainty reflects a less trustworthy Q-value function and tuning parameters can be chosen accordingly. In situations where information has propagated throughout the network and bounds the inferential uncertainty for example a lower value of λ and ε (exploit versus explore parameter) can hopefully be used advantageously.
AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)