Safe learning for control: Combining disturbance estimation, reachability analysis and reinforcement learning with systematic exploration

University essay from KTH/Reglerteknik

Author: Caroline Heidenreich; [2017]

Keywords: ;

Abstract: Learning to control an uncertain system is a problem with a plethora ofapplications in various engineering elds. In the majority of practical scenarios,one wishes that the learning process terminates quickly and does not violatesafety limits on key variables. It is particularly appealing to learn the controlpolicy directly from experiments, since this eliminates the need to rst derivean accurate physical model of the system. The main challenge when using suchan approach is to ensure safety constraints during the learning process.This thesis investigates an approach to safe learning that relies on a partlyknown state-space model of the system and regards the unknown dynamics asan additive bounded disturbance. Based on an initial conservative disturbanceestimate, a safe set and the corresponding safe control are calculated using aHamilton-Jacobi-Isaacs reachability analysis. Within the computed safe set avariant of the celebrated Q-learning algorithm, which systematically exploresthe uncertain areas of the state space, is employed to learn a control policy.Whenever the system state hits the boundary of the safe set, a safety-preservingcontrol is applied to bring the system back to safety. The initial disturbancerange is updated on-line using Gaussian Process regression based on the measureddata. This less conservative disturbance estimate is used to increase thesize of the safe set. To the best of our knowledge, this thesis provides the rstattempt towards combining these theoretical tools from reinforcement learningand reachability analysis to safe learning.We evaluate our approach on an inverted pendulum system. The proposedalgorithm manages to learn a policy that does not violate the pre-speciedsafety constraints. We observe that performance is signicantly improved whenwe incorporate systematic exploration to make sure that an optimal policy islearned everywhere in the safe set. Finally, we outline some promising directionsfor future research beyond the scope of this thesis.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)