Exploration-Exploitation Trade-off Approaches in Multi-Armed Bandit

University essay from Uppsala universitet/Institutionen för informationsteknologi

Author: Duc Huy Le; [2023]

Keywords: ;

Abstract: Multi-armed bandit, a popular framework for sequential decision-making problems, has recently gained significant attention due to numerous applications. In Multi-armed Bandit, an agent faces the central challenge of choosing exploitation of its belief to hopefully gain a high reward and exploration to improve its knowledge of the environment, and any good strategy has to efficiently balance between the two actions. Being particularly interested in the Bernoulli reward signal, in this project, we studied two canonical algorithms, ε-greedy and Thompson Sampling, and proposed the Surprise-based Exploration policy, a novel method inspired by the intrinsic motivation in psychology, that can be incorporated in many existing strategies to promote exploration into less known actions. Our experimental results show that Thompson Sampling and our proposed policy are greatly efficient with a small set of arms, however, gradually drop when the number of arms increases. Besides, we tackled the challenge exploration with a large number of arms and introduced a solution that distributes a portion of the time horizon to reduce the arm set with our proposed Successive-Reject Best Arms Identification algorithm. The empirical studies show that our method significantly improves the performance of Thompson Sampling, however, it fails to give ε-greedy enhancements. 

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)