Fine-tuning Bot Play Styles From Demonstration

University essay from Uppsala universitet/Institutionen för informationsteknologi

Author: Felicia Fredriksson; [2023]

Keywords: ;

Abstract: In recent years, Reinforcement Learning (RL) has successfully been used to train agents for games. Nonetheless, in the game industry there is still a necessity for bots not only to succeed in the environments but also to act human-like while playing the game. Additionally, there is great value in changing the play style of bots to create different characters.   One way of addressing this problem would be to fine-tune the bots using some human demonstration to change their behaviour and play style. This thesis explores Learning from Demonstration algorithms Generative Adversarial Imitation Learning (GAIL), Wasserstein Generative Adversarial Imitation Learning (WGAIL), Wasserstein Generative Adversarial Imitation Learning with Gradient Penalty (WGAIL-GP), Wasserstein Adversarial Imitation Learning (WAIL) and behavioural Cloning (BC) on the task of fine-tuning a pre-trained policies play style while still maintaining high performance in the environment.   The empirical study consisted of three stages, firstly GAIL and the newer variants were tested in two simpler RL environments, CartPole-v1 and MountainCar-v0. The second stage was fine-tuning the behaviour of a pre-trained policy in a simpler environment, CartPole-v1. The final stage compared the performance between BC and the best performing GAIL variant when fine-tuning a pre-trained policy to change its play style in the complex game environment, Racket Club. For Racket Club personas were introduced as the demonstrating experts to enable the use of game statistics for play style evaluation. From the game statistics, a play style evaluation method was developed using the cosine similarity metric and visualizations in order to quantify and identify the changes in play styles after fine-tuning the policy.   All GAIL variants solved the environment CartPole-v1 when training ab initio. Solely GAIL and WAIL were successful when training a policy in MountainCar-v0 returning rewards −146.06 ± 8.95 and −140.57 ± 12.71 respectively. Therefore only GAIL and WAIL were used for the stages to follow. WAIL was able to fine-tune the desired behaviour and maintain the maximum reward in CartPole-v1. In spite of this, WAIL had noticeable instabilities in the training resulting in large variations of the outcome of the experiments. On the other hand, GAIL was also able to fine-tune a pre-trained policy in CartPole-v1 and had no instabilities in the training. When fine-tuning the play style of a pre-trained bot in Racket Club, both GAIL and WAIL suffered from catastrophic forgetting. BC had a performance loss but still returned bots that had picked up on some of the characteristics of the play styles from the expert. A plausible solution to prevent the performance loss in BC is adding some regularization to prevent overfitting. For future work, exploring the possibilities of pre-training the discriminator when fine-tuning with GAIL might prevent the generator from suffering from catastrophic forgetting.   The conclusions drawn from this study are that GAIL and WAIL both are capable of fine-tuning a pre-trained policy behaviour in a lower complexity environment. Unfortunately, for the more complex game environment, both algorithms suffered from catastrophic forgetting and BC outperformed them on this task.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)