Mixing Music Using Deep Reinforcement Learning

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Author: Viktor Kronvall; [2019]

Keywords: ;

Abstract: Deep Reinforcement Learning has recently seen good results in tasks such as board games, computer games and the control of autonomous vehicles. Stateof-the-art autonomous DJ-systems generating mixed audio hard-code the mixing strategy commonly with a cross-fade transition. This research investigates whether Deep Reinforcement Learning is an appropriate method for learning a mixing strategy that can yield more expressive and varied mixes than the hard-coded mixing strategies by adapting the strategies to the songs played. To investigate this, a system named the DeepFADE system was constructed. The DeepFADE system was designed as a three-tier system of hierarchical Deep Reinforcement Learning models. The first tier selects an initial song and limits the song collection to a smaller subset. The second tier selects when to transition to the next song by loading the next song at pre-selected cue points. The third tier is responsible for generating a transition between the two loaded songs according to the mixing strategy. Two Deep Reinforcement Learning algorithms were evaluated, A3C and Dueling DQN. Convolutional and residual neural networks were used to train the reinforcement learning policies. Rewards functions were designed as combinations of heuristic functions that evaluate the mixing strategy according to several important aspects of a DJ-mix such as alignment of beats, stability in output volume, tonal consonance, and time between transitions of songs. The trained models yield policies that are either unable to create transitions between songs or strategies that are similar regardless of playing songs. Thus the learnt mixing strategies were not more expressive than hard-coded cross-fade mixing strategies. The training suffers from reward hacking which was argued to be caused by the agent’s tendency to focus on optimizing only some of the heuristics. The reward hacking was mitigated somewhat by the design of more elaborate rewards that guides the policy to a larger extent.A survey was conducted with a sample size of n = 11. The small samplesize implies no statistically significant conclusions can be drawn. However, the mixes generated by the trained policy was rated more enjoyable compared to a randomized mixing strategy. The convergence rate of the training is slow and training time is not only limited by the optimization of the neural networks but also by the generation of audio used during training. Due to the limited available computational resources it is not possible to draw any clear conclusions whether the proposed method is appropriate or not when constructing the mixing strategy.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)