Evaluation of multi-view input using dynamic images for action recognition at a vending fridge

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Author: Niklas Lindqvist; [2020]

Keywords: ;

Abstract: Human Action Recognition (HAR) is increasingly common in our society today. It can be found in self-driving cars, surveillance systems and cashier-free stores such as AmazonGo. The task of classifying and predicting human action is difficult, mainly due to the fact that it heavily relies on video data which contains noise in form of unrelated information of the surrounding and a temporal aspect. One method to withstand these issues is a two-stream Convolutional Neural Network architecture that determine the spatial aspect of the action using a single image and the temporal aspect of the action using stacks of Optical Flow (OF) frames. An issue with OF is computational time and the limited number of Frames Per Second (FPS) that can be supported. To combat the low FPS a Dynamic Images (DI) network can be used, which utilizes Approximate Rank Pooling to faster create images representing motion from video data. With the increased FPS supported by the DI networks it is feasible for multi-view real-time HAR. In this study, a data set is gathered at a self-serving vending fridge with a multi-view camera setup. A DI network is used together with different fusion models to investigate the effect of the multi-view camera setup. It is concluded that using a multi-view setup and fusing the DI networks can in one specific case of using support vector classifier fusion provide statistical evidence of an increased mean accuracy compared with stand-alone single-viewDI networks.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)