Deep Neural Networks for Dynamic Visual Data

University essay from Lunds universitet/Matematik LTH

Abstract: Given monocular video of people performing daily tasks our objective is to estimate the 3D positions of 32 given joints associated to the human skeleton. Due to the success of deep convolutional networks in image classification, image segmentation and activity recognition we propose to estimate 3D joint positions from video using deep convolutional networks. The modeling is carried out within the framework of convolutional neural networks, and based on the Caffe Deep learning Network. We use the architecture and the pre-trained weights of the convolutional layers of VGG-16, network developed by the Oxford Visual Geometry Group. The effect of different feature extraction architectures on model’s accuracy was studied by varying the number of pooling layers. A decreased number of pooling layers did not improve the accuracy of the model. We also studied the effect of varying the output dimension by varying the number of joints estimated simultaneously. Our findings indicate that increasing the number of estimated joint positions does not change model accuracy. Finally the effect of incorporating temporal dependencies by means of Long-Short-Term Memory (LSTM) units in the model was studied.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)