Towards Speaker Detectionusing FaceAPI Facial Movementsin Human-Machine Multiparty Dialogue

University essay from KTH/Skolan för datavetenskap och kommunikation (CSC)

Author: Fasih Haider; [2013]

Keywords: ;

Abstract: In multiparty multimodal dialogue setup, where the robot is set to interact with multiple people, a main requirement for the robot is to recognize the user speaking to it. This would allow the robot to pay attention (visually) to the person the robot is listening to (for example looking by the gaze and head pose to the speaker), and to organize the dialogue structure with multiple people. Knowing the speaker from a set of persons in the field-of-view of the robot is a research problem that is usually addressed by analyzing the facial dynamics of persons (the person that is moving his lips and looking towards the robot is probably the person speaking to the robot).This thesis investigates the use of lip and head movements for the purpose of speaker and speech/silence detection in the context of human-machine multiparty dialogue. The use of speaker and voice activity detection systems in human-machine multiparty dialogue is to help the machine in detecting who and when someone is speaking out of a set of persons in the field-of-view of the camera. To begin with, a video of four speakers (S1, S2, S3 and S4) speaking in a task free dialogue with a fifth speaker (S5) through video conferencing is audio-visually recorded. After that each speaker present in the video is annotated with segments of speech, silence, smile and laughter. Then the real-time FaceAPI face tracking commercial software is applied to each of the four speakers in the video to track the facial markers such as head and lip movements. At the end, three classification techniques namely Mahalanobis distance, naïve Bayes classifier and neural network classifier are applied to facial data (lip and head movements) to detect speech/silence and speaker. In this thesis, three types of training methods are used to estimate the training models of speech/silence for every speaker. The first one is speaker dependent method, in which the training model contains the facial data of testing person. The second one is speaker independent method, where the training model does not contain the facial data of testing person. It means that if the test person is S1 then the training model may contain the facial data of S2, S3 or S4. The third one is hybrid method, where the training model is estimated using the facial data of all the speakers and testing is performed on one of the speaker. The results of speaker dependent and hybrid methods show that the neural network classifier provides the best results. In the speaker dependent method, the accuracies of neural network classifier for speaker and speech/silence detection are 97.43% and 98.73% respectively. However, in the hybrid method, the accuracy of neural network classifier for speech/silence detection is 96.22%. The results of speaker independent method shows that the naïve Bayes classifier provides the best results with an optimal accuracy of 67.57% for speech/silence detection.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)