The Effect of Data Quantity on Dialog System Input Classification Models

University essay from KTH/Hälsoinformatik och logistik

Abstract: This paper researches how different amounts of data affect different word vector models for classification of dialog system user input. A hypothesis is tested that there is a data threshold for dense vector models to reach the state-of-the-art performance that have been shown with recent research, and that character-level n-gram word-vector classifiers are especially suited for Swedish classifiers–because of compounding and the character-level n-gram model ability to vectorize out-of-vocabulary words. Also, a second hypothesis is put forward that models trained with single statements are more suitable for chat user input classification than models trained with full conversations. The results are not able to support neither of our hypotheses but show that sparse vector models perform very well on the binary classification tasks used. Further, the results show that 799,544 words of data is insufficient for training dense vector models but that training the models with full conversations is sufficient for single statement classification as the single-statement- trained models do not show any improvement in classifying single statements.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)