The impact of corpus choice in domain specific knowledge representation

University essay from KTH/Skolan för industriell teknik och management (ITM)

Author: Adam Lewenhaupt; Emil Brismar; [2017]

Keywords: word2vec;

Abstract: Recent advents in the machine learning community, driven by larger datasets and novel algorithmic approaches to deep reinforcement learning, reward the use of large datasets. In this thesis, we examine whether dataset size has a signicant impact on the recall quality in a very specic knowledge domain. We compare a large corpus extracted from Wikipedia to smaller ones from Stackoverow and evaluate their representational quality of niche computer science knowledge. We show that a smaller dataset with high-quality data points greatly outperform a larger one, even though the smaller is a subset of the latter. This implicates that corpus choice is highly relevant for NLP-applications aimed toward complex and specic knowledge representations.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)