Descriptive Labeling of Document Clusters

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Abstract: Labeling is the process of giving a set of data a descriptive name. This thesis dealt with documents with no additional information and aimed at clustering them using topic modeling and labeling them using Wikipedia as a second source. Labeling documents is a new field with many potential solutions. This thesis examined one method in a practical setting. Unstructured data was preprocessed and clustered using a topic model. Frequent words from each cluster were used to generate a search query sent to Wikipedia, where titles and categories from the most relevant pages were stored as candidate labels. Each candidate label was evaluated based on the frequency of common cluster words among the candidate labels. The frequency was weighted proportional to the relevance of the original Wikipedia article. The relevance was based on the order of appearance in the search results. The five labels with the highest scores were chosen to describe the cluster. The clustered documents consisted of exam questions that students use to practice before a course exam. Each question in the cluster was scored by someone experienced in the relevant topic by evaluating if one of the five labels correctly described the content. The method proved unreliable, with only one course receiving labels considered descriptive for most of its questions. A significant problem was the closely related data with all documents belonging to one overarching category instead of a dataset containing independent topics. However, for one dataset, 80 % of the documents received a descriptive label, indicating that labeling using secondary sources has potential, but needs to be investigated further. 

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)