Unsupervised Extraction and Clustering of Key Phrases from Scientific Publications

University essay from Uppsala universitet/Institutionen för lingvistik och filologi

Author: Xiajing Li; [2020]

Keywords: ;

Abstract: Mapping a research domain can be of great significance for understanding and structuring the state-of-art of a research area. Standard techniques for systematically reviewing scientific literature entail extensive selection and intensive reading of manuscripts, a laborious and time consuming process performed by human experts. Researchers have spent efforts on automating methods in one or more sub-tasks of reviewing process. The main challenge of this work lies in the gap in semantic understanding of text and background domain knowledge. In this thesis we investigate the possibility of extracting keywords from scientific abstracts in an automated way. We intended to use the categories of these keywords to form a basis of a classification scheme in the context of systematically mapping studies. We propose a framework by joint unsupervised keyphrase extraction and semantic keyphrase clustering. Specifically, we (1) explore the effect of domain relevance and phrase quality measures in keyphrase extraction; (2) explore the effect of knowledge graph based word embedding in embedding representation of phrase semantics; (3) explore the effect of clustering for grouping semantically related keyphrases. Experiments are conducted on a dataset of publications pertaining the domain of "Explainable Artificial Intelligence (XAI)”. We further test the performance of clustering using terms and labels from publicly available academic taxonomies and keyword databases. Experiment results shows that: (1) Extended ranking score does improve the keyphrase extraction performance. Adapting pre-processing and candidate selection method to target document type would be more important. (2) Semantic network based word embeddings (ConceptNet) has fairly good performance, with less computational complexity. (3) Term-level semantic keyphrase clustering does not generate ideal categories for terms, however it is shown that clustering can group semantically similar terms together. Finally, we conclude that it is considered particularly challenging to find semantic related, but not morphologically similar terms.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)