Domain-specific knowledge graph construction from Swedish and English news articles

University essay from Uppsala universitet/Institutionen för lingvistik och filologi

Abstract: In the current age of new textual information emerging constantly, there is a challenge related to processing and structuring it in some ways. Moreover, the information is often expressed in many different languages, but the discourse tends to be dominated by English, which may lead to overseeing important, specific knowledge in less well-resourced languages. Knowledge graphs have been proposed as a way of structuring unstructured data, making it machine-readable and available for further processing. Researchers have also emphasized the potential bilateral benefits of combining knowledge in low- and well-resourced languages.  In this thesis, I combine the two goals of structuring textual data with the help of knowledge graphs and including multilingual information in an effort to achieve a more accurate knowledge representation. The purpose of the project is to investigate whether the information about three Swedish companies known worldwide - H&M, Spotify, and Ikea - in Swedish and English data sources is the same and how combining the two sources can be beneficial. Following a natural language processing (NLP) pipeline consisting of such tasks as coreference resolution, entity linking, and relation extraction, a knowledge graph is constructed from Swedish and English news articles about the companies. Refinement techniques are applied to improve the graph. The constructed knowledge graph is analyzed with respect to the overlap of extracted entities and the complementarity of information. Different variants of the graph are further evaluated by human raters. A number of queries illustrate the capabilities of the constructed knowledge graph. The evaluation of the graph shows that the topics covered in the two information sources differ substantially. Only a small number of entities occur in both languages. Combining the two sources can, therefore, contribute to a richer and more connected knowledge graph. The adopted refinement techniques increase the connectedness of the graph. Human evaluators consequently chose the Swedish side of the data as more relevant for the considered questions, which points out the importance of not limiting the research to more easily available and processed English data. 

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)