HERD - Hajen Entity Recognition and Disambiguation

University essay from Lunds universitet/Institutionen för datavetenskap

Author: Anton Södergren; [2016]

Keywords: Technology and Engineering;

Abstract: This thesis describes the process to build an entity recognizer and disambiguator, named HERD. The goal of the system is to find mentions of entities in text and link those mentions to a unique identifier. This system is designed to be multilingual and has versions in English, French and Swedish. I use Wikipedia as a knowledge source of both names and concepts, and Wikidata, a language agnostic, structured knowledge source, for unique identifiers. The system collects the links on Wikipedia articles to count and analyze them. The link is seen as a mention, that consists of a label and an address, that the system uses as a name and an identifier. The address is translated into a Wikidata Q-number. When the system parses a new document, each recognized name is linked to a unique identifier. I have explored logistic regression, PageRank, and feature vectors based on the Wikipedia categories to improve the name recognition, and select the best candidate for each name. The system is evaluated with the same method as used in the ERD’14 competition, and reached an F1-score of 0.701, which would have placed it 6th, out of 17 competitors, 6 percentage points lower than the highest scoring participant.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)