Implementation of an abstract module for entity resolution to combine data sources with the same domain information

University essay from Luleå tekniska universitet/Institutionen för system- och rymdteknik

Author: Ziaul Islam Chowdhury; [2021]

Keywords: ;

Abstract: Increasing digitalization is creating a lot of data every day. Sometimes the same real-world entity is stored in multiple data sources but lacks common reference. This creates a significant challenge on the integration of data sources and may cause duplicates and inconsistencies if not resolved correctly. The core idea of this thesis is to implement an abstract module for entity resolution to combine multiple data sources with similar domain information.  CRISP-DM process was used as the methodology in this thesis which started with an understanding of the business and data. Two open datasets containing product details from e-commerce sites are used to conduct the research (Abt-Buy and Amazon-Google). The datasets have similar structures and contain product name, description, manufacturer’s name, price. Both datasets contain gold-standard data to evaluate the performance of the model. In the data exploration phase, various aspects of the datasets are explored such as word-cloud containing important words in the product name and description, bigrams and trigrams of the product name, histograms, standard deviation, mean, min, max length of the product name. Data preparation phases contains NLP based preprocessing pipeline consists of normalization of case, removal of special characters and stop-words, tokenization, and lemmatization.  In the modeling phase of the CRISP-DM process, various similarity and distance measures are applied on the product name and/or description and the weighted scores are summed up to form total score of the fuzzy matching. A set of threshold values are applied to the total score and performance of the model is evaluated against the ground truth. The implemented model scored more than 60% F1-score in both datasets. Moreover, the abstract model can be applied to various datasets with similar domain information. The model is not deployed to the production environment which can be a future work. Moreover, blocking or indexing techniques can be also applied in the future with big data technologies which will reduce quadratic nature of entity resolution problem. 

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)