EVALUATING RECORD LINKAGE METHODS FOR MANIFOLD IDENTITY DETECTION
Abstract: Record Linkage is the process of linking two or more records in a database to the same real life entity. These records do not share a common identifier. This makes connecting them to each other a difficult task since they can only be linked based on similarities in their data. This data can also contain errors due to misspellings or missing fields further increasing the difficulty of the task. In this thesis, common methods for comparing records and finding duplicates are presented. Methods for increasing the performance and reducing the computer power needed are also presented to show how record-linkage can be used with big amounts of data. Built on this knowledge, several experiments comparing these methods have been conducted, using data from two benchmark data sets including Freely Extensible Biomedical Record Linkage (FEBRL) and the North Carolina Voter Registration (NCVR) data set. The results presented show that different types of similarity measures can have similar performance, and that supervised methods provide better prediction rates than unsupervised methods. Finally, suggestions for future work and improvements are given.
AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)