A difference analysis method for detecting differences between similar documents

University essay from KTH/Skolan för datavetenskap och kommunikation (CSC)

Abstract: Similarity analysis of documents is a well studied field. With a focus instead on the opposite concept, how can we try to define and distinguish the differences within documents? This project tries to determine if differences within documents can be detected as well as quantified based on their semantic qualities. We propose a method for quantifying differences by applying tf-idf based models with analysis methods for lemmatization and synonym extraction, together with utility ranking algorithms. The method is implemented and tested. The results show that the method has potential but that further studies are required in order to fully evaluate to what extent it could be of practical use. Such a method could though reap significant benefits within several different fields in which automatic difference detection could replace error prone manual labor in document management, as well as other beneficial purposes such as to provide automatically generated difference summaries.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)