Finding delta difference in large data sets

University essay from Luleå tekniska universitet/Datavetenskap

Abstract: To find out what differs between two versions of a file can be done with several different techniques and programs. These techniques and programs are often focusd on finding differences in text files, in documents, or in class files for programming. An example of a program is the popular git tool which focuses on displaying the difference between versions of files in a project. A common way to find these differences is to utilize an algorithm called Longest common subsequence, which focuses on finding the longest common subsequence in each file to find similarity between the files. By excluding all similarities in a file, all remaining text will be the differences between the files. The Longest Common Subsequence is often used to find the differences in an acceptable time. When two lines in a file is compared to see if they differ from each other hashing is used. The hash values for each correspondent line in both files will be compared. Hashing a line will give the content on that line a unique value. If as little as one character on a line is different between the version, the hash values for those lines will be different as well. These techniques are very useful when comparing two versions of a file with text content. With data from a database some, but not all, of these techniques can be useful. A big difference between data in a database and text in a file will be that content is not just added and delete but also updated. This thesis studies the problem on how to make use of these techniques when finding differences between large datasets, and doing this in a reasonable time, instead of finding differences in documents and files.  Three different methods are going to be studied in theory. These results will be provided in both time and space complexities. Finally, a selected one of these methods is further studied with implementation and testing. The reason only one of these three is implemented is because of time constraint. The one that got chosen had easy maintainability, an easy implementation, and maintains a good execution time.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)