Automatic Identification of Duplicates in Literature in Multiple Languages

University essay from Linköpings universitet/Statistik och maskininlärning

Abstract: As the the amount of books available online the sizes of each these collections are at the same pace growing larger and more commonly in multiple languages. Many of these cor- pora contain duplicates in form of various editions or translations of books. The task of finding these duplicates is usually done manually but with the growing sizes making it time consuming and demanding. The thesis set out to find a method in the field of Text Mining and Natural Language Processing that can automatize the process of manually identifying these duplicates in a corpora mainly consisting of fiction in multiple languages provided by Storytel. The problem was approached using three different methods to compute distance measures between books. The first approach was comparing titles of the books using the Levenstein- distance. The second approach used extracting entities from each book using Named En- tity Recognition and represented them using tf-idf and cosine dissimilarity to compute distances. The third approach was using a Polylingual Topic Model to estimate the books distribution of topics and compare them using Jensen Shannon Distance. In order to es- timate the parameters of the Polylingual Topic Model 8000 books were translated from Swedish to English using Apache Joshua a statistical machine translation system. For each method every book written by an author was pairwise tested using a hypothesis test where the null hypothesis was that the two books compared is not an edition or translation of the others. Since there is no known distribution to assume as the null distribution for each book a null distribution was estimated using distance measures of books not written by the author. The methods were evaluated on two different sets of manually labeled data made by the author of the thesis. One randomly sampled using one-stage cluster sampling and one consisting of books from authors that the corpus provider prior to the thesis be considered more difficult to label using automated techniques. Of the three methods the Title Matching was the method that performed best in terms of accuracy and precision based of the sampled data. The entity matching approach was the method with the lowest accuracy and precision but with a almost constant recall at around 50 %. It was concluded that there seems to be a set of duplicates that are clearly distin- guished from the estimated null-distributions, with a higher significance level a better pre- cision and accuracy could have been made with a similar recall for the specific method. For topic matching the result was worse than the title matching and when studied the es- timated model was not able to create quality topics the cause of multiple factors. It was concluded that further research is needed for the topic matching approach. None of the three methods were deemed be complete solutions to automatize detection of book duplicates.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)