An Object-Oriented Data Analysis approach for text population
Abstract: With more and more digital text-valued data available, the need to be able to cluster, classify and study them arises. We develop in this thesis statistical tools to perform null hypothesis testing and clustering or classification on text-valued data in the framework of Object-Oriented Data Analysis. The project includes research on semantic methods to represent texts, comparisons between representations, distances for such representations and performance of permutation tests. Main methods compared are Vector Space Model and topic model. More precisely, this thesis will provide an algorithm to compute permutation tests at document or sentence level to study the equality in terms of distribution of two texts for different representations and distances. Lastly, we describe the study of texts regarding a syntactic point of view and its structure with a tree representation.
AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)