Information Extraction and Document Similarity: Bag-of- Concepts based approach

University essay from Uppsala universitet/Institutionen för informationsteknologi

Author: Shubhomoy Biswas; [2022]

Keywords: ;

Abstract: People in many organizations develop rich-text files, such as Microsoft Word (MS-Word) and Microsoft Powerpoint (MS-Powerpoint), which contain textual content in a variety of domains, from product presentations to confidential paperwork. This thesis examines information extraction methods, provides a concept-based strategy for computationally representing documents, and determines the degree of similarity between documents based on the information contained in them. Finally, the proposed method of document representation's future scope is examined, as well as how it might be applied to various text/data mining approaches. The thesis is completed in an organization (Ericsson AB) where the proposed approach is tested on a genuine set of documents. 

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)