Applying Natural Language Processing to document classification

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Abstract: In today's digital world, we produce and use more electronic documents than ever before. And this trend is far from slowing down. Particularly, more and more companies and businesses now need to treat a considerable amount of documents to deal with their clients' requests. Scaling this process often requires building an automatic document treatment pipeline. Since the treatment of a document depends on its content, those pipelines heavily rely on an automatic document classifier to correctly process the documents received. Such document classifier should be able to receive a document of any type and output its class based on the text content of the document. In this thesis, we designed and implemented a machine learning pipeline for automated insurance claims documents classification. In order to find the best pipeline, we created several combination of different classifiers (logistic regressor and random forest classifier) and embedding models (Fasttext and Doc2vec). We then compared the performances of all of the pipelines using a the precision and accuracy metrics. We found that a pipeline composed of a Fasttext embedding model combined with a logistic regressor classifier was the most performant, yielding a precision of 85% and an accuracy of 86% on our dataset.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)