Classification of invoices using a 2D NLP approach : A comparison between methods for invoice information extraction for the purpose of classification

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Abstract: Many companies are handling a large number of invoices every year. To manually categorize them takes a lot of time and resources. For a model to automatically categorize invoices, the documents need to be properly read and processed by the model. While traditional Natural Language Processing may be suitable for processing structured documents, unstructured documents such as invoices often need the layout to be considered in ordered for the document to be read correctly. Techniques that take the visual information in account when processing a document is referred to as 2D NLP. One of such models that is state-of-the-art today is LayoutLMv3. This project provides a comparison of invoice-information extraction using LayoutLMv3 and plain Optical Character Recognition (OCR) for the purpose of invoice classification. LayoutLMv3 were fine-tuned for key-field extraction on 180 annotated invoices. The extracted key-fields were then used to form 3 different configurations of structured text-strings for each document. The structured texts were used for training a classification model into three categories, A: physical product, B: service and C: unknown. The results were compared with a baseline classification model trained on unstructured text obtained through OCR. The results show that all of the models achieved equal performance on the classification task. However, several inconsistencies regarding the annotations of the dataset were found. The project concluded that the raw OCR text proved to be useful for classification despite being unstructured, and that similar classification results could be obtained through considering only a few key-information fields. Obtaining a structured input through LayoutLMv3 proved to be especially useful for controlling the input to the classification model, such as omitting undesirable information. However, the drawbacks might be that some important information in some cases are excluded.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)