Computer Vision for Document Image Analysis and Text Extraction

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Abstract: Automatic document processing has been a subject of interest in the industry for the past few years, especially with the recent technological advances in Machine Learning and Computer Vision. This project investigates in-depth a major component used in Document Image Processing known as Optical Character Recognition (OCR). First, an improvement upon existing shallow CNN+LSTM is proposed, using domain-specific data synthesis. We demonstrate that this model can achieve an accuracy of up to 97% on non-handwritten text, with an accuracy improvement of 24% when using synthetic data. Furthermore, we deal with handwritten text that presents more challenges including the variance of writing style, slanting, and character ambiguity. A CNN+Transformer architecture is validated to recognize handwriting extracted from real-world insurance statements data. This model achieves a maximal accuracy of 92% on real-world data. Moreover, we demonstrate how a data pipeline relying on synthetic data can be a scalable and affordable solution for modern OCR needs.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)