CArDIS: A Swedish Historical Handwritten Character and Word Dataset for OCR

University essay from Blekinge Tekniska Högskola/Institutionen för datavetenskap

Abstract: Background: To preserve valuable sources and cultural heritage, digitization of handwritten characters is crucial. For this, Optical Character Recognition (OCR) systems were introduced and most widely used to recognize digital characters. Incase of ancient or historical characters, automatic transcription is more challenging due to lack of data, high complexity and low quality of the resource. To solve these problems, multiple image based handwritten dataset were collected from historicaland modern document images. But these dataset also have some limitations. To overcome the limitations, we were inspired to create a new image-based historical handwritten character and word dataset and evaluate it’s performance using machine learning algorithms. Objectives: The main objective of this thesis is to create a first ever Swedish historical handwritten character and word dataset named CArDIS (Character Arkiv Digital Sweden) which will be publicly available for further research. In addition,verify the correctness of the dataset and perform a quantitative analysis using different machine learning methods. Methods: Initially we searched for existing character dataset to know how modern character dataset differs from the historical handwritten dataset. We have performed literature review to learn about most commonly used dataset for OCR. On the other hand, we have also studied different machine learning algorithms and their applica-tions. Finally, we have trained six different machine learning methods namely Support Vector Machine, k-Nearest Neighbor, Convolutional Neural Network, Recurrent Neural Network, Random Forest, SVM-HOG with existing dataset and newly created dataset to evaluate the performance and efficiency of recognizing ancient handwritten characters. Results: The performance/evaluation results show that the machine learning classifiers struggle to recognise the ancient handwritten characters with less recognition accuracy. Out of which CNN outperforms with highest recognition accuracy. Conclusions: The current thesis introduces first ever newly created historical hand-written character and word dataset in Swedish named CArDIS. The character dataset contains 1,01,500 Latin and Swedish character images belonging to 29 classes while the word dataset contains 10,000 word images containing ten popular Swedish names belonging to 10 classes in RGB color space. Also, the performance of six machine learning classifiers on CArDIS and existing datasets have been reported. The thesis concludes that classifiers when trained on existing dataset and tested on CArDIS dataset show low recognition accuracy proving that, the CArDIS dataset have unique characteristics and features over the existing handwritten datasets. Finally, this re-search provided a first Swedish character and word dataset, which is robust with a proven accuracy; also it is publicly available for further research.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)