Towards word alignment and dataset creation for shorthand documents and transcripts

University essay from Uppsala universitet/Institutionen för informationsteknologi

Abstract: Analysing handwritten texts and creating labelled data sets can facilitate novel research on languages and advanced computerized analysis of authors works. However, few handwritten works have word wise labelling or data sets associated with them. More often a transcription of the text is available, but without any exact coupling between words in the transcript and word representations in the document images. Can an algorithm be created that will take only an image of handwritten text and a corresponding transcript and return a partial alignment and data set? An algorithm is developed in this thesis that explores the use of a convolutional neural network trained on English handwritten text to be able to align some words on pages and create a data set given a handwritten page image and a transcript. This algorithm is tested on handwritten English text. The algorithm is also tested on Swedish shorthand, which was the inspiration for the development of the algorithm in this work. In testing on several pages of handwritten English text, the algorithm reaches an overall average classification of 68% of words on one page with 0% miss-classification of those words. On a sequence of pages, the algorithm reaches 84% correctly classified words on 10 pages and produces a data set of 551 correctly labelled word images. This after being shown 10 pages with an average of 70.6 words on each page, with0% miss-classification. 

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)