News article segmentation using multimodal input : Using Mask R-CNN and sentence transformers

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Abstract: In this century and the last, serious efforts have been made to digitize the content housed by libraries across the world. In order to open up these volumes to content-based information retrieval, independent elements such as headlines, body text, bylines, images and captions ideally need to be connected semantically as article-level units. To query on facets such as author, section, content type or other metadata, further processing of these documents is required. Even though humans have shown exceptional ability to segment different types of elements into related components, even in languages foreign to them, this task has proven difficult for computers. The challenge of semantic segmentation in newspapers lies in the diversity of the medium: Newspapers have vastly different layouts, covering diverse content, from news articles to ads to weather reports. State-of-the-art object detection and segmentation models have been trained to detect and segment real-world objects. It is not clear whether these architectures can perform equally well when applied to scanned images of printed text. In the domain of newspapers, in addition to the images themselves, we have access to textual information through Optical Character Recognition. The recent progress made in the field of instance segmentation of real-world objects using deep learning techniques begs the question: Can the same methodology be applied in the domain of newspaper articles? In this thesis we investigate one possible approach to encode the textual signal into the image in an attempt to improve performance. Based on newspapers from the National Library of Sweden, we investigate the predictive power of visual and textual features and their capacity to generalize across different typographic designs. Results show impressive mean Average Precision scores (>0:9) for test sets sampled from the same newspaper designs as the training data when using only the image modality. 

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)