Multimodal Relation Extraction of Product Categories in Market Research Data

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Author: Philip Bergström; [2019]

Keywords: ;

Abstract: Nowadays, large amounts of unstructured data are constantly being generated and made available through websites and documents. Relation extraction, the task of automatically extracting semantic relationships between entities from such data, is therefore considered to have high commercial value today. However, many websites and documents are richly formatted, i.e., they communicate information through non-textual expressions such as tabular or visual elements. This thesis proposes a framework for relation extraction from such data, in particular, documents from the market research area. The framework performs relation extraction by applying supervised learning using both textual and visual features from PDF documents. Moreover, it allows the user to train a model without any manually labeled data by implementing labeling functions.We evaluate our framework by extracting relations from a corpus of market research documents on consumer goods. The extracted relations associate categories to products of different brands. We find that our framework outperforms a simple baseline model, although we are unable to show the effectiveness of incorporating visual features on our test set. We conclude that our framework can serve as a prototype for relation extraction from richly format-ted data, although more advanced techniques are necessary to make use of non-textual features.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)