Interactionwise Semantic Awareness in Visual Relationship Detection

University essay from Göteborgs universitet/Institutionen för data- och informationsteknik

Abstract: Visual Relationship Detection (VRD) is a relatively young research area, where the goal is to develop prediction models for detecting the relationships between objects depicted in an image. A relationship is modeled as a subject-predicate-object triplet, where the predicate (e.g an action, a spatial relation, etc. such as “eat”, “chase” or “next to”) describes how the subject and the object are interacting in the given image. VRD can be formulated as a classification problem, but suffers from the effects of having a combinatorial output space; some of the major issues to overcome are long-tail class distribution, class overlapping and intra-class variance. Machine learning models have been found effective for the task and, more specifically, many works proved that combining visual, spatial and semantic features from the detected objects is key to achieving good predictions. This work investigates on the use of distributional embeddings, often used to discover/encode semantic information, in order to improve the results of an existing neural network-based architecture for VRD. Some experiments are performed in order to make the model semantic-aware of the classification output domain, namely, predicate classes. Additionally, different word embedding models are trained from scratch to better account for multi-word objects and predicates, and are then fine-tuned on VRD-related text corpora. We evaluate our methods on two datasets. Ultimately, we show that, for some set of predicate classes, semantic knowledge of the predicates exported from trained-fromscratch distributional embeddings can be leveraged to greatly improve prediction, and it’s especially effective for zero-shot learning.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)