Contrastive Visual and Language Learning for Visual Relationship Detection

Author: Thanh Tran

Tran, Thanh, 2023 Contrastive Visual and Language Learning for Visual Relationship Detection, Flinders University, College of Science and Engineering

Terms of Use: This electronic version is (or will be) made publicly available by Flinders University in accordance with its open access policy for student theses. Copyright in this thesis remains with the author. You may use this material for uses permitted under the Copyright Act 1968. If you are the owner of any included third party copyright material and/or you believe that any material has been made available without permission of the copyright owner please contact copyright@flinders.edu.au with the details.

Abstract

The visual world is highly structured, and real-world scenes are often decomposed into multiple objects and parts of objects that interact with one another. While deep learning models have demonstrated their abilities to detect visual objects from images, they remain unable to detect higher level visual relationships that exist between these object pairs. In this work, we focus on building visual relationship detection (VRD) systems that can recognize visual relationships between objects in images. Here, we use the scene graph representations from the Visual Genome dataset, which contains objects and relationships grounded to image regions in the form of (subject, predicate, object) triples. Different from existing work which used supervised classification techniques to build VRD systems, we interpret VRD as a representational learning task and apply visual-language contrastive learning in conjunction with knowledge graph representational learning techniques to build joint visual and language embedding spaces for the VRD task. The results show that contrastive visual and language learning can improve the model’s performance on the Recall@n metric while penalizing its ability to generalize to rare visual relationship classes. We also show that translational knowledge graph embedding techniques can be applied to preserve the first-order hierarchical structure without penalizing the model’s overall performance on VRD. With these results, we argue that deep learning models have extra capacity to learn visual relationship concepts and structures through additional contrastive loss constraints, and further categorization of visual relationship labels can improve the final representational spaces.

Keywords: scene graph, deep learning, contextualized representational learning, contrastive learning, transfer learning

Subject: Science, Technology and Enterprise thesis

Thesis type: Masters
Completed: 2023
School: College of Science and Engineering
Supervisor: Paulo Santos