Author: Maelic Neau
Neau, Maelic, 2025 Real-Time And Efficient Scene Graph Generation for Real-World Applications: An End-to-End Investigation, Flinders University, College of Science and Engineering
Terms of Use: This electronic version is (or will be) made publicly available by Flinders University in accordance with its open access policy for student theses. Copyright in this thesis remains with the author. You may use this material for uses permitted under the Copyright Act 1968. If you are the owner of any included third party copyright material and/or you believe that any material has been made available without permission of the copyright owner please contact copyright@flinders.edu.au with the details.
Scene Graphs are powerful representations that abstract the content of images or videos in the form of relation triplets grounded to visual regions. Generating Scene Graphs through the task of Scene Graph Generation (SGG) seems especially promising for applications in Robotics such as in Human-Robot Collaboration (HRC) in domestic context where Scene Graphs can be used to model the environment and the interactions between the robot and the human. However, several years after the first inception of the task, the usage of Scene Graphs in real-world applications is still limited due to the poor performance of SGG models on out-of-distribution samples. In this thesis, we propose to bridge the gap between theoretical methods of SGG and their practical implementations in real-world settings, successfully contributing to the democratization of the usage of Scene Graphs. We first describe a new method for semi-automatic extraction of clean and qualitative annotations to create in-context Scene Graphs datasets from noisy data. This results in our first contribution, the IndoorVG dataset, a high-quality Scene Graphs dataset targeting scene understanding applications in a domestic context. When analyzing complex scenes, the number of relation triplets in SGG can grow quadratically, leading to a loss of performance for downstream tasks when the amount of non-informative relations predicted is high. To solve this issue, we propose a new inference process that selects a subset of highly informative relations from a set of biased and noisy predictions of an SGG model. This approach can substantially increase the performance of downstream tasks by improving the quality of generated relations. Our results on three different tasks (Visual Question Answering, Image Synthesis, and Image Captioning) demonstrate the importance of the informativeness of relations in Scene Graphs and the benefit of trading off accuracy for informativeness. To foster the usage of SGG in real-world applications and improve the deployment of models on embedded devices, we propose a new method for real-time SGG, based on state-of-the-art single-stage object detectors. Our method, named Real-Time SGG, is able to generate Scene Graphs in real-time on a single GPU without loss of accuracy, outperforming the current state-of-the-art methods in terms of speed and resource efficiency. We further extend the traditional static implementation of SGG to the time domain, introducing a Continuous SGG (C-SGG) architecture that aggregates relations from consecutive frames into a consistent representation. We applied our C-SGG method for real-time fine-grained activity understanding in the domestic context and demonstrated the advantage of our approach to model long-term complex activities in a Human-Robot Collaboration scenario.
Keywords: Scene Graph Generation, Visual Relationships Detection, Activity Understanding, Knowledge Representation, Service Robotics
Subject: Computer Science thesis
Thesis type: Doctor of Philosophy
Completed: 2025
School: College of Science and Engineering
Supervisor: Karl Sammut