学位论文详细信息
Visual relationship understanding
Visual Relationship Detection;Action Recognition
Hung, Zih-Siou ; Lazebnik ; Svetlana
关键词: Visual Relationship Detection;    Action Recognition;   
Others  :  https://www.ideals.illinois.edu/bitstream/handle/2142/108027/HUNG-THESIS-2020.pdf?sequence=1&isAllowed=y
美国|英语
来源: The Illinois Digital Environment for Access to Learning and Scholarship
PDF
【 摘 要 】

This thesis addresses two visual understanding tasks: visual relationship detection (VRD) and video action recognition. The majority of the thesis is focused on VRD, which is our main contribution.Relations amongst entities play a central role in image and video understanding. In the first three chapters, we discuss visual relationship detection, whose goal is to recognize all (subject, predicate, object) tuples in a given image. Due to the complexity of modeling (subject, predicate, object) relation triplets, it is crucial to develop a method that can not only recognize seen relations, but also generalize to unseen cases. Inspired by a previously proposed visual translation embedding model, or VTransE [1], we propose a context-augmented translation embedding model that can capture both common and rare relations. The previous VTransE model maps entities and predicates into a low-dimensional embedding vector space where the predicate is interpreted as a translation vector between the embedded features of the bounding box regions of the subject and the object. Our model additionally incorporates the contextual information captured by the bounding box of the union of the subject and the object, and learns the embeddings guided by the constraint predicate = union (subject, object) - subject - object. In a comprehensive evaluation on multiple challenging benchmarks, our approach outperforms previous translation-based models and comes close to or exceeds the state of the art across a range of settings, from small-scale to large-scale datasets, from common to previously unseen relations. It also achieves promising results for the recently introduced task of scene graph generation.In the final part of the thesis, we consider action understanding in videos. In many scenarios, we observe moving objects instead of still images. Thus, it is also important to capture motion information and recognize the action being performed. Recent work either applies 3D convolution operators to extract the motion implicitly or adds an additional optical flow path to leverage temporal features. In our work, we propose to use a novel correlation operator to establish a matching between consecutive frames. This matching encodes the movement of objects through time. Combined with the classical appearance stream, the proposed method hence learns the appearance and motion representations in parallel. On the challenging Something-Something dataset [2], we empirically demonstrate that our network achieves comparable performance to the state-of-the-art method.

【 预 览 】
附件列表
Files Size Format View
Visual relationship understanding 1609KB PDF download
  文献评价指标  
  下载次数:5次 浏览次数:3次