The world around us is highly structured. In the real world, a single object usually consists of multiple components organized in some structures (e.g., a person has different body parts), and multiple objects usually exist in a scene and interact with each other in predictable ways (e.g., man playing basketball). This structure manifests itself in the visual data that captures the world around us and in the text describing it and thus can potentially provide a strong inductive bias to various vision tasks. In this thesis, we focus on exploiting the structures existing in visual data to improve visual understanding, generation and reasoning. Specifically, for visual understanding, we model structure at different levels to improve image classification, scene graph generation and representation learning. In visual generation, we exploit the foreground-background structure in images to generate images in a layer-wise manner to reduce blending artifacts between foreground and background. Finally, we use the structured visual representations as the intermediate interface to bridge visual perception and reasoning to address different vision and language tasks, including image captioning and visual question generation. Through extensive experiments, we demonstrate that leveraging structure in visual data can not only improve the model performance, but also make vision and language models more grounded and interpretable.
【 预 览 】
附件列表
Files
Size
Format
View
Structured visual understanding, generation and reasoning