学位论文详细信息
Representations from vision and language
Vision;Language;Word Embeddings;Representation Learning;Contrastive Learning;Phrase Grounding;Semantic Scene Generation;Human-Object Interaction Detection;Deep Learning;Transfer Learning;Multitask Learning
Gupta, Tanmay
关键词: Vision;    Language;    Word Embeddings;    Representation Learning;    Contrastive Learning;    Phrase Grounding;    Semantic Scene Generation;    Human-Object Interaction Detection;    Deep Learning;    Transfer Learning;    Multitask Learning;   
Others  :  https://www.ideals.illinois.edu/bitstream/handle/2142/107978/GUPTA-DISSERTATION-2020.pdf?sequence=1&isAllowed=y
美国|英语
来源: The Illinois Digital Environment for Access to Learning and Scholarship
PDF
【 摘 要 】

Replicating a human-level understanding of the physical world in computers is a monumental task. Achieving this requires building representations of concepts that manifest themselves visually, linguistically or through other senses. Furthermore concepts do not exist in isolation but are related to each other. In this work, we show how to build representations of concepts from visual and textual data, link visual manifestations of concepts to references in text descriptions (a problem known as word or phrase grounding) without strong supervision, and model the interaction between concepts. Specifically, we address the following three challenges faced by existing vision-language models: The first challenge is that of building generalizable and accurate representations of images and words. For generalization across tasks, we build aligned image-word representations that can be shared across multiple tasks like visual recognition and visual question answering and enhance inductive transfer between them. We also augment text-only word embeddings with word embeddings learned from visual co-occurrences to provide more accurate representations of visual concepts.The second challenge is linking references to visual concepts in textual descriptions to the corresponding regions in the image without requiring strong supervision in the form of word-region grounding. We show that maximizing a lower bound on mutual information between image regions and captions leads to state-of-the-art phrase grounding performance.The third challenge is extending vision-language systems to model interactions between visual entities. We build systems that demonstrate this ability in both generation and detection settings. We show how to generate a plausible layout and appearance of entities given a text description of entity actions and interactions. We also develop a state-of-the-art factored model and training techniques for detecting human-object interactions using pretrained object and pose detectors.

【 预 览 】
附件列表
Files Size Format View
Representations from vision and language 33578KB PDF download
  文献评价指标  
  下载次数:36次 浏览次数:58次