学位论文详细信息
Toward grounded spatio-temporal reasoning
Machine learning;Deep learning;Computer vision;Natural language processing;Vision and language;Human action recognition;Video understanding;Visual captioning;Relationship reasoning;Temporal reasoning;Vision-and-language navigation;Visual grounding
Ma, Chih-Yao ; AlRegib, Ghassan Electrical and Computer Engineering Kira, Zsolt Vela, Patricio Parikh, Devi Rohrbach, Marcus ; AlRegib, Ghassan
University:Georgia Institute of Technology
Department:Electrical and Computer Engineering
关键词: Machine learning;    Deep learning;    Computer vision;    Natural language processing;    Vision and language;    Human action recognition;    Video understanding;    Visual captioning;    Relationship reasoning;    Temporal reasoning;    Vision-and-language navigation;    Visual grounding;   
Others  :  https://smartech.gatech.edu/bitstream/1853/62776/1/MA-DISSERTATION-2020.pdf
美国|英语
来源: SMARTech Repository
PDF
【 摘 要 】

To develop an Artificial Intelligence (AI) system that can understand the world around us, it needs to be able to interpret and reason about the world we see and the language we speak. In recent years, there has been a lot of attention to research at the intersection of vision, temporal reasoning, and language. One of the major challenges is how to ensure proper grounding and perform reasoning across multiple modalities given the heterogeneity resides in the data when there is no or weak supervision of the data. For example, (1) in Vision-and-Language Navigation, how to ensure the navigation agent to identify which part of the instruction has been completed or ongoing and which part is potentially needed for the next action selection, and how to identify which direction to go by finding the part of the instruction that corresponds to the observed images. (2) in visual understanding, how to efficiently leverage object-level features for downstream visual understanding tasks like action recognition and visual captioning, how to detect interactions/relationships when there is no or weak supervision from classification labels or ground-truth image/video descriptions. In my thesis, the goal is to leverage spatial, temporal, and language inputs for both visual and textual understanding. I showed (1) how to equip the concept of self-monitoring to a seq-to-seq model in order to develop a visual-textual co-grounded navigation agent that can follow human commands in natural language format, (2) how to introduce the rollback concept to the seq-to-seq based navigation agent by leveraging the self-monitoring mechanism that we proposed, (3) how to efficiently achieve object-level fine-grained video understanding for both human action recognition and video captioning, and (4) how to enforce the visual captioning models to generate grounded descriptions via a novel cyclical training regimen without ground-truth grounding annotations and without adding extra computation during inference.

【 预 览 】
附件列表
Files Size Format View
Toward grounded spatio-temporal reasoning 38381KB PDF download
  文献评价指标  
  下载次数:30次 浏览次数:27次