学位论文详细信息
Visual question answering and beyond
Visual question answering;Deep learning;Computer vision;Natural language processing;Machine learning
Agrawal, Aishwarya ; Batra, Dhruv Interactive Computing Parikh, Devi Hays, James Zitnick, C. Lawrence Vinyals, Oriol ; Batra, Dhruv
University:Georgia Institute of Technology
Department:Interactive Computing
关键词: Visual question answering;    Deep learning;    Computer vision;    Natural language processing;    Machine learning;   
Others  :  https://smartech.gatech.edu/bitstream/1853/62277/1/AGRAWAL-DISSERTATION-2019.pdf
美国|英语
来源: SMARTech Repository
PDF
【 摘 要 】

In this dissertation, I propose and study a multi-modal Artificial Intelligence (AI) task called Visual Question Answering (VQA) -- given an image and a natural language question about the image (e.g., "What kind of store is this?", "Is it safe to cross the street?"), the machine's task is to automatically produce an accurate natural language answer ("bakery", "yes"). Applications of VQA include -- aiding visually impaired users in understanding their surroundings, aiding analysts in examining large quantities of surveillance data, teaching children through interactive demos, interacting with personal AI assistants, and making visual social media content more accessible. Specifically, I study the following -- 1) how to create a large-scale dataset and define evaluation metrics for free-form and open-ended VQA, 2) how to develop techniques for characterizing the behavior of VQA models, and 3) how to build VQA models that are less driven by language biases in training data and are more visually grounded, by proposing -- a) a new evaluation protocol,b) a new model architecture, and c) a novel objective function. Most of my past work has been towards building agents that can "see" and "talk". However, for a lot of practical applications (e.g., physical agents navigating inside our houses executing natural language commands) we need agents that can not only "see" and "talk" but can also take actions. In chapter 6, I present future directions towards generalizing vision and language agents to be able to take actions.

【 预 览 】
附件列表
Files Size Format View
Visual question answering and beyond 26821KB PDF download
  文献评价指标  
  下载次数:16次 浏览次数:28次