A key requirement for any agent that wishes to interact with the visual world is the ability to understand the behavior of objects in the scene, primarily through visual means. We humans, through our cognitive system, are able to localize other people and objects in scenes, understand their relationship to the surrounding environment, and reason about not only their actions and attributes, but also about concepts which require knowledge beyond what is afforded by the pixels in visual input, such as possible future states, motion, a person’s motivations, and so on. In this thesis, we outline work that takes small steps towards solving this daunting task of replicating the human visual cognitive system.This dissertation presents methods for predicting actions, interactions with objects, and increasingly structured scenarios from single images. We devise simple methods that make use of a variety of cues by taking into account the structure inherent in the tasks we aim to solve. We show that by solving these tasks as an intermediate step and using their outputs as features, we can develop methods that operate on visual and language inputs to improve performance on tasks that require high-level image information, such as answering questions about images and producing captions for images.One issue that accompanies the learning of multiple tasks with separate deep networks, such as the work described above, is the need to store separate models, which increases storage requirements and affects scalability. We formulate and present two novel methods that draw inspiration from network pruning and weight quantization that can reuse parts of an existing network for learning new tasks with minimal additional overhead, without hurting performance on tasks that were learned earlier.
【 预 览 】
附件列表
Files
Size
Format
View
Learning and adapting visual models for multiple specialized tasks