Grounding language in images has shown it can help improve performance on many image-language tasks. To spur research on this topic, this dissertation introduces a new dataset which provides the ground truth annotations of the location of noun phrase chunks in image captions.I begin by introducing a constituent task termed phrase localization, where the goal is to localize an entity known to exist in an image when provided with a natural language query.To address this task, I introduce a model which learns a set of models, each of which capture a different concept which is useful in our task.These concepts can be predefined, such as attributes gleamed from the adjectives, as well as those which are automatically learned in a single-end-to-end neural network.I also address the more challenging detection style task, where the goal is to localize a phrase and determine if it is associated with an image.Multiple applications of the models presented in this work demonstrate their value beyond the phrase localization task.
【 预 览 】
附件列表
Files
Size
Format
View
Grounding natural language phrases in images and video