学位论文

【摘要】

Computer vision is moving from predicting discrete, categorical labels to generating rich descriptions of visual data, in particular, in the form of natural language. Learning the joint latent representations for images and language is vital to solving many image-text tasks, including image-sentence retrieval, visual grounding, and image captioning, etc. In this thesis, we first propose two-branch neural networks for learning the similarity between these two data modalities. Two network structures are proposed to produce different output representations. The first one, referred to as an embedding network, learns an explicit shared latent embedding space with a maximum-margin ranking loss and novel neighborhood constraints. The second network structure, referred to as a similarity network, fuses the two branches via element-wise product and is trained with regression loss to directly predict a similarity score. Extensive experiments show that our networks achieve high accuracies for phrase localization in the Flickr30K Entities dataset and for bi-directional image-sentence retrieval in the Flickr30K and COCO datasets. Then, we explore the image captioning problem using conditional variational auto-encoders (CVAEs). Standard CVAEs with a fixed Gaussian prior yield descriptions with too little variability. Instead, we propose two models that explicitly structure the latent space with K components corresponding to different types of image content, and combine components to create priors for images that contain multiple types of content simultaneously (e.g. several kinds of objects). The first model uses a Gaussian Mixture model (GMM) prior while the second one defines a novel Additive Gaussian (AG) prior that linearly combines component means. Experiments show that both models produce captions that are more diverse and more accurate than a strong LSTM baseline or a “vanilla” CVAE with a fixed Gaussian prior, with AG-CVAE showing particular promise. In order to further improve the caption decoder inherited from the AG-CVAE model, we attempt to train it by optimizing caption evaluation metrics (e.g. BLEU scores) using policy gradient from reinforcement learning. The loss function contains two terms: one is maximum likelihood estimator (MLE loss) and the other one is a reinforcement term based on a sum of non-differentiable rewards. Experiments show that training the decoder with this combination loss can help to generate more accurate captions. We also study the problem of ranking generated sentences conditioned on the image input and explore several variants of deep rankers built on top of the two-branch networks proposed earlier.

【预览】

附件列表
Files	Size	Format	View
Learning joint latent representations for images and language	24521KB	PDF	download


Learning joint latent representations for images and language
deep learning;computer vision
Wang, Liwei
关键词: deep learning; computer vision;
Others : https://www.ideals.illinois.edu/bitstream/handle/2142/101544/WANG-DISSERTATION-2018.pdf?sequence=1&isAllowed=y
美国\|英语
来源: The Illinois Digital Environment for Access to Learning and Scholarship
PDF


	文献评价指标
	下载次数：2次	浏览次数：15次

【 摘 要 】

【 预 览 】

【摘要】

【预览】