期刊论文

【摘要】

In the task of image captioning, learning the attentive image regions is necessary to adaptively and precisely focus on the object semantics relevant to each decoded word. In this paper, we propose a convolutional attention module that can preserve the spatial structure of the image by performing the convolution operation directly on the 2D feature maps. The proposed attention mechanism contains two components: convolutional spatial attention and cross-channel attention, aiming to determine the intended regions to describe the image along the spatial and channel dimensions, respectively. Both of the two attentions are calculated at each decoding step. In order to preserve the spatial structure, instead of operating on the vector representation of each image grid, the two attention components are both computed directly on the entire feature maps with convolution operations. Experiments on two large-scale datasets (MSCOCO and Flickr30K) demonstrate the outstanding performance of our proposed method.

【授权许可】

Unknown

Applied Sciences
Structure Preserving Convolutional Attention for Image Captioning

Fei Zheng¹ Shichen Lu² Ruimin Hu² Jing Liu³ Longteng Guo³
[1] China General Technology Research Institute, Beijing 100190, China;National Engineering Research Center for Multimedia Software, School of Computer, Wuhan University, Wuhan 430072, China;National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China;
关键词: image captioning; attention; spatial structure; deep learning; computer vision;
DOI : 10.3390/app9142888
来源: DOAJ


	文献评价指标
	下载次数：0次	浏览次数：0次

【 摘 要 】

【 授权许可】

【摘要】

【授权许可】