| Applied Sciences | |
| Structure Preserving Convolutional Attention for Image Captioning | |
| Fei Zheng1  Shichen Lu2  Ruimin Hu2  Jing Liu3  Longteng Guo3  | |
| [1] China General Technology Research Institute, Beijing 100190, China;National Engineering Research Center for Multimedia Software, School of Computer, Wuhan University, Wuhan 430072, China;National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China; | |
| 关键词: image captioning; attention; spatial structure; deep learning; computer vision; | |
| DOI : 10.3390/app9142888 | |
| 来源: DOAJ | |
【 摘 要 】
In the task of image captioning, learning the attentive image regions is necessary to adaptively and precisely focus on the object semantics relevant to each decoded word. In this paper, we propose a convolutional attention module that can preserve the spatial structure of the image by performing the convolution operation directly on the 2D feature maps. The proposed attention mechanism contains two components: convolutional spatial attention and cross-channel attention, aiming to determine the intended regions to describe the image along the spatial and channel dimensions, respectively. Both of the two attentions are calculated at each decoding step. In order to preserve the spatial structure, instead of operating on the vector representation of each image grid, the two attention components are both computed directly on the entire feature maps with convolution operations. Experiments on two large-scale datasets (MSCOCO and Flickr30K) demonstrate the outstanding performance of our proposed method.
【 授权许可】
Unknown