期刊论文

【摘要】

Image captioning spans the fields of computer vision and natural language processing. The image captioning task generalizes object detection where the descriptions are a single word. Recently, most research on image captioning has focused on deep learning techniques, especially Encoder-Decoder models with Convolutional Neural Network (CNN) feature extraction. However, few works have tried using object detection features to increase the quality of the generated captions. This paper presents an attention-based, Encoder-Decoder deep architecture that makes use of convolutional features extracted from a CNN model pre-trained on ImageNet (Xception), together with object features extracted from the YOLOv4 model, pre-trained on MS COCO. This paper also introduces a new positional encoding scheme for object features, the “importance factor”. Our model was tested on the MS COCO and Flickr30k datasets, and the performance is compared to performance in similar works. Our new feature extraction scheme raises the CIDEr score by 15.04%. The code is available at: https://github.com/abdelhadie-almalla/image_captioning

【授权许可】

CC BY

【预览】

附件列表
Files	Size	Format	View
RO202202188423383ZK.pdf	1905KB	PDF	download

Journal of Big Data
Image captioning model using attention and object features to mimic human image understanding

Nada Ghneim¹ Muhammad Abdelhadie Al-Malla² Assef Jafar²
[1] Arab International University, Daraa, Syria;Higher Institute for Applied Sciences and Technology, Damascus, Syria;
关键词: Image captioning; Object features; Convolutional neural network; Deep learning;
DOI : 10.1186/s40537-022-00571-w
来源: Springer
PDF


	文献评价指标
	下载次数：1次	浏览次数：1次

【 摘 要 】

【 授权许可】

【 预 览 】

【摘要】

【授权许可】

【预览】