期刊论文详细信息
International Journal on Informatics Visualization: JOIV
Pre-Trained CNN Architecture Analysis for Transformer-Based Indonesian Image Caption Generation Model
article
Rifqi Mulyawan1  Andi Sunyoto1  Alva Hendi Muhammad Muhammad1 
[1] Universitas Amikom Yogyakarta
关键词: Deep Neural Network;    Convolutional Neural Network;    Indonesian Image Captioning;    Transformer;    Attention Mechanism;   
DOI  :  10.30630/joiv.7.2.1387
来源: Politeknik Negeri Padang
PDF
【 摘 要 】

Classification and object recognition in image processing has significantly improved computer vision tasks. The method is often used for visual problems, especially in picture classification utilizing the Convolutional Neural Network (CNN). In the popular state-of-the-art (SOTA) task of generating a caption on an image, the implementation is often used for feature extraction of an image as an encoder. Instead of performing direct classification, these extracted features are sent from the encoder to the decoder section to generate the sequence. So, some CNN layers related to the classification task are not required. This study aims to determine which CNN pre-trained architecture or model performs best in extracting image features using a state-of-the-art Transformer model as its decoder. Unlike the original Transformer’s architecture, we implemented a vector-to-sequence way instead of sequence-to-sequence for the model. Indonesian Flickr8k and Flick30k datasets were used in this research. Evaluations were carried out using several pre-trained architectures, including ResNet18, ResNet34, ResNet50, ResNet101, VGG16, Efficientnet_b0, Efficientnet_b1, and Googlenet. The qualitative model inference results and quantitative evaluation scores were analyzed in this study. The test results show that the ResNet50 architecture can produce stable sequence generation with the highest accuracy value. With some experimentation, finetuning the encoder can significantly increase the model evaluation score. As for future work, further exploration with larger datasets like Flickr30k, MS COCO 14, MS COCO 17, and other image captioning datasets in Indonesian also implementing a new Transformers-based method can be used to get a better Indonesian automatic image captioning model.

【 授权许可】

Unknown   

【 预 览 】
附件列表
Files Size Format View
RO202307110004937ZK.pdf 3751KB PDF download
  文献评价指标  
  下载次数:6次 浏览次数:0次