期刊论文详细信息
PATTERN RECOGNITION 卷:84
Learning visual and textual representations for multimodal matching and classification
Article
Liu, Yu1  Liu, Li2,3  Guo, Yanming2  Lew, Michael S.1 
[1] Leiden Univ, Dept Comp Sci, NL-2333 CA Leiden, Netherlands
[2] Natl Univ Def Technol, Coll Syst Engn, Changsha 410073, Hunan, Peoples R China
[3] Univ Oulu, Ctr Machine Vis & Signal Anal, Oulu 8000, Finland
关键词: Vision and language;    Multimodal matching;    Multimodal classification;    Deep learning;   
DOI  :  10.1016/j.patcog.2018.07.001
来源: Elsevier
PDF
【 摘 要 】

Multimodal learning has been an important and challenging problem for decades, which aims to bridge the modality gap between heterogeneous representations, such as vision and language. Unlike many current approaches which only focus on either multimodal matching or classification, we propose a unified network to jointly learn multimodal matching and classification (MMC-Net) between images and texts. The proposed MMC-Net model can seamlessly integrate the matching and classification components. It first learns visual and textual embedding features in the matching component, and then generates discriminative multimodal representations in the classification component. Combining the two components in a unified model can help in improving their performance. Moreover, we present a multi-stage training algorithm by minimizing both of the matching and classification loss functions. Experimental results on four well-known multimodal benchmarks demonstrate the effectiveness and efficiency of the proposed approach, which achieves competitive performance for multimodal matching and classification compared to state-of-the-art approaches. (C) 2018 Published by Elsevier Ltd.

【 授权许可】

Free   

【 预 览 】
附件列表
Files Size Format View
10_1016_j_patcog_2018_07_001.pdf 4363KB PDF download
  文献评价指标  
  下载次数:11次 浏览次数:0次