期刊论文详细信息
NEUROCOMPUTING 卷:397
Audio-visual domain adaptation using conditional semi-supervised Generative Adversarial Networks
Article
Athanasiadis, Christos1  Hortal, Enrique1  Asteriadis, Stylianos1 
[1] Maastricht Univ, Dept Data Sci & Knowledge Engn, Sint Servaasklooster 39, NL-6211 TE Maastricht, Netherlands
关键词: Domain adaptation;    Conformal prediction;    Generative adversarial;    Networks;   
DOI  :  10.1016/j.neucom.2019.09.106
来源: Elsevier
PDF
【 摘 要 】

Accessing large, manually annotated audio databases in an effort to create robust models for emotion recognition is a notably difficult task, handicapped by the annotation cost and label ambiguities. On the contrary, there are plenty of publicly available datasets for emotion recognition which are based on facial expressivity due to the prevailing role of computer vision in deep learning research, nowadays. Thereby, in the current work, we performed a study on cross-modal transfer knowledge between audio and facial modalities within the emotional context. More concretely, we investigated whether facial information from videos could be used to boost the awareness and the prediction tracking of emotions in audio signals. Our approach was based on a simple hypothesis: that the emotional state's content of a person's oral expression correlates with the corresponding facial expressions. Research in the domain of cognitive psychology was affirmative to our hypothesis and suggests that visual information related to emotions fused with the auditory signal is used from humans in a cross-modal integration schema to better understand emotions. In this regard, a method called dacssGAN (which stands for Domain Adaptation Conditional Semi-Supervised Generative Adversarial Networks) is introduced in this work, in an effort to bridge these two inherently different domains. Given as input the source domain (visual data) and some conditional information that is based on inductive conformal prediction, the proposed architecture generates data distributions that are as close as possible to the target domain (audio data). Through experimentation, it is shown that classification performance of an expanded dataset using real audio enhanced with generated samples produced using dacssGAN (50.29% and 48.65%) outperforms the one obtained merely using real audio samples (49.34% and 46.90%) for two publicly available audio-visual emotion datasets. (C) 2019 The Authors. Published by Elsevier B.V.

【 授权许可】

Free   

【 预 览 】
附件列表
Files Size Format View
10_1016_j_neucom_2019_09_106.pdf 4244KB PDF download
  文献评价指标  
  下载次数:0次 浏览次数:0次