期刊论文详细信息
EURASIP Journal on Advances in Signal Processing
Text-independent speaker recognition based on adaptive course learning loss and deep residual network
Han Zhang1  Ruining Dai1  Yongsheng Zhu1  Qinghua Zhong2  Guofu Zhou3 
[1] School of Physics and Telecommunication Engineering, South China Normal University, 510006, Guangzhou, China;School of Physics and Telecommunication Engineering, South China Normal University, 510006, Guangzhou, China;South China Academy of Advanced Optoelectronics, South China Normal University, 510006, Guangzhou, China;South China Academy of Advanced Optoelectronics, South China Normal University, 510006, Guangzhou, China;
关键词: Text-independent;    Speaker recognition;    Adaptive curriculum learning loss;    Deep residual network;    Convolutional attention statistics pooling;   
DOI  :  10.1186/s13634-021-00762-2
来源: Springer
PDF
【 摘 要 】

Text-independent speaker recognition is widely used in identity recognition that has a wide spectrum of applications, such as criminal investigation, payment certification, and interest-based customer services. In order to improve the recognition ability of log filter bank feature vectors, a method of text-independent speaker recognition based on deep residual networks model was proposed in this paper. The deep residual network was composed of a residual network (ResNet) and a convolutional attention statistics pooling (CASP) layer. The CASP layer could aggregate frame-level features from the ResNet into an utterance-level features. Extracting speech features for each speaker using deep residual networks was a promising direction to explore, and a straightforward solution was to train the discriminative feature extraction network by using a margin-based loss function. However, a margin-based loss function often has certain limitations, such as the margins between different categories were set to be the same and fixed. Thus, we used an adaptive curriculum learning loss (ACLL) to address the problem and introduce two different margin-based losses for this problem, i.e., AM-Softmax and AAM-Softmax. The proposed method was applied to a large-scale VoxCeleb2 dataset for extensive text-independent speaker recognition experiments, and average equal error rate (EER) could achieve 1.76% on VoxCeleb1 test dataset, 1.91% on VoxCeleb1-E test dataset, and 3.24% on VoxCeleb1-H test dataset. Compared with related speaker recognition methods, EER was improved by 1.11% on VoxCeleb1 test dataset, 1.04% on VoxCeleb1-E test dataset, and 1.69% on VoxCeleb1-H test dataset.

【 授权许可】

CC BY   

【 预 览 】
附件列表
Files Size Format View
RO202108126250171ZK.pdf 1330KB PDF download
  文献评价指标  
  下载次数:8次 浏览次数:1次