期刊论文

【摘要】

Personalized voice triggering is a key technology in voice assistants and serves as the first step for users to activate the voice assistant. Personalized voice triggering involves keyword spotting (KWS) and speaker verification (SV). Conventional approaches to this task include developing KWS and SV systems separately. This paper proposes a single system called the multi-task deep cross-attention network (MTCANet) that simultaneously performs KWS and SV, while effectively utilizing information relevant to both tasks. The proposed framework integrates a KWS sub-network and an SV sub-network to enhance performance in challenging conditions such as noisy environments, short-duration speech, and model generalization. At the core of MTCANet are three modules: a novel deep cross-attention (DCA) module to integrate KWS and SV tasks, a multi-layer stacked shared encoder (SE) to reduce the impact of noise on the recognition rate, and soft attention (SA) modules to allow the model to focus on pertinent information in the middle layer while preventing gradient vanishing. Our proposed model demonstrates outstanding performance in the well-off test set, improving by 0.2%, 0.023, and 2.28% over the well-known SV model emphasized channel attention, propagation, and aggregation in time delay neural network (ECAPA-TDNN) and the advanced KWS model Convmixer in terms of equal error rate (EER), minimum detection cost function (minDCF), and accuracy (Acc), respectively.

【授权许可】

CC BY
© The Author(s) 2023

【预览】

附件列表
Files	Size	Format	View
RO202309140260476ZK.pdf	1793KB	PDF	download
41116_2023_38_Article_IEq306.gif	1KB	Image	download
41116_2023_38_Article_IEq325.gif	1KB	Image	download
41116_2023_38_Article_IEq327.gif	1KB	Image	download
41116_2023_38_Article_IEq330.gif	1KB	Image	download
41116_2023_38_Article_IEq332.gif	1KB	Image	download
41116_2023_38_Article_IEq334.gif	1KB	Image	download
41116_2023_38_Article_IEq337.gif	1KB	Image	download
41116_2023_38_Article_IEq340.gif	1KB	Image	download
41116_2023_38_Article_IEq341.gif	1KB	Image	download
41116_2023_38_Article_IEq345.gif	1KB	Image	download
41116_2023_38_Article_IEq347.gif	1KB	Image	download
41116_2023_38_Article_IEq349.gif	1KB	Image	download
41116_2023_38_Article_IEq161.gif	1KB	Image	download
Fig. 1	1926KB	Image	download
41116_2023_38_Article_IEq163.gif	1KB	Image	download
41116_2023_38_Article_IEq190.gif	1KB	Image	download
Fig. 1	287KB	Image	download
Fig. 1	120KB	Image	download
Fig. 4	244KB	Image	download
MediaObjects/42004_2023_927_MOESM1_ESM.pdf	3435KB	PDF	download
MediaObjects/12862_2023_2130_MOESM1_ESM.docx	3995KB	Other	download
Fig. 5	147KB	Image	download
MediaObjects/12864_2023_9504_MOESM2_ESM.xlsx	116KB	Other	download
Fig. 1	98KB	Image	download
MediaObjects/40360_2019_335_MOESM1_ESM.docx	59KB	Other	download
Fig. 2	673KB	Image	download
Fig. 6	1340KB	Image	download
Fig. 2	110KB	Image	download
	679KB	Image	download
MediaObjects/12862_2023_2130_MOESM3_ESM.docx	25KB	Other	download
MediaObjects/12862_2023_2130_MOESM4_ESM.xlsx	20KB	Other	download
Fig. 4	1372KB	Image	download
40507_2023_185_Article_IEq48.gif	1KB	Image	download
MediaObjects/40249_2023_1106_MOESM3_ESM.docx	16KB	Other	download
MediaObjects/12903_2023_3201_MOESM1_ESM.docx	50KB	Other	download
Fig. 17	770KB	Image	download
MediaObjects/13046_2023_2728_MOESM1_ESM.docx	18KB	Other	download
Fig. 2	249KB	Image	download
MediaObjects/13287_2023_3404_MOESM1_ESM.docx	87665KB	Other	download
Fig. 5	630KB	Image	download
Fig. 1	567KB	Image	download
Fig. 1	499KB	Image	download
Fig. 11	1773KB	Image	download
Fig. 2	286KB	Image	download
MediaObjects/12944_2023_1842_MOESM3_ESM.docx	17KB	Other	download
Fig. 6	121KB	Image	download

【图表】

Fig. 6

Fig. 2

Fig. 11

Fig. 1

Fig. 1

Fig. 5

Fig. 2

Fig. 17

40507_2023_185_Article_IEq48.gif

Fig. 4

Fig. 2

Fig. 6

Fig. 2

Fig. 1

Fig. 5

Fig. 4

Fig. 1

Fig. 1

41116_2023_38_Article_IEq190.gif

41116_2023_38_Article_IEq163.gif

Fig. 1

41116_2023_38_Article_IEq161.gif

41116_2023_38_Article_IEq349.gif

41116_2023_38_Article_IEq347.gif

41116_2023_38_Article_IEq345.gif

41116_2023_38_Article_IEq341.gif

41116_2023_38_Article_IEq340.gif

41116_2023_38_Article_IEq337.gif

41116_2023_38_Article_IEq334.gif

41116_2023_38_Article_IEq332.gif

41116_2023_38_Article_IEq330.gif

41116_2023_38_Article_IEq327.gif

41116_2023_38_Article_IEq325.gif

41116_2023_38_Article_IEq306.gif

【参考文献】

[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
[36]
[37]
[38]
[39]
[40]
[41]
[42]
[43]
[44]
[45]
[46]
[47]
[48]
[49]
[50]
[51]
[52]
[53]
[54]
[55]
[56]

EURASIP Journal on Audio, Speech, and Music Processing
Multi-task deep cross-attention networks for far-field speaker verification and keyword spotting
Empirical Research
Xingwei Liang¹ Ruifeng Xu² Zehua Zhang³
[1] Konka Group Co., Ltd, Shenzhen, China;School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China;School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China;School of Electronics and Information Engineering, Harbin Institute of Technology, Shenzhen, China;
关键词: Speaker verification; Keyword spotting; Personalized voice trigger; Flow attention;
DOI : 10.1186/s13636-023-00293-8
received in 2023-05-03, accepted in 2023-06-19, 发布年份 2023
来源: Springer
PDF


	文献评价指标
	下载次数：0次	浏览次数：0次

【 摘 要 】

【 授权许可】

【 预 览 】

【 图 表 】

【 参考文献 】

【摘要】

【授权许可】

【预览】

【图表】

【参考文献】