EURASIP Journal on Audio, Speech, and Music Processing | |
Multi-task deep cross-attention networks for far-field speaker verification and keyword spotting | |
Empirical Research | |
Xingwei Liang1  Ruifeng Xu2  Zehua Zhang3  | |
[1] Konka Group Co., Ltd, Shenzhen, China;School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China;School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China;School of Electronics and Information Engineering, Harbin Institute of Technology, Shenzhen, China; | |
关键词: Speaker verification; Keyword spotting; Personalized voice trigger; Flow attention; | |
DOI : 10.1186/s13636-023-00293-8 | |
received in 2023-05-03, accepted in 2023-06-19, 发布年份 2023 | |
来源: Springer | |
【 摘 要 】
Personalized voice triggering is a key technology in voice assistants and serves as the first step for users to activate the voice assistant. Personalized voice triggering involves keyword spotting (KWS) and speaker verification (SV). Conventional approaches to this task include developing KWS and SV systems separately. This paper proposes a single system called the multi-task deep cross-attention network (MTCANet) that simultaneously performs KWS and SV, while effectively utilizing information relevant to both tasks. The proposed framework integrates a KWS sub-network and an SV sub-network to enhance performance in challenging conditions such as noisy environments, short-duration speech, and model generalization. At the core of MTCANet are three modules: a novel deep cross-attention (DCA) module to integrate KWS and SV tasks, a multi-layer stacked shared encoder (SE) to reduce the impact of noise on the recognition rate, and soft attention (SA) modules to allow the model to focus on pertinent information in the middle layer while preventing gradient vanishing. Our proposed model demonstrates outstanding performance in the well-off test set, improving by 0.2%, 0.023, and 2.28% over the well-known SV model emphasized channel attention, propagation, and aggregation in time delay neural network (ECAPA-TDNN) and the advanced KWS model Convmixer in terms of equal error rate (EER), minimum detection cost function (minDCF), and accuracy (Acc), respectively.
【 授权许可】
CC BY
© The Author(s) 2023
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
RO202309140260476ZK.pdf | 1793KB | download | |
41116_2023_38_Article_IEq306.gif | 1KB | Image | download |
41116_2023_38_Article_IEq325.gif | 1KB | Image | download |
41116_2023_38_Article_IEq327.gif | 1KB | Image | download |
41116_2023_38_Article_IEq330.gif | 1KB | Image | download |
41116_2023_38_Article_IEq332.gif | 1KB | Image | download |
41116_2023_38_Article_IEq334.gif | 1KB | Image | download |
41116_2023_38_Article_IEq337.gif | 1KB | Image | download |
41116_2023_38_Article_IEq340.gif | 1KB | Image | download |
41116_2023_38_Article_IEq341.gif | 1KB | Image | download |
41116_2023_38_Article_IEq345.gif | 1KB | Image | download |
41116_2023_38_Article_IEq347.gif | 1KB | Image | download |
41116_2023_38_Article_IEq349.gif | 1KB | Image | download |
41116_2023_38_Article_IEq161.gif | 1KB | Image | download |
Fig. 1 | 1926KB | Image | download |
41116_2023_38_Article_IEq163.gif | 1KB | Image | download |
41116_2023_38_Article_IEq190.gif | 1KB | Image | download |
Fig. 1 | 287KB | Image | download |
Fig. 1 | 120KB | Image | download |
Fig. 4 | 244KB | Image | download |
MediaObjects/42004_2023_927_MOESM1_ESM.pdf | 3435KB | download | |
MediaObjects/12862_2023_2130_MOESM1_ESM.docx | 3995KB | Other | download |
Fig. 5 | 147KB | Image | download |
MediaObjects/12864_2023_9504_MOESM2_ESM.xlsx | 116KB | Other | download |
Fig. 1 | 98KB | Image | download |
MediaObjects/40360_2019_335_MOESM1_ESM.docx | 59KB | Other | download |
Fig. 2 | 673KB | Image | download |
Fig. 6 | 1340KB | Image | download |
Fig. 2 | 110KB | Image | download |
679KB | Image | download | |
MediaObjects/12862_2023_2130_MOESM3_ESM.docx | 25KB | Other | download |
MediaObjects/12862_2023_2130_MOESM4_ESM.xlsx | 20KB | Other | download |
Fig. 4 | 1372KB | Image | download |
40507_2023_185_Article_IEq48.gif | 1KB | Image | download |
MediaObjects/40249_2023_1106_MOESM3_ESM.docx | 16KB | Other | download |
MediaObjects/12903_2023_3201_MOESM1_ESM.docx | 50KB | Other | download |
Fig. 17 | 770KB | Image | download |
MediaObjects/13046_2023_2728_MOESM1_ESM.docx | 18KB | Other | download |
Fig. 2 | 249KB | Image | download |
MediaObjects/13287_2023_3404_MOESM1_ESM.docx | 87665KB | Other | download |
Fig. 5 | 630KB | Image | download |
Fig. 1 | 567KB | Image | download |
Fig. 1 | 499KB | Image | download |
Fig. 11 | 1773KB | Image | download |
Fig. 2 | 286KB | Image | download |
MediaObjects/12944_2023_1842_MOESM3_ESM.docx | 17KB | Other | download |
Fig. 6 | 121KB | Image | download |
【 图 表 】
Fig. 6
Fig. 2
Fig. 11
Fig. 1
Fig. 1
Fig. 5
Fig. 2
Fig. 17
40507_2023_185_Article_IEq48.gif
Fig. 4
Fig. 2
Fig. 6
Fig. 2
Fig. 1
Fig. 5
Fig. 4
Fig. 1
Fig. 1
41116_2023_38_Article_IEq190.gif
41116_2023_38_Article_IEq163.gif
Fig. 1
41116_2023_38_Article_IEq161.gif
41116_2023_38_Article_IEq349.gif
41116_2023_38_Article_IEq347.gif
41116_2023_38_Article_IEq345.gif
41116_2023_38_Article_IEq341.gif
41116_2023_38_Article_IEq340.gif
41116_2023_38_Article_IEq337.gif
41116_2023_38_Article_IEq334.gif
41116_2023_38_Article_IEq332.gif
41116_2023_38_Article_IEq330.gif
41116_2023_38_Article_IEq327.gif
41116_2023_38_Article_IEq325.gif
41116_2023_38_Article_IEq306.gif
【 参考文献 】
- [1]
- [2]
- [3]
- [4]
- [5]
- [6]
- [7]
- [8]
- [9]
- [10]
- [11]
- [12]
- [13]
- [14]
- [15]
- [16]
- [17]
- [18]
- [19]
- [20]
- [21]
- [22]
- [23]
- [24]
- [25]
- [26]
- [27]
- [28]
- [29]
- [30]
- [31]
- [32]
- [33]
- [34]
- [35]
- [36]
- [37]
- [38]
- [39]
- [40]
- [41]
- [42]
- [43]
- [44]
- [45]
- [46]
- [47]
- [48]
- [49]
- [50]
- [51]
- [52]
- [53]
- [54]
- [55]
- [56]