Acoustical science and technology | |
Multi-modal modeling for device-directed speech detection using acoustic and linguistic cues | |
article | |
Hiroshi Sato1  Yusuke Shinohara2  Atsunori Ogawa1  | |
[1] NTT Corporation;Yahoo Japan Corporation | |
关键词: Device-directed speech detection; Multi-modal; Utterance classification; Attention; | |
DOI : 10.1250/ast.44.40 | |
学科分类:声学和超声波 | |
来源: Acoustical Society of Japan | |
【 摘 要 】
Advances in speech recognition technology have enabled voice-controlled user interfaces. Smart speakers, such as Amazon Echo and Google Home, and smart-phones equipped with voice agent services are hands-free ways for users to communicate with their smart devices. Hereinafter, we refer to such voice-controlled devices as voice agents. Because voice agents operate in real environments, observed signals contain noises such as background speech or speech directed at other people. Thus, it is an indispensable ability for voice agents to distinguish users’ voice queries directed at the system (directed speech) from non-directed speech, and only respond to the directed speech. Keyword spotting is a common way to deal with this problem where users ‘wake up’ the system by phrasing predefined keywords or key phrases (like ‘Okay, computer’) before providing queries. The computer accepts a query spoken directly after the keyword as the device-directed query. Although key-word spotting technology can distinguish keywords with fairly high accuracy, detecting devicedirected queries only based on keywords sometimes results in incorrect responses.
【 授权许可】
Unknown
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
RO202302200000551ZK.pdf | 188KB | download |