期刊论文详细信息
Acoustical science and technology
Multi-modal modeling for device-directed speech detection using acoustic and linguistic cues
article
Hiroshi Sato1  Yusuke Shinohara2  Atsunori Ogawa1 
[1] NTT Corporation;Yahoo Japan Corporation
关键词: Device-directed speech detection;    Multi-modal;    Utterance classification;    Attention;   
DOI  :  10.1250/ast.44.40
学科分类:声学和超声波
来源: Acoustical Society of Japan
PDF
【 摘 要 】

Advances in speech recognition technology have enabled voice-controlled user interfaces. Smart speakers, such as Amazon Echo and Google Home, and smart-phones equipped with voice agent services are hands-free ways for users to communicate with their smart devices. Hereinafter, we refer to such voice-controlled devices as voice agents. Because voice agents operate in real environments, observed signals contain noises such as background speech or speech directed at other people. Thus, it is an indispensable ability for voice agents to distinguish users’ voice queries directed at the system (directed speech) from non-directed speech, and only respond to the directed speech. Keyword spotting is a common way to deal with this problem where users ‘wake up’ the system by phrasing predefined keywords or key phrases (like ‘Okay, computer’) before providing queries. The computer accepts a query spoken directly after the keyword as the device-directed query. Although key-word spotting technology can distinguish keywords with fairly high accuracy, detecting devicedirected queries only based on keywords sometimes results in incorrect responses.

【 授权许可】

Unknown   

【 预 览 】
附件列表
Files Size Format View
RO202302200000551ZK.pdf 188KB PDF download
  文献评价指标  
  下载次数:1次 浏览次数:1次