学位论文详细信息
A Multimodal Sensor Fusion Architecture for Audio-Visual Speech Recognition
Multimodal fusion;Visual speech recognition;Electrical and Computer Engineering
Makkook, Mustapha
University of Waterloo
关键词: Multimodal fusion;    Visual speech recognition;    Electrical and Computer Engineering;   
Others  :  https://uwspace.uwaterloo.ca/bitstream/10012/3065/1/Mustapha_Makkook_Thesis.pdf
瑞士|英语
来源: UWSPACE Waterloo Institutional Repository
PDF
【 摘 要 】

A key requirement for developing any innovative system in acomputing environment is to integrate a sufficiently friendly interface with the average end user. Accurate design of such auser-centered interface, however, means more than just theergonomics of the panels and displays. It also requires thatdesigners precisely define what information to use and how, where, and when to use it. Recent advances in user-centered design ofcomputing systems have suggested that multimodal integration canprovide different types and levels of intelligence to the userinterface. The work of this thesis aims at improving speech recognition-based interfaces by making use of the visual modalityconveyed by the movements of the lips.Designing a good visual front end is a major part of this framework.For this purpose, this work derives the optical flow fields for consecutive frames of people speaking. Independent ComponentAnalysis (ICA) is then used to derive basis flow fields. Thecoefficients of these basis fields comprise the visual features ofinterest. It is shown that using ICA on optical flow fields yields better classification results than the traditional approaches basedon Principal Component Analysis (PCA). In fact, ICA can capturehigher order statistics that are needed to understand the motion ofthe mouth. This is due to the fact that lips movement is complex in its nature, as it involves large image velocities, self occlusion(due to the appearance and disappearance of the teeth) and a lot ofnon-rigidity.Another issue that is of great interest to audio-visual speech recognition systems designers is the integration (fusion) of theaudio and visual information into an automatic speech recognizer.For this purpose, a reliability-driven sensor fusion scheme isdeveloped. A statistical approach is developed to account for the dynamic changes in reliability. This is done in two steps. The firststep derives suitable statistical reliability measures for theindividual information streams. These measures are based on thedispersion of the N-best hypotheses of the individual stream classifiers. The second step finds an optimal mapping between thereliability measures and the stream weights that maximizes theconditional likelihood. For this purpose, genetic algorithms areused.The addressed issues are challenging problems and are substantial for developing an audio-visual speech recognition framework that canmaximize the information gather about the words uttered and minimizethe impact of noise.

【 预 览 】
附件列表
Files Size Format View
A Multimodal Sensor Fusion Architecture for Audio-Visual Speech Recognition 1107KB PDF download
  文献评价指标  
  下载次数:8次 浏览次数:15次