Performance of an automatic speech recognition system degrade rapidly in presence of a mismatch between training and test acoustic conditions.Usually, the mismatch affects different regions in feature space differently.Significantly lower word error rates can be obtained when regions with high signal-to-noise-ratio are given a high weight, while other regions are deemphasized.This is the fundamental motivation of multi-stream systems.In this thesis, a framework for practical multi-stream architectures is presented.Past multi-stream systems employ a two stage training procedure.In these systems, stream specific networks are trained in the first stage and a large number of fusion networks are trained in the second stage.Typically, the number of fusion networks required are exponential to number of streams.For example a 5-stream system requires 31 ($2^5-1$) networks.In this thesis, we first present training technique which eliminates the need for the two stage procedure.The proposed technique also helps in replacing multiple fusion networks with a single network.This not only reduces computational cost during training, but also results in a more robust system.Performance comparison in various noise robust tasks resulted in a word-error-rate reduction of 5 -- 10 \% relative over baseline models.During test time, stream combination that produces the lowest error rate needs to be determined.This has to be done both accurately and efficiently.We proposed two performance monitor techniques to accurately determine the streams which are least affected by mismatches.For an efficient testing, a fast algorithm to search for the best stream combination is introduced with a reduction of test time latency by a factor of 20.
【 预 览 】
附件列表
Files
Size
Format
View
A Practical and Efficient Multistream Framework for Noise Robust Speech Recognition