Enormous amounts of audio recordings of human speech are essential ingredients for building reliable statistical models for many speech applications, such as automatic speech recognition and automatic prosody detection. However, most of these speech data are not being utilized because they lack transcriptions. The goal of this thesis is to use untranscribed (unlabeled) data to improve the performance of models trained using only transcribed (labeled) data. We propose a unified semi-supervised learning framework for the problem of phone classification, phone recognition and prosody detection. The proposed approach will be particularly useful in the case where recognition performance is limited by the amount of transcribed data.In the first part of the thesis, we investigate semi-supervised training of Gaussian Mixtures Models (GMMs) and Hidden Markov Models (HMMs) which are the common probabilistic models of acoustic features in a state-of-the-art continuous density HMM based speech recognition system.Specifically, a family of semi-supervised training criteria that reflects reasonable assumptions about labeled and unlabeled data is proposed. Both generative and discriminative kinds of training criteria are explored, and one important proposal of this thesis is to keep the power of discriminative training criteria by using some measures on unlabeled data as regularization to the supervised training objective.Methods are described for the optimization of these criteria, and phone classification experiments show that these criteria reliably give improvements over their supervised versions that use only labeled data. We then extend the proposed semi-supervised training criteria to the phone recognition problem. This problem is novel in the area of semi-supervised learning because there is little research on the use of unlabeled data in the sequence labeling problems. We develop lattice-based approaches for the model optimization that involves both transcribed and untranscribed speech utterances.Experiments for phone recognition show that a maximum mutual information criterion regularized by negative conditional entropy measured using unlabeled data reliably gives better results than other semi-supervised training methods.In the second part of the thesis, we propose to exploit unlabeled data for the task of automatic prosodic event detection. Prosody annotation is even harder to obtain than orthographic text transcription; it usually requires the expert knowledge of phonetics and linguistics. Therefore, we aim at reducing the annotation efforts for building an automatic prosodic event detector. We show that the mixture model has the ability of class discovery when labeled data are available from only one of the two classes and develop the learning algorithm for unsupervised prosodic boundary detection.
【 预 览 】
附件列表
Files
Size
Format
View
Semi-supervised learning for acoustic and prosodic modeling in speech applications