Generative probabilistic and neural models of the speech signal are shown to be effective in speech synthesis and speech enhancement, where generating natural and clean speech is the goal. This thesis develops two probabilistic signal processing algorithms based on the source-filter model of speech production, and two based on neural generative models of the speech signal. They are a model-based speech enhancement algorithm with ad-hoc microphone array, called GRAB; a probabilistic generative model of speech called PAT; a neural generative F0 model called TEReTA; and a Bayesian enhancement network, call BaWN, that incorporates a neural generative model of speech, called WaveNet. PAT and TEReTA aim to develop better generative models for speech synthesis. BaWN and GRAB aim to improve the naturalness and noise robustness of speech enhancement algorithms.Probabilistic Acoustic Tube (PAT) is a probabilistic generative model for speech, whose basis is the source-filter model. The highlights of the model are threefold. First, it is among the very first works to build a complete probabilistic model for speech. Second, it has a well-designed model for the phase spectrum of speech, which has been hard to model and often neglected. Third, it models the AM-FM effects in speech, which are perceptually significant but often ignored in frame-based speech processing algorithms. Experiments show that the proposed model has good potential for a number of speech processing tasks.TEReTA generates pitch contours by incorporating a theoretical model of pitch planning, the piece-wise linear target approximation (TA) model, as the output layer of a deep recurrent neural network. It aims to model semantic variations in the F0 contour, which is challenging for existing network. By combining the TA model, TEReTA is able to memorize semantic context and capture the semantic variations. Experiments on contrastive focus verify TEReTA's ability in semantics modeling.BaWN is a neural network based algorithm for single-channel enhancement. The biggest challenges of the neural network based speech enhancement algorithm are the poor generalizability to unseen noises and unnaturalness of the output speech. By incorporating a neural generative model, WaveNet, in the Bayesian framework, where WaveNet predicts the prior for speech, and where a separate enhancement network incorporates the likelihood function, BaWN is able to achieve satisfactory generalizability and a good intelligibility score of its output, even when the noisy training set is small.GRAB is a beamforming algorithm for ad-hoc microphone arrays. The task of enhancing speech with ad-hoc microphone array is challenging because of the inaccuracy in position and interference calibration. Inspired by the source-filter model, GRAB does not rely on any position or interference calibration. Instead, it incorporates a source-filter speech model and minimizes the energy that cannot be accounted for by the model. Objective and subjective evaluations on both simulated and real-world data show that GRAB is able to suppress noise effectively while keeping the speech natural and dry.Final chapters discuss the implications of this work for future research in speech processing.
【 预 览 】
附件列表
Files
Size
Format
View
Application of generative models in speech processing tasks