The Journal of Engineering | |
Single-signal entity approach for sung word recognition with artificial neural network and timeâfrequency audio features | |
Peerapol Khunarsa1  | |
关键词: statistical learning method; cross-language music data recognition; feature vector; timeâfrequency audio features; image recognition; music audio signals; singing voice recognition; background music; spectrogram feature; background instrumental accompaniments; polyphonic audio signal; sung word recognition; single-signal entity approach; singing voice region classification; feed-forward neural network classifier; artificial neural network; vocal audio signal; noise sources; music information retrieval; | |
DOI : 10.1049/joe.2017.0210 | |
学科分类:工程和技术(综合) | |
来源: IET | |
【 摘 要 】
Singing voice recognition is very different from speech recognition or automatic speech recognition because there are distinct differences between speaking and singing voices. The problem is complex because music audio signals with their background instrumental accompaniments are regarded as noise sources that degrade the performance of the recognition system. This study proposes a statistical learning method to recognise words in a vocal audio signal with background music and to classify the region of a singing voice in a polyphonic audio signal. The goal of this study is to solve the problem of recognising words from sung input without using any method to separate instrumental from the background. This study also applies a concept from image recognition by using a spectrogram feature as an image to solve the problem. An audio signal with accompanying music was analysed and transformed into a spectrogram feature. To recognise it, the entire spectrogram feature was sliced, forming a feature vector for a feed-forward neural network classifier. Several classification functions were compared, including K-Nearest Neighbour, Fisher Linear Classifier, Linear Bayes Normal Classifier, Naive Bayes Classifier, Parzen Classifier and Decision Tree. The results show that using a feed-forward neural network can effectively recognise sung words at an accuracy rate of more than 93.0%. In particular, this system can recognise cross-language music data.
【 授权许可】
CC BY
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
RO201902027313960ZK.pdf | 1424KB | download |