期刊论文详细信息
The Journal of Engineering
Single-signal entity approach for sung word recognition with artificial neural network and time–frequency audio features
Peerapol Khunarsa1 
关键词: statistical learning method;    cross-language music data recognition;    feature vector;    time–frequency audio features;    image recognition;    music audio signals;    singing voice recognition;    background music;    spectrogram feature;    background instrumental accompaniments;    polyphonic audio signal;    sung word recognition;    single-signal entity approach;    singing voice region classification;    feed-forward neural network classifier;    artificial neural network;    vocal audio signal;    noise sources;    music information retrieval;   
DOI  :  10.1049/joe.2017.0210
学科分类:工程和技术(综合)
来源: IET
PDF
【 摘 要 】

Singing voice recognition is very different from speech recognition or automatic speech recognition because there are distinct differences between speaking and singing voices. The problem is complex because music audio signals with their background instrumental accompaniments are regarded as noise sources that degrade the performance of the recognition system. This study proposes a statistical learning method to recognise words in a vocal audio signal with background music and to classify the region of a singing voice in a polyphonic audio signal. The goal of this study is to solve the problem of recognising words from sung input without using any method to separate instrumental from the background. This study also applies a concept from image recognition by using a spectrogram feature as an image to solve the problem. An audio signal with accompanying music was analysed and transformed into a spectrogram feature. To recognise it, the entire spectrogram feature was sliced, forming a feature vector for a feed-forward neural network classifier. Several classification functions were compared, including K-Nearest Neighbour, Fisher Linear Classifier, Linear Bayes Normal Classifier, Naive Bayes Classifier, Parzen Classifier and Decision Tree. The results show that using a feed-forward neural network can effectively recognise sung words at an accuracy rate of more than 93.0%. In particular, this system can recognise cross-language music data.

【 授权许可】

CC BY   

【 预 览 】
附件列表
Files Size Format View
RO201902027313960ZK.pdf 1424KB PDF download
  文献评价指标  
  下载次数:12次 浏览次数:22次