EURASIP Journal on Audio, Speech, and Music Processing | |
Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset | |
Diego de Benito-Gorron1  Joaquin Gonzalez-Rodriguez1  Alicia Lozano-Diez1  Doroteo T. Toledano1  | |
[1] AUDIAS (Audio, Data Intelligence and Speech) - Universidad Autonoma de Madrid; | |
关键词: Acoustic event detection; Speech activity detection; Music activity detection; Neural networks; Convolutional networks; LSTM; | |
DOI : 10.1186/s13636-019-0152-1 | |
来源: DOAJ |
【 摘 要 】
Abstract Audio signals represent a wide diversity of acoustic events, from background environmental noise to spoken communication. Machine learning models such as neural networks have already been proposed for audio signal modeling, where recurrent structures can take advantage of temporal dependencies. This work aims to study the implementation of several neural network-based systems for speech and music event detection over a collection of 77,937 10-second audio segments (216 h), selected from the Google AudioSet dataset. These segments belong to YouTube videos and have been represented as mel-spectrograms. We propose and compare two approaches. The first one is the training of two different neural networks, one for speech detection and another for music detection. The second approach consists on training a single neural network to tackle both tasks at the same time. The studied architectures include fully connected, convolutional and LSTM (long short-term memory) recurrent networks. Comparative results are provided in terms of classification performance and model complexity. We would like to highlight the performance of convolutional architectures, specially in combination with an LSTM stage. The hybrid convolutional-LSTM models achieve the best overall results (85% accuracy) in the three proposed tasks. Furthermore, a distractor analysis of the results has been carried out in order to identify which events in the ontology are the most harmful for the performance of the models, showing some difficult scenarios for the detection of music and speech.
【 授权许可】
Unknown