Multitaper Spectrogram for Classification of Speech and Music With Pretrained Audio Neural Networks

2021 IEEE International Conference on Distributed Computing, VLSI, Electrical Circuits and Robotics (DISCOVER) Pub Date : 2021-11-19 DOI:10.1109/DISCOVER52564.2021.9663695

G.B Rakshith, K. Narendra, Sanjeev Gurugopinath

引用次数: 0

Abstract

In this paper, we demonstrate the viability of multitaper (MT) features for classification of s peech and music with pretrained audio neural networks (PANN). Among several well-known features for audio tagging, log-mel is widely-used. Therefore, log-mel has been used to train and establish a near-perfect accurate PANN for audio tagging. For the classification problem at hand, we study the performance of MT numerator group delay (MT-NGD) and MT magnitude (MT-Mag) spectral features and compare it with the log-mel feature. Our experimental results on the MARSYAS speech and music database shows that the accuracy of the PANN converges faster as opposed to other features, when trained with MT-NGD spectrogram. Further, the multitaper representations are observed to be robust to the presence of noise in both speech and music.

查看原文本刊更多论文

基于预训练音频神经网络的多锥度谱图语音和音乐分类

在本文中，我们用预训练的音频神经网络(PANN)证明了多锥度(MT)特征用于语音和音乐分类的可行性。在几个众所周知的音频标记特性中，log-mel被广泛使用。因此，log-mel被用来训练和建立一个近乎完美的精确的音频标注PANN。对于手头的分类问题，我们研究了MT分子群延迟(MT- ngd)和MT数量级(MT- mag)谱特征的性能，并将其与对数特征进行了比较。我们在MARSYAS语音和音乐数据库上的实验结果表明，当使用MT-NGD谱图训练时，与其他特征相比，PANN的准确性收敛得更快。此外，观察到多锥度表示对语音和音乐中存在的噪声都具有鲁棒性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE International Conference on Distributed Computing, VLSI, Electrical Circuits and Robotics (DISCOVER)

自引率

0.00%

发文量