{"title":"Multitaper Spectrogram for Classification of Speech and Music With Pretrained Audio Neural Networks","authors":"G.B Rakshith, K. Narendra, Sanjeev Gurugopinath","doi":"10.1109/DISCOVER52564.2021.9663695","DOIUrl":null,"url":null,"abstract":"In this paper, we demonstrate the viability of multitaper (MT) features for classification of s peech and music with pretrained audio neural networks (PANN). Among several well-known features for audio tagging, log-mel is widely-used. Therefore, log-mel has been used to train and establish a near-perfect accurate PANN for audio tagging. For the classification problem at hand, we study the performance of MT numerator group delay (MT-NGD) and MT magnitude (MT-Mag) spectral features and compare it with the log-mel feature. Our experimental results on the MARSYAS speech and music database shows that the accuracy of the PANN converges faster as opposed to other features, when trained with MT-NGD spectrogram. Further, the multitaper representations are observed to be robust to the presence of noise in both speech and music.","PeriodicalId":413789,"journal":{"name":"2021 IEEE International Conference on Distributed Computing, VLSI, Electrical Circuits and Robotics (DISCOVER)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Conference on Distributed Computing, VLSI, Electrical Circuits and Robotics (DISCOVER)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DISCOVER52564.2021.9663695","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In this paper, we demonstrate the viability of multitaper (MT) features for classification of s peech and music with pretrained audio neural networks (PANN). Among several well-known features for audio tagging, log-mel is widely-used. Therefore, log-mel has been used to train and establish a near-perfect accurate PANN for audio tagging. For the classification problem at hand, we study the performance of MT numerator group delay (MT-NGD) and MT magnitude (MT-Mag) spectral features and compare it with the log-mel feature. Our experimental results on the MARSYAS speech and music database shows that the accuracy of the PANN converges faster as opposed to other features, when trained with MT-NGD spectrogram. Further, the multitaper representations are observed to be robust to the presence of noise in both speech and music.