Mohit Yadav, A. Sao, A. D. Dileep, Padmanabhan Rajan
{"title":"Group delay functions for speaker diarization","authors":"Mohit Yadav, A. Sao, A. D. Dileep, Padmanabhan Rajan","doi":"10.1109/NCC.2016.7561127","DOIUrl":null,"url":null,"abstract":"Speaker diarization is the task of determining “who spoke when” in a speech recording of an unknown duration containing an unknown number of speakers. The very unsupervised nature of this task makes it more challenging and demands that the feature representation used be highly discriminative across speakers. Commonly used features based on the short time Fourier transform are usually derived from the magnitude spectrum. The short time phase information is usually not used due to the complications involved in its processing. However, the information embedded in phase has been shown beneficial for many speech tasks. In this paper, we explore it for speaker diarization. Two approaches for utilizing information from the phase, through group delay functions are explored. We present our experiments and results on the publicly available AMI meeting corpus. Our experiments demonstrate that the features derived from group delay functions provide comparable or improved diarization accuracy over and on fusion with the popularly used mel-frequency cepstrum coefficients (MFCC) features.","PeriodicalId":279637,"journal":{"name":"2016 Twenty Second National Conference on Communication (NCC)","volume":"93 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 Twenty Second National Conference on Communication (NCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NCC.2016.7561127","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Speaker diarization is the task of determining “who spoke when” in a speech recording of an unknown duration containing an unknown number of speakers. The very unsupervised nature of this task makes it more challenging and demands that the feature representation used be highly discriminative across speakers. Commonly used features based on the short time Fourier transform are usually derived from the magnitude spectrum. The short time phase information is usually not used due to the complications involved in its processing. However, the information embedded in phase has been shown beneficial for many speech tasks. In this paper, we explore it for speaker diarization. Two approaches for utilizing information from the phase, through group delay functions are explored. We present our experiments and results on the publicly available AMI meeting corpus. Our experiments demonstrate that the features derived from group delay functions provide comparable or improved diarization accuracy over and on fusion with the popularly used mel-frequency cepstrum coefficients (MFCC) features.