{"title":"面向新闻广播的稳健说话人划分","authors":"M. Karthik, Mirishkar Sai Ganesh, B. Patnaik","doi":"10.1109/WISPNET.2018.8538527","DOIUrl":null,"url":null,"abstract":"This contribution presents an efficient method of speaker diarization that employs bayesian information criterion for speaker embeddings. In contrast to the traditional approaches the speaker segmentation is done using manually spectral features. The proposed method is capable enough to segment audio recording of a broadcast news by $i$-vectors as well as GMM speaker model and the conventional GMM based agglomerative for clustering the data. An unsupervised Voice Active Detector (VAD) has been developed, so that it could distinguish between speech frame and non-speech frame such that the non-speech frames can be discarded. The results of our proposed method showed significantly outperformed with the benchmark methods and reduced the diarization error margin by 14%.","PeriodicalId":6858,"journal":{"name":"2018 International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET)","volume":"11 1","pages":"1-4"},"PeriodicalIF":0.0000,"publicationDate":"2018-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Robust Speaker Diarization for News Broadcast\",\"authors\":\"M. Karthik, Mirishkar Sai Ganesh, B. Patnaik\",\"doi\":\"10.1109/WISPNET.2018.8538527\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This contribution presents an efficient method of speaker diarization that employs bayesian information criterion for speaker embeddings. In contrast to the traditional approaches the speaker segmentation is done using manually spectral features. The proposed method is capable enough to segment audio recording of a broadcast news by $i$-vectors as well as GMM speaker model and the conventional GMM based agglomerative for clustering the data. An unsupervised Voice Active Detector (VAD) has been developed, so that it could distinguish between speech frame and non-speech frame such that the non-speech frames can be discarded. The results of our proposed method showed significantly outperformed with the benchmark methods and reduced the diarization error margin by 14%.\",\"PeriodicalId\":6858,\"journal\":{\"name\":\"2018 International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET)\",\"volume\":\"11 1\",\"pages\":\"1-4\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/WISPNET.2018.8538527\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WISPNET.2018.8538527","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
This contribution presents an efficient method of speaker diarization that employs bayesian information criterion for speaker embeddings. In contrast to the traditional approaches the speaker segmentation is done using manually spectral features. The proposed method is capable enough to segment audio recording of a broadcast news by $i$-vectors as well as GMM speaker model and the conventional GMM based agglomerative for clustering the data. An unsupervised Voice Active Detector (VAD) has been developed, so that it could distinguish between speech frame and non-speech frame such that the non-speech frames can be discarded. The results of our proposed method showed significantly outperformed with the benchmark methods and reduced the diarization error margin by 14%.