Nayyer Aafaq, Usama Qamar, Sohaib Ali Khan, Z. Khan
{"title":"基于长短期记忆网络的多说话人分组","authors":"Nayyer Aafaq, Usama Qamar, Sohaib Ali Khan, Z. Khan","doi":"10.1109/ICAI58407.2023.10136670","DOIUrl":null,"url":null,"abstract":"The task of multi-speaker diarization involves de-tection of number of speakers and segregate the audio seg-ments corresponding to each speaker. Despite the tremendous advancements in deep learning, the problem of multi-speaker diarization is still far from achieving acceptable performance. In this work, we address the problem by first getting the timestamps employing voice activity detection and sliding window techniques. We further extract the Mel-Spectrograms / Mel-frequency Cepstral Coefficients (MFCC). We then train a Long Short-Term Memory (LSTM) network to get the audio embed dings named d-vectors. Subsequently, we employ K-Means and Spectral clustering techniques to segment all the speakers in the given audio file. We evaluate the proposed framework on publically available VoxConverse dataset and report results comparing with similar benchmarks in the existing literature. The proposed model performs better / at par with exisiting techniques despite simpler framework.","PeriodicalId":161809,"journal":{"name":"2023 3rd International Conference on Artificial Intelligence (ICAI)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multi-Speaker Diarization using Long-Short Term Memory Network\",\"authors\":\"Nayyer Aafaq, Usama Qamar, Sohaib Ali Khan, Z. Khan\",\"doi\":\"10.1109/ICAI58407.2023.10136670\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The task of multi-speaker diarization involves de-tection of number of speakers and segregate the audio seg-ments corresponding to each speaker. Despite the tremendous advancements in deep learning, the problem of multi-speaker diarization is still far from achieving acceptable performance. In this work, we address the problem by first getting the timestamps employing voice activity detection and sliding window techniques. We further extract the Mel-Spectrograms / Mel-frequency Cepstral Coefficients (MFCC). We then train a Long Short-Term Memory (LSTM) network to get the audio embed dings named d-vectors. Subsequently, we employ K-Means and Spectral clustering techniques to segment all the speakers in the given audio file. We evaluate the proposed framework on publically available VoxConverse dataset and report results comparing with similar benchmarks in the existing literature. The proposed model performs better / at par with exisiting techniques despite simpler framework.\",\"PeriodicalId\":161809,\"journal\":{\"name\":\"2023 3rd International Conference on Artificial Intelligence (ICAI)\",\"volume\":\"50 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-02-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 3rd International Conference on Artificial Intelligence (ICAI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICAI58407.2023.10136670\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 3rd International Conference on Artificial Intelligence (ICAI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICAI58407.2023.10136670","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Multi-Speaker Diarization using Long-Short Term Memory Network
The task of multi-speaker diarization involves de-tection of number of speakers and segregate the audio seg-ments corresponding to each speaker. Despite the tremendous advancements in deep learning, the problem of multi-speaker diarization is still far from achieving acceptable performance. In this work, we address the problem by first getting the timestamps employing voice activity detection and sliding window techniques. We further extract the Mel-Spectrograms / Mel-frequency Cepstral Coefficients (MFCC). We then train a Long Short-Term Memory (LSTM) network to get the audio embed dings named d-vectors. Subsequently, we employ K-Means and Spectral clustering techniques to segment all the speakers in the given audio file. We evaluate the proposed framework on publically available VoxConverse dataset and report results comparing with similar benchmarks in the existing literature. The proposed model performs better / at par with exisiting techniques despite simpler framework.