基于长短期记忆网络的多说话人分组

2023 3rd International Conference on Artificial Intelligence (ICAI) Pub Date : 2023-02-22 DOI:10.1109/ICAI58407.2023.10136670

Nayyer Aafaq, Usama Qamar, Sohaib Ali Khan, Z. Khan

{"title":"基于长短期记忆网络的多说话人分组","authors":"Nayyer Aafaq, Usama Qamar, Sohaib Ali Khan, Z. Khan","doi":"10.1109/ICAI58407.2023.10136670","DOIUrl":null,"url":null,"abstract":"The task of multi-speaker diarization involves de-tection of number of speakers and segregate the audio seg-ments corresponding to each speaker. Despite the tremendous advancements in deep learning, the problem of multi-speaker diarization is still far from achieving acceptable performance. In this work, we address the problem by first getting the timestamps employing voice activity detection and sliding window techniques. We further extract the Mel-Spectrograms / Mel-frequency Cepstral Coefficients (MFCC). We then train a Long Short-Term Memory (LSTM) network to get the audio embed dings named d-vectors. Subsequently, we employ K-Means and Spectral clustering techniques to segment all the speakers in the given audio file. We evaluate the proposed framework on publically available VoxConverse dataset and report results comparing with similar benchmarks in the existing literature. The proposed model performs better / at par with exisiting techniques despite simpler framework.","PeriodicalId":161809,"journal":{"name":"2023 3rd International Conference on Artificial Intelligence (ICAI)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multi-Speaker Diarization using Long-Short Term Memory Network\",\"authors\":\"Nayyer Aafaq, Usama Qamar, Sohaib Ali Khan, Z. Khan\",\"doi\":\"10.1109/ICAI58407.2023.10136670\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The task of multi-speaker diarization involves de-tection of number of speakers and segregate the audio seg-ments corresponding to each speaker. Despite the tremendous advancements in deep learning, the problem of multi-speaker diarization is still far from achieving acceptable performance. In this work, we address the problem by first getting the timestamps employing voice activity detection and sliding window techniques. We further extract the Mel-Spectrograms / Mel-frequency Cepstral Coefficients (MFCC). We then train a Long Short-Term Memory (LSTM) network to get the audio embed dings named d-vectors. Subsequently, we employ K-Means and Spectral clustering techniques to segment all the speakers in the given audio file. We evaluate the proposed framework on publically available VoxConverse dataset and report results comparing with similar benchmarks in the existing literature. The proposed model performs better / at par with exisiting techniques despite simpler framework.\",\"PeriodicalId\":161809,\"journal\":{\"name\":\"2023 3rd International Conference on Artificial Intelligence (ICAI)\",\"volume\":\"50 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-02-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 3rd International Conference on Artificial Intelligence (ICAI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICAI58407.2023.10136670\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 3rd International Conference on Artificial Intelligence (ICAI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICAI58407.2023.10136670","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

多说话人分配器的任务是检测说话人的数量，并分离每个说话人对应的音频段。尽管深度学习取得了巨大的进步，但多说话者分界化的问题仍远未达到可接受的性能。在这项工作中，我们首先通过使用语音活动检测和滑动窗口技术获得时间戳来解决这个问题。进一步提取mel谱图/ mel频率倒谱系数(MFCC)。然后，我们训练一个长短期记忆(LSTM)网络来获得称为d-向量的音频嵌入。随后，我们采用k均值和谱聚类技术对给定音频文件中的所有说话人进行分割。我们在公开可用的VoxConverse数据集上评估了所提出的框架，并将结果与现有文献中的类似基准进行了比较。尽管框架更简单，但所提出的模型与现有技术性能更好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Multi-Speaker Diarization using Long-Short Term Memory Network

The task of multi-speaker diarization involves de-tection of number of speakers and segregate the audio seg-ments corresponding to each speaker. Despite the tremendous advancements in deep learning, the problem of multi-speaker diarization is still far from achieving acceptable performance. In this work, we address the problem by first getting the timestamps employing voice activity detection and sliding window techniques. We further extract the Mel-Spectrograms / Mel-frequency Cepstral Coefficients (MFCC). We then train a Long Short-Term Memory (LSTM) network to get the audio embed dings named d-vectors. Subsequently, we employ K-Means and Spectral clustering techniques to segment all the speakers in the given audio file. We evaluate the proposed framework on publically available VoxConverse dataset and report results comparing with similar benchmarks in the existing literature. The proposed model performs better / at par with exisiting techniques despite simpler framework.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2023 3rd International Conference on Artificial Intelligence (ICAI)

自引率

0.00%

发文量