Speaker Diarization For Vietnamese Conversations Using Deep Neural Network Embeddings

Tung Lam Nguyen, Bao Thang Ta, Van Hai Do, T. Tran, Nhat Minh Le
{"title":"Speaker Diarization For Vietnamese Conversations Using Deep Neural Network Embeddings","authors":"Tung Lam Nguyen, Bao Thang Ta, Van Hai Do, T. Tran, Nhat Minh Le","doi":"10.1109/ICCE55644.2022.9852042","DOIUrl":null,"url":null,"abstract":"Speaker diarization, known as finding “who spoke when” is the method of dividing a conversation into segments spoken by the same speaker. While speaker diarization has numerous applications, there are little to no reports on its application in Vietnamese speech processing system. In addition, the key to accurately do such task is to learn discriminative speaker representations, or speaker embeddings. Recently X-Vectors and ECAPA-TDNN, based on deep neural networks, has emerged as state-of-the-art speaker embeddings networks for English corpora. In this work, we build a speaker diarization system for Vietnamese telephone conversations, and explore the capabilities of X-Vectors and ECAPA-TDNN in the system. We also evaluate the discriminative characteristics of these speaker embeddings networks on a bare-bones speaker verification system. Used data include proprietary datasets (IPCC-110000, IPCC-2000, VTR-1350) and a public dataset (ZALO-400). While these datasets can be used directly for training and testing for speaker verification task, for speaker diarization task we have to simulate multi-way conversations. Our conducted experiments show that ECAPA-TDNN system out-perform the X-Vectors system on both speaker verification and speaker diarization tasks.","PeriodicalId":388547,"journal":{"name":"2022 IEEE Ninth International Conference on Communications and Electronics (ICCE)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE Ninth International Conference on Communications and Electronics (ICCE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCE55644.2022.9852042","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Speaker diarization, known as finding “who spoke when” is the method of dividing a conversation into segments spoken by the same speaker. While speaker diarization has numerous applications, there are little to no reports on its application in Vietnamese speech processing system. In addition, the key to accurately do such task is to learn discriminative speaker representations, or speaker embeddings. Recently X-Vectors and ECAPA-TDNN, based on deep neural networks, has emerged as state-of-the-art speaker embeddings networks for English corpora. In this work, we build a speaker diarization system for Vietnamese telephone conversations, and explore the capabilities of X-Vectors and ECAPA-TDNN in the system. We also evaluate the discriminative characteristics of these speaker embeddings networks on a bare-bones speaker verification system. Used data include proprietary datasets (IPCC-110000, IPCC-2000, VTR-1350) and a public dataset (ZALO-400). While these datasets can be used directly for training and testing for speaker verification task, for speaker diarization task we have to simulate multi-way conversations. Our conducted experiments show that ECAPA-TDNN system out-perform the X-Vectors system on both speaker verification and speaker diarization tasks.
基于深度神经网络嵌入的越南语会话的说话人分类
说话人划分,即查找“谁在什么时候说话”,是将对话划分为同一说话人所说的片段的方法。虽然说话人化有许多应用,但在越南语语音处理系统中的应用鲜有报道。此外,准确完成这一任务的关键是学习判别说话人表征或说话人嵌入。近年来,基于深度神经网络的X-Vectors和ECAPA-TDNN已成为最先进的英语语料库说话人嵌入网络。在这项工作中,我们建立了一个越南语电话会话的扬声器拨号系统,并探索了X-Vectors和ECAPA-TDNN在系统中的功能。我们还在一个基本的说话人验证系统上评估了这些说话人嵌入网络的判别特性。使用的数据包括专有数据集(IPCC-110000, IPCC-2000, VTR-1350)和公共数据集(ZALO-400)。虽然这些数据集可以直接用于说话人验证任务的训练和测试,但对于说话人拨号任务,我们必须模拟多路对话。我们进行的实验表明,ECAPA-TDNN系统在说话人验证和说话人拨号任务上都优于X-Vectors系统。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信