Speaker Diarization For Vietnamese Conversations Using Deep Neural Network Embeddings

2022 IEEE Ninth International Conference on Communications and Electronics (ICCE) Pub Date : 2022-07-27 DOI:10.1109/ICCE55644.2022.9852042

Tung Lam Nguyen, Bao Thang Ta, Van Hai Do, T. Tran, Nhat Minh Le

{"title":"Speaker Diarization For Vietnamese Conversations Using Deep Neural Network Embeddings","authors":"Tung Lam Nguyen, Bao Thang Ta, Van Hai Do, T. Tran, Nhat Minh Le","doi":"10.1109/ICCE55644.2022.9852042","DOIUrl":null,"url":null,"abstract":"Speaker diarization, known as finding “who spoke when” is the method of dividing a conversation into segments spoken by the same speaker. While speaker diarization has numerous applications, there are little to no reports on its application in Vietnamese speech processing system. In addition, the key to accurately do such task is to learn discriminative speaker representations, or speaker embeddings. Recently X-Vectors and ECAPA-TDNN, based on deep neural networks, has emerged as state-of-the-art speaker embeddings networks for English corpora. In this work, we build a speaker diarization system for Vietnamese telephone conversations, and explore the capabilities of X-Vectors and ECAPA-TDNN in the system. We also evaluate the discriminative characteristics of these speaker embeddings networks on a bare-bones speaker verification system. Used data include proprietary datasets (IPCC-110000, IPCC-2000, VTR-1350) and a public dataset (ZALO-400). While these datasets can be used directly for training and testing for speaker verification task, for speaker diarization task we have to simulate multi-way conversations. Our conducted experiments show that ECAPA-TDNN system out-perform the X-Vectors system on both speaker verification and speaker diarization tasks.","PeriodicalId":388547,"journal":{"name":"2022 IEEE Ninth International Conference on Communications and Electronics (ICCE)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE Ninth International Conference on Communications and Electronics (ICCE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCE55644.2022.9852042","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Speaker diarization, known as finding “who spoke when” is the method of dividing a conversation into segments spoken by the same speaker. While speaker diarization has numerous applications, there are little to no reports on its application in Vietnamese speech processing system. In addition, the key to accurately do such task is to learn discriminative speaker representations, or speaker embeddings. Recently X-Vectors and ECAPA-TDNN, based on deep neural networks, has emerged as state-of-the-art speaker embeddings networks for English corpora. In this work, we build a speaker diarization system for Vietnamese telephone conversations, and explore the capabilities of X-Vectors and ECAPA-TDNN in the system. We also evaluate the discriminative characteristics of these speaker embeddings networks on a bare-bones speaker verification system. Used data include proprietary datasets (IPCC-110000, IPCC-2000, VTR-1350) and a public dataset (ZALO-400). While these datasets can be used directly for training and testing for speaker verification task, for speaker diarization task we have to simulate multi-way conversations. Our conducted experiments show that ECAPA-TDNN system out-perform the X-Vectors system on both speaker verification and speaker diarization tasks.

查看原文本刊更多论文

基于深度神经网络嵌入的越南语会话的说话人分类

说话人划分，即查找“谁在什么时候说话”，是将对话划分为同一说话人所说的片段的方法。虽然说话人化有许多应用，但在越南语语音处理系统中的应用鲜有报道。此外，准确完成这一任务的关键是学习判别说话人表征或说话人嵌入。近年来，基于深度神经网络的X-Vectors和ECAPA-TDNN已成为最先进的英语语料库说话人嵌入网络。在这项工作中，我们建立了一个越南语电话会话的扬声器拨号系统，并探索了X-Vectors和ECAPA-TDNN在系统中的功能。我们还在一个基本的说话人验证系统上评估了这些说话人嵌入网络的判别特性。使用的数据包括专有数据集(IPCC-110000, IPCC-2000, VTR-1350)和公共数据集(ZALO-400)。虽然这些数据集可以直接用于说话人验证任务的训练和测试，但对于说话人拨号任务，我们必须模拟多路对话。我们进行的实验表明，ECAPA-TDNN系统在说话人验证和说话人拨号任务上都优于X-Vectors系统。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 IEEE Ninth International Conference on Communications and Electronics (ICCE)

自引率

0.00%

发文量