Tung Lam Nguyen, Bao Thang Ta, Van Hai Do, T. Tran, Nhat Minh Le
{"title":"Speaker Diarization For Vietnamese Conversations Using Deep Neural Network Embeddings","authors":"Tung Lam Nguyen, Bao Thang Ta, Van Hai Do, T. Tran, Nhat Minh Le","doi":"10.1109/ICCE55644.2022.9852042","DOIUrl":null,"url":null,"abstract":"Speaker diarization, known as finding “who spoke when” is the method of dividing a conversation into segments spoken by the same speaker. While speaker diarization has numerous applications, there are little to no reports on its application in Vietnamese speech processing system. In addition, the key to accurately do such task is to learn discriminative speaker representations, or speaker embeddings. Recently X-Vectors and ECAPA-TDNN, based on deep neural networks, has emerged as state-of-the-art speaker embeddings networks for English corpora. In this work, we build a speaker diarization system for Vietnamese telephone conversations, and explore the capabilities of X-Vectors and ECAPA-TDNN in the system. We also evaluate the discriminative characteristics of these speaker embeddings networks on a bare-bones speaker verification system. Used data include proprietary datasets (IPCC-110000, IPCC-2000, VTR-1350) and a public dataset (ZALO-400). While these datasets can be used directly for training and testing for speaker verification task, for speaker diarization task we have to simulate multi-way conversations. Our conducted experiments show that ECAPA-TDNN system out-perform the X-Vectors system on both speaker verification and speaker diarization tasks.","PeriodicalId":388547,"journal":{"name":"2022 IEEE Ninth International Conference on Communications and Electronics (ICCE)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE Ninth International Conference on Communications and Electronics (ICCE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCE55644.2022.9852042","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Speaker diarization, known as finding “who spoke when” is the method of dividing a conversation into segments spoken by the same speaker. While speaker diarization has numerous applications, there are little to no reports on its application in Vietnamese speech processing system. In addition, the key to accurately do such task is to learn discriminative speaker representations, or speaker embeddings. Recently X-Vectors and ECAPA-TDNN, based on deep neural networks, has emerged as state-of-the-art speaker embeddings networks for English corpora. In this work, we build a speaker diarization system for Vietnamese telephone conversations, and explore the capabilities of X-Vectors and ECAPA-TDNN in the system. We also evaluate the discriminative characteristics of these speaker embeddings networks on a bare-bones speaker verification system. Used data include proprietary datasets (IPCC-110000, IPCC-2000, VTR-1350) and a public dataset (ZALO-400). While these datasets can be used directly for training and testing for speaker verification task, for speaker diarization task we have to simulate multi-way conversations. Our conducted experiments show that ECAPA-TDNN system out-perform the X-Vectors system on both speaker verification and speaker diarization tasks.