语音化系统中抽象相似度度量的相似度阈值优化:一个数学公式

Applied Mathematics and Sciences An International Journal (MathSJ) Pub Date : 2023-06-26 DOI:10.5121/mathsj.2023.10201

Jagat Chaitanya Prabhala, Venkatnareshbabu K, R. Ravi

{"title":"语音化系统中抽象相似度度量的相似度阈值优化:一个数学公式","authors":"Jagat Chaitanya Prabhala, Venkatnareshbabu K, R. Ravi","doi":"10.5121/mathsj.2023.10201","DOIUrl":null,"url":null,"abstract":"Speaker diarization is a critical task in speech processing that aims to identify \"who spoke when?\" in an audio or video recording that contains unknown amounts of speech from unknown speakers and unknown number of speakers. Diarization has numerous applications in speech recognition, speaker identification, and automatic captioning. Supervised and unsupervised algorithms are used to address speaker diarization problems, but providing exhaustive labeling for the training dataset can become costly in supervised learning, while accuracy can be compromised when using unsupervised approaches. This paper presents a novel approach to speaker diarization, which defines loosely labeled data and employs x-vector embedding and a formalized approach for threshold searching with a given abstract similarity metric to cluster temporal segments into unique user segments. The proposed algorithm uses concepts of graph theory, matrix algebra, and genetic algorithm to formulate and solve the optimization problem. Additionally, the algorithm is applied to English, Spanish, and Chinese audios, and the performance is evaluated using wellknown similarity metrics. The results demonstrate that the robustness of the proposed approach. The findings of this research have significant implications for speech processing, speaker identification including those with tonal differences. The proposed method offers a practical and efficient solution for speaker diarization in real-world scenarios where there are labeling time and cost constraints","PeriodicalId":276601,"journal":{"name":"Applied Mathematics and Sciences An International Journal (MathSJ)","volume":"135 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIARIZATION SYSTEMS: A MATHEMATICAL FORMULATION\",\"authors\":\"Jagat Chaitanya Prabhala, Venkatnareshbabu K, R. Ravi\",\"doi\":\"10.5121/mathsj.2023.10201\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Speaker diarization is a critical task in speech processing that aims to identify \\\"who spoke when?\\\" in an audio or video recording that contains unknown amounts of speech from unknown speakers and unknown number of speakers. Diarization has numerous applications in speech recognition, speaker identification, and automatic captioning. Supervised and unsupervised algorithms are used to address speaker diarization problems, but providing exhaustive labeling for the training dataset can become costly in supervised learning, while accuracy can be compromised when using unsupervised approaches. This paper presents a novel approach to speaker diarization, which defines loosely labeled data and employs x-vector embedding and a formalized approach for threshold searching with a given abstract similarity metric to cluster temporal segments into unique user segments. The proposed algorithm uses concepts of graph theory, matrix algebra, and genetic algorithm to formulate and solve the optimization problem. Additionally, the algorithm is applied to English, Spanish, and Chinese audios, and the performance is evaluated using wellknown similarity metrics. The results demonstrate that the robustness of the proposed approach. The findings of this research have significant implications for speech processing, speaker identification including those with tonal differences. The proposed method offers a practical and efficient solution for speaker diarization in real-world scenarios where there are labeling time and cost constraints\",\"PeriodicalId\":276601,\"journal\":{\"name\":\"Applied Mathematics and Sciences An International Journal (MathSJ)\",\"volume\":\"135 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-06-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Applied Mathematics and Sciences An International Journal (MathSJ)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.5121/mathsj.2023.10201\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Mathematics and Sciences An International Journal (MathSJ)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5121/mathsj.2023.10201","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

说话人分类是语音处理中的一项关键任务，其目的是在包含未知说话人和未知说话人的未知数量的语音或视频记录中识别“谁在何时说话?”Diarization在语音识别、说话人识别和自动字幕中有许多应用。有监督和无监督算法用于解决说话人特征化问题，但是在有监督学习中，为训练数据集提供详尽的标记可能会变得昂贵，而使用无监督方法时，准确性可能会受到损害。本文提出了一种新的说话人分类方法，该方法定义了松散标记的数据，并采用x向量嵌入和一种形式化的阈值搜索方法，使用给定的抽象相似性度量将时间段聚类为唯一的用户段。该算法使用图论、矩阵代数和遗传算法的概念来制定和解决优化问题。此外，该算法还应用于英语、西班牙语和中文音频，并使用众所周知的相似度度量来评估性能。结果表明，该方法具有较好的鲁棒性。本研究结果对语音处理、说话人识别(包括声调差异)具有重要意义。该方法为存在标注时间和成本限制的现实场景下的说话人标注提供了一种实用高效的解决方案

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIARIZATION SYSTEMS: A MATHEMATICAL FORMULATION

Speaker diarization is a critical task in speech processing that aims to identify "who spoke when?" in an audio or video recording that contains unknown amounts of speech from unknown speakers and unknown number of speakers. Diarization has numerous applications in speech recognition, speaker identification, and automatic captioning. Supervised and unsupervised algorithms are used to address speaker diarization problems, but providing exhaustive labeling for the training dataset can become costly in supervised learning, while accuracy can be compromised when using unsupervised approaches. This paper presents a novel approach to speaker diarization, which defines loosely labeled data and employs x-vector embedding and a formalized approach for threshold searching with a given abstract similarity metric to cluster temporal segments into unique user segments. The proposed algorithm uses concepts of graph theory, matrix algebra, and genetic algorithm to formulate and solve the optimization problem. Additionally, the algorithm is applied to English, Spanish, and Chinese audios, and the performance is evaluated using wellknown similarity metrics. The results demonstrate that the robustness of the proposed approach. The findings of this research have significant implications for speech processing, speaker identification including those with tonal differences. The proposed method offers a practical and efficient solution for speaker diarization in real-world scenarios where there are labeling time and cost constraints

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Applied Mathematics and Sciences An International Journal (MathSJ)

自引率

0.00%

发文量