Advanced approaches to speaker diarization of audio documents

2009 Joint Conferences on Pervasive Computing (JCPC) Pub Date : 2009-12-01 DOI:10.1109/JCPC.2009.5420194

K. Markov

{"title":"Advanced approaches to speaker diarization of audio documents","authors":"K. Markov","doi":"10.1109/JCPC.2009.5420194","DOIUrl":null,"url":null,"abstract":"Speaker diarization is the process of annotating an audio document with information about the speaker identity of speech segments along with their start and end time. Assuming that audio input consists of speech only or that non-speech segments have been already identified by another method, the task of speaker diarization is to find “who spoke when”. Since there is no prior information about the number of speakers, the main approach is to apply segment clustering. According to the clustering algorithm used, speaker diarization systems can be divided into two groups: 1) based on agglomerative clustering, and 2) based on on-line clustering. Agglomerative clustering is an off-line approach and is used in most of the current systems because it gives accurate results and can be fine tuned by performing several processing passes over the data. This, however, comes at the cost of high computational load which increases exponentially with the number of segments and the requirement of having the whole audio document available in advance. In contrast, on-line clustering based systems have almost constant computational load, work on-line in real time with small latency, but are generally less accurate than off-line systems. As we show in this paper, when using advanced on-line learning methods and original design, on-line systems can make less errors than off-line systems and can even work faster than real time with very low latency.","PeriodicalId":284323,"journal":{"name":"2009 Joint Conferences on Pervasive Computing (JCPC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 Joint Conferences on Pervasive Computing (JCPC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/JCPC.2009.5420194","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Speaker diarization is the process of annotating an audio document with information about the speaker identity of speech segments along with their start and end time. Assuming that audio input consists of speech only or that non-speech segments have been already identified by another method, the task of speaker diarization is to find “who spoke when”. Since there is no prior information about the number of speakers, the main approach is to apply segment clustering. According to the clustering algorithm used, speaker diarization systems can be divided into two groups: 1) based on agglomerative clustering, and 2) based on on-line clustering. Agglomerative clustering is an off-line approach and is used in most of the current systems because it gives accurate results and can be fine tuned by performing several processing passes over the data. This, however, comes at the cost of high computational load which increases exponentially with the number of segments and the requirement of having the whole audio document available in advance. In contrast, on-line clustering based systems have almost constant computational load, work on-line in real time with small latency, but are generally less accurate than off-line systems. As we show in this paper, when using advanced on-line learning methods and original design, on-line systems can make less errors than off-line systems and can even work faster than real time with very low latency.

查看原文本刊更多论文

音频文件的说话人拨号化的先进方法

说话人特征化是用语音片段的说话人身份及其开始和结束时间等信息对音频文档进行注释的过程。假设音频输入仅由语音组成，或者非语音片段已经通过另一种方法识别出来，那么说话人拨号的任务就是找到“谁在什么时候说话”。由于没有关于说话人数量的先验信息，主要的方法是应用分段聚类。根据所使用的聚类算法，说话人分类系统可分为两类:基于聚类的说话人分类系统和基于在线聚类的说话人分类系统。聚合聚类是一种离线方法，在大多数当前系统中使用，因为它提供准确的结果，并且可以通过对数据执行多次处理来进行微调。然而，这是以高计算负荷为代价的，计算负荷随着片段数量和提前获得整个音频文档的要求呈指数级增长。相比之下，基于在线聚类的系统具有几乎恒定的计算负荷，实时在线工作，延迟小，但通常不如离线系统准确。正如我们在本文中所展示的，当使用先进的在线学习方法和原始设计时，在线系统可以比离线系统产生更少的错误，甚至可以以非常低的延迟比实时更快地工作。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2009 Joint Conferences on Pervasive Computing (JCPC)

自引率

0.00%

发文量