Never-ending learning system for on-line speaker diarization

K. Markov, Satoshi Nakamura
{"title":"Never-ending learning system for on-line speaker diarization","authors":"K. Markov, Satoshi Nakamura","doi":"10.1109/ASRU.2007.4430197","DOIUrl":null,"url":null,"abstract":"In this paper, we describe new high-performance on-line speaker diarization system which works faster than real-time and has very low latency. It consists of several modules including voice activity detection, novel speaker detection, speaker gender and speaker identity classification. All modules share a set of Gaussian mixture models (GMM) representing pause, male and female speakers, and each individual speaker. Initially, there are only three GMMs for pause and two speaker genders, trained in advance from some data. During the speaker diarization process, for each speech segment it is decided whether it comes from a new speaker or from already known speaker. In case of a new speaker, his/her gender is identified, and then, from the corresponding gender GMM, a new GMM is spawned by copying its parameters. This GMM is learned on-line using the speech segment data and from this point it is used to represent the new speaker. All individual speaker models are produced in this way. In the case of an old speaker, s/he is identified and the corresponding GMM is again learned on-line. In order to prevent an unlimited grow of the speaker model number, those models that have not been selected as winners for a long period of time are deleted from the system. This allows the system to be able to perform its task indefinitely in addition to being capable of self-organization, i.e. unsupervised adaptive learning, and preservation of the learned knowledge, i.e. speakers. Such functionalities are attributed to the so called Never-Ending Learning systems. For evaluation, we used part of the TC-STAR database consisting of European Parliament Plenary speeches. The results show that this system achieves a speaker diarization error rate of 4.6% with latency of at most 3 seconds.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"40","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASRU.2007.4430197","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 40

Abstract

In this paper, we describe new high-performance on-line speaker diarization system which works faster than real-time and has very low latency. It consists of several modules including voice activity detection, novel speaker detection, speaker gender and speaker identity classification. All modules share a set of Gaussian mixture models (GMM) representing pause, male and female speakers, and each individual speaker. Initially, there are only three GMMs for pause and two speaker genders, trained in advance from some data. During the speaker diarization process, for each speech segment it is decided whether it comes from a new speaker or from already known speaker. In case of a new speaker, his/her gender is identified, and then, from the corresponding gender GMM, a new GMM is spawned by copying its parameters. This GMM is learned on-line using the speech segment data and from this point it is used to represent the new speaker. All individual speaker models are produced in this way. In the case of an old speaker, s/he is identified and the corresponding GMM is again learned on-line. In order to prevent an unlimited grow of the speaker model number, those models that have not been selected as winners for a long period of time are deleted from the system. This allows the system to be able to perform its task indefinitely in addition to being capable of self-organization, i.e. unsupervised adaptive learning, and preservation of the learned knowledge, i.e. speakers. Such functionalities are attributed to the so called Never-Ending Learning systems. For evaluation, we used part of the TC-STAR database consisting of European Parliament Plenary speeches. The results show that this system achieves a speaker diarization error rate of 4.6% with latency of at most 3 seconds.
永无休止的在线扬声器拨号学习系统
在本文中,我们描述了一种新的高性能在线扬声器拨号系统,该系统工作速度快于实时,并且具有非常低的延迟。它包括语音活动检测、新说话人检测、说话人性别和说话人身份分类等几个模块。所有模块共享一组高斯混合模型(GMM),表示暂停,男性和女性演讲者以及每个单独的演讲者。最初,只有三个gmm用于暂停和两个说话者性别,这些都是事先从一些数据中训练出来的。在说话人分化过程中,对于每一个语音片段,都要确定它是来自一个新的说话人还是来自一个已知的说话人。如果有新的说话人,则识别其性别,然后从对应的性别GMM中复制其参数,生成新的GMM。这个GMM是使用语音片段数据在线学习的,从这一点上它被用来表示新的说话者。所有单独的扬声器型号都是以这种方式生产的。对于老说话者,识别他/她,并再次在线学习相应的GMM。为了防止扬声器型号的无限增长,长时间未被选中的型号将从系统中删除。这使得系统除了能够自组织(即无监督自适应学习)和保存所学知识(即说话者)之外,还能够无限期地执行其任务。这些功能都归功于所谓的永无止境的学习系统。为了进行评估,我们使用了由欧洲议会全体会议发言组成的TC-STAR数据库的一部分。结果表明,该系统的说话人拨号错误率为4.6%,延迟不超过3秒。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信