Advancing speaker embedding learning: Wespeaker toolkit for research and production

IF 2.4 3区 计算机科学 Q2 ACOUSTICS
Shuai Wang , Zhengyang Chen , Bing Han , Hongji Wang , Chengdong Liang , Binbin Zhang , Xu Xiang , Wen Ding , Johan Rohdin , Anna Silnova , Yanmin Qian , Haizhou Li
{"title":"Advancing speaker embedding learning: Wespeaker toolkit for research and production","authors":"Shuai Wang ,&nbsp;Zhengyang Chen ,&nbsp;Bing Han ,&nbsp;Hongji Wang ,&nbsp;Chengdong Liang ,&nbsp;Binbin Zhang ,&nbsp;Xu Xiang ,&nbsp;Wen Ding ,&nbsp;Johan Rohdin ,&nbsp;Anna Silnova ,&nbsp;Yanmin Qian ,&nbsp;Haizhou Li","doi":"10.1016/j.specom.2024.103104","DOIUrl":null,"url":null,"abstract":"<div><p>Speaker modeling plays a crucial role in various tasks, and fixed-dimensional vector representations, known as speaker embeddings, are the predominant modeling approach. These embeddings are typically evaluated within the framework of speaker verification, yet their utility extends to a broad scope of related tasks including speaker diarization, speech synthesis, voice conversion, and target speaker extraction. This paper presents Wespeaker, a user-friendly toolkit designed for both research and production purposes, dedicated to the learning of speaker embeddings. Wespeaker offers scalable data management, state-of-the-art speaker embedding models, and self-supervised learning training schemes with the potential to leverage large-scale unlabeled real-world data. The toolkit incorporates structured recipes that have been successfully adopted in winning systems across various speaker verification challenges, ensuring highly competitive results. For production-oriented development, Wespeaker integrates CPU- and GPU-compatible deployment and runtime codes, supporting mainstream platforms such as Windows, Linux, Mac and on-device chips such as horizon X3’PI. Wespeaker also provides off-the-shelf high-quality speaker embeddings by providing various pretrained models, which can be effortlessly applied to different tasks that require speaker modeling. The toolkit is publicly available at <span><span>https://github.com/wenet-e2e/wespeaker</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"162 ","pages":"Article 103104"},"PeriodicalIF":2.4000,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639324000761","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0

Abstract

Speaker modeling plays a crucial role in various tasks, and fixed-dimensional vector representations, known as speaker embeddings, are the predominant modeling approach. These embeddings are typically evaluated within the framework of speaker verification, yet their utility extends to a broad scope of related tasks including speaker diarization, speech synthesis, voice conversion, and target speaker extraction. This paper presents Wespeaker, a user-friendly toolkit designed for both research and production purposes, dedicated to the learning of speaker embeddings. Wespeaker offers scalable data management, state-of-the-art speaker embedding models, and self-supervised learning training schemes with the potential to leverage large-scale unlabeled real-world data. The toolkit incorporates structured recipes that have been successfully adopted in winning systems across various speaker verification challenges, ensuring highly competitive results. For production-oriented development, Wespeaker integrates CPU- and GPU-compatible deployment and runtime codes, supporting mainstream platforms such as Windows, Linux, Mac and on-device chips such as horizon X3’PI. Wespeaker also provides off-the-shelf high-quality speaker embeddings by providing various pretrained models, which can be effortlessly applied to different tasks that require speaker modeling. The toolkit is publicly available at https://github.com/wenet-e2e/wespeaker.

推进演讲者嵌入式学习:用于研究和制作的 Wespeaker 工具包
扬声器建模在各种任务中起着至关重要的作用,而固定维度的向量表示(即扬声器嵌入)是最主要的建模方法。这些嵌入通常是在扬声器验证的框架内进行评估的,但它们的用途也扩展到了扬声器日记化、语音合成、语音转换和目标扬声器提取等广泛的相关任务中。本文介绍的 Wespeaker 是一款用户友好型工具包,既可用于研究,也可用于生产,专门用于学习扬声器嵌入。Wespeaker 提供可扩展的数据管理、最先进的扬声器嵌入模型和自监督学习训练方案,具有利用大规模无标注真实世界数据的潜力。该工具包采用了结构化配方,这些配方已成功应用于各种扬声器验证挑战的成功系统中,确保了极具竞争力的结果。对于面向生产的开发,Wespeaker集成了CPU和GPU兼容的部署和运行代码,支持Windows、Linux、Mac等主流平台和horizon X3'PI等设备上芯片。Wespeaker 还通过提供各种预训练模型提供现成的高质量扬声器嵌入,这些模型可以毫不费力地应用于需要扬声器建模的不同任务。该工具包可通过 https://github.com/wenet-e2e/wespeaker 公开获取。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Speech Communication
Speech Communication 工程技术-计算机:跨学科应用
CiteScore
6.80
自引率
6.20%
发文量
94
审稿时长
19.2 weeks
期刊介绍: Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results. The journal''s primary objectives are: • to present a forum for the advancement of human and human-machine speech communication science; • to stimulate cross-fertilization between different fields of this domain; • to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信