Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens

Taejin Park, Ivan Medennikov, Kunal Dhawan, Weiqing Wang, He Huang, Nithin Rao Koluguri, Krishna C. Puvvada, Jagadeesh Balam, Boris Ginsburg
{"title":"Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens","authors":"Taejin Park, Ivan Medennikov, Kunal Dhawan, Weiqing Wang, He Huang, Nithin Rao Koluguri, Krishna C. Puvvada, Jagadeesh Balam, Boris Ginsburg","doi":"arxiv-2409.06656","DOIUrl":null,"url":null,"abstract":"We propose Sortformer, a novel neural model for speaker diarization, trained\nwith unconventional objectives compared to existing end-to-end diarization\nmodels. The permutation problem in speaker diarization has long been regarded\nas a critical challenge. Most prior end-to-end diarization systems employ\npermutation invariant loss (PIL), which optimizes for the permutation that\nyields the lowest error. In contrast, we introduce Sort Loss, which enables a\ndiarization model to autonomously resolve permutation, with or without PIL. We\ndemonstrate that combining Sort Loss and PIL achieves performance competitive\nwith state-of-the-art end-to-end diarization models trained exclusively with\nPIL. Crucially, we present a streamlined multispeaker ASR architecture that\nleverages Sortformer as a speaker supervision model, embedding speaker label\nestimation within the ASR encoder state using a sinusoidal kernel function.\nThis approach resolves the speaker permutation problem through sorted\nobjectives, effectively bridging speaker-label timestamps and speaker tokens.\nIn our experiments, we show that the proposed multispeaker ASR architecture,\nenhanced with speaker supervision, improves performance via adapter techniques.\nCode and trained models will be made publicly available via the NVIDIA NeMo\nframework","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"19 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06656","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

We propose Sortformer, a novel neural model for speaker diarization, trained with unconventional objectives compared to existing end-to-end diarization models. The permutation problem in speaker diarization has long been regarded as a critical challenge. Most prior end-to-end diarization systems employ permutation invariant loss (PIL), which optimizes for the permutation that yields the lowest error. In contrast, we introduce Sort Loss, which enables a diarization model to autonomously resolve permutation, with or without PIL. We demonstrate that combining Sort Loss and PIL achieves performance competitive with state-of-the-art end-to-end diarization models trained exclusively with PIL. Crucially, we present a streamlined multispeaker ASR architecture that leverages Sortformer as a speaker supervision model, embedding speaker label estimation within the ASR encoder state using a sinusoidal kernel function. This approach resolves the speaker permutation problem through sorted objectives, effectively bridging speaker-label timestamps and speaker tokens. In our experiments, we show that the proposed multispeaker ASR architecture, enhanced with speaker supervision, improves performance via adapter techniques. Code and trained models will be made publicly available via the NVIDIA NeMo framework
Sortformer:通过衔接时间戳和标记,实现说话者记录与 ASR 的无缝集成
我们提出的 Sortformer 是一种用于说话人日记化的新型神经模型,与现有的端到端日记化模型相比,它采用了非常规的目标进行训练。长期以来,说话人日记化中的置换问题一直被认为是一个严峻的挑战。之前的大多数端到端日记化系统都采用了排列不变损失算法(PIL),该算法会对产生最低误差的排列方式进行优化。与此相反,我们引入了排序损失(Sort Loss),它能让数据化模型在有或没有 PIL 的情况下自主解决排列问题。我们证明,将 Sort Loss 和 PIL 结合使用,其性能可与完全使用 PIL 训练的最先进端到端数据化模型相媲美。最重要的是,我们提出了一种精简的多扬声器 ASR 架构,该架构将 Sortformer 作为扬声器监督模型,使用正弦内核函数将扬声器标签估计嵌入 ASR 编码器状态中。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信