Taejin Park, Ivan Medennikov, Kunal Dhawan, Weiqing Wang, He Huang, Nithin Rao Koluguri, Krishna C. Puvvada, Jagadeesh Balam, Boris Ginsburg
{"title":"Sortformer:通过衔接时间戳和标记,实现说话者记录与 ASR 的无缝集成","authors":"Taejin Park, Ivan Medennikov, Kunal Dhawan, Weiqing Wang, He Huang, Nithin Rao Koluguri, Krishna C. Puvvada, Jagadeesh Balam, Boris Ginsburg","doi":"arxiv-2409.06656","DOIUrl":null,"url":null,"abstract":"We propose Sortformer, a novel neural model for speaker diarization, trained\nwith unconventional objectives compared to existing end-to-end diarization\nmodels. The permutation problem in speaker diarization has long been regarded\nas a critical challenge. Most prior end-to-end diarization systems employ\npermutation invariant loss (PIL), which optimizes for the permutation that\nyields the lowest error. In contrast, we introduce Sort Loss, which enables a\ndiarization model to autonomously resolve permutation, with or without PIL. We\ndemonstrate that combining Sort Loss and PIL achieves performance competitive\nwith state-of-the-art end-to-end diarization models trained exclusively with\nPIL. Crucially, we present a streamlined multispeaker ASR architecture that\nleverages Sortformer as a speaker supervision model, embedding speaker label\nestimation within the ASR encoder state using a sinusoidal kernel function.\nThis approach resolves the speaker permutation problem through sorted\nobjectives, effectively bridging speaker-label timestamps and speaker tokens.\nIn our experiments, we show that the proposed multispeaker ASR architecture,\nenhanced with speaker supervision, improves performance via adapter techniques.\nCode and trained models will be made publicly available via the NVIDIA NeMo\nframework","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"19 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens\",\"authors\":\"Taejin Park, Ivan Medennikov, Kunal Dhawan, Weiqing Wang, He Huang, Nithin Rao Koluguri, Krishna C. Puvvada, Jagadeesh Balam, Boris Ginsburg\",\"doi\":\"arxiv-2409.06656\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We propose Sortformer, a novel neural model for speaker diarization, trained\\nwith unconventional objectives compared to existing end-to-end diarization\\nmodels. The permutation problem in speaker diarization has long been regarded\\nas a critical challenge. Most prior end-to-end diarization systems employ\\npermutation invariant loss (PIL), which optimizes for the permutation that\\nyields the lowest error. In contrast, we introduce Sort Loss, which enables a\\ndiarization model to autonomously resolve permutation, with or without PIL. We\\ndemonstrate that combining Sort Loss and PIL achieves performance competitive\\nwith state-of-the-art end-to-end diarization models trained exclusively with\\nPIL. Crucially, we present a streamlined multispeaker ASR architecture that\\nleverages Sortformer as a speaker supervision model, embedding speaker label\\nestimation within the ASR encoder state using a sinusoidal kernel function.\\nThis approach resolves the speaker permutation problem through sorted\\nobjectives, effectively bridging speaker-label timestamps and speaker tokens.\\nIn our experiments, we show that the proposed multispeaker ASR architecture,\\nenhanced with speaker supervision, improves performance via adapter techniques.\\nCode and trained models will be made publicly available via the NVIDIA NeMo\\nframework\",\"PeriodicalId\":501284,\"journal\":{\"name\":\"arXiv - EE - Audio and Speech Processing\",\"volume\":\"19 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - EE - Audio and Speech Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.06656\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06656","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
我们提出的 Sortformer 是一种用于说话人日记化的新型神经模型,与现有的端到端日记化模型相比,它采用了非常规的目标进行训练。长期以来,说话人日记化中的置换问题一直被认为是一个严峻的挑战。之前的大多数端到端日记化系统都采用了排列不变损失算法(PIL),该算法会对产生最低误差的排列方式进行优化。与此相反,我们引入了排序损失(Sort Loss),它能让数据化模型在有或没有 PIL 的情况下自主解决排列问题。我们证明,将 Sort Loss 和 PIL 结合使用,其性能可与完全使用 PIL 训练的最先进端到端数据化模型相媲美。最重要的是,我们提出了一种精简的多扬声器 ASR 架构,该架构将 Sortformer 作为扬声器监督模型,使用正弦内核函数将扬声器标签估计嵌入 ASR 编码器状态中。
Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens
We propose Sortformer, a novel neural model for speaker diarization, trained
with unconventional objectives compared to existing end-to-end diarization
models. The permutation problem in speaker diarization has long been regarded
as a critical challenge. Most prior end-to-end diarization systems employ
permutation invariant loss (PIL), which optimizes for the permutation that
yields the lowest error. In contrast, we introduce Sort Loss, which enables a
diarization model to autonomously resolve permutation, with or without PIL. We
demonstrate that combining Sort Loss and PIL achieves performance competitive
with state-of-the-art end-to-end diarization models trained exclusively with
PIL. Crucially, we present a streamlined multispeaker ASR architecture that
leverages Sortformer as a speaker supervision model, embedding speaker label
estimation within the ASR encoder state using a sinusoidal kernel function.
This approach resolves the speaker permutation problem through sorted
objectives, effectively bridging speaker-label timestamps and speaker tokens.
In our experiments, we show that the proposed multispeaker ASR architecture,
enhanced with speaker supervision, improves performance via adapter techniques.
Code and trained models will be made publicly available via the NVIDIA NeMo
framework