Taejin Park, Ivan Medennikov, Kunal Dhawan, Weiqing Wang, He Huang, Nithin Rao Koluguri, Krishna C. Puvvada, Jagadeesh Balam, Boris Ginsburg
{"title":"Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens","authors":"Taejin Park, Ivan Medennikov, Kunal Dhawan, Weiqing Wang, He Huang, Nithin Rao Koluguri, Krishna C. Puvvada, Jagadeesh Balam, Boris Ginsburg","doi":"arxiv-2409.06656","DOIUrl":null,"url":null,"abstract":"We propose Sortformer, a novel neural model for speaker diarization, trained\nwith unconventional objectives compared to existing end-to-end diarization\nmodels. The permutation problem in speaker diarization has long been regarded\nas a critical challenge. Most prior end-to-end diarization systems employ\npermutation invariant loss (PIL), which optimizes for the permutation that\nyields the lowest error. In contrast, we introduce Sort Loss, which enables a\ndiarization model to autonomously resolve permutation, with or without PIL. We\ndemonstrate that combining Sort Loss and PIL achieves performance competitive\nwith state-of-the-art end-to-end diarization models trained exclusively with\nPIL. Crucially, we present a streamlined multispeaker ASR architecture that\nleverages Sortformer as a speaker supervision model, embedding speaker label\nestimation within the ASR encoder state using a sinusoidal kernel function.\nThis approach resolves the speaker permutation problem through sorted\nobjectives, effectively bridging speaker-label timestamps and speaker tokens.\nIn our experiments, we show that the proposed multispeaker ASR architecture,\nenhanced with speaker supervision, improves performance via adapter techniques.\nCode and trained models will be made publicly available via the NVIDIA NeMo\nframework","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"19 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06656","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
We propose Sortformer, a novel neural model for speaker diarization, trained
with unconventional objectives compared to existing end-to-end diarization
models. The permutation problem in speaker diarization has long been regarded
as a critical challenge. Most prior end-to-end diarization systems employ
permutation invariant loss (PIL), which optimizes for the permutation that
yields the lowest error. In contrast, we introduce Sort Loss, which enables a
diarization model to autonomously resolve permutation, with or without PIL. We
demonstrate that combining Sort Loss and PIL achieves performance competitive
with state-of-the-art end-to-end diarization models trained exclusively with
PIL. Crucially, we present a streamlined multispeaker ASR architecture that
leverages Sortformer as a speaker supervision model, embedding speaker label
estimation within the ASR encoder state using a sinusoidal kernel function.
This approach resolves the speaker permutation problem through sorted
objectives, effectively bridging speaker-label timestamps and speaker tokens.
In our experiments, we show that the proposed multispeaker ASR architecture,
enhanced with speaker supervision, improves performance via adapter techniques.
Code and trained models will be made publicly available via the NVIDIA NeMo
framework