Li Zhang, Ning Jiang, Qing Wang, Yue Li, Quan Lu, Lei Xie
{"title":"Whisper-SV: Adapting Whisper for Low-data-resource Speaker Verification","authors":"Li Zhang, Ning Jiang, Qing Wang, Yue Li, Quan Lu, Lei Xie","doi":"arxiv-2407.10048","DOIUrl":null,"url":null,"abstract":"Trained on 680,000 hours of massive speech data, Whisper is a multitasking,\nmultilingual speech foundation model demonstrating superior performance in\nautomatic speech recognition, translation, and language identification.\nHowever, its applicability in speaker verification (SV) tasks remains\nunexplored, particularly in low-data-resource scenarios where labeled speaker\ndata in specific domains are limited. To fill this gap, we propose a\nlightweight adaptor framework to boost SV with Whisper, namely Whisper-SV.\nGiven that Whisper is not specifically optimized for SV tasks, we introduce a\nrepresentation selection module to quantify the speaker-specific\ncharacteristics contained in each layer of Whisper and select the top-k layers\nwith prominent discriminative speaker features. To aggregate pivotal\nspeaker-related features while diminishing non-speaker redundancies across the\nselected top-k distinct layers of Whisper, we design a multi-layer aggregation\nmodule in Whisper-SV to integrate multi-layer representations into a singular,\ncompacted representation for SV. In the multi-layer aggregation module, we\nemploy convolutional layers with shortcut connections among different layers to\nrefine speaker characteristics derived from multi-layer representations from\nWhisper. In addition, an attention aggregation layer is used to reduce\nnon-speaker interference and amplify speaker-specific cues for SV tasks.\nFinally, a simple classification module is used for speaker classification.\nExperiments on VoxCeleb1, FFSVC, and IMSV datasets demonstrate that Whisper-SV\nachieves EER/minDCF of 2.22%/0.307, 6.14%/0.488, and 7.50%/0.582, respectively,\nshowing superior performance in low-data-resource SV scenarios.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.10048","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Trained on 680,000 hours of massive speech data, Whisper is a multitasking,
multilingual speech foundation model demonstrating superior performance in
automatic speech recognition, translation, and language identification.
However, its applicability in speaker verification (SV) tasks remains
unexplored, particularly in low-data-resource scenarios where labeled speaker
data in specific domains are limited. To fill this gap, we propose a
lightweight adaptor framework to boost SV with Whisper, namely Whisper-SV.
Given that Whisper is not specifically optimized for SV tasks, we introduce a
representation selection module to quantify the speaker-specific
characteristics contained in each layer of Whisper and select the top-k layers
with prominent discriminative speaker features. To aggregate pivotal
speaker-related features while diminishing non-speaker redundancies across the
selected top-k distinct layers of Whisper, we design a multi-layer aggregation
module in Whisper-SV to integrate multi-layer representations into a singular,
compacted representation for SV. In the multi-layer aggregation module, we
employ convolutional layers with shortcut connections among different layers to
refine speaker characteristics derived from multi-layer representations from
Whisper. In addition, an attention aggregation layer is used to reduce
non-speaker interference and amplify speaker-specific cues for SV tasks.
Finally, a simple classification module is used for speaker classification.
Experiments on VoxCeleb1, FFSVC, and IMSV datasets demonstrate that Whisper-SV
achieves EER/minDCF of 2.22%/0.307, 6.14%/0.488, and 7.50%/0.582, respectively,
showing superior performance in low-data-resource SV scenarios.