Whisper-SV: Adapting Whisper for low-data-resource speaker verification

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Speech Communication Pub Date : 2024-07-14 DOI:10.1016/j.specom.2024.103103

Li Zhang , Ning Jiang , Qing Wang , Yue Li , Quan Lu , Lei Xie

{"title":"Whisper-SV: Adapting Whisper for low-data-resource speaker verification","authors":"Li Zhang , Ning Jiang , Qing Wang , Yue Li , Quan Lu , Lei Xie","doi":"10.1016/j.specom.2024.103103","DOIUrl":null,"url":null,"abstract":"<div><p>Trained on 680,000 h of massive speech data, Whisper is a multitasking, multilingual speech foundation model demonstrating superior performance in automatic speech recognition, translation, and language identification. However, its applicability in speaker verification (SV) tasks remains unexplored, particularly in low-data-resource scenarios where labeled speaker data in specific domains are limited. To fill this gap, we propose a lightweight adaptor framework to boost SV with Whisper, namely Whisper-SV. Given that Whisper is not specifically optimized for SV tasks, we introduce a representation selection module to quantify the speaker-specific characteristics contained in each layer of Whisper and select the top-k layers with prominent discriminative speaker features. To aggregate pivotal speaker-related features while diminishing non-speaker redundancies across the selected top-k distinct layers of Whisper, we design a multi-layer aggregation module in Whisper-SV to integrate multi-layer representations into a singular, compacted representation for SV. In the multi-layer aggregation module, we employ convolutional layers with shortcut connections among different layers to refine speaker characteristics derived from multi-layer representations from Whisper. In addition, an attention aggregation layer is used to reduce non-speaker interference and amplify speaker-specific cues for SV tasks. Finally, a simple classification module is used for speaker classification. Experiments on VoxCeleb1, FFSVC, and IMSV datasets demonstrate that Whisper-SV achieves EER/minDCF of 2.22%/0.307, 6.14%/0.488, and 7.50%/0.582, respectively, showing superior performance in low-data-resource SV scenarios.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"163 ","pages":"Article 103103"},"PeriodicalIF":2.4000,"publicationDate":"2024-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S016763932400075X","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

Abstract

Trained on 680,000 h of massive speech data, Whisper is a multitasking, multilingual speech foundation model demonstrating superior performance in automatic speech recognition, translation, and language identification. However, its applicability in speaker verification (SV) tasks remains unexplored, particularly in low-data-resource scenarios where labeled speaker data in specific domains are limited. To fill this gap, we propose a lightweight adaptor framework to boost SV with Whisper, namely Whisper-SV. Given that Whisper is not specifically optimized for SV tasks, we introduce a representation selection module to quantify the speaker-specific characteristics contained in each layer of Whisper and select the top-k layers with prominent discriminative speaker features. To aggregate pivotal speaker-related features while diminishing non-speaker redundancies across the selected top-k distinct layers of Whisper, we design a multi-layer aggregation module in Whisper-SV to integrate multi-layer representations into a singular, compacted representation for SV. In the multi-layer aggregation module, we employ convolutional layers with shortcut connections among different layers to refine speaker characteristics derived from multi-layer representations from Whisper. In addition, an attention aggregation layer is used to reduce non-speaker interference and amplify speaker-specific cues for SV tasks. Finally, a simple classification module is used for speaker classification. Experiments on VoxCeleb1, FFSVC, and IMSV datasets demonstrate that Whisper-SV achieves EER/minDCF of 2.22%/0.307, 6.14%/0.488, and 7.50%/0.582, respectively, showing superior performance in low-data-resource SV scenarios.

查看原文本刊更多论文

Whisper-SV：为低数据资源扬声器验证调整 Whisper

Whisper 是一种多任务、多语言语音基础模型，曾在 680,000 小时的海量语音数据上进行过训练，在自动语音识别、翻译和语言识别方面表现出卓越的性能。然而，它在说话人验证（SV）任务中的适用性仍有待探索，尤其是在低数据资源场景中，因为特定领域的标注说话人数据有限。为了填补这一空白，我们提出了一个轻量级适配器框架，即 Whisper-SV，来利用 Whisper 提升 SV。鉴于 Whisper 并未专门针对 SV 任务进行优化，我们引入了一个表征选择模块，以量化 Whisper 每一层所包含的特定说话人特征，并选择具有突出辨别说话人特征的前 k 层。为了聚合与说话人相关的关键特征，同时减少 Whisper 所选的前 k 个不同层中的非说话人冗余，我们在 Whisper-SV 中设计了一个多层聚合模块，将多层表示法整合为一个单一、紧凑的 SV 表示法。在多层聚合模块中，我们利用卷积层与不同层之间的捷径连接来完善从 Whisper 多层表征中得出的说话者特征。此外，我们还利用注意力聚合层来减少非说话者的干扰，并放大 SV 任务中说话者的特定线索。最后，一个简单的分类模块用于扬声器分类。在 VoxCeleb1、FFSVC 和 IMSV 数据集上的实验表明，Whisper-SV 的 EER/minDCF 分别为 2.22%/0.307、6.14%/0.488 和 7.50%/0.582，在低数据资源 SV 场景中表现出了卓越的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Speech Communication 工程技术-计算机：跨学科应用

CiteScore

6.80

自引率

6.20%

发文量

审稿时长

19.2 weeks

期刊介绍： Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results. The journal''s primary objectives are: • to present a forum for the advancement of human and human-machine speech communication science; • to stimulate cross-fertilization between different fields of this domain; • to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.