Few-Shot Keyword Spotting from Mixed Speech

arXiv - CS - Sound Pub Date : 2024-07-05 DOI:arxiv-2407.06078

Junming Yuan, Ying Shi, LanTian Li, Dong Wang, Askar Hamdulla

引用次数: 0

Abstract

Few-shot keyword spotting (KWS) aims to detect unknown keywords with limited training samples. A commonly used approach is the pre-training and fine-tuning framework. While effective in clean conditions, this approach struggles with mixed keyword spotting -- simultaneously detecting multiple keywords blended in an utterance, which is crucial in real-world applications. Previous research has proposed a Mix-Training (MT) approach to solve the problem, however, it has never been tested in the few-shot scenario. In this paper, we investigate the possibility of using MT and other relevant methods to solve the two practical challenges together: few-shot and mixed speech. Experiments conducted on the LibriSpeech and Google Speech Command corpora demonstrate that MT is highly effective on this task when employed in either the pre-training phase or the fine-tuning phase. Moreover, combining SSL-based large-scale pre-training (HuBert) and MT fine-tuning yields very strong results in all the test conditions.

查看原文本刊更多论文

从混合语音中发现少量关键词

少量关键词抽取（KWS）旨在利用有限的训练样本检测未知关键词。一种常用的方法是预训练和微调框架。这种方法虽然在干净的条件下很有效，但在混合关键词检测方面却很吃力，即同时检测语篇中混合的多个关键词，这在实际应用中至关重要。之前的研究提出了一种混合训练（MT）方法来解决这个问题，但是这种方法从未在少量语料的情况下进行过测试。在本文中，我们研究了使用 MT 和其他相关方法一并解决两个实际挑战的可能性：少发语音和混合语音。在 LibriSpeech 和 Google Speech Command 语料库上进行的实验表明，无论是在预训练阶段还是在微调阶段，MT 在这项任务中都非常有效。此外，将基于 SSL 的大规模预训练（HuBert）与 MT 微调相结合，在所有测试条件下都能获得非常出色的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Sound

自引率

0.00%

发文量