{"title":"Few-Shot Keyword Spotting from Mixed Speech","authors":"Junming Yuan, Ying Shi, LanTian Li, Dong Wang, Askar Hamdulla","doi":"arxiv-2407.06078","DOIUrl":null,"url":null,"abstract":"Few-shot keyword spotting (KWS) aims to detect unknown keywords with limited\ntraining samples. A commonly used approach is the pre-training and fine-tuning\nframework. While effective in clean conditions, this approach struggles with\nmixed keyword spotting -- simultaneously detecting multiple keywords blended in\nan utterance, which is crucial in real-world applications. Previous research\nhas proposed a Mix-Training (MT) approach to solve the problem, however, it has\nnever been tested in the few-shot scenario. In this paper, we investigate the\npossibility of using MT and other relevant methods to solve the two practical\nchallenges together: few-shot and mixed speech. Experiments conducted on the\nLibriSpeech and Google Speech Command corpora demonstrate that MT is highly\neffective on this task when employed in either the pre-training phase or the\nfine-tuning phase. Moreover, combining SSL-based large-scale pre-training\n(HuBert) and MT fine-tuning yields very strong results in all the test\nconditions.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"18 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.06078","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Few-shot keyword spotting (KWS) aims to detect unknown keywords with limited
training samples. A commonly used approach is the pre-training and fine-tuning
framework. While effective in clean conditions, this approach struggles with
mixed keyword spotting -- simultaneously detecting multiple keywords blended in
an utterance, which is crucial in real-world applications. Previous research
has proposed a Mix-Training (MT) approach to solve the problem, however, it has
never been tested in the few-shot scenario. In this paper, we investigate the
possibility of using MT and other relevant methods to solve the two practical
challenges together: few-shot and mixed speech. Experiments conducted on the
LibriSpeech and Google Speech Command corpora demonstrate that MT is highly
effective on this task when employed in either the pre-training phase or the
fine-tuning phase. Moreover, combining SSL-based large-scale pre-training
(HuBert) and MT fine-tuning yields very strong results in all the test
conditions.