{"title":"Improves Neural Acoustic Word Embeddings Query by Example Spoken Term Detection with Wav2vec Pretraining and Circle Loss","authors":"Zhaoqi Li, Long Wu, Ta Li, Yonghong Yan","doi":"10.1109/ISCSLP49672.2021.9362065","DOIUrl":null,"url":null,"abstract":"Query by example spoken term detection (QbE-STD) is a popular keyword detection method in the absence of speech resources. It can build a keyword query system with decent performance when there are few labeled speeches and a lack of pronunciation dictionaries. In recent years, neural acoustic word embeddings (NAWEs) has become a commonly used QbE-STD method. To make the embedded features extracted by the neural network contain more accurate context information, we use wav2vec pre-training to improve the performance of the network. Compared with the Mel-frequency cepstral coefficients(MFCC) system, the average precision (AP) is relatively improved by 11.1%. We also find that the AP of the wav2vec and MFCC splicing system is better, demonstrating that wav2vec cannot contain all spectrum information. To accelerate the convergence speed of the splicing system, we use circle loss to replace the triplet loss, making the convergence about 40% epochs earlier on average. The circle loss also relatively increases AP by more than 4.9%. The AP of our best-performing system is 7.7% better than the wav2vec baseline system and 19.7% better than the MFCC baseline system.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISCSLP49672.2021.9362065","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Query by example spoken term detection (QbE-STD) is a popular keyword detection method in the absence of speech resources. It can build a keyword query system with decent performance when there are few labeled speeches and a lack of pronunciation dictionaries. In recent years, neural acoustic word embeddings (NAWEs) has become a commonly used QbE-STD method. To make the embedded features extracted by the neural network contain more accurate context information, we use wav2vec pre-training to improve the performance of the network. Compared with the Mel-frequency cepstral coefficients(MFCC) system, the average precision (AP) is relatively improved by 11.1%. We also find that the AP of the wav2vec and MFCC splicing system is better, demonstrating that wav2vec cannot contain all spectrum information. To accelerate the convergence speed of the splicing system, we use circle loss to replace the triplet loss, making the convergence about 40% epochs earlier on average. The circle loss also relatively increases AP by more than 4.9%. The AP of our best-performing system is 7.7% better than the wav2vec baseline system and 19.7% better than the MFCC baseline system.
实例查询语音词检测(Query by example spoken term detection, QbE-STD)是在缺乏语音资源的情况下流行的关键字检测方法。在语音标注少、语音词典缺乏的情况下,可以构建一个性能良好的关键词查询系统。近年来,神经声学词嵌入(NAWEs)已成为一种常用的QbE-STD方法。为了使神经网络提取的嵌入特征包含更准确的上下文信息,我们使用了wav2vec预训练来提高网络的性能。与mel -倒谱系数(MFCC)系统相比,平均精度(AP)相对提高了11.1%。我们还发现,wav2vec和MFCC拼接系统的AP更好,说明wav2vec不能包含所有的频谱信息。为了加快拼接系统的收敛速度,我们使用了圆损耗来代替三重态损耗,使拼接系统的收敛速度平均提前了40%左右。圈损也使AP相对增加4.9%以上。我们的最佳性能系统的AP比wav2vec基准系统高7.7%,比MFCC基准系统高19.7%。