Interspeech最新文献

筛选
英文 中文
Adversarial-Free Speaker Identity-Invariant Representation Learning for Automatic Dysarthric Speech Classification 用于构音障碍语音自动分类的对抗性自由说话人身份不变表示学习
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-402
Parvaneh Janbakhshi, I. Kodrasi
{"title":"Adversarial-Free Speaker Identity-Invariant Representation Learning for Automatic Dysarthric Speech Classification","authors":"Parvaneh Janbakhshi, I. Kodrasi","doi":"10.21437/interspeech.2022-402","DOIUrl":"https://doi.org/10.21437/interspeech.2022-402","url":null,"abstract":"Speech representations which are robust to pathology-unrelated cues such as speaker identity information have been shown to be advantageous for automatic dysarthric speech classification. A recently proposed technique to learn speaker identity-invariant representations for dysarthric speech classification is based on adversarial training. However, adversarial training can be challenging, unstable, and sensitive to training parameters. To avoid adversarial training, in this paper we propose to learn speaker-identity invariant representations exploiting a feature separation framework relying on mutual information minimization. Experimental results on a database of neurotypical and dysarthric speech show that the proposed adversarial-free framework successfully learns speaker identity-invariant representations. Further, it is shown that such representations result in a similar dysarthric speech classification performance as the representations obtained using adversarial training, while the training procedure is more stable and less sensitive to training parameters.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2138-2142"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48272141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Knowledge distillation for In-memory keyword spotting model 内存关键字识别模型的知识精馏
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-633
Zeyang Song, Qi Liu, Qu Yang, Haizhou Li
{"title":"Knowledge distillation for In-memory keyword spotting model","authors":"Zeyang Song, Qi Liu, Qu Yang, Haizhou Li","doi":"10.21437/interspeech.2022-633","DOIUrl":"https://doi.org/10.21437/interspeech.2022-633","url":null,"abstract":"We study a light-weight implementation of keyword spotting (KWS) for voice command and control, that can be implemented on an in-memory computing (IMC) unit with same accuracy at a lower computational cost than the state-of-the-art methods. KWS is expected to be always-on for mobile devices with limited resources. IMC represents one of the solutions. However, it only supports multiplication-accumulation and Boolean operations. We note that common feature extraction methods, such as MFCC and SincConv, are not supported by IMC as they depend on expensive logarithm computing. On the other hand, some neural network solutions to KWS involve a large number of parameters that are not feasible for mobile devices. In this work, we propose a knowledge distillation technique to replace the complex speech frontend like MFCC or SincConv with a light-weight encoder without performance loss. Experiments show that the proposed model outperforms the KWS model with MFCC and SincConv front-end in terms of accuracy and computational cost.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4128-4132"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48292021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Adversarial and Sequential Training for Cross-lingual Prosody Transfer TTS 跨语言韵律迁移TTS的对抗性和顺序性训练
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-865
Min-Kyung Kim, Joon‐Hyuk Chang
{"title":"Adversarial and Sequential Training for Cross-lingual Prosody Transfer TTS","authors":"Min-Kyung Kim, Joon‐Hyuk Chang","doi":"10.21437/interspeech.2022-865","DOIUrl":"https://doi.org/10.21437/interspeech.2022-865","url":null,"abstract":"This study presents a method for improving the performance of the text-to-speech (TTS) model by using three global speech-style representations: language, speaker, and prosody. Synthesizing different languages and prosody in the speaker’s voice regardless of their own language and prosody is possi-ble. To construct the embedding of each representation conditioned in the TTS model such that it is independent of the other representations, we propose an adversarial training method for the general architecture of TTS models. Furthermore, we introduce a sequential training method that includes rehearsal-based continual learning to train complex and small amounts of data without forgetting previously learned information. The experimental results show that the proposed method can generate good-quality speech and yield high similarity for speakers and prosody, even for representations that the speaker in the dataset does not contain.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4556-4560"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46991331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Phonetic Analysis of Self-supervised Representations of English Speech 英语语音自监督表征的语音分析
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10884
Dan Wells, Hao Tang, Korin Richmond
{"title":"Phonetic Analysis of Self-supervised Representations of English Speech","authors":"Dan Wells, Hao Tang, Korin Richmond","doi":"10.21437/interspeech.2022-10884","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10884","url":null,"abstract":"We present an analysis of discrete units discovered via self-supervised representation learning on English speech. We focus on units produced by a pre-trained HuBERT model due to its wide adoption in ASR, speech synthesis, and many other tasks. Whereas previous work has evaluated the quality of such quantization models in aggregate over all phones for a given language, we break our analysis down into broad phonetic classes, taking into account specific aspects of their articulation when consid-ering their alignment to discrete units. We find that these units correspond to sub-phonetic events, and that fine dynamics such as the distinct closure and release portions of plosives tend to be represented by sequences of discrete units. Our work provides a reference for the phonetic properties of discrete units discovered by HuBERT, facilitating analyses of many speech applications based on this model.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3583-3587"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47141667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
W2V2-Light: A Lightweight Version of Wav2vec 2.0 for Automatic Speech Recognition W2V2 Light:用于自动语音识别的Wav2vec 2.0的轻量级版本
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10339
Dong-Hyun Kim, Jaehwan Lee, J. Mo, Joon‐Hyuk Chang
{"title":"W2V2-Light: A Lightweight Version of Wav2vec 2.0 for Automatic Speech Recognition","authors":"Dong-Hyun Kim, Jaehwan Lee, J. Mo, Joon‐Hyuk Chang","doi":"10.21437/interspeech.2022-10339","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10339","url":null,"abstract":"Wav2vec 2.0 (W2V2) has shown remarkable speech recognition performance by pre-training only with unlabeled data and fine-tuning with a small amount of labeled data. However, the practical application of W2V2 is hindered by hardware memory limitations, as it contains 317 million parameters. To ad-dress this issue, we propose W2V2-Light, a lightweight version of W2V2. We introduce two simple sharing methods to reduce the memory consumption as well as the computational costs of W2V2. Compared to W2V2, our model has 91% lesser parameters and a speedup of 1.31 times with minor degradation in downstream task performance. Moreover, by quantifying the stability of representations, we provide an empirical insight into why our model is capable of maintaining competitive performance despite the significant reduction in memory","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3038-3042"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47360779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Autoencoder-Based Tongue Shape Estimation During Continuous Speech 基于自编码器的连续语音舌形估计
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10272
Vinicius Ribeiro, Y. Laprie
{"title":"Autoencoder-Based Tongue Shape Estimation During Continuous Speech","authors":"Vinicius Ribeiro, Y. Laprie","doi":"10.21437/interspeech.2022-10272","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10272","url":null,"abstract":"Vocal tract shape estimation is a necessary step for articulatory speech synthesis. However, the literature on the topic is scarce, and most current methods lack adequacy to many physical constraints related to speech production. This study proposes an alternative approach to the task to solve specific issues faced in the previous work, especially those related to critical ar-ticulators. We present an autoencoder-based method for tongue shape estimation during continuous speech. An autoencoder is trained to learn the data’s encoding and serves as an auxiliary network for the principal one, which maps phonemes to the shapes. Instead of predicting the exact points in the target curve, the neural network learns how to predict the curve’s main components, i.e., the autoencoder’s representation. We show how this approach allows imposing critical articulators’ constraints, controlling the tongue shape through the latent space, and generating a smooth output without relying on any postprocessing method.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"86-90"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44213806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Unsupervised Acoustic-to-Articulatory Inversion with Variable Vocal Tract Anatomy 无监督声学-发音倒置与可变声道解剖
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-477
Yifan Sun, Qinlong Huang, Xihong Wu
{"title":"Unsupervised Acoustic-to-Articulatory Inversion with Variable Vocal Tract Anatomy","authors":"Yifan Sun, Qinlong Huang, Xihong Wu","doi":"10.21437/interspeech.2022-477","DOIUrl":"https://doi.org/10.21437/interspeech.2022-477","url":null,"abstract":"Acoustic and articulatory variability across speakers has al-ways limited the generalization performance of acoustic-to-articulatory inversion (AAI) methods. Speaker-independent AAI (SI-AAI) methods generally focus on the transformation of acoustic features, but rarely consider the direct matching in the articulatory space. Unsupervised AAI methods have the potential of better generalization ability but typically use a fixed mor-phological setting of a physical articulatory synthesizer even for different speakers, which may cause nonnegligible articulatory compensation. In this paper, we propose to jointly estimate articulatory movements and vocal tract anatomy during the inversion of speech. An unsupervised AAI framework is employed, where estimated vocal tract anatomy is used to set the configuration of a physical articulatory synthesizer, which in turn is driven by estimated articulation movements to imitate a given speech. Experiments show that the estimation of vocal tract anatomy can bring both acoustic and articulatory benefits. Acoustically, the reconstruction quality is higher; articulatorily, the estimated articulatory movement trajectories better match the measured ones. Moreover, the estimated anatomy parameters show clear clusterings by speakers, indicating successful decoupling of speaker characteristics and linguistic content.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4656-4660"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44404742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Combining Simple but Novel Data Augmentation Methods for Improving Conformer ASR 结合简单但新颖的数据增强方法改进保形ASR
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10835
Ronit Damania, Christopher Homan, Emily Tucker Prud'hommeaux
{"title":"Combining Simple but Novel Data Augmentation Methods for Improving Conformer ASR","authors":"Ronit Damania, Christopher Homan, Emily Tucker Prud'hommeaux","doi":"10.21437/interspeech.2022-10835","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10835","url":null,"abstract":"","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4890-4894"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44483829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Streaming model for Acoustic to Articulatory Inversion with transformer networks 基于变压器网络的声-铰接反演流模型
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10159
Sathvik Udupa, Aravind Illa, P. Ghosh
{"title":"Streaming model for Acoustic to Articulatory Inversion with transformer networks","authors":"Sathvik Udupa, Aravind Illa, P. Ghosh","doi":"10.21437/interspeech.2022-10159","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10159","url":null,"abstract":"Estimating speech articulatory movements from speech acoustics is known as Acoustic to Articulatory Inversion (AAI). Recently, transformer-based AAI models have been shown to achieve state-of-art performance. However, in transformer networks, the attention is applied over the whole utterance, thereby needing to obtain the full utterance before the inference, which leads to high latency and is impractical for streaming AAI. To enable streaming during inference, evaluation could be performed on non-overlapping chucks instead of a full utterance. However, due to a mismatch of the attention receptive field during training and evaluation, there could be a drop in AAI performance. To overcome this scenario, in this work we perform experiments with different attention masks and use context from previous predictions during training. Experiments results revealed that using the random start mask attention with the context from previous predictions of transformer decoder performs better than the baseline results.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"625-629"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44495671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Gram Vaani ASR Challenge on spontaneous telephone speech recordings in regional variations of Hindi Gram Vaani ASR挑战印地语地区变体的自发电话语音记录
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-11371
Anish Bhanushali, Grant Bridgman, Deekshitha G, P. Ghosh, Pratik Kumar, Saurabh Kumar, Adithya Raj Kolladath, Nithya Ravi, Aaditeshwar Seth, Ashish Seth, Abhayjeet Singh, Vrunda N. Sukhadia, Umesh S, Sathvik Udupa, L. D. Prasad
{"title":"Gram Vaani ASR Challenge on spontaneous telephone speech recordings in regional variations of Hindi","authors":"Anish Bhanushali, Grant Bridgman, Deekshitha G, P. Ghosh, Pratik Kumar, Saurabh Kumar, Adithya Raj Kolladath, Nithya Ravi, Aaditeshwar Seth, Ashish Seth, Abhayjeet Singh, Vrunda N. Sukhadia, Umesh S, Sathvik Udupa, L. D. Prasad","doi":"10.21437/interspeech.2022-11371","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11371","url":null,"abstract":"This paper describes the corpus and baseline systems for the Gram Vaani Automatic Speech Recognition (ASR) challenge in regional variations of Hindi. The corpus for this challenge comprises the spontaneous telephone speech recordings collected by a social technology enterprise, Gram Vaani . The regional variations of Hindi together with spontaneity of speech, natural background and transcriptions with variable accuracy due to crowdsourcing make it a unique corpus for ASR on spontaneous telephonic speech. Around, 1108 hours of real-world spontaneous speech recordings, including 1000 hours of unlabelled training data, 100 hours of labelled training data, 5 hours of development data and 3 hours of evaluation data, have been released as a part of the challenge. The efficacy of both training and test sets are validated on different ASR systems in both traditional time-delay neural network-hidden Markov model (TDNN-HMM) frameworks and fully-neural end-to-end (E2E) setup. The word error rate (WER) and character error rate (CER) on eval set for a TDNN model trained on 100 hours of labelled data are 29 . 7% and 15 . 1% , respectively. While, in E2E setup, WER and CER on eval set for a conformer model trained on 100 hours of data are 32 . 9% and 19 . 0% , respectively.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3548-3552"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43519978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信