Interspeech最新文献_第5页

Adversarial-Free Speaker Identity-Invariant Representation Learning for Automatic Dysarthric Speech Classification 用于构音障碍语音自动分类的对抗性自由说话人身份不变表示学习

Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-402

Parvaneh Janbakhshi, I. Kodrasi

引用次数: 0

Knowledge distillation for In-memory keyword spotting model 内存关键字识别模型的知识精馏

Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-633

Zeyang Song, Qi Liu, Qu Yang, Haizhou Li

引用次数: 1

Adversarial and Sequential Training for Cross-lingual Prosody Transfer TTS 跨语言韵律迁移TTS的对抗性和顺序性训练

Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-865

Min-Kyung Kim, Joon‐Hyuk Chang

引用次数: 1

Phonetic Analysis of Self-supervised Representations of English Speech 英语语音自监督表征的语音分析

Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10884

Dan Wells, Hao Tang, Korin Richmond

引用次数: 8

W2V2-Light: A Lightweight Version of Wav2vec 2.0 for Automatic Speech Recognition W2V2 Light：用于自动语音识别的Wav2vec 2.0的轻量级版本

Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10339

Dong-Hyun Kim, Jaehwan Lee, J. Mo, Joon‐Hyuk Chang

引用次数: 4

Autoencoder-Based Tongue Shape Estimation During Continuous Speech 基于自编码器的连续语音舌形估计

Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10272

Vinicius Ribeiro, Y. Laprie

引用次数: 2

Unsupervised Acoustic-to-Articulatory Inversion with Variable Vocal Tract Anatomy 无监督声学-发音倒置与可变声道解剖

Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-477

Yifan Sun, Qinlong Huang, Xihong Wu

{"title":"Unsupervised Acoustic-to-Articulatory Inversion with Variable Vocal Tract Anatomy","authors":"Yifan Sun, Qinlong Huang, Xihong Wu","doi":"10.21437/interspeech.2022-477","DOIUrl":"https://doi.org/10.21437/interspeech.2022-477","url":null,"abstract":"Acoustic and articulatory variability across speakers has al-ways limited the generalization performance of acoustic-to-articulatory inversion (AAI) methods. Speaker-independent AAI (SI-AAI) methods generally focus on the transformation of acoustic features, but rarely consider the direct matching in the articulatory space. Unsupervised AAI methods have the potential of better generalization ability but typically use a ﬁxed mor-phological setting of a physical articulatory synthesizer even for different speakers, which may cause nonnegligible articulatory compensation. In this paper, we propose to jointly estimate articulatory movements and vocal tract anatomy during the inversion of speech. An unsupervised AAI framework is employed, where estimated vocal tract anatomy is used to set the conﬁguration of a physical articulatory synthesizer, which in turn is driven by estimated articulation movements to imitate a given speech. Experiments show that the estimation of vocal tract anatomy can bring both acoustic and articulatory beneﬁts. Acoustically, the reconstruction quality is higher; articulatorily, the estimated articulatory movement trajectories better match the measured ones. Moreover, the estimated anatomy parameters show clear clusterings by speakers, indicating successful decoupling of speaker characteristics and linguistic content.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4656-4660"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44404742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Combining Simple but Novel Data Augmentation Methods for Improving Conformer ASR 结合简单但新颖的数据增强方法改进保形ASR

Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10835

Ronit Damania, Christopher Homan, Emily Tucker Prud'hommeaux

引用次数: 0

Streaming model for Acoustic to Articulatory Inversion with transformer networks 基于变压器网络的声-铰接反演流模型

Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10159

Sathvik Udupa, Aravind Illa, P. Ghosh

引用次数: 2

Gram Vaani ASR Challenge on spontaneous telephone speech recordings in regional variations of Hindi Gram Vaani ASR挑战印地语地区变体的自发电话语音记录

Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-11371

Anish Bhanushali, Grant Bridgman, Deekshitha G, P. Ghosh, Pratik Kumar, Saurabh Kumar, Adithya Raj Kolladath, Nithya Ravi, Aaditeshwar Seth, Ashish Seth, Abhayjeet Singh, Vrunda N. Sukhadia, Umesh S, Sathvik Udupa, L. D. Prasad

{"title":"Gram Vaani ASR Challenge on spontaneous telephone speech recordings in regional variations of Hindi","authors":"Anish Bhanushali, Grant Bridgman, Deekshitha G, P. Ghosh, Pratik Kumar, Saurabh Kumar, Adithya Raj Kolladath, Nithya Ravi, Aaditeshwar Seth, Ashish Seth, Abhayjeet Singh, Vrunda N. Sukhadia, Umesh S, Sathvik Udupa, L. D. Prasad","doi":"10.21437/interspeech.2022-11371","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11371","url":null,"abstract":"This paper describes the corpus and baseline systems for the Gram Vaani Automatic Speech Recognition (ASR) challenge in regional variations of Hindi. The corpus for this challenge comprises the spontaneous telephone speech recordings collected by a social technology enterprise, Gram Vaani . The regional variations of Hindi together with spontaneity of speech, natural background and transcriptions with variable accuracy due to crowdsourcing make it a unique corpus for ASR on spontaneous telephonic speech. Around, 1108 hours of real-world spontaneous speech recordings, including 1000 hours of unlabelled training data, 100 hours of labelled training data, 5 hours of development data and 3 hours of evaluation data, have been released as a part of the challenge. The efficacy of both training and test sets are validated on different ASR systems in both traditional time-delay neural network-hidden Markov model (TDNN-HMM) frameworks and fully-neural end-to-end (E2E) setup. The word error rate (WER) and character error rate (CER) on eval set for a TDNN model trained on 100 hours of labelled data are 29 . 7% and 15 . 1% , respectively. While, in E2E setup, WER and CER on eval set for a conformer model trained on 100 hours of data are 32 . 9% and 19 . 0% , respectively.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3548-3552"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43519978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4