Interspeech最新文献

筛选
英文 中文
Syllable sequence of /a/+/ta/ can be heard as /atta/ in Japanese with visual or tactile cues 在日语中,/a/+/ta/的音节序列可以听为/atta/,带有视觉或触觉提示
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10099
T. Arai, Miho Yamada, Megumi Okusawa
{"title":"Syllable sequence of /a/+/ta/ can be heard as /atta/ in Japanese with visual or tactile cues","authors":"T. Arai, Miho Yamada, Megumi Okusawa","doi":"10.21437/interspeech.2022-10099","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10099","url":null,"abstract":"In our previous work, we reported that the word /atta/ with a geminate consonant differs from the syllable sequence /a/+pause+/ta/ in Japanese; specifically, there are formant transitions at the end of the first syllable in /atta/ but not in /a/+pause+/ta/. We also showed that native Japanese speakers perceived /atta/ when a facial video of /atta/ was synchronously played with an audio signal of /a/+pause+/ta/. In that study, we utilized two video clips for the two utterances in which the speaker was asked to control only the timing of the articulatory closing. In that case, there was no guarantee that the videos would be the exactly same except for the timing. Therefore, in the current study, we use a physical model of the human vocal tract with a miniature robot hand unit to achieve articulatory movements for visual cues. We also provide tactile cues to the listener’s finger because we want to test whether cues of another modality affect this perception in the same framework. Our findings showed that when either visual or tactile cues were presented with an audio stimulus, listeners more frequently responded that they heard /atta/ compared to audio-only presentations.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"302 ","pages":"3083-3087"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41331666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cross-Lingual Transfer Learning Approach to Phoneme Error Detection via Latent Phonetic Representation 基于潜在语音表征的跨语言迁移学习音素错误检测方法
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10228
Jovan M. Dalhouse, K. Itou
{"title":"Cross-Lingual Transfer Learning Approach to Phoneme Error Detection via Latent Phonetic Representation","authors":"Jovan M. Dalhouse, K. Itou","doi":"10.21437/interspeech.2022-10228","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10228","url":null,"abstract":"Extensive research has been conducted on CALL systems for Pronunciation Error detection to automate language improvement through self-evaluation. However, many of these previous approaches have relied on HMM or Neural Network Hybrid Models which, although have proven to be effective, often utilize phonetically labelled L2 speech data which is ex-pensive and often scarce. This paper discusses a ”zero-shot” transfer learning approach to detect phonetic errors in L2 English speech by Japanese native speakers using solely unaligned phonetically labelled native language speech. The proposed method introduces a simple base architecture which utilizes the XLSR-Wav2Vec2.0 model pre-trained on unlabelled multilingual speech. Phoneme mapping for each language is determined based on difference of articulation of similar phonemes. This method achieved a Phonetic Error Rate of 0.214 on erroneous L2 speech after fine-tuning on 70 hours of speech with low resource automated phonetic labelling, and proved to ad-ditionally model phonemes of the native language of the L2 speaker effectively without the need for L2 speech fine-tuning.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3133-3137"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41397632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Biometric Russian Audio-Visual Extended MASKS (BRAVE-MASKS) Corpus: Multimodal Mask Type Recognition Task 生物识别俄语视听扩展掩码(BRAVE-MASKS)语料库:多模式掩码类型识别任务
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10240
M. Markitantov, E. Ryumina, D. Ryumin, A. Karpov
{"title":"Biometric Russian Audio-Visual Extended MASKS (BRAVE-MASKS) Corpus: Multimodal Mask Type Recognition Task","authors":"M. Markitantov, E. Ryumina, D. Ryumin, A. Karpov","doi":"10.21437/interspeech.2022-10240","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10240","url":null,"abstract":"In this paper, we present a new multimodal corpus called Biometric Russian Audio-Visual Extended MASKS (BRAVE-MASKS), which is designed to analyze voice and facial characteristics of persons wearing various masks, as well as to develop automatic systems for bimodal verification and identification of speakers. In particular, we tackle the multimodal mask type recognition task (6 classes). As a result, audio, visual and multimodal systems were developed, which showed UAR of 54.83%, 72.02% and 82.01%, respectively, on the Test set. These performances are the baseline for the BRAVE-MASKS corpus to compare the follow-up approaches with the proposed systems.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"1756-1760"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49580219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Vector-quantized Variational Autoencoder for Phase-aware Speech Enhancement 面向相位感知语音增强的矢量量化变分自编码器
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-443
Tuan Vu Ho, Q. Nguyen, M. Akagi, M. Unoki
{"title":"Vector-quantized Variational Autoencoder for Phase-aware Speech Enhancement","authors":"Tuan Vu Ho, Q. Nguyen, M. Akagi, M. Unoki","doi":"10.21437/interspeech.2022-443","DOIUrl":"https://doi.org/10.21437/interspeech.2022-443","url":null,"abstract":"Speech-enhancement methods based on the complex ideal ratio mask (cIRM) have achieved promising results. These methods often deploy a deep neural network to jointly estimate the real and imaginary components of the cIRM defined in the complex domain. However, the unbounded property of the cIRM poses difficulties when it comes to effectively training a neural network. To alleviate this problem, this paper proposes a phase-aware speech-enhancement method through estimating the magnitude and phase of a complex adaptive Wiener filter. With this method, a noise-robust vector-quantized variational autoencoder is used for estimating the magnitude of the Wiener filter by using the Itakura-Saito divergence on the time-frequency domain, while the phase of the Wiener filter is estimated using a convolutional recurrent network using the scale-invariant signal-to-noise-ratio constraint in the time domain. The proposed method was evaluated on the open Voice Bank+DEMAND dataset to provide a direct comparison with other speech-enhancement methods and achieved a Perceptual Evaluation of Speech Quality score of 2.85 and ShortTime Objective Intelligibility score of 0.94, which is better than the stateof-art method based on cIRM estimation during the 2020 Deep Noise Challenge.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"176-180"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42627367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Speech Acoustics in Mild Cognitive Impairment and Parkinson's Disease With and Without Concurrent Drawing Tasks 轻度认知障碍和帕金森氏症患者在有和无同时绘图任务的情况下的语音声学
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10772
Tanya Talkar, Christina Manxhari, James Williamson, Kara M. Smith, T. Quatieri
{"title":"Speech Acoustics in Mild Cognitive Impairment and Parkinson's Disease With and Without Concurrent Drawing Tasks","authors":"Tanya Talkar, Christina Manxhari, James Williamson, Kara M. Smith, T. Quatieri","doi":"10.21437/interspeech.2022-10772","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10772","url":null,"abstract":"Parkinson’s disease (PD) is characterized by motor dysfunction; however, non-motor symptoms such as cognitive decline also have a dramatic impact on quality of life. Current assessments to diagnose cognitive impairment take many hours and require high clinician involvement. Thus, there is a need to develop new tools leading to quick and accurate determination of cognitive impairment to allow for appropriate, timely interventions. In this paper, individuals with PD, designated as either having no cognitive impairment (NCI) or mild cognitive impairment (MCI), undergo a speech-based protocol, involving reading or listing items within a category, performed either with or without a concurrent drawing task. From the speech recordings, we extract motor coordination-based features, derived from correlations across acoustic features representative of speech production subsystems. The correlation-based features are utilized in gaussian mixture models to discriminate between individuals designated NCI or MCI in both the single and dual task paradigms. Features derived from the laryngeal and respiratory subsystems, in particular, discriminate between these two groups with AUCs > 0.80. These results suggest that cognitive impairment can be detected using speech from both single and dual task paradigms, and that cognitive impairment may manifest as differences in vocal fold vibration stability. 1","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2258-2262"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42743244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Speech2Slot: A Limited Generation Framework with Boundary Detection for Slot Filling from Speech 基于边界检测的语音槽填充有限生成框架
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-11347
Pengwei Wang, Yinpei Su, Xiaohuan Zhou, Xin Ye, Liangchen Wei, Ming Liu, Yuan You, Feijun Jiang
{"title":"Speech2Slot: A Limited Generation Framework with Boundary Detection for Slot Filling from Speech","authors":"Pengwei Wang, Yinpei Su, Xiaohuan Zhou, Xin Ye, Liangchen Wei, Ming Liu, Yuan You, Feijun Jiang","doi":"10.21437/interspeech.2022-11347","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11347","url":null,"abstract":"Slot filling is an essential component of Spoken Language Understanding. In contrast to conventional pipeline approaches, which extract slots from the ASR output, end-to-end approaches directly get slots from speech within a classification or generation framework. However, classification relies on predefined categories, which is not scal-able, and the generative model is decoding in an open-domain space, suffering from blurred boundaries of slots in speech. To address the shortcomings of these two for-mulations, we propose a new encoder-decoder framework for slot filling, named Speech2Slot, leveraging a limited generation method with boundary detection. We also released a large-scale Chinese spoken slot filling dataset named Voice Navigation Dataset in Chinese (VNDC). Experiments on VNDC show that our model is markedly superior to other approaches, outperforming the state-of-the-art slot filling approach with 6.65% accuracy improvement. We make our code 1 publicly available for researchers to replicate and build on our work.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2748-2752"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42908253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Cross-Modal Decision Regularization for Simultaneous Speech Translation 语音同声翻译的跨模态决策正则化
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10617
Mohd Abbas Zaidi, Beomseok Lee, Sangha Kim, Chanwoo Kim
{"title":"Cross-Modal Decision Regularization for Simultaneous Speech Translation","authors":"Mohd Abbas Zaidi, Beomseok Lee, Sangha Kim, Chanwoo Kim","doi":"10.21437/interspeech.2022-10617","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10617","url":null,"abstract":"Simultaneous translation systems start producing the output while processing the partial source sentence in the incoming input stream. These systems need to decide when to read more input and when to write the output. The decisions taken by the model depend on the structure of source/target language and the information contained in the partial input sequence. Hence, read/write decision policy remains the same across different input modalities, i.e., speech and text. This motivates us to leverage the text transcripts corresponding to the speech input for improving simultaneous speech-to-text translation (SimulST). We propose Cross-Modal Decision Regularization (CMDR) to improve the decision policy of SimulST systems by using the simultaneous text-to-text translation (SimulMT) task. We also extend several techniques from the offline speech translation domain to explore the role of SimulMT task in improving SimulST performance. Overall, we achieve 34.66% / 4.5 BLEU improvement over the baseline model across different latency regimes for the MuST-C English-German (EnDe) SimulST task.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"116-120"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43012436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Acoustic Representation Learning on Breathing and Speech Signals for COVID-19 Detection 呼吸和语音信号的声学表示学习用于COVID-19检测
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10376
Debottam Dutta, Debarpan Bhattacharya, Sriram Ganapathy, A. H. Poorjam, Deepak Mittal, M. Singh
{"title":"Acoustic Representation Learning on Breathing and Speech Signals for COVID-19 Detection","authors":"Debottam Dutta, Debarpan Bhattacharya, Sriram Ganapathy, A. H. Poorjam, Deepak Mittal, M. Singh","doi":"10.21437/interspeech.2022-10376","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10376","url":null,"abstract":"In this paper, we describe an approach for representation learning of audio signals for the task of COVID-19 detection. The raw audio samples are processed with a bank of 1-D convolutional filters that are parameterized as cosine modulated Gaussian functions. The choice of these kernels allows the interpretation of the filterbanks as smooth band-pass filters. The filtered outputs are pooled, log-compressed and used in a self-attention based relevance weighting mechanism. The relevance weighting emphasizes the key regions of the time-frequency decomposition that are important for the downstream task. The subsequent layers of the model consist of a recurrent architecture and the models are trained for a COVID-19 detection task. In our experiments on the Coswara data set, we show that the proposed model achieves significant performance improvements over the baseline system as well as other representation learning approaches. Further, the approach proposed is shown to be uniformly applicable for speech and breathing signals and for transfer learning from a larger data set. Copyright © 2022 ISCA.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2863-2867"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47618169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Japanese ASR-Robust Pre-trained Language Model with Pseudo-Error Sentences Generated by Grapheme-Phoneme Conversion 日语asr -鲁棒预训练的伪错误句模型
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-327
Yasuhito Ohsugi, Itsumi Saito, Kyosuke Nishida, Sen Yoshida
{"title":"Japanese ASR-Robust Pre-trained Language Model with Pseudo-Error Sentences Generated by Grapheme-Phoneme Conversion","authors":"Yasuhito Ohsugi, Itsumi Saito, Kyosuke Nishida, Sen Yoshida","doi":"10.21437/interspeech.2022-327","DOIUrl":"https://doi.org/10.21437/interspeech.2022-327","url":null,"abstract":"Spoken language understanding systems typically consist of a pipeline of automatic speech recognition (ASR) and natural language processing (NLP) modules. Although pre-trained language models (PLMs) have been successful in NLP by training on large corpora of written texts; spoken language with serious ASR errors that change its meaning is difficult to understand. We propose a method for pre-training Japanese LMs robust against ASR errors without using ASR. With the proposed method using written texts, sentences containing pseudo-ASR errors are generated using a pseudo-error dictionary constructed using grapheme-to-phoneme and phoneme-to-grapheme models based on neural networks. Experiments on spoken dialogue summarization showed that the ASR-robust LM pre-trained with the proposed method outperformed the LM pre-trained with standard masked language modeling by 3.17 points on ROUGE-L when fine-tuning with dialogues including ASR errors.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2688-2692"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47792200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bottom-up discovery of structure and variation in response tokens ('backchannels') across diverse languages 自下而上发现不同语言中响应令牌(“后台通道”)的结构和变化
Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-11288
Andreas Liesenfeld, Mark Dingemanse
{"title":"Bottom-up discovery of structure and variation in response tokens ('backchannels') across diverse languages","authors":"Andreas Liesenfeld, Mark Dingemanse","doi":"10.21437/interspeech.2022-11288","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11288","url":null,"abstract":"Response tokens (also known as backchannels, continuers, or feedback) are a frequent feature of human interaction, where they serve to display understanding and streamline turn-taking. We propose a bottom-up method to study responsive behaviour across 16 languages (8 language families). We use sequential context and recurrence of turns formats to identify candidate response tokens in a language-agnostic way across diverse conversational corpora. We then use UMAP clustering directly on speech signals to represent structure and variation. We find that (i) written orthographic annotations underrepresent the at-tested variation, (ii) distinctions between formats can be gradient rather than discrete, (iii) most languages appear to make available a broad distinction between a minimal nasal format ‘mm’ and a fuller ‘yeah’-like format. Charting this aspect of human interaction contributes to our understanding of interactional infrastructure across languages and can inform the design of speech technologies.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"1126-1130"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48906637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信