2022 IEEE Spoken Language Technology Workshop (SLT)最新文献

筛选
英文 中文
Improving Semi-Supervised End-To-End Automatic Speech Recognition Using Cyclegan and Inter-Domain Losses 利用循环gan和域间损失改进半监督端到端自动语音识别
2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2022-10-20 DOI: 10.1109/SLT54892.2023.10022448
C. Li, Ngoc Thang Vu
{"title":"Improving Semi-Supervised End-To-End Automatic Speech Recognition Using Cyclegan and Inter-Domain Losses","authors":"C. Li, Ngoc Thang Vu","doi":"10.1109/SLT54892.2023.10022448","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022448","url":null,"abstract":"We propose a novel method that combines CycleGAN and inter-domain losses for semi-supervised end-to-end automatic speech recognition. Inter-domain loss targets the extraction of an intermediate shared representation of speech and text inputs using a shared network. CycleGAN uses cycle-consistent loss and the identity mapping loss to preserve relevant characteristics of the input feature after converting from one domain to another. As such, both approaches are suitable to train end-to-end models on unpaired speech-text inputs. In this paper, we exploit the advantages from both inter-domain loss and CycleGAN to achieve better shared representation of unpaired speech and text inputs and thus improve the speech-to-text mapping. Our experimental results on the WSJ eval92 and Voxforge (non English) show $8sim 8.5%$ character error rate reduction over the baseline, and the results on LibriSpeech test_clean also show noticeable improvement.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128390238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A Data-Driven Investigation of Noise-Adaptive Utterance Generation with Linguistic Modification 带有语言修饰的自适应语音生成的数据驱动研究
2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2022-10-19 DOI: 10.1109/SLT54892.2023.10022437
Anupama Chingacham, Vera Demberg, D. Klakow
{"title":"A Data-Driven Investigation of Noise-Adaptive Utterance Generation with Linguistic Modification","authors":"Anupama Chingacham, Vera Demberg, D. Klakow","doi":"10.1109/SLT54892.2023.10022437","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022437","url":null,"abstract":"In noisy environments, speech can be hard to understand for humans. Spoken dialog systems can help to enhance the intelligibility of their output, either by modifying the speech synthesis (e.g., imitate Lombard speech) or by optimizing the language generation. We here focus on the second type of approach, by which an intended message is realized with words that are more intelligible in a specific noisy environment. By conducting a speech perception experiment, we created a dataset of 900 paraphrases in babble noise, perceived by native English speakers with normal hearing. We find that careful selection of paraphrases can improve intelligibility by 33% at SNR -5 dB. Our analysis of the data shows that the intelligibility differences between paraphrases are mainly driven by noise-robust acoustic cues. Furthermore, we propose an intelligibility-aware paraphrase ranking model, which outperforms baseline models with a relative improvement of 31.37% at SNR -5 dB.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115114440","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
End-to-End Integration of Speech Recognition, Dereverberation, Beamforming, and Self-Supervised Learning Representation 语音识别的端到端集成,去噪,波束成形,和自监督学习表示
2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2022-10-19 DOI: 10.1109/SLT54892.2023.10023199
Yoshiki Masuyama, Xuankai Chang, Samuele Cornell, Shinji Watanabe, Nobutaka Ono
{"title":"End-to-End Integration of Speech Recognition, Dereverberation, Beamforming, and Self-Supervised Learning Representation","authors":"Yoshiki Masuyama, Xuankai Chang, Samuele Cornell, Shinji Watanabe, Nobutaka Ono","doi":"10.1109/SLT54892.2023.10023199","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10023199","url":null,"abstract":"Self-supervised learning representation (SSLR) has demonstrated its significant effectiveness in automatic speech recognition (ASR), mainly with clean speech. Recent work pointed out the strength of integrating SSLR with single-channel speech enhancement for ASR in noisy environments. This paper further advances this integration by dealing with multi-channel input. We propose a novel end-to-end architecture by integrating dereverberation, beamforming, SSLR, and ASR within a single neural network. Our system achieves the best performance reported in the literature on the CHiME-4 6-channel track with a word error rate (WER) of 1.77%. While the WavLM-based strong SSLR demonstrates promising results by itself, the end-to-end integration with the weighted power minimization distortionless response beamformer, which simultaneously performs dereverberation and denoising, improves WER significantly. Its effectiveness is also validated on the REVERB dataset.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130503313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
G-Augment: Searching for the Meta-Structure of Data Augmentation Policies for ASR G-Augment:面向ASR的数据增强策略元结构的搜索
2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2022-10-19 DOI: 10.1109/SLT54892.2023.10022748
Gary Wang, Ekin D.Cubuk, A. Rosenberg, Shuyang Cheng, Ron J. Weiss, B. Ramabhadran, P. Moreno, Quoc V. Le, Daniel S. Park
{"title":"G-Augment: Searching for the Meta-Structure of Data Augmentation Policies for ASR","authors":"Gary Wang, Ekin D.Cubuk, A. Rosenberg, Shuyang Cheng, Ron J. Weiss, B. Ramabhadran, P. Moreno, Quoc V. Le, Daniel S. Park","doi":"10.1109/SLT54892.2023.10022748","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022748","url":null,"abstract":"Data augmentation is a ubiquitous technique used to provide robustness to automatic speech recognition (ASR) training. However, even as so much of the ASR training process has become automated and more “end-to-end,” the data augmentation policy (what augmentation functions to use, and how to apply them) remains hand-crafted. We present G(raph)-Augment, a technique to define the augmentation space as directed acyclic graphs (DAGs) and search over this space to optimize the augmentation policy itself. We show that given the same computational budget, policies produced by G-Augment are able to perform better than SpecAugment policies obtained by random search on fine-tuning tasks on CHiME-6 and AMI. G-Augment is also able to establish a new state-of-the-art ASR performance on the CHiME-6 evaluation set (30.7% WER). We further demonstrate that G- Augment policies show better transfer properties across warm-start to cold-start training and model size compared to random-searched SpecAugment policies.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121114419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Two-Stage Training Method for Japanese Electrolaryngeal Speech Enhancement Based on Sequence-to-Sequence Voice Conversion 基于序列到序列语音转换的日语电喉语音增强两阶段训练方法
2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2022-10-19 DOI: 10.1109/SLT54892.2023.10023033
D. Ma, Lester Phillip Violeta, Kazuhiro Kobayashi, T. Toda
{"title":"Two-Stage Training Method for Japanese Electrolaryngeal Speech Enhancement Based on Sequence-to-Sequence Voice Conversion","authors":"D. Ma, Lester Phillip Violeta, Kazuhiro Kobayashi, T. Toda","doi":"10.1109/SLT54892.2023.10023033","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10023033","url":null,"abstract":"Sequence-to-sequence (seq2seq) voice conversion (VC) models have greater potential in converting electrolaryngeal (EL) speech to normal speech (EL2SP) compared to conventional VC models. However, EL2SP based on seq2seq VC requires a sufficiently large amount of parallel data for the model training and it suffers from significant performance degradation when the amount of training data is insufficient. To address this issue, we suggest a novel, two-stage strategy to optimize the performance on EL2SP based on seq2seq VC when a small amount of the parallel dataset is available. In contrast to utilizing high-quality data augmentations in previous studies, we first combine a large amount of imperfect synthetic parallel data of EL and normal speech, with the original dataset into VC training. Then, a second stage training is conducted with the original parallel dataset only. The results show that the proposed method progressively improves the performance of EL2SP based on seq2seq VC.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121475731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
N-Best Hypotheses Reranking for Text-to-SQL Systems 文本到sql系统的n -最佳假设重新排序
2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2022-10-19 DOI: 10.1109/SLT54892.2023.10023434
Lu Zeng, S. Parthasarathi, Dilek Z. Hakkani-Tür
{"title":"N-Best Hypotheses Reranking for Text-to-SQL Systems","authors":"Lu Zeng, S. Parthasarathi, Dilek Z. Hakkani-Tür","doi":"10.1109/SLT54892.2023.10023434","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10023434","url":null,"abstract":"Text-to-SQL task maps natural language utterances to structured queries that can be issued to a database. State-of-the-art (SOTA) systems rely on finetuning large, pre-trained language models in conjunction with constrained decoding applying a SQL parser. On the well established Spider dataset, we begin with Oracle studies: specifically, choosing an Oracle hypothesis from a SOTA model's 10-best list, yields a 7.7% absolute improvement in both exact match (EM) and execution (EX) accuracy, showing significant potential improvements with reranking. Identifying coherence and correctness as reranking approaches, we design a model generating a query plan and propose a heuristic schema linking algorithm. Combining both approaches, with T5-Large, we obtain a consistent 1% improvement in EM accuracy, and a 2.5% improvement in EX, establishing a new SOTA for this task. Our comprehensive error studies on DEV data show the underlying difficulty in making progress on this task.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"144 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128608502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Maestro-U: Leveraging Joint Speech-Text Representation Learning for Zero Supervised Speech ASR 利用联合语音-文本表示学习实现零监督语音ASR
2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2022-10-18 DOI: 10.1109/SLT54892.2023.10022791
Zhehuai Chen, Ankur Bapna, A. Rosenberg, Yu Zhang, B. Ramabhadran, P. Moreno, Nanxin Chen
{"title":"Maestro-U: Leveraging Joint Speech-Text Representation Learning for Zero Supervised Speech ASR","authors":"Zhehuai Chen, Ankur Bapna, A. Rosenberg, Yu Zhang, B. Ramabhadran, P. Moreno, Nanxin Chen","doi":"10.1109/SLT54892.2023.10022791","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022791","url":null,"abstract":"Training state-of-the-art Automated Speech Recognition (ASR) models typically requires a substantial amount of transcribed speech. In this work, we demonstrate that a modality-matched joint speech and text model introduced in [1] can be leveraged to train a massively multilingual ASR model without any supervised (manually transcribed) speech for some languages. This paper explores the use of jointly learnt speech and text representations in a massively multilingual, zero supervised speech, real-world setting to expand the set of languages covered by ASR with only unlabeled speech and text in the target languages. Using the FLEURS dataset, we define the task to cover 102 languages, where transcribed speech is available in 52 of these languages and can be used to improve end-to-end ASR quality on the remaining 50. First, we show that by combining speech representations with byte-level text representations and use of language embeddings, we can dramatically reduce the Character Error Rate (CER) on languages with no supervised speech from 64.8% to 30.8%, a relative reduction of 53%. Second, using a subset of South Asian languages we show that Maestro-U can promote knowledge transfer from languages with supervised speech even when there is limited to no graphemic overlap. Overall, Maestro-U closes the gap to oracle performance by 68.5% relative and reduces the CER of 19 languages below 15%.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130994234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
HMM vs. CTC for Automatic Speech Recognition: Comparison Based on Full-Sum Training from Scratch 自动语音识别的HMM与CTC:基于从头开始的全和训练的比较
2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2022-10-18 DOI: 10.1109/SLT54892.2023.10022967
Tina Raissi, Wei Zhou, S. Berger, R. Schluter, H. Ney
{"title":"HMM vs. CTC for Automatic Speech Recognition: Comparison Based on Full-Sum Training from Scratch","authors":"Tina Raissi, Wei Zhou, S. Berger, R. Schluter, H. Ney","doi":"10.1109/SLT54892.2023.10022967","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022967","url":null,"abstract":"In this work, we compare from-scratch sequence-level cross-entropy (full-sum) training of Hidden Markov Model (HMM) and Connectionist Temporal Classification (CTC) topologies for automatic speech recognition (ASR). Besides accuracy, we further analyze their capability for generating high-quality time alignment between the speech signal and the transcription, which can be crucial for many subsequent applications. Moreover, we propose several methods to improve convergence of from-scratch full-sum training by addressing the alignment modeling issue. Systematic comparison is conducted on both Switchboard and LibriSpeech corpora across CTC, posterior HMM with and w/o transition probabilities, and standard hybrid HMM. We also provide a detailed analysis of both Viterbi forced-alignment and Baum-Welch full-sum occupation probabilities.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124517325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
SVLDL: Improved Speaker Age Estimation Using Selective Variance Label Distribution Learning SVLDL:使用选择性方差标签分布学习改进说话人年龄估计
2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2022-10-18 DOI: 10.1109/SLT54892.2023.10023124
Zuheng Kang, Jianzong Wang, Junqing Peng, Jing Xiao
{"title":"SVLDL: Improved Speaker Age Estimation Using Selective Variance Label Distribution Learning","authors":"Zuheng Kang, Jianzong Wang, Junqing Peng, Jing Xiao","doi":"10.1109/SLT54892.2023.10023124","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10023124","url":null,"abstract":"Estimating age from a single speech is a classic and challenging topic. Although Label Distribution Learning (LDL) can represent adjacent indistinguishable ages well, the uncertainty of the age estimate for each utterance varies from person to person, i.e., the variance of the age distribution is different. To address this issue, we propose selective variance label distribution learning (SVLDL) method to adapt the variance of different age distributions. Furthermore, the model uses WavLM as the speech feature extractor and adds the auxiliary task of gender recognition to further improve the performance. Two tricks are applied on the loss function to enhance the robustness of the age estimation and improve the quality of the fitted age distribution. Extensive experiments show that the model achieves state-of-the-art performance on all aspects of the NIST SRE08-10 and a real-world datasets.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"116 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124066694","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Sub-8-Bit Quantization for On-Device Speech Recognition: A Regularization-Free Approach 设备上语音识别的8位以下量化:一种无正则化的方法
2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2022-10-17 DOI: 10.1109/SLT54892.2023.10022821
Kai Zhen, Martin H. Radfar, H. Nguyen, Grant P. Strimel, Nathan Susanj, A. Mouchtaris
{"title":"Sub-8-Bit Quantization for On-Device Speech Recognition: A Regularization-Free Approach","authors":"Kai Zhen, Martin H. Radfar, H. Nguyen, Grant P. Strimel, Nathan Susanj, A. Mouchtaris","doi":"10.1109/SLT54892.2023.10022821","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022821","url":null,"abstract":"For on-device automatic speech recognition (ASR), quantization aware training (QAT) is ubiquitous to achieve the trade-off between model predictive performance and efficiency. Among existing QAT methods, one major drawback is that the quantization centroids have to be predetermined and fixed. To overcome this limitation, we introduce a regularization-free, “soft-to-hard” compression mechanism with self-adjustable centroids in a $mu$ -Law constrained space, resulting in a simpler yet more versatile quantization scheme, called General Quantizer (GQ). We apply GQ to ASR tasks using Recurrent Neural Network Transducer (RNN-T) and Conformer architectures on both LibriSpeech and de-identified far-field datasets. Without accuracy degradation, GQ can compress both RNN-T and Conformer into sub-8-bit, and for some RNN-T layers, to 1-bit for fast and accurate inference. We observe a 30.73% memory footprint saving and 31.75% user-perceived latency reduction compared to 8-bit QAT via physical device benchmarking.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"62 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121225496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信