2022 IEEE Spoken Language Technology Workshop (SLT)最新文献

筛选
英文 中文
Distilling Sequence-to-Sequence Voice Conversion Models for Streaming Conversion Applications 为流转换应用提取序列到序列的语音转换模型
2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2023-01-09 DOI: 10.1109/SLT54892.2023.10023432
Kou Tanaka, H. Kameoka, Takuhiro Kaneko, Shogo Seki
{"title":"Distilling Sequence-to-Sequence Voice Conversion Models for Streaming Conversion Applications","authors":"Kou Tanaka, H. Kameoka, Takuhiro Kaneko, Shogo Seki","doi":"10.1109/SLT54892.2023.10023432","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10023432","url":null,"abstract":"This paper describes a method for distilling a recurrent-based sequence-to-sequence (S2S) voice conversion (VC) model. Although the performance of recent VCs is becoming higher quality, streaming conversion is still a challenge when considering practical applications. To achieve streaming VC, the conversion model needs a streamable structure, a causal layer rather than a non-causal layer. Motivated by this constraint and recent advances in S2S learning, we apply the teacher-student framework to recurrent-based S2S- VC models. A major challenge is how to minimize degradation due to the use of causal layers which masks future input information. Experimental evaluations show that except for male-to-female speaker conversion, our approach is able to maintain the teacher model's performance in terms of subjective evaluations despite the streamable student model structure. Audio samples can be accessed on http://www.kecl.ntt.co.jp/people/tanaka.ko/projects/dists2svc.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123754688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Streaming Bilingual End-to-End ASR Model Using Attention Over Multiple Softmax 基于多Softmax关注的流双语端到端ASR模型
2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2023-01-09 DOI: 10.1109/SLT54892.2023.10022475
Aditya Patil, Vikas Joshi, Purvi Agrawal, Rupeshkumar Mehta
{"title":"Streaming Bilingual End-to-End ASR Model Using Attention Over Multiple Softmax","authors":"Aditya Patil, Vikas Joshi, Purvi Agrawal, Rupeshkumar Mehta","doi":"10.1109/SLT54892.2023.10022475","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022475","url":null,"abstract":"Even with several advancements in multilingual modeling, it is challenging to recognize multiple languages using a single neural model, without knowing the input language and most multilingual models assume the availability of the input language. In this work, we propose a novel bilingual end-to-end (E2E) modeling approach, where a single neural model can recognize both languages and also support switching between the languages, without any language input from the user. The proposed model has shared encoder and prediction networks, with language-specific joint networks that are combined via a self-attention mechanism. As the language-specific posteriors are combined, it produces a single posterior probability over all the output symbols, enabling a single beam search decoding and also allowing dynamic switching between the languages. The proposed approach outperforms the conventional bilingual baseline with 13.3%, 8.23% and 1.3% word error rate relative reduction on Hindi, English and code-mixed test sets, respectively.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128284064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Code-Switched Language Modelling Using a Code Predictive Lstm in Under-Resourced South African Languages 在资源不足的南非语言中使用代码预测Lstm的代码切换语言建模
2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2023-01-09 DOI: 10.1109/SLT54892.2023.10022517
Joshua Jansen van Vüren, T. Niesler
{"title":"Code-Switched Language Modelling Using a Code Predictive Lstm in Under-Resourced South African Languages","authors":"Joshua Jansen van Vüren, T. Niesler","doi":"10.1109/SLT54892.2023.10022517","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022517","url":null,"abstract":"We present a new LSTM language model architecture for code-switched speech incorporating a neural structure that explicitly models language switches. Experimental evaluation of this code predictive model for four under-resourced South African languages shows consistent improvements in perplexity as well as perplexity specifically over code-switches compared to an LSTM baseline. Substantial reductions in absolute speech recognition word error rates (0.5%-1.2%) as well as errors specifically at code-switches (0.6%-2.3%) are also achieved during n-best rescoring. When used for both data augmentation and n-best rescoring, our code predictive model reduces word error rate by a further 0.8%-2.6% absolute and consistently outperforms a baseline LSTM. The similar and consistent trends observed across all four language pairs allows us to conclude that explicit modelling of language switches by a dedicated language model component is a suitable strategy for code-switched speech recognition.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127032423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
NAM+: Towards Scalable End-to-End Contextual Biasing for Adaptive ASR 面向自适应ASR的可扩展端到端上下文偏置
2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2023-01-09 DOI: 10.1109/SLT54892.2023.10023323
Tsendsuren Munkhdalai, Zelin Wu, G. Pundak, K. Sim, Jiayang Li, Pat Rondon, Tara N. Sainath
{"title":"NAM+: Towards Scalable End-to-End Contextual Biasing for Adaptive ASR","authors":"Tsendsuren Munkhdalai, Zelin Wu, G. Pundak, K. Sim, Jiayang Li, Pat Rondon, Tara N. Sainath","doi":"10.1109/SLT54892.2023.10023323","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10023323","url":null,"abstract":"Attention-based biasing techniques for end-to-end ASR systems are able to achieve large accuracy gains without requiring the inference algorithm adjustments and parameter tuning common to fusion approaches. However, it is challenging to simultaneously scale up attention-based biasing to realistic numbers of biased phrases; maintain in-domain WER gains, while minimizing out-of-domain losses; and run in real time. We present NAM+, an attention-based biasing approach which achieves a 16X inference speedup per acoustic frame over prior work when run with 3,000 biasing entities, as measured on a typical mobile CPU. NAM+ achieves these run-time gains through a combination of Two-Pass Hierarchical Attention and Dilated Context Update. Compared to the adapted baseline, NAM+ further decreases the in-domain WER by up to 12.6% relative, while incurring an out-of-domain WER regression of 20% relative. Compared to the non-adapted baseline, the out-of-domain WER regression is 7.1 % relative.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134142106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Cover Page 封面页
2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2023-01-09 DOI: 10.1109/slt54892.2023.10022896
{"title":"Cover Page","authors":"","doi":"10.1109/slt54892.2023.10022896","DOIUrl":"https://doi.org/10.1109/slt54892.2023.10022896","url":null,"abstract":"","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135062005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hackathon
2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2023-01-09 DOI: 10.1109/slt54892.2023.10023077
DO Objetivo
{"title":"Hackathon","authors":"DO Objetivo","doi":"10.1109/slt54892.2023.10023077","DOIUrl":"https://doi.org/10.1109/slt54892.2023.10023077","url":null,"abstract":"O tema do “Hackathon DAF e PCTec/UnB” será “UnB na palma da sua mão”, cuja ideia é pensar a inovação de forma a gerar benefícios diretos para a Universidade de Brasília, por meio da construção de um APP através do qual os estudantes, professores, servidores e a comunidade acadêmica em geral possam monitorar os serviços prestados pelas empresas responsáveis pelos diversos contratos vigentes com a universidade.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131123454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Response Timing Estimation for Spoken Dialog Systems Based on Syntactic Completeness Prediction 基于句法完整性预测的口语对话系统响应时间估计
2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2023-01-09 DOI: 10.1109/SLT54892.2023.10023458
Jin Sakuma, S. Fujie, Tetsunori Kobayashi
{"title":"Response Timing Estimation for Spoken Dialog Systems Based on Syntactic Completeness Prediction","authors":"Jin Sakuma, S. Fujie, Tetsunori Kobayashi","doi":"10.1109/SLT54892.2023.10023458","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10023458","url":null,"abstract":"Appropriate response timing is very important for achieving smooth dialog progression. Conventionally, prosodic, temporal and linguistic features have been used to determine timing. In addition to the conventional parameters, we propose to utilize the syntactic completeness after a certain time, which represents whether the other party is about to finish speaking. We generate the next token sequence from intermediate speech recognition results using a language model and obtain the probability of the end of utterance appearing $K$ tokens ahead, where $K$ varies from 1 to $M$. We obtain an $M$ -dimensional vector, which we denote as estimates of syntactic completeness (ESC). We evaluated this method on a simulated dialog database of a restaurant information center. The results confirmed that considering ESC improves the performance of response timing estimation, especially the accuracy in quick responses, compared with the method using only conventional features.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124378541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Peppanet: Effective Mispronunciation Detection and Diagnosis Leveraging Phonetic, Phonological, and Acoustic Cues Peppanet:利用语音、语音和声学线索有效的错误发音检测和诊断
2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2023-01-09 DOI: 10.1109/SLT54892.2023.10022472
Bi-Cheng Yan, Hsin-Wei Wang, Berlin Chen
{"title":"Peppanet: Effective Mispronunciation Detection and Diagnosis Leveraging Phonetic, Phonological, and Acoustic Cues","authors":"Bi-Cheng Yan, Hsin-Wei Wang, Berlin Chen","doi":"10.1109/SLT54892.2023.10022472","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022472","url":null,"abstract":"Mispronunciation detection and diagnosis (MDD) aims to detect erroneous pronunciation segments in an L2 learner's articulation and subsequently provide informative diagnostic feedback. Most existing neural methods follow a dictation-based modeling paradigm that finds out pronunciation errors and returns diagnostic feedback at the same time by aligning the recognized phone sequence uttered by an L2 learner to the corresponding canonical phone sequence of a given text prompt. However, the main downside of these methods is that the dictation process and alignment process are mostly made independent of each other. In view of this, we present a novel end-to-end neural method, dubbed PeppaNet, building on a unified structure that can jointly model the dictation process and the alignment process. The model of our method learns to directly predict the pronunciation correctness of each canonical phone of the text prompt and in turn provides its corresponding diagnostic feedback. In contrast to the conventional dictation-based methods that rely mainly on a free-phone recognition process, PeppaNet makes good use of an effective selective gating mechanism to simultaneously incorporate phonetic, phonological and acoustic cues to generate corrections that are more proper and phonetically related to the canonical pronunciations. Extensive sets of experiments conducted on the L2-ARCTIC benchmark dataset seem to show the merits of our proposed method in comparison to some recent top-of-the-line methods.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125891553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Welcome Page 欢迎页面
2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2023-01-09 DOI: 10.1109/slt54892.2023.10022398
D. Glaser
{"title":"Welcome Page","authors":"D. Glaser","doi":"10.1109/slt54892.2023.10022398","DOIUrl":"https://doi.org/10.1109/slt54892.2023.10022398","url":null,"abstract":"[2]Professor Donald A. Glaser was a master of experimental science throughout his career. Born in Cleveland and educated at Case Institute of Technology, he earned a doctorate at Caltech and taught at the University of Michigan before accepting a post at UC Berkeley in 1959. Early in his career, Dr. Glaser experimented with ways to make the workings of sub-atomic particles visible. For his subsequent invention of the bubble chamber, he was awarded the Nobel Prize in Physics 1960. He then began exploring the new field of molecular biology, improving techniques for working with bacterial phages, bacteria, and mammalian cells. By designing equipment to automate his experiments and scale them up, he could run thousands of experiments simultaneously, generating enough data to move the science forward. Recognizing the implications for medicine, Dr. Glaser and two friends created the pioneering biotech company Cetus Corporation in 1971, thus launching the genetic engineering industry.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125017525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Flickering Reduction with Partial Hypothesis Reranking for Streaming ASR 基于部分假设重排序的流ASR闪烁抑制
2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2023-01-09 DOI: 10.1109/SLT54892.2023.10023016
A. Bruguier, David Qiu, Trevor Strohman, Yanzhang He
{"title":"Flickering Reduction with Partial Hypothesis Reranking for Streaming ASR","authors":"A. Bruguier, David Qiu, Trevor Strohman, Yanzhang He","doi":"10.1109/SLT54892.2023.10023016","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10023016","url":null,"abstract":"Incremental speech recognizers start displaying results while the users are still speaking. These partial results are beneficial to users who like the responsiveness of the system. However, as new partial results come in, words that were previously displayed can change or disappear. The results appear unstable and this unwanted phenomenon is called flickering. Typical remediation approaches can increase latency and reduce the quality of the partials results, but little work has been done to measure these effects. We first introduce two new metrics that allow us to measure the quality and latency of the partials. We propose the new, lightweight approach of reranking the partial results in favor of a more stable prefix without changing the beam search. This allows us to reduce flickering without impacting the final result. We show that we can roughly halve the amount of flickering with negligible impact on the quality and latency of the partial results.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132439022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信