Kou Tanaka, H. Kameoka, Takuhiro Kaneko, Shogo Seki
{"title":"Distilling Sequence-to-Sequence Voice Conversion Models for Streaming Conversion Applications","authors":"Kou Tanaka, H. Kameoka, Takuhiro Kaneko, Shogo Seki","doi":"10.1109/SLT54892.2023.10023432","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10023432","url":null,"abstract":"This paper describes a method for distilling a recurrent-based sequence-to-sequence (S2S) voice conversion (VC) model. Although the performance of recent VCs is becoming higher quality, streaming conversion is still a challenge when considering practical applications. To achieve streaming VC, the conversion model needs a streamable structure, a causal layer rather than a non-causal layer. Motivated by this constraint and recent advances in S2S learning, we apply the teacher-student framework to recurrent-based S2S- VC models. A major challenge is how to minimize degradation due to the use of causal layers which masks future input information. Experimental evaluations show that except for male-to-female speaker conversion, our approach is able to maintain the teacher model's performance in terms of subjective evaluations despite the streamable student model structure. Audio samples can be accessed on http://www.kecl.ntt.co.jp/people/tanaka.ko/projects/dists2svc.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123754688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Streaming Bilingual End-to-End ASR Model Using Attention Over Multiple Softmax","authors":"Aditya Patil, Vikas Joshi, Purvi Agrawal, Rupeshkumar Mehta","doi":"10.1109/SLT54892.2023.10022475","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022475","url":null,"abstract":"Even with several advancements in multilingual modeling, it is challenging to recognize multiple languages using a single neural model, without knowing the input language and most multilingual models assume the availability of the input language. In this work, we propose a novel bilingual end-to-end (E2E) modeling approach, where a single neural model can recognize both languages and also support switching between the languages, without any language input from the user. The proposed model has shared encoder and prediction networks, with language-specific joint networks that are combined via a self-attention mechanism. As the language-specific posteriors are combined, it produces a single posterior probability over all the output symbols, enabling a single beam search decoding and also allowing dynamic switching between the languages. The proposed approach outperforms the conventional bilingual baseline with 13.3%, 8.23% and 1.3% word error rate relative reduction on Hindi, English and code-mixed test sets, respectively.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128284064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Code-Switched Language Modelling Using a Code Predictive Lstm in Under-Resourced South African Languages","authors":"Joshua Jansen van Vüren, T. Niesler","doi":"10.1109/SLT54892.2023.10022517","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022517","url":null,"abstract":"We present a new LSTM language model architecture for code-switched speech incorporating a neural structure that explicitly models language switches. Experimental evaluation of this code predictive model for four under-resourced South African languages shows consistent improvements in perplexity as well as perplexity specifically over code-switches compared to an LSTM baseline. Substantial reductions in absolute speech recognition word error rates (0.5%-1.2%) as well as errors specifically at code-switches (0.6%-2.3%) are also achieved during n-best rescoring. When used for both data augmentation and n-best rescoring, our code predictive model reduces word error rate by a further 0.8%-2.6% absolute and consistently outperforms a baseline LSTM. The similar and consistent trends observed across all four language pairs allows us to conclude that explicit modelling of language switches by a dedicated language model component is a suitable strategy for code-switched speech recognition.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127032423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tsendsuren Munkhdalai, Zelin Wu, G. Pundak, K. Sim, Jiayang Li, Pat Rondon, Tara N. Sainath
{"title":"NAM+: Towards Scalable End-to-End Contextual Biasing for Adaptive ASR","authors":"Tsendsuren Munkhdalai, Zelin Wu, G. Pundak, K. Sim, Jiayang Li, Pat Rondon, Tara N. Sainath","doi":"10.1109/SLT54892.2023.10023323","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10023323","url":null,"abstract":"Attention-based biasing techniques for end-to-end ASR systems are able to achieve large accuracy gains without requiring the inference algorithm adjustments and parameter tuning common to fusion approaches. However, it is challenging to simultaneously scale up attention-based biasing to realistic numbers of biased phrases; maintain in-domain WER gains, while minimizing out-of-domain losses; and run in real time. We present NAM+, an attention-based biasing approach which achieves a 16X inference speedup per acoustic frame over prior work when run with 3,000 biasing entities, as measured on a typical mobile CPU. NAM+ achieves these run-time gains through a combination of Two-Pass Hierarchical Attention and Dilated Context Update. Compared to the adapted baseline, NAM+ further decreases the in-domain WER by up to 12.6% relative, while incurring an out-of-domain WER regression of 20% relative. Compared to the non-adapted baseline, the out-of-domain WER regression is 7.1 % relative.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134142106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hackathon","authors":"DO Objetivo","doi":"10.1109/slt54892.2023.10023077","DOIUrl":"https://doi.org/10.1109/slt54892.2023.10023077","url":null,"abstract":"O tema do “Hackathon DAF e PCTec/UnB” será “UnB na palma da sua mão”, cuja ideia é pensar a inovação de forma a gerar benefícios diretos para a Universidade de Brasília, por meio da construção de um APP através do qual os estudantes, professores, servidores e a comunidade acadêmica em geral possam monitorar os serviços prestados pelas empresas responsáveis pelos diversos contratos vigentes com a universidade.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131123454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Response Timing Estimation for Spoken Dialog Systems Based on Syntactic Completeness Prediction","authors":"Jin Sakuma, S. Fujie, Tetsunori Kobayashi","doi":"10.1109/SLT54892.2023.10023458","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10023458","url":null,"abstract":"Appropriate response timing is very important for achieving smooth dialog progression. Conventionally, prosodic, temporal and linguistic features have been used to determine timing. In addition to the conventional parameters, we propose to utilize the syntactic completeness after a certain time, which represents whether the other party is about to finish speaking. We generate the next token sequence from intermediate speech recognition results using a language model and obtain the probability of the end of utterance appearing $K$ tokens ahead, where $K$ varies from 1 to $M$. We obtain an $M$ -dimensional vector, which we denote as estimates of syntactic completeness (ESC). We evaluated this method on a simulated dialog database of a restaurant information center. The results confirmed that considering ESC improves the performance of response timing estimation, especially the accuracy in quick responses, compared with the method using only conventional features.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124378541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Peppanet: Effective Mispronunciation Detection and Diagnosis Leveraging Phonetic, Phonological, and Acoustic Cues","authors":"Bi-Cheng Yan, Hsin-Wei Wang, Berlin Chen","doi":"10.1109/SLT54892.2023.10022472","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022472","url":null,"abstract":"Mispronunciation detection and diagnosis (MDD) aims to detect erroneous pronunciation segments in an L2 learner's articulation and subsequently provide informative diagnostic feedback. Most existing neural methods follow a dictation-based modeling paradigm that finds out pronunciation errors and returns diagnostic feedback at the same time by aligning the recognized phone sequence uttered by an L2 learner to the corresponding canonical phone sequence of a given text prompt. However, the main downside of these methods is that the dictation process and alignment process are mostly made independent of each other. In view of this, we present a novel end-to-end neural method, dubbed PeppaNet, building on a unified structure that can jointly model the dictation process and the alignment process. The model of our method learns to directly predict the pronunciation correctness of each canonical phone of the text prompt and in turn provides its corresponding diagnostic feedback. In contrast to the conventional dictation-based methods that rely mainly on a free-phone recognition process, PeppaNet makes good use of an effective selective gating mechanism to simultaneously incorporate phonetic, phonological and acoustic cues to generate corrections that are more proper and phonetically related to the canonical pronunciations. Extensive sets of experiments conducted on the L2-ARCTIC benchmark dataset seem to show the merits of our proposed method in comparison to some recent top-of-the-line methods.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125891553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Welcome Page","authors":"D. Glaser","doi":"10.1109/slt54892.2023.10022398","DOIUrl":"https://doi.org/10.1109/slt54892.2023.10022398","url":null,"abstract":"[2]Professor Donald A. Glaser was a master of experimental science throughout his career. Born in Cleveland and educated at Case Institute of Technology, he earned a doctorate at Caltech and taught at the University of Michigan before accepting a post at UC Berkeley in 1959. Early in his career, Dr. Glaser experimented with ways to make the workings of sub-atomic particles visible. For his subsequent invention of the bubble chamber, he was awarded the Nobel Prize in Physics 1960. He then began exploring the new field of molecular biology, improving techniques for working with bacterial phages, bacteria, and mammalian cells. By designing equipment to automate his experiments and scale them up, he could run thousands of experiments simultaneously, generating enough data to move the science forward. Recognizing the implications for medicine, Dr. Glaser and two friends created the pioneering biotech company Cetus Corporation in 1971, thus launching the genetic engineering industry.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125017525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Bruguier, David Qiu, Trevor Strohman, Yanzhang He
{"title":"Flickering Reduction with Partial Hypothesis Reranking for Streaming ASR","authors":"A. Bruguier, David Qiu, Trevor Strohman, Yanzhang He","doi":"10.1109/SLT54892.2023.10023016","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10023016","url":null,"abstract":"Incremental speech recognizers start displaying results while the users are still speaking. These partial results are beneficial to users who like the responsiveness of the system. However, as new partial results come in, words that were previously displayed can change or disappear. The results appear unstable and this unwanted phenomenon is called flickering. Typical remediation approaches can increase latency and reduce the quality of the partials results, but little work has been done to measure these effects. We first introduce two new metrics that allow us to measure the quality and latency of the partials. We propose the new, lightweight approach of reranking the partial results in favor of a more stable prefix without changing the beam search. This allows us to reduce flickering without impacting the final result. We show that we can roughly halve the amount of flickering with negligible impact on the quality and latency of the partial results.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132439022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}