2018 IEEE Spoken Language Technology Workshop (SLT)最新文献

筛选
英文 中文
Evaluating on-device ASR on Field Recordings from an Interactive Reading Companion 评估交互式阅读伴侣现场记录的设备上ASR
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639603
Anastassia Loukina, Nitin Madnani, Beata Beigman Klebanov, A. Misra, Georgi Angelov, O. Todic
{"title":"Evaluating on-device ASR on Field Recordings from an Interactive Reading Companion","authors":"Anastassia Loukina, Nitin Madnani, Beata Beigman Klebanov, A. Misra, Georgi Angelov, O. Todic","doi":"10.1109/SLT.2018.8639603","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639603","url":null,"abstract":"Many applications designed to assess and improve oral reading fluency use automated speech recognition (ASR) to provide feedback to students, teachers, and parents. Most such applications rely on a distributed architecture with the speech recognition component located in the cloud. For interactive applications, this approach requires a reliable Internet connection that may not always be available. We investigate whether on-device ASR can be used for a virtual reading companion using recordings obtained from children both in a controlled environment and in the field. Our limited evaluation makes us cautiously optimistic about the feasibility of using on-device ASR for our application.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127816338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
An Icelandic Pronunciation Dictionary for TTS 冰岛语TTS发音字典
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639590
Anna Björk Nikulásdóttir, Jón Guðnason, Eiríkur Rögnvaldsson
{"title":"An Icelandic Pronunciation Dictionary for TTS","authors":"Anna Björk Nikulásdóttir, Jón Guðnason, Eiríkur Rögnvaldsson","doi":"10.1109/SLT.2018.8639590","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639590","url":null,"abstract":"This paper describes an Icelandic pronunciation dictionary for speech applications and its processing for use in a text-to-speech system for Icelandic. Cleaning and correction procedures were implemented to create a consistent training set for grapheme-to-phoneme conversion modeling, needed for the automatic extension of the dictionary. Experiments with the original version of the dictionary and the cleaned version described in this paper as training sets for a joint sequence g2p algorithm show a clear benefit of using clean data for training, both in terms of PER and in terms of categories of errors made by the g2p algorithm. The results of the dictionary processing where also used to create an initial version of an open source database for Icelandic speech applications.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130454887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Toward Multi-Features Emphasis Speech Translation: Assessment of Human Emphasis Production and Perception with Speech and Text Clues 走向多特征重音翻译:基于语音和文本线索的人类重音产生和感知评价
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639641
Quoc Truong Do, S. Sakti, Satoshi Nakamura
{"title":"Toward Multi-Features Emphasis Speech Translation: Assessment of Human Emphasis Production and Perception with Speech and Text Clues","authors":"Quoc Truong Do, S. Sakti, Satoshi Nakamura","doi":"10.1109/SLT.2018.8639641","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639641","url":null,"abstract":"Emphasis is an important factor of human speech that helps convey emotion and the focused information of utterances. Recently, studies have been conducted on speech-to-speech translation to preserve the emphasis information from the source language to the target language. However, since different cultures have various ways of expressing emphasis, just considering the acoustic-to-acoustic feature emphasis translation may not always reflect the experiences of users. On the other hand, emphasis can be expressed at various levels in both text and speech. However, it remains unclear how we communicate emphasis in a different form (acoustic/linguistic) with different levels and whether we can perceive the difference between different levels of emphasis or observe the similarity of the same emphasis levels in both text and speech forms. In this paper, we conducted analyses on human perception of emphasis with both speech and text clues through crowd-sourced evaluations. The results indicate that although participants can distinguish among emphasis levels and perceive the same emphasis level between speech and text, many ambiguities still exist at certain emphasis levels. Thus, our result provides insight into what needs to be handled during the emphasis translation process.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117163567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
An Evaluation of Deep Spectral Mappings and WaveNet Vocoder for Voice Conversion 语音转换中深度光谱映射和波网声码器的评价
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639608
Patrick Lumban Tobing, Tomoki Hayashi, Yi-Chiao Wu, Kazuhiro Kobayashi, T. Toda
{"title":"An Evaluation of Deep Spectral Mappings and WaveNet Vocoder for Voice Conversion","authors":"Patrick Lumban Tobing, Tomoki Hayashi, Yi-Chiao Wu, Kazuhiro Kobayashi, T. Toda","doi":"10.1109/SLT.2018.8639608","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639608","url":null,"abstract":"This paper presents an evaluation of deep spectral mapping and WaveNet vocoder in voice conversion (VC). In our VC framework, spectral features of an input speaker are converted into those of a target speaker using the deep spectral mapping, and then together with the excitation features, the converted waveform is generated using WaveNet vocoder. In this work, we compare three different deep spectral mapping networks, i.e., a deep single density network (DSDN), a deep mixture density network (DMDN), and a long short-term memory recurrent neural network with an autoregressive output layer (LSTM-AR). Moreover, we also investigate several methods for reducing mismatches of spectral features used in WaveNet vocoder between training and conversion processes, such as some methods to alleviate oversmoothing effects of the converted spectral features, and another method to refine WaveNet using the converted spectral features. The experimental results demonstrate that the LSTM-AR yields nearly better spectral mapping accuracy than the others, and the proposed WaveNet refinement method significantly improves the naturalness of the converted waveform.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117266383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
High-Degree Feature for Deep Neural Network Based Acoustic Model 基于深度神经网络的声学模型的高阶特征
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639524
Hoon Chung, Sung Joo Lee, J. Park
{"title":"High-Degree Feature for Deep Neural Network Based Acoustic Model","authors":"Hoon Chung, Sung Joo Lee, J. Park","doi":"10.1109/SLT.2018.8639524","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639524","url":null,"abstract":"In this paper, we propose to use high-degree features to improve the discrimination performance of Deep Neural Network (DNN) based acoustic model. Thanks to the successful posterior probability estimation of DNNs for high-dimensional features, high-dimensional acoustic features are commonly considered in DNN-based acoustic models.Even though it is not clear how DNN-based acoustic models estimate the posterior probability robustly, the use of high-dimensional features is based on a theorem that it helps separability of patters. There is another well-known knowledge that high-degree features increase linear separability of nonlinear input features. However, there is little work to exploit high-degree features explicitly in a DNN-based acoustic model. Therefore, in this work, we investigate high-degree features to improve the performance further.In this work, the proposed approach was evaluated on a Wall Street Journal (WSJ) speech recognition domain. The proposed method achieved up to 21.8% error reduction rate for the Eval92 test set by reducing the word error rate from 4.82% to 3.77% when using degree-2 polynomial expansion.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"212 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132654270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multichannel ASR with Knowledge Distillation and Generalized Cross Correlation Feature 基于知识蒸馏和广义互相关特征的多通道ASR
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639600
Wenjie Li, Yu Zhang, Pengyuan Zhang, Fengpei Ge
{"title":"Multichannel ASR with Knowledge Distillation and Generalized Cross Correlation Feature","authors":"Wenjie Li, Yu Zhang, Pengyuan Zhang, Fengpei Ge","doi":"10.1109/SLT.2018.8639600","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639600","url":null,"abstract":"Multi-channel signal processing techniques have played an important role in the far-field automatic speech recognition (ASR) as the separate front-end enhancement part. However, they often meet the mismatch problem. In this paper, we proposed a novel architecture of acoustic model, in which the multi-channel speech without preprocessing was utilized directly. Besides the strategy of knowledge distillation and the generalized cross correlation (GCC) adaptation were employed. We use knowledge distillation to transfer knowledge from a well-trained close-talking model to distant-talking scenarios in every frame of the multichannel distant speech. Moreover, the GCC between microphones, which contains the spatial information, is supplied as an auxiliary input to the neural network. We observe good compensation of those two techniques. Evaluated with the AMI and ICSI meeting corpora, the proposed methods achieve relative WER improvement of 7.7% and 7.5% over the model trained directly on the concatenated multi-channel speech.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131415925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
A K-Nearest Neighbours Approach To Unsupervised Spoken Term Discovery 无监督口语词汇发现的k近邻方法
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639515
Alexis Thual, Corentin Dancette, Julien Karadayi, Juan Benjumea, Emmanuel Dupoux
{"title":"A K-Nearest Neighbours Approach To Unsupervised Spoken Term Discovery","authors":"Alexis Thual, Corentin Dancette, Julien Karadayi, Juan Benjumea, Emmanuel Dupoux","doi":"10.1109/SLT.2018.8639515","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639515","url":null,"abstract":"Unsupervised spoken term discovery is the task of finding recurrent acoustic patterns in speech without any annotations. Current approaches consists of two steps: (1) discovering similar patterns in speech, and (2) partitioning those pairs of acoustic tokens using graph clustering methods. We propose a new approach for the first step. Previous systems used various approximation algorithms to make the search tractable on large amounts of data. Our approach is based on an optimized k-nearest neighbours (KNN) search coupled with a fixed word embedding algorithm. The results show that the KNN algorithm is robust across languages, consistently-performs the DTW-based baseline, and is competitive with current state-of-the-art spoken term discovery systems.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114881134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Improved Auto-Marking Confidence for Spoken Language Assessment 提高了口语评估的自动评分信心
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639634
Marco Del Vecchio, A. Malinin, M. Gales
{"title":"Improved Auto-Marking Confidence for Spoken Language Assessment","authors":"Marco Del Vecchio, A. Malinin, M. Gales","doi":"10.1109/SLT.2018.8639634","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639634","url":null,"abstract":"Automatic assessment of spoken language proficiency is a sought-after technology. These systems often need to handle the operating scenario where candidates have a skill level or first language which was not encountered during the training stage. For high stakes tests it is necessary for those systems to have good grading performance when the candidate is from the same population as those contained in the training set, and they should know when they are likely to perform badly in the case when the candidate is not from the same population as the ones contained in training set. This paper focuses on using Deep Density Networks to yield auto-marking confidence. Firstly, we explore the benefits of parametrising either a predictive distribution or a posterior distribution over the parameters of the model likelihood and obtaining the predictive distribution via marginalisation. Secondly, we investigate how it is possible to act on the parametrised density in order to explicitly teach the model to have low confidence in areas of the observation space where there is no training data by assigning confidence scores to artificially generated data. Lastly, we compare the capabilities of Factor Analysis, Variational Auto-Encodes, and Wasserstein Generative Adversarial Networks to generate artificial data.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124020335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Hierarchical RNNs for Waveform-Level Speech Synthesis 用于波形级语音合成的分层rnn
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639588
Qingyun Dou, Moquan Wan, G. Degottex, Zhiyi Ma, M. Gales
{"title":"Hierarchical RNNs for Waveform-Level Speech Synthesis","authors":"Qingyun Dou, Moquan Wan, G. Degottex, Zhiyi Ma, M. Gales","doi":"10.1109/SLT.2018.8639588","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639588","url":null,"abstract":"Speech synthesis technology has a wide range of applications such as voice assistants. In recent years waveform-level synthesis systems have achieved state-of-the-art performance, as they overcome the limitations of vocoder-based synthesis systems. A range of waveform-level synthesis systems have been proposed; this paper investigates the performance of hierarchical Recurrent Neural Networks (RNNs) for speech synthesis. First, the form of network conditioning is discussed, comparing linguistic features and vocoder features from a vocoder-based synthesis system. It is found that compared with linguistic features, conditioning on vocoder features requires less data and modeling power, and yields better performance when there is limited data. By conditioning the hierarchical RNN on vocoder features, this paper develops a neural vocoder, which is capable of high quality synthesis when there is sufficient data. Furthermore, this neural vocoder is flexible, as conceptually it can map any sequence of vocoder features to speech, enabling efficient synthesizer porting to a target speaker. Subjective listening tests demonstrate that the neural vocoder outperforms a high quality baseline, and that it can change its voice to a very different speaker, given less than 15 minutes of data for fine tuning.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116810829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
LSTM Language Model Adaptation with Images and Titles for Multimedia Automatic Speech Recognition 基于图像和标题的LSTM语言模型自适应多媒体自动语音识别
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639551
Yasufumi Moriya, G. Jones
{"title":"LSTM Language Model Adaptation with Images and Titles for Multimedia Automatic Speech Recognition","authors":"Yasufumi Moriya, G. Jones","doi":"10.1109/SLT.2018.8639551","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639551","url":null,"abstract":"Transcription of multimedia data sources is often a challenging automatic speech recognition (ASR) task. The incorporation of visual features as additional contextual information as a means to improve ASR for this data has recently drawn attention from researchers. Our investigation extends existing ASR methods by using images and video titles to adapt a recurrent neural network (RNN) language model with a long-short term memory (LSTM) network. Our language model is tested on transcription of an existing corpus of instruction videos and on a new corpus consisting of lecture videos. Consistent reduction in perplexity by 5–10 is observed on both datasets. When the non-adapted model is combined with the image adaptation and video title adaptation models for n-best ASR hypotheses re-ranking, additionally the word error rate (WER) is decreased by around 0.5% on both datasets. By analysing the output word probabilities of the model, it is found that both image adaptation and video title adaptation give the model more confidence in the choice of contextually correct informative words","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115524482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信