2018 IEEE Spoken Language Technology Workshop (SLT)最新文献

筛选
英文 中文
Far-Field ASR Using Low-Rank and Sparse Soft Targets from Parallel Data 基于并行数据的低秩稀疏软目标远场ASR
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639579
Pranay Dighe, Afsaneh Asaei, H. Bourlard
{"title":"Far-Field ASR Using Low-Rank and Sparse Soft Targets from Parallel Data","authors":"Pranay Dighe, Afsaneh Asaei, H. Bourlard","doi":"10.1109/SLT.2018.8639579","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639579","url":null,"abstract":"Far-field automatic speech recognition (ASR) of conversational speech is often considered to be a very challenging task due to the poor quality of alignments available for training the DNN acoustic models. A common way to alleviate this problem is to use clean alignments obtained from parallelly recorded close-talk speech data. In this work, we advance the parallel data approach by obtaining enhanced low-rank and sparse soft targets from a close-talk ASR system and using them for training more accurate far-field acoustic models. Specifically, we (i) exploit eigenposteriors and Compressive Sensing dictionaries to learn low-dimensional senone subspaces in DNN posterior space, and (ii) enhance close-talk DNN posteriors to achieve high quality soft targets for training far-field DNN acoustic models. We show that the enhanced soft targets encode the structural and temporal interrelationships among senone classes which are easily accessible in the DNN posterior space of close-talk speech but not in its noisy far-field counterpart. We exploit enhanced soft targets to improve the mapping of far-field acoustics to close-talk senone classes. The experiments are performed on AMI meeting corpus where our approach improves DNN based acoustic modeling by 4.4% absolute (~8% rel.) reduction in WER as compared to a system which doesn’t use parallel data. Finally, the approach is also validated on state-of-the-art recurrent and time delay neural network architectures.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115392007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Comparing Prosodic Frameworks: Investigating the Acoustic-Symbolic Relationship in ToBI and RaP 比较韵律框架:ToBI和RaP的声-符号关系研究
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639539
Raul Fernandez, A. Rosenberg
{"title":"Comparing Prosodic Frameworks: Investigating the Acoustic-Symbolic Relationship in ToBI and RaP","authors":"Raul Fernandez, A. Rosenberg","doi":"10.1109/SLT.2018.8639539","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639539","url":null,"abstract":"ToBI is the dominant tool for symbolically describing prosodic content in American English speech material. This is due to its descriptive power and its theoretical grounding, but also to the amount of available annotated data. Recently, a modest amount of material annotated with the Rhythm and Pitch (RaP) framework was released publicly. In this paper, we investigate the acoustic-symbolic relationship under these two systems. We present experiments looking at this relationship in both directions. From acoustic to symbolic, we compare the automatic prediction of prosodic prominence as defined under the two systems. From symbolic to acoustic, we examine the utility of these annotation standards to correctly prescribe the acoustics of a given utterance from their symbolic sequences. We find RaP to be promising, showing a somewhat stronger acoustic-symbolic relationship than ToBI given a comparable amount of data for some aspects of these tasks. While with more annotated data ToBI results are stronger, it remains to be shown whether RaP performance can scale up.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127324449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Word Segmentation From Phoneme Sequences Based On Pitman-Yor Semi-Markov Model Exploiting Subword Information 基于利用子词信息的Pitman-Yor半马尔可夫模型的音素分词
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639607
Ryu Takeda, Kazunori Komatani, Alexander I. Rudnicky
{"title":"Word Segmentation From Phoneme Sequences Based On Pitman-Yor Semi-Markov Model Exploiting Subword Information","authors":"Ryu Takeda, Kazunori Komatani, Alexander I. Rudnicky","doi":"10.1109/SLT.2018.8639607","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639607","url":null,"abstract":"Word segmentation from phoneme sequences is essential to identify unknown words -of-vocabulary; OOV) in spoken dialogues. The Pitman-Yor semi-Markov model (PYSMM) is used for word segmentation that handles dynamic increase in vocabularies. The obtained vocabularies, however, still include meaningless entries due to insufficient cues for phoneme sequences. We focus here on using subword information to capture patterns as “words.” We propose 1) a model based on subword N-gram and subword estimation using a vocabulary set, and 2) posterior fusion of the results of a PYSMM and our model to take advantage of both. Our experiments showed 1) the potential of using subword information for OOV acquisition, and 2) that our method outperformed the PYSMM by 1.53 and 1.07 in terms of the F-measure of the obtained OOV set for English and Japanese corpora, respectively.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123662405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Improving FFTNet Vocoder with Noise Shaping and Subband Approaches 用噪声整形和子带方法改进FFTNet声码器
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639687
T. Okamoto, T. Toda, Y. Shiga, H. Kawai
{"title":"Improving FFTNet Vocoder with Noise Shaping and Subband Approaches","authors":"T. Okamoto, T. Toda, Y. Shiga, H. Kawai","doi":"10.1109/SLT.2018.8639687","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639687","url":null,"abstract":"Although FFTNet neural vocoders can synthesize speech waveforms in real time, the synthesized speech quality is worse than that of WaveNet vocoders. To improve the synthesized speech quality of FFTNet while ensuring real-time synthesis, residual connections are introduced to enhance the prediction accuracy. Additionally, time-invariant noise shaping and subband approaches, which significantly improve the synthesized speech quality of WaveNet vocoders, are applied. A subband FFTNet vocoder with multiband input is also proposed to directly compensate the phase shift between subbands. The proposed approaches are evaluated through experiments using a Japanese male corpus with a sampling frequency of 16 kHz. The results are compared with those synthesized by the STRAIGHT vocoder without mel-cepstral compression and those from conventional FFTNet and WaveNet vocoders. The proposed approaches are shown to successfully improve the synthesized speech quality of the FFTNet vocoder. In particular, the use of noise shaping enables FFTNet to significantly outperform the STRAIGHT vocoder.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122411762","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
JSpeech: A Multi-Lingual Conversational Speech Corpus JSpeech:一个多语言会话语音语料库
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639658
A. J. Choobbasti, Mohammad Erfan Gholamian, Amir Vaheb, Saeid Safavi
{"title":"JSpeech: A Multi-Lingual Conversational Speech Corpus","authors":"A. J. Choobbasti, Mohammad Erfan Gholamian, Amir Vaheb, Saeid Safavi","doi":"10.1109/SLT.2018.8639658","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639658","url":null,"abstract":"Speech processing, automatic speech and speaker recognition are the major area of interests in the field of computational linguistics. Research and development of computer and human interaction, forensic technologies and dialogue systems have been the motivating factor behind this interest.In this paper, JSpeech is introduced, a multi-lingual corpus. This corpus contains 1332 hours of conversational speech from 47 different languages. This corpus can be used in a variety of studies, created from 106 public chat group the effect of language variability on the performance of speaker recognition systems and automatic language detection. To this end, we include speaker verification results obtained for this corpus using a state of the art method based on 3D convolutional neural network.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"201 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123029504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Detection and Calibration of Whisper for Speaker Recognition 基于说话人识别的耳语检测与校准
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639595
Finnian Kelly, J. Hansen
{"title":"Detection and Calibration of Whisper for Speaker Recognition","authors":"Finnian Kelly, J. Hansen","doi":"10.1109/SLT.2018.8639595","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639595","url":null,"abstract":"Whisper is a commonly encountered form of speech that differs significantly from modal speech. As speaker recognition technology becomes more ubiquitous, it is important to assess the abilities and limitations of systems in the presence of variability such as whisper. In this paper, a comparative evaluation of whispered speaker recognition performance across two independent datasets is presented. Whisper-neutral speech comparisons are observed to consistently degrade performance relative to both neutral-neutral and whisper-whisper comparisons. An i-vector-based approach to whisper detection is introduced, and is shown to perform accurately across datasets even at short durations. The output of the whisper detector is subsequently used to select score calibration parameters for whispered speech comparisons, leading to a reduction in global calibration and discrimination error.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114492182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Phase-Based Feature Representations for Improving Recognition of Dysarthric Speech 基于相位的特征表示提高困难语音识别
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639031
S. Sehgal, S. Cunningham, P. Green
{"title":"Phase-Based Feature Representations for Improving Recognition of Dysarthric Speech","authors":"S. Sehgal, S. Cunningham, P. Green","doi":"10.1109/SLT.2018.8639031","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639031","url":null,"abstract":"Dysarthria is a neurological speech impairment, which usually results in the loss of motor speech control due to muscular atrophy and incoordination of the articulators. As a result the speech becomes less intelligible and difficult to model by machine learning algorithms due to inconsistencies in the acoustic signal and data sparseness. This paper presents phase-based feature representations for dysarthric speech that are exploited in the group delay spectrum. Such representations are found to be better suited to characterising the resonances of the vocal tract, exhibit better phone discrimination capabilities in dysarthric signals and consequently improve ASR performance. All the experiments were conducted using the UASPEECH corpus and significant ASR gains are reported using phase-based cepstral features in comparison to the standard MFCCs irrespective of the severity of the condition.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128768927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Adaptive Wavenet Vocoder for Residual Compensation in GAN-Based Voice Conversion 基于gan的语音转换中残差补偿的自适应小波声码器
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639507
Berrak Sisman, Mingyang Zhang, S. Sakti, Haizhou Li, Satoshi Nakamura
{"title":"Adaptive Wavenet Vocoder for Residual Compensation in GAN-Based Voice Conversion","authors":"Berrak Sisman, Mingyang Zhang, S. Sakti, Haizhou Li, Satoshi Nakamura","doi":"10.1109/SLT.2018.8639507","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639507","url":null,"abstract":"In this paper, we propose to use generative adversarial networks (GAN) together with a WaveNet vocoder to address the over-smoothing problem arising from the deep learning approaches to voice conversion, and to improve the vocoding quality over the traditional vocoders. As GAN aims to minimize the divergence between the natural and converted speech parameters, it effectively alleviates the over-smoothing problem in the converted speech. On the other hand, WaveNet vocoder allows us to leverage from the human speech of a large speaker population, thus improving the naturalness of the synthetic voice. Furthermore, for the first time, we study how to use WaveNet vocoder for residual compensation to improve the voice conversion performance. The experiments show that the proposed voice conversion framework consistently outperforms the baselines.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129034091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 37
Role Annotated Speech Recognition for Conversational Interactions 用于会话交互的角色注释语音识别
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639611
Nikolaos Flemotomos, Zhuohao Chen, David C. Atkins, Shrikanth S. Narayanan
{"title":"Role Annotated Speech Recognition for Conversational Interactions","authors":"Nikolaos Flemotomos, Zhuohao Chen, David C. Atkins, Shrikanth S. Narayanan","doi":"10.1109/SLT.2018.8639611","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639611","url":null,"abstract":"Speaker Role Recognition (SRR) assigns a specific speaker role to each speaker-homogeneous speech segment in a conversation. Typically, those segments have to be identified first through a diarization step. Additionally, since SRR is usually based on the different linguistic patterns observed between the roles to be recognized, an Automatic Speech Recognition (ASR) system is also indispensable for the task in hand to convert speech to text. In this work we introduce a Role Annotated Speech Recognition (RASR) system which, given a speech signal, outputs a sequence of words annotated with the corresponding speaker roles. Thus, the need of different component modules which are connected in a way that may lead to error propagation is eliminated. We present, analyze, and test our system for the case of two speaker roles to show-case an end-to-end approach for automatic rich transcription with application to clinical dyadic interactions.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134130767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Context-Aware Attention Mechanism for Speech Emotion Recognition 语音情绪识别的语境感知注意机制
2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI: 10.1109/SLT.2018.8639633
Gaetan Ramet, Philip N. Garner, Michael Baeriswyl, Alexandros Lazaridis
{"title":"Context-Aware Attention Mechanism for Speech Emotion Recognition","authors":"Gaetan Ramet, Philip N. Garner, Michael Baeriswyl, Alexandros Lazaridis","doi":"10.1109/SLT.2018.8639633","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639633","url":null,"abstract":"In this work, we study the use of attention mechanisms to enhance the performance of the state-of-the-art deep learning model in Speech Emotion Recognition (SER). We introduce a new Long Short-Term Memory (LSTM)-based neural network attention model which is able to take into account the temporal information in speech during the computation of the attention vector. The proposed LSTM-based model is evaluated on the IEMOCAP dataset using a 5-fold cross-validation scheme and achieved 68.8% weighted accuracy on 4 classes, which outperforms the state-of-the-art models.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131833954","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 37
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信