{"title":"Far-Field ASR Using Low-Rank and Sparse Soft Targets from Parallel Data","authors":"Pranay Dighe, Afsaneh Asaei, H. Bourlard","doi":"10.1109/SLT.2018.8639579","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639579","url":null,"abstract":"Far-field automatic speech recognition (ASR) of conversational speech is often considered to be a very challenging task due to the poor quality of alignments available for training the DNN acoustic models. A common way to alleviate this problem is to use clean alignments obtained from parallelly recorded close-talk speech data. In this work, we advance the parallel data approach by obtaining enhanced low-rank and sparse soft targets from a close-talk ASR system and using them for training more accurate far-field acoustic models. Specifically, we (i) exploit eigenposteriors and Compressive Sensing dictionaries to learn low-dimensional senone subspaces in DNN posterior space, and (ii) enhance close-talk DNN posteriors to achieve high quality soft targets for training far-field DNN acoustic models. We show that the enhanced soft targets encode the structural and temporal interrelationships among senone classes which are easily accessible in the DNN posterior space of close-talk speech but not in its noisy far-field counterpart. We exploit enhanced soft targets to improve the mapping of far-field acoustics to close-talk senone classes. The experiments are performed on AMI meeting corpus where our approach improves DNN based acoustic modeling by 4.4% absolute (~8% rel.) reduction in WER as compared to a system which doesn’t use parallel data. Finally, the approach is also validated on state-of-the-art recurrent and time delay neural network architectures.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115392007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Comparing Prosodic Frameworks: Investigating the Acoustic-Symbolic Relationship in ToBI and RaP","authors":"Raul Fernandez, A. Rosenberg","doi":"10.1109/SLT.2018.8639539","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639539","url":null,"abstract":"ToBI is the dominant tool for symbolically describing prosodic content in American English speech material. This is due to its descriptive power and its theoretical grounding, but also to the amount of available annotated data. Recently, a modest amount of material annotated with the Rhythm and Pitch (RaP) framework was released publicly. In this paper, we investigate the acoustic-symbolic relationship under these two systems. We present experiments looking at this relationship in both directions. From acoustic to symbolic, we compare the automatic prediction of prosodic prominence as defined under the two systems. From symbolic to acoustic, we examine the utility of these annotation standards to correctly prescribe the acoustics of a given utterance from their symbolic sequences. We find RaP to be promising, showing a somewhat stronger acoustic-symbolic relationship than ToBI given a comparable amount of data for some aspects of these tasks. While with more annotated data ToBI results are stronger, it remains to be shown whether RaP performance can scale up.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127324449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ryu Takeda, Kazunori Komatani, Alexander I. Rudnicky
{"title":"Word Segmentation From Phoneme Sequences Based On Pitman-Yor Semi-Markov Model Exploiting Subword Information","authors":"Ryu Takeda, Kazunori Komatani, Alexander I. Rudnicky","doi":"10.1109/SLT.2018.8639607","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639607","url":null,"abstract":"Word segmentation from phoneme sequences is essential to identify unknown words -of-vocabulary; OOV) in spoken dialogues. The Pitman-Yor semi-Markov model (PYSMM) is used for word segmentation that handles dynamic increase in vocabularies. The obtained vocabularies, however, still include meaningless entries due to insufficient cues for phoneme sequences. We focus here on using subword information to capture patterns as “words.” We propose 1) a model based on subword N-gram and subword estimation using a vocabulary set, and 2) posterior fusion of the results of a PYSMM and our model to take advantage of both. Our experiments showed 1) the potential of using subword information for OOV acquisition, and 2) that our method outperformed the PYSMM by 1.53 and 1.07 in terms of the F-measure of the obtained OOV set for English and Japanese corpora, respectively.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123662405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving FFTNet Vocoder with Noise Shaping and Subband Approaches","authors":"T. Okamoto, T. Toda, Y. Shiga, H. Kawai","doi":"10.1109/SLT.2018.8639687","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639687","url":null,"abstract":"Although FFTNet neural vocoders can synthesize speech waveforms in real time, the synthesized speech quality is worse than that of WaveNet vocoders. To improve the synthesized speech quality of FFTNet while ensuring real-time synthesis, residual connections are introduced to enhance the prediction accuracy. Additionally, time-invariant noise shaping and subband approaches, which significantly improve the synthesized speech quality of WaveNet vocoders, are applied. A subband FFTNet vocoder with multiband input is also proposed to directly compensate the phase shift between subbands. The proposed approaches are evaluated through experiments using a Japanese male corpus with a sampling frequency of 16 kHz. The results are compared with those synthesized by the STRAIGHT vocoder without mel-cepstral compression and those from conventional FFTNet and WaveNet vocoders. The proposed approaches are shown to successfully improve the synthesized speech quality of the FFTNet vocoder. In particular, the use of noise shaping enables FFTNet to significantly outperform the STRAIGHT vocoder.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122411762","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. J. Choobbasti, Mohammad Erfan Gholamian, Amir Vaheb, Saeid Safavi
{"title":"JSpeech: A Multi-Lingual Conversational Speech Corpus","authors":"A. J. Choobbasti, Mohammad Erfan Gholamian, Amir Vaheb, Saeid Safavi","doi":"10.1109/SLT.2018.8639658","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639658","url":null,"abstract":"Speech processing, automatic speech and speaker recognition are the major area of interests in the field of computational linguistics. Research and development of computer and human interaction, forensic technologies and dialogue systems have been the motivating factor behind this interest.In this paper, JSpeech is introduced, a multi-lingual corpus. This corpus contains 1332 hours of conversational speech from 47 different languages. This corpus can be used in a variety of studies, created from 106 public chat group the effect of language variability on the performance of speaker recognition systems and automatic language detection. To this end, we include speaker verification results obtained for this corpus using a state of the art method based on 3D convolutional neural network.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"201 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123029504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Detection and Calibration of Whisper for Speaker Recognition","authors":"Finnian Kelly, J. Hansen","doi":"10.1109/SLT.2018.8639595","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639595","url":null,"abstract":"Whisper is a commonly encountered form of speech that differs significantly from modal speech. As speaker recognition technology becomes more ubiquitous, it is important to assess the abilities and limitations of systems in the presence of variability such as whisper. In this paper, a comparative evaluation of whispered speaker recognition performance across two independent datasets is presented. Whisper-neutral speech comparisons are observed to consistently degrade performance relative to both neutral-neutral and whisper-whisper comparisons. An i-vector-based approach to whisper detection is introduced, and is shown to perform accurately across datasets even at short durations. The output of the whisper detector is subsequently used to select score calibration parameters for whispered speech comparisons, leading to a reduction in global calibration and discrimination error.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114492182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Phase-Based Feature Representations for Improving Recognition of Dysarthric Speech","authors":"S. Sehgal, S. Cunningham, P. Green","doi":"10.1109/SLT.2018.8639031","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639031","url":null,"abstract":"Dysarthria is a neurological speech impairment, which usually results in the loss of motor speech control due to muscular atrophy and incoordination of the articulators. As a result the speech becomes less intelligible and difficult to model by machine learning algorithms due to inconsistencies in the acoustic signal and data sparseness. This paper presents phase-based feature representations for dysarthric speech that are exploited in the group delay spectrum. Such representations are found to be better suited to characterising the resonances of the vocal tract, exhibit better phone discrimination capabilities in dysarthric signals and consequently improve ASR performance. All the experiments were conducted using the UASPEECH corpus and significant ASR gains are reported using phase-based cepstral features in comparison to the standard MFCCs irrespective of the severity of the condition.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128768927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Berrak Sisman, Mingyang Zhang, S. Sakti, Haizhou Li, Satoshi Nakamura
{"title":"Adaptive Wavenet Vocoder for Residual Compensation in GAN-Based Voice Conversion","authors":"Berrak Sisman, Mingyang Zhang, S. Sakti, Haizhou Li, Satoshi Nakamura","doi":"10.1109/SLT.2018.8639507","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639507","url":null,"abstract":"In this paper, we propose to use generative adversarial networks (GAN) together with a WaveNet vocoder to address the over-smoothing problem arising from the deep learning approaches to voice conversion, and to improve the vocoding quality over the traditional vocoders. As GAN aims to minimize the divergence between the natural and converted speech parameters, it effectively alleviates the over-smoothing problem in the converted speech. On the other hand, WaveNet vocoder allows us to leverage from the human speech of a large speaker population, thus improving the naturalness of the synthetic voice. Furthermore, for the first time, we study how to use WaveNet vocoder for residual compensation to improve the voice conversion performance. The experiments show that the proposed voice conversion framework consistently outperforms the baselines.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129034091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nikolaos Flemotomos, Zhuohao Chen, David C. Atkins, Shrikanth S. Narayanan
{"title":"Role Annotated Speech Recognition for Conversational Interactions","authors":"Nikolaos Flemotomos, Zhuohao Chen, David C. Atkins, Shrikanth S. Narayanan","doi":"10.1109/SLT.2018.8639611","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639611","url":null,"abstract":"Speaker Role Recognition (SRR) assigns a specific speaker role to each speaker-homogeneous speech segment in a conversation. Typically, those segments have to be identified first through a diarization step. Additionally, since SRR is usually based on the different linguistic patterns observed between the roles to be recognized, an Automatic Speech Recognition (ASR) system is also indispensable for the task in hand to convert speech to text. In this work we introduce a Role Annotated Speech Recognition (RASR) system which, given a speech signal, outputs a sequence of words annotated with the corresponding speaker roles. Thus, the need of different component modules which are connected in a way that may lead to error propagation is eliminated. We present, analyze, and test our system for the case of two speaker roles to show-case an end-to-end approach for automatic rich transcription with application to clinical dyadic interactions.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134130767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gaetan Ramet, Philip N. Garner, Michael Baeriswyl, Alexandros Lazaridis
{"title":"Context-Aware Attention Mechanism for Speech Emotion Recognition","authors":"Gaetan Ramet, Philip N. Garner, Michael Baeriswyl, Alexandros Lazaridis","doi":"10.1109/SLT.2018.8639633","DOIUrl":"https://doi.org/10.1109/SLT.2018.8639633","url":null,"abstract":"In this work, we study the use of attention mechanisms to enhance the performance of the state-of-the-art deep learning model in Speech Emotion Recognition (SER). We introduce a new Long Short-Term Memory (LSTM)-based neural network attention model which is able to take into account the temporal information in speech during the computation of the attention vector. The proposed LSTM-based model is evaluated on the IEMOCAP dataset using a 5-fold cross-validation scheme and achieved 68.8% weighted accuracy on 4 classes, which outperforms the state-of-the-art models.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131833954","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}