{"title":"Visually Grounded Cross-Lingual Keyword Spotting in Speech","authors":"H. Kamper, Michael Roth","doi":"10.21437/SLTU.2018-53","DOIUrl":"https://doi.org/10.21437/SLTU.2018-53","url":null,"abstract":"","PeriodicalId":190269,"journal":{"name":"Workshop on Spoken Language Technologies for Under-resourced Languages","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122428080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Prosodic Analysis of Non-Native South Indian English Speech","authors":"Radha Krishna Guntur, R. Krishnan, V. K. Mittal","doi":"10.21437/SLTU.2018-15","DOIUrl":"https://doi.org/10.21437/SLTU.2018-15","url":null,"abstract":"Investigations on linguistic prosody related to non-native English speech by South Indians were carried out using a database specifically meant for this study. Prosodic differences between native and non-native speech samples of regional language groups: Kannada, Tamil, and Telugu were evaluated and compared. This information is useful in applications such as Native language identification. It is observed that the mean value of pitch, and the general variation of pitch contour is higher in the case of non-native English speech by all the three groups of speakers, indicating accommodation of speaking manner. This study finds that dynamic variation of pitch is the least for English speech by native Kannada language speakers. The increase in standard deviation of pitch contour for non-native English speech by Kannada speakers is much less at about 3.7% on an average. In the case of Tamil and Telugu native speakers, it is 9.5%, and 27% respectively.","PeriodicalId":190269,"journal":{"name":"Workshop on Spoken Language Technologies for Under-resourced Languages","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127091112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Post-Processing Using Speech Enhancement Techniques for Unit Selection and Hidden Markov Model Based Low Resource Language Marathi Text-to-Speech System","authors":"Sangramsing Kayte, Monica R. Mundada","doi":"10.21437/SLTU.2018-20","DOIUrl":"https://doi.org/10.21437/SLTU.2018-20","url":null,"abstract":"","PeriodicalId":190269,"journal":{"name":"Workshop on Spoken Language Technologies for Under-resourced Languages","volume":"136 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121324401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Vuddagiri, K. Gurugubelli, P. Jain, Hari Krishna Vydana, A. Vuppala
{"title":"IIITH-ILSC Speech Database for Indain Language Identification","authors":"R. Vuddagiri, K. Gurugubelli, P. Jain, Hari Krishna Vydana, A. Vuppala","doi":"10.21437/SLTU.2018-12","DOIUrl":"https://doi.org/10.21437/SLTU.2018-12","url":null,"abstract":"This work focuses on the development of speech data comprising 23 Indian languages for developing language identification (LID) systems. Large data is a pre-requisite for developing state-of-the-art LID systems. With this motivation, the task of developing multilingual speech corpus for Indian languages has been initiated. This paper describes the composition of the data and the performances of various LID systems developed using this data. In this paper, Mel frequency cepstral feature representation is used for language identification. In this work, various state-of-the-art LID systems are developed using i-vectors, deep neural network (DNN) and deep neural network with attention (DNN-WA) models. The performance of the LID system is observed in terms of the equal error rate for i-vector, DNN and DNN-WA is 17.77%, 17.95%, and 15.18% respec-tively. Deep neural network with attention model shows a better performance over i-vector and DNN models.","PeriodicalId":190269,"journal":{"name":"Workshop on Spoken Language Technologies for Under-resourced Languages","volume":"86 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116421005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
L. Nanayakkara, Chamila Liyanage, Pubudu Tharaka Viswakula, Thilini Nagungodage, Randil Pushpananda, R. Weerasinghe
{"title":"A Human Quality Text to Speech System for Sinhala","authors":"L. Nanayakkara, Chamila Liyanage, Pubudu Tharaka Viswakula, Thilini Nagungodage, Randil Pushpananda, R. Weerasinghe","doi":"10.21437/SLTU.2018-33","DOIUrl":"https://doi.org/10.21437/SLTU.2018-33","url":null,"abstract":"This paper proposes an approach on implementing a Text to Speech system for Sinhala language using MaryTTS framework. In this project, a set of rules for mapping text to sound were identified and proceeded with Unit selection mechanism. The datasets used for this study were gathered from newspaper articles and the corresponding sentences were recorded by a professional speaker. User level evaluation was conducted with 20 candidates, where the intelligibility and the naturalness of the developed Sinhala TTS system received an approximate score of 70%. And the overall speech quality is an approximately to 60%.","PeriodicalId":190269,"journal":{"name":"Workshop on Spoken Language Technologies for Under-resourced Languages","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123686298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alexander Gutkin, Tatiana Merkulova, Martin Jansche
{"title":"Predicting the Features of World Atlas of Language Structures from Speech","authors":"Alexander Gutkin, Tatiana Merkulova, Martin Jansche","doi":"10.21437/SLTU.2018-52","DOIUrl":"https://doi.org/10.21437/SLTU.2018-52","url":null,"abstract":"Recent work considered how images paired with speech can be used as supervision for building speech systems when transcriptions are not available. We ask whether visual grounding can be used for cross-lingual keyword spotting: given a text keyword in one language, the task is to retrieve spoken utterances containing that keyword in another language. This could enable searching through speech in a low-resource language using text queries in a high-resource language. As a proof-of-concept, we use English speech with German queries: we use a German visual tagger to add keyword labels to each training image, and then train a neural network to map English speech to German keywords. Without seeing parallel speech-transcriptions or translations, the model achieves a precision at ten of 58%. We show that most erroneous retrievals contain equivalent or semantically relevant keywords; excluding these would improve P@10 to 91%.","PeriodicalId":190269,"journal":{"name":"Workshop on Spoken Language Technologies for Under-resourced Languages","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126389098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Low-resource Tibetan Dialect Acoustic Modeling Based on Transfer Learning","authors":"Jinghao Yan, Zhiqiang Lv, Shen Huang, Hongzhi Yu","doi":"10.21437/SLTU.2018-2","DOIUrl":"https://doi.org/10.21437/SLTU.2018-2","url":null,"abstract":"","PeriodicalId":190269,"journal":{"name":"Workshop on Spoken Language Technologies for Under-resourced Languages","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125047378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Incorporating Speaker Normalizing Capabilities to an End-to-End Speech Recognition System","authors":"Hari Krishna Vydana, Sivanand Achanta, A. Vuppala","doi":"10.21437/sltu.2018-36","DOIUrl":"https://doi.org/10.21437/sltu.2018-36","url":null,"abstract":"Speaker normalization is one of the crucial aspects of an Automatic speech recognition system (ASR). Speaker normalization is employed to reduce the performance drop in ASR due to speaker variabilities. Traditional speaker normalization methods are mostly linear transforms over the input data estimated per speaker, such transforms would be efficient with sufficient data. In practical scenarios, only a single utterance from the test speaker is accessible. The present study explores speaker normalization methods for end-to-end speech recognition systems that could efficiently be performed even when single utterance from the unseen speaker is available. In this work, it is hypothesized that by suitably providing information about the speaker’s identity while training an end-to-end neural network, the capability to normalize the speaker variability could be in-corporated into an ASR system. The efficiency of these normalization methods depends on the representation used for unseen speakers. In this work, the identity of the training speaker is represented in two different ways viz. i) by using a one-hot speaker code, ii) a weighted combination of all the training speakers identities. The unseen speakers from the test set are represented using a weighted combination of training speakers representations. Both the approaches have reduced the word error rate (WER) by 0.6, 1.3% WSJ corpus.","PeriodicalId":190269,"journal":{"name":"Workshop on Spoken Language Technologies for Under-resourced Languages","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126785093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marcely Zanon Boito, Antonios Anastasopoulos, M. Lekakou, A. Villavicencio, L. Besacier
{"title":"A small Griko-Italian speech translation corpus","authors":"Marcely Zanon Boito, Antonios Anastasopoulos, M. Lekakou, A. Villavicencio, L. Besacier","doi":"10.21437/SLTU.2018-8","DOIUrl":"https://doi.org/10.21437/SLTU.2018-8","url":null,"abstract":"This paper presents an extension to a very low-resource parallel corpus collected in an endangered language, Griko, making it useful for computational research. The corpus consists of 330 utterances (about 2 hours of speech) which have been transcribed and translated in Italian, with annotations for word-level speech-to-transcription and speech-to-translation alignments. The corpus also includes morpho syntactic tags and word-level glosses. Applying an automatic unit discovery method, pseudo-phones were also generated. We detail how the corpus was collected, cleaned and processed, and we illustrate its use on zero-resource tasks by presenting some baseline results for the task of speech-to-translation alignment and unsupervised word discovery. The dataset will be available online, aiming to encourage replicability and diversity in computational language documentation experiments.","PeriodicalId":190269,"journal":{"name":"Workshop on Spoken Language Technologies for Under-resourced Languages","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129372206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Raghav Menon, A. Biswas, A. Saeb, John Quinn, T. Niesler
{"title":"Automatic Speech Recognition for Humanitarian Applications in Somali","authors":"Raghav Menon, A. Biswas, A. Saeb, John Quinn, T. Niesler","doi":"10.21437/SLTU.2018-5","DOIUrl":"https://doi.org/10.21437/SLTU.2018-5","url":null,"abstract":"We present our first efforts in building an automatic speech recognition system for Somali, an under-resourced language, using 1.57 hrs of annotated speech for acoustic model training. The system is part of an ongoing effort by the United Nations (UN) to implement keyword spotting systems supporting humanitarian relief programmes in parts of Africa where languages are severely under-resourced. We evaluate several types of acoustic model, including recent neural architectures. Language model data augmentation using a combination of recurrent neural networks (RNN) and long short-term memory neural networks (LSTMs) as well as the perturbation of acoustic data are also considered. We find that both types of data augmentation are beneficial to performance, with our best system using a combination of convolutional neural networks (CNNs), time-delay neural networks (TDNNs) and bi-directional long short term memory (BLSTMs) to achieve a word error rate of 53.75%.","PeriodicalId":190269,"journal":{"name":"Workshop on Spoken Language Technologies for Under-resourced Languages","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130841081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}