{"title":"A Unified Phonological Representation of South Asian Languages for Multilingual Text-to-Speech","authors":"Isin Demirsahin, Martin Jansche, Alexander Gutkin","doi":"10.21437/SLTU.2018-17","DOIUrl":"https://doi.org/10.21437/SLTU.2018-17","url":null,"abstract":"We present a multilingual phoneme inventory and inclusion mappings from the native inventories of several major South Asian languages for multilingual parametric text-to-speech synthesis (TTS). Our goal is to reduce the need for training data when building new TTS voices by leveraging available data for similar languages within a common feature design. For West Bengali, Gujarati, Kannada, Malayalam, Marathi, Tamil, Tel-ugu, and Urdu we compare TTS voices trained only on monolingual data with voices trained on multilingual data from 12 languages. In subjective evaluations multilingually trained voices outperform (or in a few cases are statistically tied with) the corresponding monolingual voices. The multilingual setup can further be used to synthesize speech for languages not seen in the training data; preliminary evaluations lean towards good. Our results indicate that pooling data from different languages in a single acoustic model can be beneficial, opening up new uses and research questions.","PeriodicalId":190269,"journal":{"name":"Workshop on Spoken Language Technologies for Under-resourced Languages","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127128056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Neural Networks-based Automatic Speech Recognition for Agricultural Commodity in Gujarati Language","authors":"Hardik B. Sailor, H. Patil","doi":"10.21437/SLTU.2018-34","DOIUrl":"https://doi.org/10.21437/SLTU.2018-34","url":null,"abstract":"In this paper, we present a development of Automatic Speech Recognition (ASR) system as a part of a speech-based access for an agricultural commodity in the Gujarati (a low resource) language. We proposed to use neural networks for language modeling, acoustic modeling, and feature learning from the raw speech signals. The speech database of agricultural commodities was collected from the farmers belonging to various villages of Gujarat state (India). The database has various dialectal variations and real noisy acoustic environments. Acoustic modeling is performed using Time Delay Neural Networks (TDNN). The auditory feature representation is learned using Convolutional Restricted Boltzmann Machine (ConvRBM) and Teager Energy Operator (TEO). The language model (LM) rescoring is performed using Recurrent Neural Networks (RNN). RNNLM rescoring provides an absolute reduction of 0.69-1.18 in % WER for all the feature sets compared to the bi-gram LM. The system combination of ConvRBM and Mel filterbank further improved the performance of ASR compared to the baseline TDNN with Mel filterbank features (5.4 % relative reduction in WER). The statistical significance of proposed approach is justified using a bootstrap-based % Probability of Improvement (POI) measure.","PeriodicalId":190269,"journal":{"name":"Workshop on Spoken Language Technologies for Under-resourced Languages","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114635431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
V. Spoorthy, Veena Thenkanidiyoor, Dileep Aroor Dinesh
{"title":"SVM Based Language Diarization for Code-Switched Bilingual Indian Speech Using Bottleneck Features","authors":"V. Spoorthy, Veena Thenkanidiyoor, Dileep Aroor Dinesh","doi":"10.21437/SLTU.2018-28","DOIUrl":"https://doi.org/10.21437/SLTU.2018-28","url":null,"abstract":"This paper proposes an SVM-based language diarizer for code-switched bilingual Indian speech. Code-switching corresponds to usage of more than one language within a single utterance. Language diarization involves identifying code-switch points in an utterance and segmenting it into homogeneous language segments. This is very important for Indian context because every Indian is at least bilingual and code-switching is inevitable. For building an effective language diarizer, it is helpful to consider phonotactic features. In this work, we propose to consider bottleneck features for language diarization. Bottleneck features correspond to output of a narrow hidden layer of a multilayer neural network trained to perform phone state classification. The studies conducted using the standard datasets have shown the effectiveness of the proposed approach.","PeriodicalId":190269,"journal":{"name":"Workshop on Spoken Language Technologies for Under-resourced Languages","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128935215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Oddur Kjartansson, Supheakmungkol Sarin, Knot Pipatsrisawat, Martin Jansche, Linne Ha
{"title":"Crowd-Sourced Speech Corpora for Javanese, Sundanese, Sinhala, Nepali, and Bangladeshi Bengali","authors":"Oddur Kjartansson, Supheakmungkol Sarin, Knot Pipatsrisawat, Martin Jansche, Linne Ha","doi":"10.21437/SLTU.2018-11","DOIUrl":"https://doi.org/10.21437/SLTU.2018-11","url":null,"abstract":"We present speech corpora for Javanese, Sundanese, Sinhala, Nepali, and Bangladeshi Bengali. Each corpus consists of an average of approximately 200k recorded utterances that were provided by native-speaker volunteers in the respective region. Recordings were made using portable consumer electronics in reasonably quiet environments. For each recorded utterance the textual prompt and an anonymized hexadecimal identifier of the speaker are available. Biographical information of the speakers is unavailable. In particular, the speakers come from an unspeci-fied mix of genders. The recordings are suitable for research on acoustic modeling for speech recognition, for example. To validate the integrity of the corpora and their suitability for speech recognition research, we provide simple recipes that illustrate how they can be used with the open-source Kaldi speech recognition toolkit. The corpora are being made available under a Creative Commons license in the hope that they will stimulate further research on these languages.","PeriodicalId":190269,"journal":{"name":"Workshop on Spoken Language Technologies for Under-resourced Languages","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114493229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
B. M. L. Srivastava, Sunayana Sitaram, R. Mehta, K. Mohan, Pallavi Matani, Sandeepkumar Satpal, Kalika Bali, Radhakrishnan Srikanth, N. Nayak
{"title":"Interspeech 2018 Low Resource Automatic Speech Recognition Challenge for Indian Languages","authors":"B. M. L. Srivastava, Sunayana Sitaram, R. Mehta, K. Mohan, Pallavi Matani, Sandeepkumar Satpal, Kalika Bali, Radhakrishnan Srikanth, N. Nayak","doi":"10.21437/SLTU.2018-3","DOIUrl":"https://doi.org/10.21437/SLTU.2018-3","url":null,"abstract":"India has more than 1500 languages, with 30 of them spoken by more than one million native speakers. Most of them are low-resource and could greatly benefit from speech and language technologies. Building speech recognition support for these low-resource languages requires innovation in handling constraints on data size, while also exploiting the unique properties and similarities among Indian languages. With this goal, we organized a low-resource Automatic Speech Recognition challenge for Indian languages as part of Interspeech 2018. We released 50 hours of speech data with transcriptions for Tamil, Telugu and Gujarati, amounting to a total of 150 hours. Participants were required to only use the data we released for the challenge to preserve the low-resource setting, however, they were not restricted to work on any particular aspect of the speech recognizer. We received 109 submissions from 18 research groups and evaluated the systems in terms of Word Error Rate on a blind test set. In this paper we summarize the data, approaches and results of the challenge.","PeriodicalId":190269,"journal":{"name":"Workshop on Spoken Language Technologies for Under-resourced Languages","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115924976","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
O. Scharenborg, Patrick Ebel, M. Hasegawa-Johnson, N. Dehak
{"title":"Building an ASR System for Mboshi Using A Cross-Language Definition of Acoustic Units Approach","authors":"O. Scharenborg, Patrick Ebel, M. Hasegawa-Johnson, N. Dehak","doi":"10.21437/SLTU.2018-35","DOIUrl":"https://doi.org/10.21437/SLTU.2018-35","url":null,"abstract":"For many languages in the world, not enough (annotated) speech data is available to train an ASR system. Recently, we proposed a cross-language method for training an ASR system using linguistic knowledge and semi-supervised training. Here, we apply this approach to the low-resource language Mboshi. Using an ASR system trained on Dutch, Mboshi acoustic units were first created using cross-language initialization of the phoneme vectors in the output layer. Subsequently, this adapted system was retrained using Mboshi self-labels. Two training methods were investigated: retraining of only the output layer and retraining the full deep neural network (DNN). The resulting Mboshi system was analyzed by investigating per phoneme accuracies, phoneme confusions, and by visualizing the hidden layers of the DNNs prior to and following retraining with the self-labels. Results showed a fairly similar performance for the two training methods but a better phoneme representation for the fully retrained DNN.","PeriodicalId":190269,"journal":{"name":"Workshop on Spoken Language Technologies for Under-resourced Languages","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122069798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Joyanta Basu, Soma Khan, M. S. Bepari, Rajib Roy, Madhab Pal, Sushmita Nandi
{"title":"Designing an IVR Based Framework for Telephony Speech Data Collection and Transcription in Under-Resourced Languages","authors":"Joyanta Basu, Soma Khan, M. S. Bepari, Rajib Roy, Madhab Pal, Sushmita Nandi","doi":"10.21437/SLTU.2018-10","DOIUrl":"https://doi.org/10.21437/SLTU.2018-10","url":null,"abstract":"Scarcity of digitally available language resources restricts development of large scale speech applications in Indian scenario. This paper describes a unique design framework for telephony speech data collection in under-resourced languages using interactive voice response (IVR) technology. IVR systems provide a fast, reliable, automated and relatively low cost medium for simultaneous multilingual audio resource collection from remote users and help in structured storage of resources for further usage. The framework needs IVR hardware & API, related software tools and text resources as its necessary components. Detailed functional design and development process of such a running IVR system are stepwise elaborated. Sample IVR call-flow design templates and offline audio transcription procedure is also presented for ease of understanding. Entire methodology is language independent and is adaptable to similar tasks in other languages and specially beneficial to accelerate resource creation process in under-resourced languages, minimizing manual efforts of data collection and transcription.","PeriodicalId":190269,"journal":{"name":"Workshop on Spoken Language Technologies for Under-resourced Languages","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122990665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Biswas, E. V. D. Westhuizen, T. Niesler, F. D. Wet
{"title":"Improving ASR for Code-Switched Speech in Under-Resourced Languages Using Out-of-Domain Data","authors":"A. Biswas, E. V. D. Westhuizen, T. Niesler, F. D. Wet","doi":"10.21437/SLTU.2018-26","DOIUrl":"https://doi.org/10.21437/SLTU.2018-26","url":null,"abstract":"We explore the use of out-of-domain monolingual data for the improvement of automatic speech recognition (ASR) of code-switched speech. This is relevant because annotated code-switched speech data is both scarce and very hard to produce, especially when the languages concerned are under-resourced, while monolingual corpora are generally better-resourced. We perform experiments using a recently-introduced small five-language corpus of code-switched South African soap opera speech. We consider specifically whether ASR of English– isiZulu code-switched speech can be improved by incorporating monolingual data from unrelated but larger corpora. TDNN-BLSTM acoustic models are trained using various configura-tions of training data. The utility of artificially-generated bilingual English–isiZulu text to augment language model training data is also explored. We find that English-isiZulu speech recognition accuracy can be improved by incorporating mono-lingual out-of-domain data despite the differences between the soap-opera and monolingual speech.","PeriodicalId":190269,"journal":{"name":"Workshop on Spoken Language Technologies for Under-resourced Languages","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128047631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Comparative Study of SMT and NMT: Case Study of English-Nepali Language Pair","authors":"P. Acharya, B. Bal","doi":"10.21437/SLTU.2018-19","DOIUrl":"https://doi.org/10.21437/SLTU.2018-19","url":null,"abstract":"","PeriodicalId":190269,"journal":{"name":"Workshop on Spoken Language Technologies for Under-resourced Languages","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116086742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hindi Speech Vowel Recognition Using Hidden Markov Model","authors":"Shobha Bhatt, A. Dev, Anurag Jain","doi":"10.21437/SLTU.2018-42","DOIUrl":"https://doi.org/10.21437/SLTU.2018-42","url":null,"abstract":"Machine learning has revolutionised speech technologies for major world languages, but these technologies have generally not been available for the roughly 4,000 languages with populations of fewer than 10,000 speakers. This paper describes the development of Elpis, a pipeline which language documentation workers with minimal computational experience can use to build their own speech recognition models, resulting in models being built for 16 languages from the Asia-Pacific region. Elpis puts machine learning speech technologies within reach of people working with languages with scarce data, in a scalable way. This is impactful since it enables language communities to cross the digital divide, and speeds up language documentation. Complete automation of the process is not feasible for languages with small quantities of data and potentially large vocabularies. Hence our goal is not full automation, but rather to make a practical and effective workflow that integrates machine learning technologies.","PeriodicalId":190269,"journal":{"name":"Workshop on Spoken Language Technologies for Under-resourced Languages","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115273459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}