Sashi Novitasari, Quoc Truong Do, S. Sakti, D. Lestari, Satoshi Nakamura
{"title":"Multi-Modal Multi-Task Deep Learning For Speaker And Emotion Recognition Of TV-Series Data","authors":"Sashi Novitasari, Quoc Truong Do, S. Sakti, D. Lestari, Satoshi Nakamura","doi":"10.1109/ICSDA.2018.8693020","DOIUrl":"https://doi.org/10.1109/ICSDA.2018.8693020","url":null,"abstract":"Since paralinguistic aspects must be considered to understand speech, we construct a deep learning framework that utilizes multi-modal features to simultaneously recognize both speakers and emotions. There are three kinds of feature modalities: acoustic, lexical, and facial. To fuse the features from multiple modalities, we experimented on three methods: majority voting, concatenation, and hierarchical fusion. The recognition was done from TV-series dataset that simulate actual conversations.","PeriodicalId":303819,"journal":{"name":"2018 Oriental COCOSDA - International Conference on Speech Database and Assessments","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114521407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Japanese-English Code-Switching Speech Data Construction","authors":"Sahoko Nakayama, Takatomo Kano, Quoc Truong Do, S. Sakti, Satoshi Nakamura","doi":"10.1109/ICSDA.2018.8693044","DOIUrl":"https://doi.org/10.1109/ICSDA.2018.8693044","url":null,"abstract":"As the number of Japanese-English bilingual speakers continues to increase, code-switching phenomena also happen more frequently. The units and locations of switches may vary widely from single word switches to whole phrases (beyond the length of the loanword units). Therefore, speech recognition systems must be developed that can handle not only Japanese or English but also Japanese-English code-switching. Consequently, a large-scale code-switching speech database is required for model training. But collecting natural conversation dialogues of Japanese-English data is both time-consuming and expensive. This paper presents the construction of Japanese-English code-switching speech data by utilizing a Japanese and English text-to-speech system from a bilingual speaker. Various switching units are also investigated including units of words and phrases. As a result, we successfully constructed over 280-k speech utterances of Japanese-English code-switching.","PeriodicalId":303819,"journal":{"name":"2018 Oriental COCOSDA - International Conference on Speech Database and Assessments","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132897888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Barsha Deka, Joyshree Chakraborty, Abhishek Dey, Shikhamoni Nath, Priyankoo Sarmah, S. Nirmala, K. Samudravijaya
{"title":"Speech Corpora of Under Resourced Languages of North-East India","authors":"Barsha Deka, Joyshree Chakraborty, Abhishek Dey, Shikhamoni Nath, Priyankoo Sarmah, S. Nirmala, K. Samudravijaya","doi":"10.1109/ICSDA.2018.8693038","DOIUrl":"https://doi.org/10.1109/ICSDA.2018.8693038","url":null,"abstract":"In this paper, we present an account of an ongoing effort in creation of speech corpora of under-resourced languages of North-East India, namely, Assamese, Bengali and Nepali. The speech corpora are being created for development of Automatic Speech Recognition system in Assamese as well as for Language Identification system. The text corpus of Assamese language comprises of 1000 sentences collected from different sources such as story books, novels, proverbs. Speech data are recorded over telephone channel using an interactive voice response system. Speakers were asked to read one or more sets of sentences, each set containing 20 sentences. Speech was simultaneously recorded using a hand-held audio recorder. While significant amount of speech data has been collected for Assamese language, the task has begun for Bengali, Nepali and English spoken by native speakers of these 3 languages. Currently, the Assamese speech database contains more than 5000 utterances by 27 native speakers. Information about the speakers such as dialect, gender, age-group were also collected. We discuss the methodology used in collecting speech samples, and present a descriptive statistics of the speech corpora.","PeriodicalId":303819,"journal":{"name":"2018 Oriental COCOSDA - International Conference on Speech Database and Assessments","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131116529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Acoustic Comparison of Vowel Articulation When Combined with Different Tone Categories in Mandarin","authors":"Chong Cao, Yanlu Xie, Jinsong Zhang","doi":"10.1109/ICSDA.2018.8693015","DOIUrl":"https://doi.org/10.1109/ICSDA.2018.8693015","url":null,"abstract":"It was found that there existed an interaction between the source (i.e., fundamental frequencies) and the vocal tract filter (i.e., formant frequencies). Previous studies investigated such interaction from a perspective of perception with evidence from Mandarin which uses four tones to distinguish lexical meanings. While few studies examined such interaction from a perspective of production. This study explored differences of formant frequencies in vowel articulation when combined with different fundamental frequency patterns (i.e., tones). We calculated frequencies of the first two formants (i.e., F1, F2) and their distance (i.e., F2-F1) of different vowels with four lexical tones. Results showed that both F1 and F2 values were significantly different when combined with different tones. Moreover, such interaction varied with vowels: high vowels usually presented a contrary correlation pattern compared with other vowels. The finding about the co-variation between formants and fundamental frequencies may help to improve the naturalness of speech synthesis.","PeriodicalId":303819,"journal":{"name":"2018 Oriental COCOSDA - International Conference on Speech Database and Assessments","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129507882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"AWA Long-Term Recorded Speech Corpus And Robust Speaker Recognition Method For Session Variability","authors":"S. Tsuge, S. Kuroiwa, Tomoko Ohsuga, Y. Ishimoto","doi":"10.1109/ICSDA.2018.8693004","DOIUrl":"https://doi.org/10.1109/ICSDA.2018.8693004","url":null,"abstract":"Session variability is one of the most important issues in the speaker recognition technology. On the other hand, our scientific interest lies in how individual voice changes as time progresses and where the limit of the changes. From these motivations, we have been constructing “AWA Long-Term Recorded speech corpus (AWA-LTR)” that contains one's same content speech recorded at morning, noon, and evening once a week for over 10 years using the same microphone in a soundproof chamber. AWA-LTR first version has been released by Speech Resources Consortium, National Institute of Informatics (NII-SRC), Japan in 2012. In addition, we will release AWA-LTR second version in 2018. Hence, in this paper, we describe the details of AWA-LTR and the data release schedule of this corpus. As an effective application example using the corpus, we propose a robust speaker recognition method for session variability and evaluate the proposed method by the speaker identification experiment in this paper.","PeriodicalId":303819,"journal":{"name":"2018 Oriental COCOSDA - International Conference on Speech Database and Assessments","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115206424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
B. Mumtaz, Sahar Rauf, Hafsa Qadir, J. Khalid, T. Habib, S. Hussain, Rukhsana Barkat, E. Haq
{"title":"URDU Speech Corpora for Banking Sector in Pakistan","authors":"B. Mumtaz, Sahar Rauf, Hafsa Qadir, J. Khalid, T. Habib, S. Hussain, Rukhsana Barkat, E. Haq","doi":"10.1109/ICSDA.2018.8693010","DOIUrl":"https://doi.org/10.1109/ICSDA.2018.8693010","url":null,"abstract":"This research describes an effort to build Urdu speech corpora for the banking sector in Pakistan. We have designed speech corpora to develop debit card activation ASR and these corpora are comprised of eight types of corpora mainly debit card number corpus, expiry date corpus, last four digit corpus, months' name, date of birth corpus, account type and Urdu-counting corpus. These corpora contain telephone speech in read style obtained from more than 400 speakers specifically in Punjabi accent in both outdoor and indoor environments, including offices, homes, banks, and universities. The speech is automatically annotated and manually verified at sentence tier and reports 98% inter-annotator accuracy. In this paper, we report the design, recording and annotation process of speech corpora that serve as a data development step for ASR, and will be integrated in debit card activation service in banking sector of Pakistan.","PeriodicalId":303819,"journal":{"name":"2018 Oriental COCOSDA - International Conference on Speech Database and Assessments","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133829483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Development of Text and Speech Corpus for Designing the Multilingual Recognition System","authors":"S. Bansal, S. Agrawal","doi":"10.1109/ICSDA.2018.8693013","DOIUrl":"https://doi.org/10.1109/ICSDA.2018.8693013","url":null,"abstract":"To create the multilingual speech and text corpus manually is very difficult and time-consuming task. This paper presents the overall methodology and experiences of text and speech data collection for three under resourced languages i.e., Hindi, Manipuri and Urdu. The text data collection is done through web crawling in 3 domains i.e., general, news and travel to capture the versatility of database among these languages. The main objective of this project is to collect text and speech database which can be used for training the multilingual spoken language identification systems. In total we collected a text corpus of three million words and audio corpus of 150 speakers (50 native speakers) of each language. Each speaker recorded 300 phonetically rich sentences created through text analysis. The speech utterances were recorded at the rate of 16 kHz through microphone using GOLDWAVE software tool in a sound treated room. The collected speech data sets were annotated manually at phonemic level for each language and made available for development of multilingual recognition system.","PeriodicalId":303819,"journal":{"name":"2018 Oriental COCOSDA - International Conference on Speech Database and Assessments","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129370501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Unsupervised Dependency Corpus Annotation for Myanmar Language","authors":"Hnin Thu Zar Aye, Win Pa Pa, Ye Kyaw Thu","doi":"10.1109/ICSDA.2018.8693009","DOIUrl":"https://doi.org/10.1109/ICSDA.2018.8693009","url":null,"abstract":"Dependency parsing can provide the connection of linguistic unit (words) by a directed links. This paper presents annotating a general domain corpus by using unsupervised approach by applying Universal part-of-speech (U-POS) to build Treebank for unsupervised dependency parsing of Myanmar Language. Up to now it is still hard task to obtain complete syntactic structures for Myanmar Language. Dependency structures of words in Myanmar sentences are also presented of general words and phrases orders and the relations of basic sentence structures. To annotate by using U-POS, UDPipe is used. Moreover, the preliminary results of annotated trees and parsing experiment are presented. Parsing experiments are evaluated by UDPipe in terms of unlabeled and labeled attachment scores: (UAS) and (LAS), which are 93.20%, and 91.21% in test experiment respectively.","PeriodicalId":303819,"journal":{"name":"2018 Oriental COCOSDA - International Conference on Speech Database and Assessments","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124457396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yultuz Rapkat, Gulnur Arkin, Mijit Ablimit, A. Hamdulla
{"title":"Acoustic Features Of Mandarin Diphthongs By Uyghur Learners At Primary Level","authors":"Yultuz Rapkat, Gulnur Arkin, Mijit Ablimit, A. Hamdulla","doi":"10.1109/ICSDA.2018.8693014","DOIUrl":"https://doi.org/10.1109/ICSDA.2018.8693014","url":null,"abstract":"From the perspective of experimental phonetics, this paper makes an acoustic comparison analysis of the diphthongs Uyghur and Chinese college speakers, and examines the situation of primary-level Uyghur learners’ acquisition of Chinese Mandarin diphthongs. A total of 132 samples (including 9 diphthongs) are extracted from the recorded corpus, and the formants of the vowel are statistically analyzed. The characteristics and the distributions of the formants are analyzed to investigate the acoustic characteristics. Finally, combined with the experimental results, the Uyghur learners’ at primary level acquisition of diphthongs will be further discussed and analyzed. The purpose of this paper is to understand the Uyghur college learners’ acquisition of Chinese Mandarin diphthongs tracks and to provide the correct reference data for the Computer Assisted Language Learning System of Uyghur Learning Chinese Mandarin.","PeriodicalId":303819,"journal":{"name":"2018 Oriental COCOSDA - International Conference on Speech Database and Assessments","volume":"35 23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133715545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Phonetic Realization Of Information Structures In Chinese English Learners’ Reading Texts","authors":"Xinyi Wen, Yuan Jia, Ai-jun Li","doi":"10.1109/ICSDA.2018.8693006","DOIUrl":"https://doi.org/10.1109/ICSDA.2018.8693006","url":null,"abstract":"The present study aims to investigate the phonetic realization of information structure in L2, by comparing the productions of English discourse from Beijing English learners and from native English speakers. Phonetic and statistical analyses are conducted on English reading texts selected from Asian English Speech cOrpus Project (AESOP). The main findings include: Beijing English learners do not distinguish the given and new information with pitch range as native English speakers do, which is the main difference between the two speaker groups; the slight differences found on duration and mean pitch value might result from other factors rather than phonetic strategies utilized in information packaging. Besides, the difference between Beijing English learners' performance in lexical and referential levels mainly lies in the duration of accessible information.","PeriodicalId":303819,"journal":{"name":"2018 Oriental COCOSDA - International Conference on Speech Database and Assessments","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133898229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}