{"title":"多语言识别系统中文本和语音语料库的开发","authors":"S. Bansal, S. Agrawal","doi":"10.1109/ICSDA.2018.8693013","DOIUrl":null,"url":null,"abstract":"To create the multilingual speech and text corpus manually is very difficult and time-consuming task. This paper presents the overall methodology and experiences of text and speech data collection for three under resourced languages i.e., Hindi, Manipuri and Urdu. The text data collection is done through web crawling in 3 domains i.e., general, news and travel to capture the versatility of database among these languages. The main objective of this project is to collect text and speech database which can be used for training the multilingual spoken language identification systems. In total we collected a text corpus of three million words and audio corpus of 150 speakers (50 native speakers) of each language. Each speaker recorded 300 phonetically rich sentences created through text analysis. The speech utterances were recorded at the rate of 16 kHz through microphone using GOLDWAVE software tool in a sound treated room. The collected speech data sets were annotated manually at phonemic level for each language and made available for development of multilingual recognition system.","PeriodicalId":303819,"journal":{"name":"2018 Oriental COCOSDA - International Conference on Speech Database and Assessments","volume":"39 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Development of Text and Speech Corpus for Designing the Multilingual Recognition System\",\"authors\":\"S. Bansal, S. Agrawal\",\"doi\":\"10.1109/ICSDA.2018.8693013\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"To create the multilingual speech and text corpus manually is very difficult and time-consuming task. This paper presents the overall methodology and experiences of text and speech data collection for three under resourced languages i.e., Hindi, Manipuri and Urdu. The text data collection is done through web crawling in 3 domains i.e., general, news and travel to capture the versatility of database among these languages. The main objective of this project is to collect text and speech database which can be used for training the multilingual spoken language identification systems. In total we collected a text corpus of three million words and audio corpus of 150 speakers (50 native speakers) of each language. Each speaker recorded 300 phonetically rich sentences created through text analysis. The speech utterances were recorded at the rate of 16 kHz through microphone using GOLDWAVE software tool in a sound treated room. The collected speech data sets were annotated manually at phonemic level for each language and made available for development of multilingual recognition system.\",\"PeriodicalId\":303819,\"journal\":{\"name\":\"2018 Oriental COCOSDA - International Conference on Speech Database and Assessments\",\"volume\":\"39 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 Oriental COCOSDA - International Conference on Speech Database and Assessments\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICSDA.2018.8693013\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 Oriental COCOSDA - International Conference on Speech Database and Assessments","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSDA.2018.8693013","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Development of Text and Speech Corpus for Designing the Multilingual Recognition System
To create the multilingual speech and text corpus manually is very difficult and time-consuming task. This paper presents the overall methodology and experiences of text and speech data collection for three under resourced languages i.e., Hindi, Manipuri and Urdu. The text data collection is done through web crawling in 3 domains i.e., general, news and travel to capture the versatility of database among these languages. The main objective of this project is to collect text and speech database which can be used for training the multilingual spoken language identification systems. In total we collected a text corpus of three million words and audio corpus of 150 speakers (50 native speakers) of each language. Each speaker recorded 300 phonetically rich sentences created through text analysis. The speech utterances were recorded at the rate of 16 kHz through microphone using GOLDWAVE software tool in a sound treated room. The collected speech data sets were annotated manually at phonemic level for each language and made available for development of multilingual recognition system.