A. Chotimongkol, K. Saykhum, P. Chootrakool, N. Thatphithakkul, C. Wutiwiwatchai
{"title":"LOTUS-BN:泰国广播新闻语料库及其研究应用","authors":"A. Chotimongkol, K. Saykhum, P. Chootrakool, N. Thatphithakkul, C. Wutiwiwatchai","doi":"10.1109/ICSDA.2009.5278377","DOIUrl":null,"url":null,"abstract":"This paper describes the design and construction of the LOTUS-BN corpus, a Thai television broadcast news corpus. In addition to audio recordings and their transcription, this corpus also includes a detailed annotation of many interesting characteristics of broadcast news data such as acoustic condition, overlapping speech, news topic and named entity. The LOTUS-BN is still an ongoing project with the goal of collecting 100 hours of speech. We report initial statistics analyzed from 60 hours of speech which show that the LOTUS-BN corpus has a rich vocabulary of approximately 26,000 words with one third of them are named entities. Thus, this corpus is a good resource for developing an LVCSR system and investigating on named entity detection and recognition in addition to broadcast news related applications. Research applications on these topics are also discussed.","PeriodicalId":254906,"journal":{"name":"2009 Oriental COCOSDA International Conference on Speech Database and Assessments","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"29","resultStr":"{\"title\":\"LOTUS-BN: A Thai broadcast news corpus and its research applications\",\"authors\":\"A. Chotimongkol, K. Saykhum, P. Chootrakool, N. Thatphithakkul, C. Wutiwiwatchai\",\"doi\":\"10.1109/ICSDA.2009.5278377\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper describes the design and construction of the LOTUS-BN corpus, a Thai television broadcast news corpus. In addition to audio recordings and their transcription, this corpus also includes a detailed annotation of many interesting characteristics of broadcast news data such as acoustic condition, overlapping speech, news topic and named entity. The LOTUS-BN is still an ongoing project with the goal of collecting 100 hours of speech. We report initial statistics analyzed from 60 hours of speech which show that the LOTUS-BN corpus has a rich vocabulary of approximately 26,000 words with one third of them are named entities. Thus, this corpus is a good resource for developing an LVCSR system and investigating on named entity detection and recognition in addition to broadcast news related applications. Research applications on these topics are also discussed.\",\"PeriodicalId\":254906,\"journal\":{\"name\":\"2009 Oriental COCOSDA International Conference on Speech Database and Assessments\",\"volume\":\"25 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-10-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"29\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2009 Oriental COCOSDA International Conference on Speech Database and Assessments\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICSDA.2009.5278377\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 Oriental COCOSDA International Conference on Speech Database and Assessments","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSDA.2009.5278377","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
LOTUS-BN: A Thai broadcast news corpus and its research applications
This paper describes the design and construction of the LOTUS-BN corpus, a Thai television broadcast news corpus. In addition to audio recordings and their transcription, this corpus also includes a detailed annotation of many interesting characteristics of broadcast news data such as acoustic condition, overlapping speech, news topic and named entity. The LOTUS-BN is still an ongoing project with the goal of collecting 100 hours of speech. We report initial statistics analyzed from 60 hours of speech which show that the LOTUS-BN corpus has a rich vocabulary of approximately 26,000 words with one third of them are named entities. Thus, this corpus is a good resource for developing an LVCSR system and investigating on named entity detection and recognition in addition to broadcast news related applications. Research applications on these topics are also discussed.