{"title":"带上下文信息的泰语文本-音素评价的异音语料库","authors":"C. Hansakunbuntheung, Sumonmas Thatphithakkul","doi":"10.1109/ICSDA.2017.8384421","DOIUrl":null,"url":null,"abstract":"Heteronyms, which are texts with multiple pronunciations depending on their contexts, is a crucial problem in text-to- phoneme conversion. Conventional pronunciation corpora that collect only grapheme-phoneme pairs are not enough to evaluate the heteronym issue. Furthermore, in no-word- break languages e.g. Thai, the issue of orthographic groups with multiple possible word segmentation is another major cause of ambiguous pronunciations. Thus, this paper proposes \"ConPro\" corpus, a context-dependent pronunciation corpus of Thai heteronyms with systematic collection and context information for evaluating the accuracy of text-to-phoneme conversions. The keys of the corpus design include 1) multiple-word orthographic group as the basic unit, 2) pragmatic and compact contextual texts as evaluating texts, 3) Categorial Matrix tags for representing orthographic types and usage domains of orthographic groups, and, investigating problem categories in text-to-phoneme conversions, and, 4) pronunciation-and- meaning-prioritized heteronym collecting for extending the coverage of heteronyms and contexts.","PeriodicalId":255147,"journal":{"name":"2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)","volume":"373-375 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ConPro: Heteronym pronunciation corpus with context information for text-to-phoneme evaluation in Thai\",\"authors\":\"C. Hansakunbuntheung, Sumonmas Thatphithakkul\",\"doi\":\"10.1109/ICSDA.2017.8384421\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Heteronyms, which are texts with multiple pronunciations depending on their contexts, is a crucial problem in text-to- phoneme conversion. Conventional pronunciation corpora that collect only grapheme-phoneme pairs are not enough to evaluate the heteronym issue. Furthermore, in no-word- break languages e.g. Thai, the issue of orthographic groups with multiple possible word segmentation is another major cause of ambiguous pronunciations. Thus, this paper proposes \\\"ConPro\\\" corpus, a context-dependent pronunciation corpus of Thai heteronyms with systematic collection and context information for evaluating the accuracy of text-to-phoneme conversions. The keys of the corpus design include 1) multiple-word orthographic group as the basic unit, 2) pragmatic and compact contextual texts as evaluating texts, 3) Categorial Matrix tags for representing orthographic types and usage domains of orthographic groups, and, investigating problem categories in text-to-phoneme conversions, and, 4) pronunciation-and- meaning-prioritized heteronym collecting for extending the coverage of heteronyms and contexts.\",\"PeriodicalId\":255147,\"journal\":{\"name\":\"2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)\",\"volume\":\"373-375 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICSDA.2017.8384421\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSDA.2017.8384421","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
ConPro: Heteronym pronunciation corpus with context information for text-to-phoneme evaluation in Thai
Heteronyms, which are texts with multiple pronunciations depending on their contexts, is a crucial problem in text-to- phoneme conversion. Conventional pronunciation corpora that collect only grapheme-phoneme pairs are not enough to evaluate the heteronym issue. Furthermore, in no-word- break languages e.g. Thai, the issue of orthographic groups with multiple possible word segmentation is another major cause of ambiguous pronunciations. Thus, this paper proposes "ConPro" corpus, a context-dependent pronunciation corpus of Thai heteronyms with systematic collection and context information for evaluating the accuracy of text-to-phoneme conversions. The keys of the corpus design include 1) multiple-word orthographic group as the basic unit, 2) pragmatic and compact contextual texts as evaluating texts, 3) Categorial Matrix tags for representing orthographic types and usage domains of orthographic groups, and, investigating problem categories in text-to-phoneme conversions, and, 4) pronunciation-and- meaning-prioritized heteronym collecting for extending the coverage of heteronyms and contexts.