{"title":"越南语变音符恢复作为顺序标记","authors":"N. Trung, Q. Nguyen, N. Phuong","doi":"10.1109/rivf.2012.6169816","DOIUrl":null,"url":null,"abstract":"Diacritics restoration is the process of restoring original script from diacritic-free script by correct insertion of diacritics. In this paper, this problem is casted as a sequential tagging task where each term is tagged with its own accents. We did careful evaluations on three domains of Vietnamese: writing language, spoken language and literature using two methods: conditional random fields (CRFs) and support vector machines (SVMs), and achieved promising results. We also investigated two levels of lexical: learning from letters and learning from syllables. Although the former performs poorly than the latter, it shows stable results in all three language domains. Therefore, the letter level approach is more useful when we have to deal with unknown words or when words in a sentence are reordered and repeated to achieve stylistic and artistic effect.","PeriodicalId":115212,"journal":{"name":"2012 IEEE RIVF International Conference on Computing & Communication Technologies, Research, Innovation, and Vision for the Future","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Vietnamese Diacritics Restoration as Sequential Tagging\",\"authors\":\"N. Trung, Q. Nguyen, N. Phuong\",\"doi\":\"10.1109/rivf.2012.6169816\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Diacritics restoration is the process of restoring original script from diacritic-free script by correct insertion of diacritics. In this paper, this problem is casted as a sequential tagging task where each term is tagged with its own accents. We did careful evaluations on three domains of Vietnamese: writing language, spoken language and literature using two methods: conditional random fields (CRFs) and support vector machines (SVMs), and achieved promising results. We also investigated two levels of lexical: learning from letters and learning from syllables. Although the former performs poorly than the latter, it shows stable results in all three language domains. Therefore, the letter level approach is more useful when we have to deal with unknown words or when words in a sentence are reordered and repeated to achieve stylistic and artistic effect.\",\"PeriodicalId\":115212,\"journal\":{\"name\":\"2012 IEEE RIVF International Conference on Computing & Communication Technologies, Research, Innovation, and Vision for the Future\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-03-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2012 IEEE RIVF International Conference on Computing & Communication Technologies, Research, Innovation, and Vision for the Future\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/rivf.2012.6169816\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE RIVF International Conference on Computing & Communication Technologies, Research, Innovation, and Vision for the Future","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/rivf.2012.6169816","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Vietnamese Diacritics Restoration as Sequential Tagging
Diacritics restoration is the process of restoring original script from diacritic-free script by correct insertion of diacritics. In this paper, this problem is casted as a sequential tagging task where each term is tagged with its own accents. We did careful evaluations on three domains of Vietnamese: writing language, spoken language and literature using two methods: conditional random fields (CRFs) and support vector machines (SVMs), and achieved promising results. We also investigated two levels of lexical: learning from letters and learning from syllables. Although the former performs poorly than the latter, it shows stable results in all three language domains. Therefore, the letter level approach is more useful when we have to deal with unknown words or when words in a sentence are reordered and repeated to achieve stylistic and artistic effect.