Sukanya Dutta, Tista Saha, Somnath Banerjee, S. Naskar
{"title":"代码混合社交媒体文本中的文本规范化","authors":"Sukanya Dutta, Tista Saha, Somnath Banerjee, S. Naskar","doi":"10.1109/ReTIS.2015.7232908","DOIUrl":null,"url":null,"abstract":"This paper addresses the problem of text normalization, an often overlooked problem in natural language processing, in code-mixed social media text. The objective of the work presented here is to correct English spelling errors in code-mixed social media text that contains English words as well as Romanized transliteration of words from another language, in this case Bangla. The targeted research problem also entails solving another problem, that of word-level language identification in code-mixed social media text. We employ a CRF based machine learning approach followed by post-processing heuristics for the word-level language identification task. For spelling correction, we used the noisy channel model of spelling correction. In addition, the spell checker model presented here tackles wordplay, contracted words and phonetic variations. Overall, the word-level language identification achieved 90.5% accuracy and the spell checker achieved 69.43% accuracy on the detected English words.","PeriodicalId":161306,"journal":{"name":"2015 IEEE 2nd International Conference on Recent Trends in Information Systems (ReTIS)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":"{\"title\":\"Text normalization in code-mixed social media text\",\"authors\":\"Sukanya Dutta, Tista Saha, Somnath Banerjee, S. Naskar\",\"doi\":\"10.1109/ReTIS.2015.7232908\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper addresses the problem of text normalization, an often overlooked problem in natural language processing, in code-mixed social media text. The objective of the work presented here is to correct English spelling errors in code-mixed social media text that contains English words as well as Romanized transliteration of words from another language, in this case Bangla. The targeted research problem also entails solving another problem, that of word-level language identification in code-mixed social media text. We employ a CRF based machine learning approach followed by post-processing heuristics for the word-level language identification task. For spelling correction, we used the noisy channel model of spelling correction. In addition, the spell checker model presented here tackles wordplay, contracted words and phonetic variations. Overall, the word-level language identification achieved 90.5% accuracy and the spell checker achieved 69.43% accuracy on the detected English words.\",\"PeriodicalId\":161306,\"journal\":{\"name\":\"2015 IEEE 2nd International Conference on Recent Trends in Information Systems (ReTIS)\",\"volume\":\"35 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-07-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"20\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 IEEE 2nd International Conference on Recent Trends in Information Systems (ReTIS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ReTIS.2015.7232908\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE 2nd International Conference on Recent Trends in Information Systems (ReTIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ReTIS.2015.7232908","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Text normalization in code-mixed social media text
This paper addresses the problem of text normalization, an often overlooked problem in natural language processing, in code-mixed social media text. The objective of the work presented here is to correct English spelling errors in code-mixed social media text that contains English words as well as Romanized transliteration of words from another language, in this case Bangla. The targeted research problem also entails solving another problem, that of word-level language identification in code-mixed social media text. We employ a CRF based machine learning approach followed by post-processing heuristics for the word-level language identification task. For spelling correction, we used the noisy channel model of spelling correction. In addition, the spell checker model presented here tackles wordplay, contracted words and phonetic variations. Overall, the word-level language identification achieved 90.5% accuracy and the spell checker achieved 69.43% accuracy on the detected English words.