噪声短信文本的词法规范化模型

2014 First International Conference on Computational Systems and Communications (ICCSC) Pub Date : 2014-12-01 DOI:10.1109/COMPSC.2014.7032621

Greety Jose, Nisha S. Raj

{"title":"噪声短信文本的词法规范化模型","authors":"Greety Jose, Nisha S. Raj","doi":"10.1109/COMPSC.2014.7032621","DOIUrl":null,"url":null,"abstract":"In day to day life, digital mediated interactions and communications being an important constituent. The expeditious growth of electronic communications such as E-mails, micro blogs, SMS and chats etc has fabricated extensively noisy forms of text. It predominantly in young urbanités. The tremendous growth of noises in text are due to a variety of factors, such as the small number of characters allowed per text messages (160 characters is allowed per SMS and 140 characters allowed per tweets), inventing new abbreviations, using non standard orthographic forms, phonetic substitution etc. In this paper we introduce a lexical normalization model for cleaning the noisy texts. The normalization is based on the channelized database. The model will capture the user interaction for improving the model accuracy. Precursory evaluation shows that the channel model will normalize the noisy word to their standard peer with 97.5 % accuracy.","PeriodicalId":388270,"journal":{"name":"2014 First International Conference on Computational Systems and Communications (ICCSC)","volume":"121 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Lexical normalization model for noisy SMS text\",\"authors\":\"Greety Jose, Nisha S. Raj\",\"doi\":\"10.1109/COMPSC.2014.7032621\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In day to day life, digital mediated interactions and communications being an important constituent. The expeditious growth of electronic communications such as E-mails, micro blogs, SMS and chats etc has fabricated extensively noisy forms of text. It predominantly in young urbanités. The tremendous growth of noises in text are due to a variety of factors, such as the small number of characters allowed per text messages (160 characters is allowed per SMS and 140 characters allowed per tweets), inventing new abbreviations, using non standard orthographic forms, phonetic substitution etc. In this paper we introduce a lexical normalization model for cleaning the noisy texts. The normalization is based on the channelized database. The model will capture the user interaction for improving the model accuracy. Precursory evaluation shows that the channel model will normalize the noisy word to their standard peer with 97.5 % accuracy.\",\"PeriodicalId\":388270,\"journal\":{\"name\":\"2014 First International Conference on Computational Systems and Communications (ICCSC)\",\"volume\":\"121 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 First International Conference on Computational Systems and Communications (ICCSC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/COMPSC.2014.7032621\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 First International Conference on Computational Systems and Communications (ICCSC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/COMPSC.2014.7032621","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

在日常生活中，数字媒介的互动和通信是一个重要组成部分。电子通信的迅速发展，如电子邮件、微博、短信和聊天等，制造了大量嘈杂的文本形式。它主要发生在年轻的都市人身上。文本中噪音的巨大增长是由于各种因素造成的，比如每条短信允许的字符数量很少(每条短信允许160个字符，每条推文允许140个字符)，发明新的缩写，使用非标准的正字法形式，语音替代等。本文介绍了一种用于清除噪声文本的词法归一化模型。规范化是基于信道化数据库的。该模型将捕获用户交互，以提高模型的准确性。初步评估表明，该信道模型将有噪声的词归一化为标准词，准确率为97.5%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Lexical normalization model for noisy SMS text

In day to day life, digital mediated interactions and communications being an important constituent. The expeditious growth of electronic communications such as E-mails, micro blogs, SMS and chats etc has fabricated extensively noisy forms of text. It predominantly in young urbanités. The tremendous growth of noises in text are due to a variety of factors, such as the small number of characters allowed per text messages (160 characters is allowed per SMS and 140 characters allowed per tweets), inventing new abbreviations, using non standard orthographic forms, phonetic substitution etc. In this paper we introduce a lexical normalization model for cleaning the noisy texts. The normalization is based on the channelized database. The model will capture the user interaction for improving the model accuracy. Precursory evaluation shows that the channel model will normalize the noisy word to their standard peer with 97.5 % accuracy.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2014 First International Conference on Computational Systems and Communications (ICCSC)

自引率

0.00%

发文量