基于多字符频率的印度语高效短信编码

2016 Twenty Second National Conference on Communication (NCC) Pub Date : 2016-03-01 DOI:10.1109/NCC.2016.7561128

Manu Seth, Sourya Basu, Shivam Chaturvedi, R. Hegde

{"title":"基于多字符频率的印度语高效短信编码","authors":"Manu Seth, Sourya Basu, Shivam Chaturvedi, R. Hegde","doi":"10.1109/NCC.2016.7561128","DOIUrl":null,"url":null,"abstract":"Short Message Service (SMS) via cell phones is a widely used mode of data communication. Currently employed encoding schemes allow the transmission of 160 characters per SMS in English. This drops to 70 characters per SMS if any Indian language including Hindi is used, due to the UNICODE format used therein. Schemes proposed to improve the encoding efficiency of short text messaging generally encode one character at a time. Table splitting schemes that reduce the average number of bits per character are generally used in this context. In this paper, a novel multi-character frequency-based encoding scheme is proposed for efficient messaging of short text messages in four Indian Languages. Both uni-gram and bi-gram modelling based schemes are proposed herein. The efficiency of the proposed schemes is evaluated by conducting experiments on a large multilingual database of short text messages collected from twitter using a dictionary learning approach. Performance evaluation shows that these encoding schemes can allow the transmission of around 190 characters per SMS in English and more than 165 characters per SMS for Four Indian Languages. Encoding efficiency is significantly improved when compared to existing state of the art table marker algorithms and is motivating enough to be used in practice for transmission of short text messages in Indian Languages.","PeriodicalId":279637,"journal":{"name":"2016 Twenty Second National Conference on Communication (NCC)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multi character frequency based encoding for efficient text messaging in Indian Languages\",\"authors\":\"Manu Seth, Sourya Basu, Shivam Chaturvedi, R. Hegde\",\"doi\":\"10.1109/NCC.2016.7561128\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Short Message Service (SMS) via cell phones is a widely used mode of data communication. Currently employed encoding schemes allow the transmission of 160 characters per SMS in English. This drops to 70 characters per SMS if any Indian language including Hindi is used, due to the UNICODE format used therein. Schemes proposed to improve the encoding efficiency of short text messaging generally encode one character at a time. Table splitting schemes that reduce the average number of bits per character are generally used in this context. In this paper, a novel multi-character frequency-based encoding scheme is proposed for efficient messaging of short text messages in four Indian Languages. Both uni-gram and bi-gram modelling based schemes are proposed herein. The efficiency of the proposed schemes is evaluated by conducting experiments on a large multilingual database of short text messages collected from twitter using a dictionary learning approach. Performance evaluation shows that these encoding schemes can allow the transmission of around 190 characters per SMS in English and more than 165 characters per SMS for Four Indian Languages. Encoding efficiency is significantly improved when compared to existing state of the art table marker algorithms and is motivating enough to be used in practice for transmission of short text messages in Indian Languages.\",\"PeriodicalId\":279637,\"journal\":{\"name\":\"2016 Twenty Second National Conference on Communication (NCC)\",\"volume\":\"28 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 Twenty Second National Conference on Communication (NCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/NCC.2016.7561128\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 Twenty Second National Conference on Communication (NCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NCC.2016.7561128","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

手机短信服务(SMS)是一种广泛使用的数据通信方式。目前采用的编码方案允许每条英文短信传输160个字符。如果使用包括印地语在内的任何印度语言，由于其中使用的是UNICODE格式，则每条短信的长度将降至70个字符。目前提出的提高短信编码效率的方案一般一次只编码一个字符。在这种情况下通常使用减少每个字符平均位数的表分割方案。本文提出了一种新的基于多字符频率的编码方案，用于四种印度语言的短文本消息的高效传递。本文提出了基于单图和双图的建模方案。通过使用字典学习方法在从twitter收集的短信的大型多语言数据库上进行实验，评估了所提出方案的效率。性能评估表明，这些编码方案可以允许每条英语短信传输约190个字符，每条四种印度语言短信传输超过165个字符。与现有的最先进的表标记算法相比，编码效率得到了显着提高，并且足以在实践中用于印度语言的短文本消息传输。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Multi character frequency based encoding for efficient text messaging in Indian Languages

Short Message Service (SMS) via cell phones is a widely used mode of data communication. Currently employed encoding schemes allow the transmission of 160 characters per SMS in English. This drops to 70 characters per SMS if any Indian language including Hindi is used, due to the UNICODE format used therein. Schemes proposed to improve the encoding efficiency of short text messaging generally encode one character at a time. Table splitting schemes that reduce the average number of bits per character are generally used in this context. In this paper, a novel multi-character frequency-based encoding scheme is proposed for efficient messaging of short text messages in four Indian Languages. Both uni-gram and bi-gram modelling based schemes are proposed herein. The efficiency of the proposed schemes is evaluated by conducting experiments on a large multilingual database of short text messages collected from twitter using a dictionary learning approach. Performance evaluation shows that these encoding schemes can allow the transmission of around 190 characters per SMS in English and more than 165 characters per SMS for Four Indian Languages. Encoding efficiency is significantly improved when compared to existing state of the art table marker algorithms and is motivating enough to be used in practice for transmission of short text messages in Indian Languages.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2016 Twenty Second National Conference on Communication (NCC)

自引率

0.00%

发文量