罗马人到古尔穆克人的社交媒体文本规范化

Int. J. Intell. Comput. Cybern. Pub Date : 2020-10-30 DOI:10.1108/ijicc-08-2020-0096

J. Kaur, J. Singh

{"title":"罗马人到古尔穆克人的社交媒体文本规范化","authors":"J. Kaur, J. Singh","doi":"10.1108/ijicc-08-2020-0096","DOIUrl":null,"url":null,"abstract":"PurposeNormalization is an important step in all the natural language processing applications that are handling social media text. The text from social media poses a different kind of problems that are not present in regular text. Recently, a considerable amount of work has been done in this direction, but mostly in the English language. People who do not speak English code mixed the text with their native language and posted text on social media using the Roman script. This kind of text further aggravates the problem of normalizing. This paper aims to discuss the concept of normalization with respect to code-mixed social media text, and a model has been proposed to normalize such text.Design/methodology/approachThe system is divided into two phases – candidate generation and most probable sentence selection. Candidate generation task is treated as machine translation task where the Roman text is treated as source language and Gurmukhi text is treated as the target language. Character-based translation system has been proposed to generate candidate tokens. Once candidates are generated, the second phase uses the beam search method for selecting the most probable sentence based on hidden Markov model.FindingsCharacter error rate (CER) and bilingual evaluation understudy (BLEU) score are reported. The proposed system has been compared with Akhar software and RB\\_R2G system, which are also capable of transliterating Roman text to Gurmukhi. The performance of the system outperforms Akhar software. The CER and BLEU scores are 0.268121 and 0.6807939, respectively, for ill-formed text.Research limitations/implicationsIt was observed that the system produces dialectical variations of a word or the word with minor errors like diacritic missing. Spell checker can improve the output of the system by correcting these minor errors. Extensive experimentation is needed for optimizing language identifier, which will further help in improving the output. The language model also seeks further exploration. Inclusion of wider context, particularly from social media text, is an important area that deserves further investigation.Practical implicationsThe practical implications of this study are: (1) development of parallel dataset containing Roman and Gurmukhi text; (2) development of dataset annotated with language tag; (3) development of the normalizing system, which is first of its kind and proposes translation based solution for normalizing noisy social media text from Roman to Gurmukhi. It can be extended for any pair of scripts. (4) The proposed system can be used for better analysis of social media text. Theoretically, our study helps in better understanding of text normalization in social media context and opens the doors for further research in multilingual social media text normalization.Originality/valueExisting research work focus on normalizing monolingual text. This study contributes towards the development of a normalization system for multilingual text.","PeriodicalId":352072,"journal":{"name":"Int. J. Intell. Comput. Cybern.","volume":"48 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Roman to Gurmukhi Social Media Text Normalization\",\"authors\":\"J. Kaur, J. Singh\",\"doi\":\"10.1108/ijicc-08-2020-0096\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"PurposeNormalization is an important step in all the natural language processing applications that are handling social media text. The text from social media poses a different kind of problems that are not present in regular text. Recently, a considerable amount of work has been done in this direction, but mostly in the English language. People who do not speak English code mixed the text with their native language and posted text on social media using the Roman script. This kind of text further aggravates the problem of normalizing. This paper aims to discuss the concept of normalization with respect to code-mixed social media text, and a model has been proposed to normalize such text.Design/methodology/approachThe system is divided into two phases – candidate generation and most probable sentence selection. Candidate generation task is treated as machine translation task where the Roman text is treated as source language and Gurmukhi text is treated as the target language. Character-based translation system has been proposed to generate candidate tokens. Once candidates are generated, the second phase uses the beam search method for selecting the most probable sentence based on hidden Markov model.FindingsCharacter error rate (CER) and bilingual evaluation understudy (BLEU) score are reported. The proposed system has been compared with Akhar software and RB\\\\_R2G system, which are also capable of transliterating Roman text to Gurmukhi. The performance of the system outperforms Akhar software. The CER and BLEU scores are 0.268121 and 0.6807939, respectively, for ill-formed text.Research limitations/implicationsIt was observed that the system produces dialectical variations of a word or the word with minor errors like diacritic missing. Spell checker can improve the output of the system by correcting these minor errors. Extensive experimentation is needed for optimizing language identifier, which will further help in improving the output. The language model also seeks further exploration. Inclusion of wider context, particularly from social media text, is an important area that deserves further investigation.Practical implicationsThe practical implications of this study are: (1) development of parallel dataset containing Roman and Gurmukhi text; (2) development of dataset annotated with language tag; (3) development of the normalizing system, which is first of its kind and proposes translation based solution for normalizing noisy social media text from Roman to Gurmukhi. It can be extended for any pair of scripts. (4) The proposed system can be used for better analysis of social media text. Theoretically, our study helps in better understanding of text normalization in social media context and opens the doors for further research in multilingual social media text normalization.Originality/valueExisting research work focus on normalizing monolingual text. This study contributes towards the development of a normalization system for multilingual text.\",\"PeriodicalId\":352072,\"journal\":{\"name\":\"Int. J. Intell. Comput. Cybern.\",\"volume\":\"48 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-10-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Int. J. Intell. Comput. Cybern.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1108/ijicc-08-2020-0096\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Int. J. Intell. Comput. Cybern.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1108/ijicc-08-2020-0096","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

在处理社交媒体文本的所有自然语言处理应用程序中，规范化是一个重要的步骤。来自社交媒体的文本带来了常规文本中不存在的不同类型的问题。最近，在这个方向上做了相当多的工作，但主要是用英语。不讲英语代码的人将文本与母语混合，并使用罗马文字在社交媒体上发布文本。这种文本进一步加剧了规范化的问题。本文旨在讨论关于代码混合社交媒体文本的规范化概念，并提出了一个模型来规范化此类文本。设计/方法/方法系统分为两个阶段——候选词生成和最可能的句子选择。候选生成任务被视为机器翻译任务，其中罗马文本被视为源语言，Gurmukhi文本被视为目标语言。提出了基于字符的翻译系统来生成候选令牌。生成候选句后，第二阶段使用基于隐马尔可夫模型的光束搜索方法选择最可能的句子。结果报告了字符错误率(CER)和双语评价替补评分(BLEU)。该系统与Akhar软件和RB\_R2G系统进行了比较，后者也能够将罗马文本音译为Gurmukhi。该系统的性能优于Akhar软件。对于格式不良的文本，CER和BLEU得分分别为0.268121和0.6807939。研究局限/启示据观察，该系统会产生一个单词的辩证变体或单词的轻微错误，如变音符缺失。拼写检查器可以通过纠正这些小错误来提高系统的输出。语言标识符的优化需要大量的实验，这将有助于进一步提高输出。语言模型也有待进一步探索。纳入更广泛的背景，特别是来自社交媒体的文本，是一个值得进一步研究的重要领域。实际意义本研究的实际意义是:(1)开发包含罗马语和古穆克语文本的并行数据集;(2)开发带有语言标记的数据集;(3)规范化系统的开发，首次提出了基于翻译的嘈杂社交媒体文本从罗马语到Gurmukhi语规范化的解决方案。它可以扩展到任何一对脚本。(4)所提出的系统可以更好地用于社交媒体文本的分析。从理论上讲，我们的研究有助于更好地理解社交媒体背景下的文本规范化，并为进一步研究多语言社交媒体文本规范化打开了大门。原创性/价值现有的研究工作主要集中在单语文本的规范化方面。本研究有助于多语言文本规范化系统的发展。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Roman to Gurmukhi Social Media Text Normalization

PurposeNormalization is an important step in all the natural language processing applications that are handling social media text. The text from social media poses a different kind of problems that are not present in regular text. Recently, a considerable amount of work has been done in this direction, but mostly in the English language. People who do not speak English code mixed the text with their native language and posted text on social media using the Roman script. This kind of text further aggravates the problem of normalizing. This paper aims to discuss the concept of normalization with respect to code-mixed social media text, and a model has been proposed to normalize such text.Design/methodology/approachThe system is divided into two phases – candidate generation and most probable sentence selection. Candidate generation task is treated as machine translation task where the Roman text is treated as source language and Gurmukhi text is treated as the target language. Character-based translation system has been proposed to generate candidate tokens. Once candidates are generated, the second phase uses the beam search method for selecting the most probable sentence based on hidden Markov model.FindingsCharacter error rate (CER) and bilingual evaluation understudy (BLEU) score are reported. The proposed system has been compared with Akhar software and RB\_R2G system, which are also capable of transliterating Roman text to Gurmukhi. The performance of the system outperforms Akhar software. The CER and BLEU scores are 0.268121 and 0.6807939, respectively, for ill-formed text.Research limitations/implicationsIt was observed that the system produces dialectical variations of a word or the word with minor errors like diacritic missing. Spell checker can improve the output of the system by correcting these minor errors. Extensive experimentation is needed for optimizing language identifier, which will further help in improving the output. The language model also seeks further exploration. Inclusion of wider context, particularly from social media text, is an important area that deserves further investigation.Practical implicationsThe practical implications of this study are: (1) development of parallel dataset containing Roman and Gurmukhi text; (2) development of dataset annotated with language tag; (3) development of the normalizing system, which is first of its kind and proposes translation based solution for normalizing noisy social media text from Roman to Gurmukhi. It can be extended for any pair of scripts. (4) The proposed system can be used for better analysis of social media text. Theoretically, our study helps in better understanding of text normalization in social media context and opens the doors for further research in multilingual social media text normalization.Originality/valueExisting research work focus on normalizing monolingual text. This study contributes towards the development of a normalization system for multilingual text.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Int. J. Intell. Comput. Cybern.

自引率

0.00%

发文量