社交媒体文本中字母复制产生的文字游戏检测

The 2013 10th International Joint Conference on Computer Science and Software Engineering (JCSSE) Pub Date : 2013-05-29 DOI:10.1109/JCSSE.2013.6567310

P. Hirankan, A. Suchato, P. Punyabukkana

{"title":"社交媒体文本中字母复制产生的文字游戏检测","authors":"P. Hirankan, A. Suchato, P. Punyabukkana","doi":"10.1109/JCSSE.2013.6567310","DOIUrl":null,"url":null,"abstract":"Wordplay generated by letters of its original word being repeated is commonly found in social network texts. Most of the time, wordplay items of this type are ambiguous to machines in language processing tasks such as Text-to-Speech. This paper shows some statistics on the number of letters from 102,586 real social network text items and proposes a set of classification features together with a few classification frameworks to detect repeated-letter wordplay tokens from Thai social network texts, which were tokenized by CRF-based Thai word segmentation. Evaluation on 48,949 text items shows that the proposed method achieves the detection accuracy of 98.45% which is an improvement over simple rule-based and some previously proposed methods.","PeriodicalId":199516,"journal":{"name":"The 2013 10th International Joint Conference on Computer Science and Software Engineering (JCSSE)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Detection of wordplay generated by reproduction of letters in social media texts\",\"authors\":\"P. Hirankan, A. Suchato, P. Punyabukkana\",\"doi\":\"10.1109/JCSSE.2013.6567310\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Wordplay generated by letters of its original word being repeated is commonly found in social network texts. Most of the time, wordplay items of this type are ambiguous to machines in language processing tasks such as Text-to-Speech. This paper shows some statistics on the number of letters from 102,586 real social network text items and proposes a set of classification features together with a few classification frameworks to detect repeated-letter wordplay tokens from Thai social network texts, which were tokenized by CRF-based Thai word segmentation. Evaluation on 48,949 text items shows that the proposed method achieves the detection accuracy of 98.45% which is an improvement over simple rule-based and some previously proposed methods.\",\"PeriodicalId\":199516,\"journal\":{\"name\":\"The 2013 10th International Joint Conference on Computer Science and Software Engineering (JCSSE)\",\"volume\":\"11 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-05-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"The 2013 10th International Joint Conference on Computer Science and Software Engineering (JCSSE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/JCSSE.2013.6567310\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"The 2013 10th International Joint Conference on Computer Science and Software Engineering (JCSSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/JCSSE.2013.6567310","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

通过重复原词的字母而产生的文字游戏在社交网络文本中很常见。大多数时候，这种类型的文字游戏项目对于语言处理任务(如文本到语音)中的机器来说是模糊的。本文对102,586个真实社交网络文本项目的字母数量进行了统计，并提出了一组分类特征和一些分类框架来检测来自泰国社交网络文本的重复字母文字游戏标记，这些标记通过基于crf的泰语分词进行了标记。对48,949个文本条目的评估表明，该方法的检测准确率达到98.45%，比简单的基于规则的方法和之前提出的一些方法有了提高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Detection of wordplay generated by reproduction of letters in social media texts

Wordplay generated by letters of its original word being repeated is commonly found in social network texts. Most of the time, wordplay items of this type are ambiguous to machines in language processing tasks such as Text-to-Speech. This paper shows some statistics on the number of letters from 102,586 real social network text items and proposes a set of classification features together with a few classification frameworks to detect repeated-letter wordplay tokens from Thai social network texts, which were tokenized by CRF-based Thai word segmentation. Evaluation on 48,949 text items shows that the proposed method achieves the detection accuracy of 98.45% which is an improvement over simple rule-based and some previously proposed methods.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

The 2013 10th International Joint Conference on Computer Science and Software Engineering (JCSSE)

自引率

0.00%

发文量