社交媒体文本中字母复制产生的文字游戏检测

P. Hirankan, A. Suchato, P. Punyabukkana
{"title":"社交媒体文本中字母复制产生的文字游戏检测","authors":"P. Hirankan, A. Suchato, P. Punyabukkana","doi":"10.1109/JCSSE.2013.6567310","DOIUrl":null,"url":null,"abstract":"Wordplay generated by letters of its original word being repeated is commonly found in social network texts. Most of the time, wordplay items of this type are ambiguous to machines in language processing tasks such as Text-to-Speech. This paper shows some statistics on the number of letters from 102,586 real social network text items and proposes a set of classification features together with a few classification frameworks to detect repeated-letter wordplay tokens from Thai social network texts, which were tokenized by CRF-based Thai word segmentation. Evaluation on 48,949 text items shows that the proposed method achieves the detection accuracy of 98.45% which is an improvement over simple rule-based and some previously proposed methods.","PeriodicalId":199516,"journal":{"name":"The 2013 10th International Joint Conference on Computer Science and Software Engineering (JCSSE)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Detection of wordplay generated by reproduction of letters in social media texts\",\"authors\":\"P. Hirankan, A. Suchato, P. Punyabukkana\",\"doi\":\"10.1109/JCSSE.2013.6567310\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Wordplay generated by letters of its original word being repeated is commonly found in social network texts. Most of the time, wordplay items of this type are ambiguous to machines in language processing tasks such as Text-to-Speech. This paper shows some statistics on the number of letters from 102,586 real social network text items and proposes a set of classification features together with a few classification frameworks to detect repeated-letter wordplay tokens from Thai social network texts, which were tokenized by CRF-based Thai word segmentation. Evaluation on 48,949 text items shows that the proposed method achieves the detection accuracy of 98.45% which is an improvement over simple rule-based and some previously proposed methods.\",\"PeriodicalId\":199516,\"journal\":{\"name\":\"The 2013 10th International Joint Conference on Computer Science and Software Engineering (JCSSE)\",\"volume\":\"11 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-05-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"The 2013 10th International Joint Conference on Computer Science and Software Engineering (JCSSE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/JCSSE.2013.6567310\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"The 2013 10th International Joint Conference on Computer Science and Software Engineering (JCSSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/JCSSE.2013.6567310","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

摘要

通过重复原词的字母而产生的文字游戏在社交网络文本中很常见。大多数时候,这种类型的文字游戏项目对于语言处理任务(如文本到语音)中的机器来说是模糊的。本文对102,586个真实社交网络文本项目的字母数量进行了统计,并提出了一组分类特征和一些分类框架来检测来自泰国社交网络文本的重复字母文字游戏标记,这些标记通过基于crf的泰语分词进行了标记。对48,949个文本条目的评估表明,该方法的检测准确率达到98.45%,比简单的基于规则的方法和之前提出的一些方法有了提高。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Detection of wordplay generated by reproduction of letters in social media texts
Wordplay generated by letters of its original word being repeated is commonly found in social network texts. Most of the time, wordplay items of this type are ambiguous to machines in language processing tasks such as Text-to-Speech. This paper shows some statistics on the number of letters from 102,586 real social network text items and proposes a set of classification features together with a few classification frameworks to detect repeated-letter wordplay tokens from Thai social network texts, which were tokenized by CRF-based Thai word segmentation. Evaluation on 48,949 text items shows that the proposed method achieves the detection accuracy of 98.45% which is an improvement over simple rule-based and some previously proposed methods.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信