Detection of wordplay generated by reproduction of letters in social media texts

The 2013 10th International Joint Conference on Computer Science and Software Engineering (JCSSE) Pub Date : 2013-05-29 DOI:10.1109/JCSSE.2013.6567310

P. Hirankan, A. Suchato, P. Punyabukkana

引用次数: 4

Abstract

Wordplay generated by letters of its original word being repeated is commonly found in social network texts. Most of the time, wordplay items of this type are ambiguous to machines in language processing tasks such as Text-to-Speech. This paper shows some statistics on the number of letters from 102,586 real social network text items and proposes a set of classification features together with a few classification frameworks to detect repeated-letter wordplay tokens from Thai social network texts, which were tokenized by CRF-based Thai word segmentation. Evaluation on 48,949 text items shows that the proposed method achieves the detection accuracy of 98.45% which is an improvement over simple rule-based and some previously proposed methods.

查看原文本刊更多论文

社交媒体文本中字母复制产生的文字游戏检测

通过重复原词的字母而产生的文字游戏在社交网络文本中很常见。大多数时候，这种类型的文字游戏项目对于语言处理任务(如文本到语音)中的机器来说是模糊的。本文对102,586个真实社交网络文本项目的字母数量进行了统计，并提出了一组分类特征和一些分类框架来检测来自泰国社交网络文本的重复字母文字游戏标记，这些标记通过基于crf的泰语分词进行了标记。对48,949个文本条目的评估表明，该方法的检测准确率达到98.45%，比简单的基于规则的方法和之前提出的一些方法有了提高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

The 2013 10th International Joint Conference on Computer Science and Software Engineering (JCSSE)

自引率

0.00%

发文量