A system for popular Thai slang extraction from social media content with n-gram based tokenization

Rachsuda Jiamthapthaksin, Pisal Setthawong, Nitipan Ratanasawetwad
{"title":"A system for popular Thai slang extraction from social media content with n-gram based tokenization","authors":"Rachsuda Jiamthapthaksin, Pisal Setthawong, Nitipan Ratanasawetwad","doi":"10.1109/KST.2016.7440478","DOIUrl":null,"url":null,"abstract":"With increased penetration of smart devices and internet connectivity, many Thais are more readily engaged in social media, online forums, and chat groups. As there is an increased consumption of social media content, there is a shift from the consumption of traditional medias in which formal language are used regularly such as broadcast and traditional print medias. Social media posts are a reflection of the trend, where posts usually made by younger generations usually involve communication in slang and non-formal language which is not typically available in formalized dictionaries. As the Thai population like to follow trends, one of behaviors of that many Thai social media users engage in, is to follow the latest popular social media trends in slang and word usage. As slang are changed and evolved over time, it is usually useful to have an online mining tool in which could capture the trends of emerging and popular slang. This paper proposes an approach that extracts popular Thai slang by comparing social media posts and utilizing tokenization, a dictionary based approach to extract unknown words, before expanding it by using n-gram approach to figure what are currently trending and popular slang words.","PeriodicalId":350687,"journal":{"name":"2016 8th International Conference on Knowledge and Smart Technology (KST)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 8th International Conference on Knowledge and Smart Technology (KST)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/KST.2016.7440478","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

With increased penetration of smart devices and internet connectivity, many Thais are more readily engaged in social media, online forums, and chat groups. As there is an increased consumption of social media content, there is a shift from the consumption of traditional medias in which formal language are used regularly such as broadcast and traditional print medias. Social media posts are a reflection of the trend, where posts usually made by younger generations usually involve communication in slang and non-formal language which is not typically available in formalized dictionaries. As the Thai population like to follow trends, one of behaviors of that many Thai social media users engage in, is to follow the latest popular social media trends in slang and word usage. As slang are changed and evolved over time, it is usually useful to have an online mining tool in which could capture the trends of emerging and popular slang. This paper proposes an approach that extracts popular Thai slang by comparing social media posts and utilizing tokenization, a dictionary based approach to extract unknown words, before expanding it by using n-gram approach to figure what are currently trending and popular slang words.
一个基于n-gram标记化的从社交媒体内容中提取流行泰国俚语的系统
随着智能设备和互联网连接的日益普及,许多泰国人更愿意参与社交媒体、在线论坛和聊天群。随着社交媒体内容消费的增加,传统媒体(如广播和传统印刷媒体)的消费出现了转变,传统媒体经常使用正式语言。社交媒体上的帖子反映了这一趋势,这些帖子通常由年轻一代发布,通常涉及俚语和非正式语言的交流,这些语言通常在正式的词典中找不到。由于泰国人喜欢追随潮流,许多泰国社交媒体用户的行为之一就是跟随最新流行的社交媒体俚语和词汇使用趋势。随着时间的推移,俚语不断变化和发展,有一个在线挖掘工具可以捕捉新兴和流行俚语的趋势,这通常是有用的。本文提出了一种方法,通过比较社交媒体帖子和利用tokenization(一种基于字典的方法来提取未知单词)来提取流行的泰国俚语,然后通过使用n-gram方法来扩展它,以找出当前流行的俚语单词。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信