Constructing and Analysing the MalaySarc Dataset: A Resource for Detecting and Understanding Sarcasm in Malay Language

Suziane Haslinda Suhaimi, Nur Azaliah AbuBakar, Nurulhuda Firdaus Mohd Azmi
{"title":"Constructing and Analysing the MalaySarc Dataset: A Resource for Detecting and Understanding Sarcasm in Malay Language","authors":"Suziane Haslinda Suhaimi, Nur Azaliah AbuBakar, Nurulhuda Firdaus Mohd Azmi","doi":"10.11159/cist23.126","DOIUrl":null,"url":null,"abstract":"- Social media platforms provide users with an efficient and effective way to interact with content without requiring lengthy or complex textual expressions. However, sarcasm in social media discourse has become a serious problem for researchers. Compared to English and several other main languages, the research on sarcasm and the accessibility of reference materials in the Malay language are still significantly lagging. Therefore, this study aims to develop a new dataset of Malay sarcasm detection by detailing each process step, from data collection to filtering to annotation. The dataset consists of two types of data: Facebook comments and its emotion reaction buttons, which include 6,325 non-sarcastic texts and 1,380 sarcastic texts. In addition, the descriptive analysis of this dataset was also conducted to determine the usage patterns of the main features of Malay sarcasm. The analysis shows that emoji is one of the features that play an essential role in determining sarcastic comments. Besides, there are pattern-based features based on the identification of high-frequency terms in the text. The resulting dataset provides diverse examples of sarcasm that consider the linguistic and cultural nuances of the language, thus improving the accuracy and reliability of identifying social media. The findings will aid future research in developing automatic Malay sarcasm detection models using machine learning.","PeriodicalId":294100,"journal":{"name":"World Congress on Electrical Engineering and Computer Systems and Science","volume":"01 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"World Congress on Electrical Engineering and Computer Systems and Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.11159/cist23.126","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

- Social media platforms provide users with an efficient and effective way to interact with content without requiring lengthy or complex textual expressions. However, sarcasm in social media discourse has become a serious problem for researchers. Compared to English and several other main languages, the research on sarcasm and the accessibility of reference materials in the Malay language are still significantly lagging. Therefore, this study aims to develop a new dataset of Malay sarcasm detection by detailing each process step, from data collection to filtering to annotation. The dataset consists of two types of data: Facebook comments and its emotion reaction buttons, which include 6,325 non-sarcastic texts and 1,380 sarcastic texts. In addition, the descriptive analysis of this dataset was also conducted to determine the usage patterns of the main features of Malay sarcasm. The analysis shows that emoji is one of the features that play an essential role in determining sarcastic comments. Besides, there are pattern-based features based on the identification of high-frequency terms in the text. The resulting dataset provides diverse examples of sarcasm that consider the linguistic and cultural nuances of the language, thus improving the accuracy and reliability of identifying social media. The findings will aid future research in developing automatic Malay sarcasm detection models using machine learning.
构建和分析马来语讽刺语数据集:马来语讽刺语的检测和理解资源
-社交媒体平台为用户提供了一种高效、有效的与内容交互的方式,而不需要冗长、复杂的文本表达。然而,社交媒体话语中的讽刺已经成为研究人员面临的一个严重问题。与英语和其他几门主要语言相比,马来语讽刺语的研究和参考资料的可及性仍然明显滞后。因此,本研究旨在通过详细介绍从数据收集到过滤再到注释的每个过程步骤,开发一个新的马来语讽刺检测数据集。该数据集由两种类型的数据组成:Facebook评论及其情感反应按钮,其中包括6325条非讽刺文本和1380条讽刺文本。此外,还对该数据集进行了描述性分析,以确定马来语讽刺的主要特征的使用模式。分析表明,表情符号是决定讽刺评论的重要特征之一。此外,还有基于文本高频词识别的模式特征。由此产生的数据集提供了多种讽刺的例子,考虑了语言和文化的细微差别,从而提高了识别社交媒体的准确性和可靠性。这一发现将有助于未来使用机器学习开发自动马来语讽刺检测模型的研究。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信