Sarcasm detection using news headlines dataset

Rishabh Misra , Prahal Arora
{"title":"Sarcasm detection using news headlines dataset","authors":"Rishabh Misra ,&nbsp;Prahal Arora","doi":"10.1016/j.aiopen.2023.01.001","DOIUrl":null,"url":null,"abstract":"<div><p>Sarcasm has been an elusive concept for humans. Due to interesting linguistic properties, sarcasm detection has gained traction of the Natural Language Processing (NLP) research community in the past few years. However, the task of predicting sarcasm in a text remains a difficult one for machines as well, and there are limited insights into what makes a sentence sarcastic. Past studies in sarcasm detection either use large scale datasets collected using tag-based supervision or small scale manually annotated datasets. The former category of datasets are noisy in terms of labels and language, whereas the latter category of datasets do not have enough instances to train deep learning models reliably despite having high-quality labels. To overcome these shortcomings, we introduce a high-quality and relatively larger-scale dataset which is a collection of news headlines from a sarcastic news website and a real news website. We describe the unique aspects of our dataset and compare its various characteristics with other benchmark datasets in sarcasm detection domain. Furthermore, we produce insights into what constitute as sarcasm in a text using a Hybrid Neural Network architecture. First released in 2019, we dedicate a section on how the NLP research community has extensively relied upon our contributions to push the state of the art further in the sarcasm detection domain. Lastly, we make the dataset as well as framework implementation publicly available to facilitate continued research in this domain.</p></div>","PeriodicalId":100068,"journal":{"name":"AI Open","volume":"4 ","pages":"Pages 13-18"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"AI Open","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666651023000013","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7

Abstract

Sarcasm has been an elusive concept for humans. Due to interesting linguistic properties, sarcasm detection has gained traction of the Natural Language Processing (NLP) research community in the past few years. However, the task of predicting sarcasm in a text remains a difficult one for machines as well, and there are limited insights into what makes a sentence sarcastic. Past studies in sarcasm detection either use large scale datasets collected using tag-based supervision or small scale manually annotated datasets. The former category of datasets are noisy in terms of labels and language, whereas the latter category of datasets do not have enough instances to train deep learning models reliably despite having high-quality labels. To overcome these shortcomings, we introduce a high-quality and relatively larger-scale dataset which is a collection of news headlines from a sarcastic news website and a real news website. We describe the unique aspects of our dataset and compare its various characteristics with other benchmark datasets in sarcasm detection domain. Furthermore, we produce insights into what constitute as sarcasm in a text using a Hybrid Neural Network architecture. First released in 2019, we dedicate a section on how the NLP research community has extensively relied upon our contributions to push the state of the art further in the sarcasm detection domain. Lastly, we make the dataset as well as framework implementation publicly available to facilitate continued research in this domain.

基于新闻标题数据集的讽刺检测
讽刺对人类来说一直是一个难以捉摸的概念。由于有趣的语言特性,讽刺检测在过去几年中受到了自然语言处理(NLP)研究界的关注。然而,对于机器来说,预测文本中的讽刺仍然是一项困难的任务,而且对一个句子的讽刺原因的见解有限。过去的讽刺检测研究要么使用使用基于标签的监督收集的大规模数据集,要么使用小规模手动注释的数据集。前一类数据集在标签和语言方面是有噪声的,而后一类数据集中尽管有高质量的标签,但没有足够的实例来可靠地训练深度学习模型。为了克服这些缺点,我们引入了一个高质量且规模相对较大的数据集,该数据集是来自讽刺新闻网站和真实新闻网站的新闻标题的集合。我们描述了我们数据集的独特之处,并将其各种特征与讽刺检测领域的其他基准数据集进行了比较。此外,我们使用混合神经网络架构来深入了解文本中的讽刺构成。我们于2019年首次发布,专门介绍了NLP研究界如何广泛依赖我们的贡献,进一步推动讽刺检测领域的最新技术。最后,我们公开了数据集和框架实现,以促进该领域的持续研究。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
45.00
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信