基于BERT的印尼新闻标题灰色不平衡类标题党检测

IF 1.5 Q4 COMPUTER SCIENCE, INFORMATION SYSTEMS

Journal of Advances in Information Technology Pub Date : 2023-01-01 DOI:10.12720/jait.14.2.233-241

P. Andono, Pieter Santoso Hadi, Muljono Muljono, Catur Supriyanto

{"title":"基于BERT的印尼新闻标题灰色不平衡类标题党检测","authors":"P. Andono, Pieter Santoso Hadi, Muljono Muljono, Catur Supriyanto","doi":"10.12720/jait.14.2.233-241","DOIUrl":null,"url":null,"abstract":"—Bahasa Indonesia is used by about 263 million people in the world but it is classified as an under-resourced language. The problem of clickbait in news analysis has gained attention in recent years. However, for Indonesian, there is still a lack of resources for clickbait tasks. Clickbait attracts the attention of readers, even though the content is not informative and misleading. The imbalance of the clickbait dataset means unequal distribution of classes within the dataset which affects the classification result. In this research, focal loss is proposed to improve classification accuracy without reducing the number of original data. Normally, clickbait data are separated into two classes, namely clickbait, and non-clickbait. However, some titles are difficult to categorize, even by humans. Therefore, this study categorizes the titles into three categories, namely clickbait, non-clickbait, and gray-clickbait. The proposed method achieves an accuracy of 93.4% in the classification of two classes, which is better than previous studies. However, the proposed method achieves an accuracy of 73.3% in the classification of three classes. Our research shows a high similarity between gray-clickbait and clickbait data, making classification more challenging. On the other hand, the use of titles on three categorizations in clickbait is not enough to provide better classification performance.","PeriodicalId":36452,"journal":{"name":"Journal of Advances in Information Technology","volume":"1 1","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Clickbait Detection in Indonesian News Title with Gray Unbalanced Class Based on BERT\",\"authors\":\"P. Andono, Pieter Santoso Hadi, Muljono Muljono, Catur Supriyanto\",\"doi\":\"10.12720/jait.14.2.233-241\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"—Bahasa Indonesia is used by about 263 million people in the world but it is classified as an under-resourced language. The problem of clickbait in news analysis has gained attention in recent years. However, for Indonesian, there is still a lack of resources for clickbait tasks. Clickbait attracts the attention of readers, even though the content is not informative and misleading. The imbalance of the clickbait dataset means unequal distribution of classes within the dataset which affects the classification result. In this research, focal loss is proposed to improve classification accuracy without reducing the number of original data. Normally, clickbait data are separated into two classes, namely clickbait, and non-clickbait. However, some titles are difficult to categorize, even by humans. Therefore, this study categorizes the titles into three categories, namely clickbait, non-clickbait, and gray-clickbait. The proposed method achieves an accuracy of 93.4% in the classification of two classes, which is better than previous studies. However, the proposed method achieves an accuracy of 73.3% in the classification of three classes. Our research shows a high similarity between gray-clickbait and clickbait data, making classification more challenging. On the other hand, the use of titles on three categorizations in clickbait is not enough to provide better classification performance.\",\"PeriodicalId\":36452,\"journal\":{\"name\":\"Journal of Advances in Information Technology\",\"volume\":\"1 1\",\"pages\":\"\"},\"PeriodicalIF\":1.5000,\"publicationDate\":\"2023-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Advances in Information Technology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.12720/jait.14.2.233-241\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Advances in Information Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.12720/jait.14.2.233-241","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

-世界上约有2.63亿人使用印尼语，但它被归类为资源不足的语言。近年来，新闻分析中的标题党问题引起了人们的关注。然而，对于印尼语来说，仍然缺乏用于标题党任务的资源。标题党吸引了读者的注意力，即使内容没有信息和误导。标题党数据集的不平衡是指数据集中类别分布不均匀，影响分类结果。本研究提出在不减少原始数据数量的前提下，利用焦点损失来提高分类精度。通常，标题党数据分为两类，即标题党和非标题党。然而，有些标题很难分类，即使是人类。因此，本研究将标题分为三类，即标题党(clickbait)、非标题党(non-clickbait)和灰色标题党(灰色标题党)。该方法在两类分类中准确率达到93.4%，优于以往的研究。然而，该方法在三类分类中达到了73.3%的准确率。我们的研究表明，灰色标题党和标题党数据之间的相似性很高，这使得分类更具挑战性。另一方面，在标题党中使用三种分类标题不足以提供更好的分类性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Clickbait Detection in Indonesian News Title with Gray Unbalanced Class Based on BERT

—Bahasa Indonesia is used by about 263 million people in the world but it is classified as an under-resourced language. The problem of clickbait in news analysis has gained attention in recent years. However, for Indonesian, there is still a lack of resources for clickbait tasks. Clickbait attracts the attention of readers, even though the content is not informative and misleading. The imbalance of the clickbait dataset means unequal distribution of classes within the dataset which affects the classification result. In this research, focal loss is proposed to improve classification accuracy without reducing the number of original data. Normally, clickbait data are separated into two classes, namely clickbait, and non-clickbait. However, some titles are difficult to categorize, even by humans. Therefore, this study categorizes the titles into three categories, namely clickbait, non-clickbait, and gray-clickbait. The proposed method achieves an accuracy of 93.4% in the classification of two classes, which is better than previous studies. However, the proposed method achieves an accuracy of 73.3% in the classification of three classes. Our research shows a high similarity between gray-clickbait and clickbait data, making classification more challenging. On the other hand, the use of titles on three categorizations in clickbait is not enough to provide better classification performance.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Advances in Information Technology Computer Science-Information Systems

CiteScore

4.20

自引率

20.00%

发文量