P. Andono, Pieter Santoso Hadi, Muljono Muljono, Catur Supriyanto
{"title":"基于BERT的印尼新闻标题灰色不平衡类标题党检测","authors":"P. Andono, Pieter Santoso Hadi, Muljono Muljono, Catur Supriyanto","doi":"10.12720/jait.14.2.233-241","DOIUrl":null,"url":null,"abstract":"—Bahasa Indonesia is used by about 263 million people in the world but it is classified as an under-resourced language. The problem of clickbait in news analysis has gained attention in recent years. However, for Indonesian, there is still a lack of resources for clickbait tasks. Clickbait attracts the attention of readers, even though the content is not informative and misleading. The imbalance of the clickbait dataset means unequal distribution of classes within the dataset which affects the classification result. In this research, focal loss is proposed to improve classification accuracy without reducing the number of original data. Normally, clickbait data are separated into two classes, namely clickbait, and non-clickbait. However, some titles are difficult to categorize, even by humans. Therefore, this study categorizes the titles into three categories, namely clickbait, non-clickbait, and gray-clickbait. The proposed method achieves an accuracy of 93.4% in the classification of two classes, which is better than previous studies. However, the proposed method achieves an accuracy of 73.3% in the classification of three classes. Our research shows a high similarity between gray-clickbait and clickbait data, making classification more challenging. On the other hand, the use of titles on three categorizations in clickbait is not enough to provide better classification performance.","PeriodicalId":36452,"journal":{"name":"Journal of Advances in Information Technology","volume":"1 1","pages":""},"PeriodicalIF":0.9000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Clickbait Detection in Indonesian News Title with Gray Unbalanced Class Based on BERT\",\"authors\":\"P. Andono, Pieter Santoso Hadi, Muljono Muljono, Catur Supriyanto\",\"doi\":\"10.12720/jait.14.2.233-241\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"—Bahasa Indonesia is used by about 263 million people in the world but it is classified as an under-resourced language. The problem of clickbait in news analysis has gained attention in recent years. However, for Indonesian, there is still a lack of resources for clickbait tasks. Clickbait attracts the attention of readers, even though the content is not informative and misleading. The imbalance of the clickbait dataset means unequal distribution of classes within the dataset which affects the classification result. In this research, focal loss is proposed to improve classification accuracy without reducing the number of original data. Normally, clickbait data are separated into two classes, namely clickbait, and non-clickbait. However, some titles are difficult to categorize, even by humans. Therefore, this study categorizes the titles into three categories, namely clickbait, non-clickbait, and gray-clickbait. The proposed method achieves an accuracy of 93.4% in the classification of two classes, which is better than previous studies. However, the proposed method achieves an accuracy of 73.3% in the classification of three classes. Our research shows a high similarity between gray-clickbait and clickbait data, making classification more challenging. On the other hand, the use of titles on three categorizations in clickbait is not enough to provide better classification performance.\",\"PeriodicalId\":36452,\"journal\":{\"name\":\"Journal of Advances in Information Technology\",\"volume\":\"1 1\",\"pages\":\"\"},\"PeriodicalIF\":0.9000,\"publicationDate\":\"2023-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Advances in Information Technology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.12720/jait.14.2.233-241\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Advances in Information Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.12720/jait.14.2.233-241","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Clickbait Detection in Indonesian News Title with Gray Unbalanced Class Based on BERT
—Bahasa Indonesia is used by about 263 million people in the world but it is classified as an under-resourced language. The problem of clickbait in news analysis has gained attention in recent years. However, for Indonesian, there is still a lack of resources for clickbait tasks. Clickbait attracts the attention of readers, even though the content is not informative and misleading. The imbalance of the clickbait dataset means unequal distribution of classes within the dataset which affects the classification result. In this research, focal loss is proposed to improve classification accuracy without reducing the number of original data. Normally, clickbait data are separated into two classes, namely clickbait, and non-clickbait. However, some titles are difficult to categorize, even by humans. Therefore, this study categorizes the titles into three categories, namely clickbait, non-clickbait, and gray-clickbait. The proposed method achieves an accuracy of 93.4% in the classification of two classes, which is better than previous studies. However, the proposed method achieves an accuracy of 73.3% in the classification of three classes. Our research shows a high similarity between gray-clickbait and clickbait data, making classification more challenging. On the other hand, the use of titles on three categorizations in clickbait is not enough to provide better classification performance.