印尼语新闻分类特征提取的评价

Kevin Djajadinata, Hussein Faisol, G. F. Shidik, Muljono, A. Z. Fanani
{"title":"印尼语新闻分类特征提取的评价","authors":"Kevin Djajadinata, Hussein Faisol, G. F. Shidik, Muljono, A. Z. Fanani","doi":"10.1109/iSemantic50169.2020.9234252","DOIUrl":null,"url":null,"abstract":"News is information about knowledge or event that occurs within a certain period. In the text news, there are several categories can be classified. This research proposes an evaluation of feature extraction to classify Indonesian language news. The dataset are from www.cnnindonesia.com (May 2018 - July 2018) with 4 categories and has a total of 3677 data and www.liputan6.com with 4 categories and has a total of 3415 data. All existing data will be processed to structured form and then the feature is extracted with 8 feature extraction method (TF, TF-IDF, TF-RF, TF-Prob, TF-CHI, TF-IDF-ISCDF, TF-IGM, and RTF-IGM) combined with 6 classification algorithms (Gaussian Naïve Bayes, k-NN, Decision Tree, Neural Network, Logistic Regression, and Support Vector Machine). From this research can be concluded that the Gaussian Naïve Bayes algorithm with TF-Prob was able to obtain the best accuracy with 99.701% (CNN Indonesia) and 99.824% (Liputan6) from 5 fold cross-validation.","PeriodicalId":345558,"journal":{"name":"2020 International Seminar on Application for Technology of Information and Communication (iSemantic)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Evaluation of Feature Extraction for Indonesian News Classification\",\"authors\":\"Kevin Djajadinata, Hussein Faisol, G. F. Shidik, Muljono, A. Z. Fanani\",\"doi\":\"10.1109/iSemantic50169.2020.9234252\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"News is information about knowledge or event that occurs within a certain period. In the text news, there are several categories can be classified. This research proposes an evaluation of feature extraction to classify Indonesian language news. The dataset are from www.cnnindonesia.com (May 2018 - July 2018) with 4 categories and has a total of 3677 data and www.liputan6.com with 4 categories and has a total of 3415 data. All existing data will be processed to structured form and then the feature is extracted with 8 feature extraction method (TF, TF-IDF, TF-RF, TF-Prob, TF-CHI, TF-IDF-ISCDF, TF-IGM, and RTF-IGM) combined with 6 classification algorithms (Gaussian Naïve Bayes, k-NN, Decision Tree, Neural Network, Logistic Regression, and Support Vector Machine). From this research can be concluded that the Gaussian Naïve Bayes algorithm with TF-Prob was able to obtain the best accuracy with 99.701% (CNN Indonesia) and 99.824% (Liputan6) from 5 fold cross-validation.\",\"PeriodicalId\":345558,\"journal\":{\"name\":\"2020 International Seminar on Application for Technology of Information and Communication (iSemantic)\",\"volume\":\"7 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-09-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 International Seminar on Application for Technology of Information and Communication (iSemantic)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/iSemantic50169.2020.9234252\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 International Seminar on Application for Technology of Information and Communication (iSemantic)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/iSemantic50169.2020.9234252","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

新闻是关于某一时期内发生的知识或事件的信息。在文本新闻中,有几个类别可以分类。本研究提出一种评价印尼语新闻分类的特征提取方法。数据集来自www.cnnindonesia.com(2018年5月- 2018年7月),共4类,共3677条数据;www.liputan6.com有4类,共3415条数据。将所有现有数据处理成结构化形式,然后使用8种特征提取方法(TF、TF- idf、TF- rf、TF- prob、TF- chi、TF- idf - iscdf、TF- igm、RTF-IGM)结合6种分类算法(高斯Naïve贝叶斯、k-NN、决策树、神经网络、逻辑回归、支持向量机)进行特征提取。从本研究中可以得出,经过5次交叉验证,使用TF-Prob的高斯Naïve Bayes算法能够获得最佳准确率,分别为99.701% (CNN Indonesia)和99.824% (Liputan6)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Evaluation of Feature Extraction for Indonesian News Classification
News is information about knowledge or event that occurs within a certain period. In the text news, there are several categories can be classified. This research proposes an evaluation of feature extraction to classify Indonesian language news. The dataset are from www.cnnindonesia.com (May 2018 - July 2018) with 4 categories and has a total of 3677 data and www.liputan6.com with 4 categories and has a total of 3415 data. All existing data will be processed to structured form and then the feature is extracted with 8 feature extraction method (TF, TF-IDF, TF-RF, TF-Prob, TF-CHI, TF-IDF-ISCDF, TF-IGM, and RTF-IGM) combined with 6 classification algorithms (Gaussian Naïve Bayes, k-NN, Decision Tree, Neural Network, Logistic Regression, and Support Vector Machine). From this research can be concluded that the Gaussian Naïve Bayes algorithm with TF-Prob was able to obtain the best accuracy with 99.701% (CNN Indonesia) and 99.824% (Liputan6) from 5 fold cross-validation.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信