News Text Categorization using Random Forest and Naïve Bayes

Upasana Parida, M. Nayak, A. Nayak
{"title":"News Text Categorization using Random Forest and Naïve Bayes","authors":"Upasana Parida, M. Nayak, A. Nayak","doi":"10.1109/ODICON50556.2021.9428925","DOIUrl":null,"url":null,"abstract":"As this whole world is gradually digitized in every aspects, there is an exponential increase of amount of data in every fields. Retrieving valuable information from this unstructured, unorganized raw data is challenging and time consuming. There are so many techniques has been proposed in the field of information retrieval to organize the unorganized data efficiently. Text Categorization is one of the techniques introduced for this purpose of categorization of documents into pre-determined categories depending on their contents. This is one of the sub technique of text classification. Text Categorization is also known as Topic Spotting. The experimental study is done on standard benchmark news data set of Reuter using machine learning techniques like Random Forest and Naïve Bayes. TFIDF Vectorizer and Count Vectorizer is used for extracting features from the data set efficiently. Chi-Square is used for reduction of feature set from the extracted feature set to select the best features to accelerate performance. The result is captured in form of the two metrics accuracy and kappa statistics to analyze the effect of different features extractions and classification technique on news data set. This experimental evaluation will improvise the future research of the use of different machine learning techniques on news data set.","PeriodicalId":197132,"journal":{"name":"2021 1st Odisha International Conference on Electrical Power Engineering, Communication and Computing Technology(ODICON)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 1st Odisha International Conference on Electrical Power Engineering, Communication and Computing Technology(ODICON)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ODICON50556.2021.9428925","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

Abstract

As this whole world is gradually digitized in every aspects, there is an exponential increase of amount of data in every fields. Retrieving valuable information from this unstructured, unorganized raw data is challenging and time consuming. There are so many techniques has been proposed in the field of information retrieval to organize the unorganized data efficiently. Text Categorization is one of the techniques introduced for this purpose of categorization of documents into pre-determined categories depending on their contents. This is one of the sub technique of text classification. Text Categorization is also known as Topic Spotting. The experimental study is done on standard benchmark news data set of Reuter using machine learning techniques like Random Forest and Naïve Bayes. TFIDF Vectorizer and Count Vectorizer is used for extracting features from the data set efficiently. Chi-Square is used for reduction of feature set from the extracted feature set to select the best features to accelerate performance. The result is captured in form of the two metrics accuracy and kappa statistics to analyze the effect of different features extractions and classification technique on news data set. This experimental evaluation will improvise the future research of the use of different machine learning techniques on news data set.
使用随机森林和Naïve贝叶斯的新闻文本分类
随着整个世界在各个方面逐渐数字化,各个领域的数据量都呈指数级增长。从这些非结构化、无组织的原始数据中检索有价值的信息既具有挑战性又耗时。为了有效地组织无组织的数据,在信息检索领域提出了许多技术。文本分类是为此目的而引入的技术之一,它根据文档的内容将其分类为预先确定的类别。这是文本分类的子技术之一。文本分类也被称为主题定位。实验研究使用随机森林和Naïve贝叶斯等机器学习技术在路透社的标准基准新闻数据集上完成。利用TFIDF矢量器和Count矢量器从数据集中有效地提取特征。使用卡方法从提取的特征集中对特征集进行约简,以选择最佳特征来加速性能。将结果以准确率和kappa统计量的形式进行捕获,分析不同特征提取和分类技术对新闻数据集的影响。这一实验评估将为未来在新闻数据集上使用不同机器学习技术的研究提供即兴创作。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信