News Text Categorization using Random Forest and Naïve Bayes

2021 1st Odisha International Conference on Electrical Power Engineering, Communication and Computing Technology(ODICON) Pub Date : 2021-01-08 DOI:10.1109/ODICON50556.2021.9428925

Upasana Parida, M. Nayak, A. Nayak

{"title":"News Text Categorization using Random Forest and Naïve Bayes","authors":"Upasana Parida, M. Nayak, A. Nayak","doi":"10.1109/ODICON50556.2021.9428925","DOIUrl":null,"url":null,"abstract":"As this whole world is gradually digitized in every aspects, there is an exponential increase of amount of data in every fields. Retrieving valuable information from this unstructured, unorganized raw data is challenging and time consuming. There are so many techniques has been proposed in the field of information retrieval to organize the unorganized data efficiently. Text Categorization is one of the techniques introduced for this purpose of categorization of documents into pre-determined categories depending on their contents. This is one of the sub technique of text classification. Text Categorization is also known as Topic Spotting. The experimental study is done on standard benchmark news data set of Reuter using machine learning techniques like Random Forest and Naïve Bayes. TFIDF Vectorizer and Count Vectorizer is used for extracting features from the data set efficiently. Chi-Square is used for reduction of feature set from the extracted feature set to select the best features to accelerate performance. The result is captured in form of the two metrics accuracy and kappa statistics to analyze the effect of different features extractions and classification technique on news data set. This experimental evaluation will improvise the future research of the use of different machine learning techniques on news data set.","PeriodicalId":197132,"journal":{"name":"2021 1st Odisha International Conference on Electrical Power Engineering, Communication and Computing Technology(ODICON)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 1st Odisha International Conference on Electrical Power Engineering, Communication and Computing Technology(ODICON)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ODICON50556.2021.9428925","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

As this whole world is gradually digitized in every aspects, there is an exponential increase of amount of data in every fields. Retrieving valuable information from this unstructured, unorganized raw data is challenging and time consuming. There are so many techniques has been proposed in the field of information retrieval to organize the unorganized data efficiently. Text Categorization is one of the techniques introduced for this purpose of categorization of documents into pre-determined categories depending on their contents. This is one of the sub technique of text classification. Text Categorization is also known as Topic Spotting. The experimental study is done on standard benchmark news data set of Reuter using machine learning techniques like Random Forest and Naïve Bayes. TFIDF Vectorizer and Count Vectorizer is used for extracting features from the data set efficiently. Chi-Square is used for reduction of feature set from the extracted feature set to select the best features to accelerate performance. The result is captured in form of the two metrics accuracy and kappa statistics to analyze the effect of different features extractions and classification technique on news data set. This experimental evaluation will improvise the future research of the use of different machine learning techniques on news data set.

查看原文本刊更多论文

使用随机森林和Naïve贝叶斯的新闻文本分类

随着整个世界在各个方面逐渐数字化，各个领域的数据量都呈指数级增长。从这些非结构化、无组织的原始数据中检索有价值的信息既具有挑战性又耗时。为了有效地组织无组织的数据，在信息检索领域提出了许多技术。文本分类是为此目的而引入的技术之一，它根据文档的内容将其分类为预先确定的类别。这是文本分类的子技术之一。文本分类也被称为主题定位。实验研究使用随机森林和Naïve贝叶斯等机器学习技术在路透社的标准基准新闻数据集上完成。利用TFIDF矢量器和Count矢量器从数据集中有效地提取特征。使用卡方法从提取的特征集中对特征集进行约简，以选择最佳特征来加速性能。将结果以准确率和kappa统计量的形式进行捕获，分析不同特征提取和分类技术对新闻数据集的影响。这一实验评估将为未来在新闻数据集上使用不同机器学习技术的研究提供即兴创作。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 1st Odisha International Conference on Electrical Power Engineering, Communication and Computing Technology(ODICON)

自引率

0.00%

发文量