{"title":"News Text Categorization using Random Forest and Naïve Bayes","authors":"Upasana Parida, M. Nayak, A. Nayak","doi":"10.1109/ODICON50556.2021.9428925","DOIUrl":null,"url":null,"abstract":"As this whole world is gradually digitized in every aspects, there is an exponential increase of amount of data in every fields. Retrieving valuable information from this unstructured, unorganized raw data is challenging and time consuming. There are so many techniques has been proposed in the field of information retrieval to organize the unorganized data efficiently. Text Categorization is one of the techniques introduced for this purpose of categorization of documents into pre-determined categories depending on their contents. This is one of the sub technique of text classification. Text Categorization is also known as Topic Spotting. The experimental study is done on standard benchmark news data set of Reuter using machine learning techniques like Random Forest and Naïve Bayes. TFIDF Vectorizer and Count Vectorizer is used for extracting features from the data set efficiently. Chi-Square is used for reduction of feature set from the extracted feature set to select the best features to accelerate performance. The result is captured in form of the two metrics accuracy and kappa statistics to analyze the effect of different features extractions and classification technique on news data set. This experimental evaluation will improvise the future research of the use of different machine learning techniques on news data set.","PeriodicalId":197132,"journal":{"name":"2021 1st Odisha International Conference on Electrical Power Engineering, Communication and Computing Technology(ODICON)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 1st Odisha International Conference on Electrical Power Engineering, Communication and Computing Technology(ODICON)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ODICON50556.2021.9428925","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6
Abstract
As this whole world is gradually digitized in every aspects, there is an exponential increase of amount of data in every fields. Retrieving valuable information from this unstructured, unorganized raw data is challenging and time consuming. There are so many techniques has been proposed in the field of information retrieval to organize the unorganized data efficiently. Text Categorization is one of the techniques introduced for this purpose of categorization of documents into pre-determined categories depending on their contents. This is one of the sub technique of text classification. Text Categorization is also known as Topic Spotting. The experimental study is done on standard benchmark news data set of Reuter using machine learning techniques like Random Forest and Naïve Bayes. TFIDF Vectorizer and Count Vectorizer is used for extracting features from the data set efficiently. Chi-Square is used for reduction of feature set from the extracted feature set to select the best features to accelerate performance. The result is captured in form of the two metrics accuracy and kappa statistics to analyze the effect of different features extractions and classification technique on news data set. This experimental evaluation will improvise the future research of the use of different machine learning techniques on news data set.