Multilevel Classification of Pakistani News using Machine Learning

2021 22nd International Arab Conference on Information Technology (ACIT) Pub Date : 2021-12-21 DOI:10.1109/acit53391.2021.9677431

Anum Ilyas, S. Obaid, N. Bawany

{"title":"Multilevel Classification of Pakistani News using Machine Learning","authors":"Anum Ilyas, S. Obaid, N. Bawany","doi":"10.1109/acit53391.2021.9677431","DOIUrl":null,"url":null,"abstract":"The availability of innumerable sources of online news has benefitted the masses as they have opportunity to gather news from a diverse set of sources. However, classification of this huge data being generated on regular basis has never been a simple task. This textual information can be invaluable only when it is processed to maximize its usefulness which is possible with automated text classification. Natural Language Processing (NLP) and Machine learning techniques have been extensively applied in this particular domain to address this challenge. Text classification is helpful in several scenarios such as product mining, emotions or sentiment analysis, etc. News classification is one of its applications through which content of news is processed and analyzed to assign predefined label(s). This research is focused on classification of Pakistani news obtained from dataset available on Open Data Pakistan. We have applied various machine learning algorithms including Logistic Regression, Random Forest, Support Vector Machine, and Naïve Bayes for first-level classification and Logistic Regression for multilevel classification. Comparative analysis of these algorithms is also presented. We achieved a maximum of 97.8% accuracy through Support Vector Machine in single-level classification and 83% through Logistic Regression in multilevel text classification.","PeriodicalId":302120,"journal":{"name":"2021 22nd International Arab Conference on Information Technology (ACIT)","volume":"731 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 22nd International Arab Conference on Information Technology (ACIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/acit53391.2021.9677431","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The availability of innumerable sources of online news has benefitted the masses as they have opportunity to gather news from a diverse set of sources. However, classification of this huge data being generated on regular basis has never been a simple task. This textual information can be invaluable only when it is processed to maximize its usefulness which is possible with automated text classification. Natural Language Processing (NLP) and Machine learning techniques have been extensively applied in this particular domain to address this challenge. Text classification is helpful in several scenarios such as product mining, emotions or sentiment analysis, etc. News classification is one of its applications through which content of news is processed and analyzed to assign predefined label(s). This research is focused on classification of Pakistani news obtained from dataset available on Open Data Pakistan. We have applied various machine learning algorithms including Logistic Regression, Random Forest, Support Vector Machine, and Naïve Bayes for first-level classification and Logistic Regression for multilevel classification. Comparative analysis of these algorithms is also presented. We achieved a maximum of 97.8% accuracy through Support Vector Machine in single-level classification and 83% through Logistic Regression in multilevel text classification.

查看原文本刊更多论文

使用机器学习的巴基斯坦新闻多层次分类

无数在线新闻来源的可用性使大众受益，因为他们有机会从不同的来源收集新闻。然而，对这些定期生成的庞大数据进行分类从来都不是一项简单的任务。只有对这些文本信息进行处理，使其有用性最大化时，这些文本信息才有可能是无价的，这可以通过自动文本分类实现。自然语言处理(NLP)和机器学习技术已被广泛应用于这一特定领域，以解决这一挑战。文本分类在产品挖掘、情感或情感分析等场景中很有帮助。新闻分类是其应用之一，通过对新闻内容进行处理和分析来分配预定义的标签。这项研究的重点是对巴基斯坦新闻的分类，这些新闻来自巴基斯坦开放数据网站上的数据集。我们应用了各种机器学习算法，包括逻辑回归、随机森林、支持向量机和Naïve贝叶斯用于一级分类，逻辑回归用于多级分类。并对这些算法进行了比较分析。我们通过支持向量机在单层次分类中获得了97.8%的准确率，通过逻辑回归在多层次文本分类中获得了83%的准确率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 22nd International Arab Conference on Information Technology (ACIT)

自引率

0.00%

发文量