乌尔都语新闻分类:使用机器学习技术的实证研究

2022 Mohammad Ali Jinnah University International Conference on Computing (MAJICC) Pub Date : 2022-10-27 DOI:10.1109/MAJICC56935.2022.9994152

Aman Farooq, Zainab Noreen, Safiyah Batool, Fouzia Naz

{"title":"乌尔都语新闻分类:使用机器学习技术的实证研究","authors":"Aman Farooq, Zainab Noreen, Safiyah Batool, Fouzia Naz","doi":"10.1109/MAJICC56935.2022.9994152","DOIUrl":null,"url":null,"abstract":"Text is a rich source of information and there is unlimited text on the internet. Automatic text classification is a technique to label those text documents with predefined categories. This has various applications including sentiment analysis, spam detection, NLP etc. There is much work done on english text classification but there is a huge gap with Urdu. There isn't any standard algorithm known that outperforms all others. Also it is observed that classifiers usually perform better when the text is preprocessed, but there aren't any standard stemmer, stop word list, tokenizer etc. available for urdu text. Urdu is rich morphologically and it's a challenge to design preprocessing tools for urdu. This research tends to reduce the gap by testing different classification algorithms using different dimensionality reduction combinations on urdu news data set to know which performs better. It also includes designing a stemmer, tokenizer and preparing a stop word list. In this research it was concluded that SVM performed better with the combination of both preprocessing techniques. Fasttext library was also tested for urdu text classification which achieved 95%accuracy and f-score 1 %less than SVM. Another approach used is that topic modeling has been performed using LDA and documents have been weighed as topics. Classification using documents as topics didn't perform well but Random Forest performed better than Naive Bayes and SVM. It's in future work to design a POS tagger that may improve performance of stemmer and to test deep learning methods for urdu text classification.","PeriodicalId":205027,"journal":{"name":"2022 Mohammad Ali Jinnah University International Conference on Computing (MAJICC)","volume":"86 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Urdu News Classification: An Empirical Study Using Machine Learning Techniques\",\"authors\":\"Aman Farooq, Zainab Noreen, Safiyah Batool, Fouzia Naz\",\"doi\":\"10.1109/MAJICC56935.2022.9994152\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Text is a rich source of information and there is unlimited text on the internet. Automatic text classification is a technique to label those text documents with predefined categories. This has various applications including sentiment analysis, spam detection, NLP etc. There is much work done on english text classification but there is a huge gap with Urdu. There isn't any standard algorithm known that outperforms all others. Also it is observed that classifiers usually perform better when the text is preprocessed, but there aren't any standard stemmer, stop word list, tokenizer etc. available for urdu text. Urdu is rich morphologically and it's a challenge to design preprocessing tools for urdu. This research tends to reduce the gap by testing different classification algorithms using different dimensionality reduction combinations on urdu news data set to know which performs better. It also includes designing a stemmer, tokenizer and preparing a stop word list. In this research it was concluded that SVM performed better with the combination of both preprocessing techniques. Fasttext library was also tested for urdu text classification which achieved 95%accuracy and f-score 1 %less than SVM. Another approach used is that topic modeling has been performed using LDA and documents have been weighed as topics. Classification using documents as topics didn't perform well but Random Forest performed better than Naive Bayes and SVM. It's in future work to design a POS tagger that may improve performance of stemmer and to test deep learning methods for urdu text classification.\",\"PeriodicalId\":205027,\"journal\":{\"name\":\"2022 Mohammad Ali Jinnah University International Conference on Computing (MAJICC)\",\"volume\":\"86 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 Mohammad Ali Jinnah University International Conference on Computing (MAJICC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/MAJICC56935.2022.9994152\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 Mohammad Ali Jinnah University International Conference on Computing (MAJICC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MAJICC56935.2022.9994152","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

文本是一种丰富的信息来源，互联网上有无限的文本。自动文本分类是一种用预定义的类别标记这些文本文档的技术。它有各种各样的应用，包括情感分析、垃圾邮件检测、自然语言处理等。在英语文本分类方面已经做了很多工作，但与乌尔都语文本分类存在巨大差距。没有任何已知的标准算法比其他所有算法都要好。此外，我们还观察到，当文本经过预处理时，分类器通常会表现得更好，但乌尔都语文本没有任何标准的词干、停止词列表、标记器等可用。乌尔都语具有丰富的语态，设计乌尔都语预处理工具是一项挑战。本研究试图通过在乌尔都语新闻数据集上使用不同的降维组合测试不同的分类算法，以了解哪种分类算法的性能更好，从而缩小两者之间的差距。它还包括设计一个词干，标记器和准备一个停止词列表。在本研究中得出结论，支持向量机在两种预处理技术的结合下表现更好。Fasttext库也用于乌尔都语文本分类测试，准确率达到95%，f-score比SVM低1%。使用的另一种方法是使用LDA执行主题建模，并将文档作为主题进行权衡。使用文档作为主题的分类效果不佳，但随机森林的分类效果优于朴素贝叶斯和支持向量机。未来的工作是设计一个词性标注器，以提高词性标注器的性能，并测试乌尔都语文本分类的深度学习方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Urdu News Classification: An Empirical Study Using Machine Learning Techniques

Text is a rich source of information and there is unlimited text on the internet. Automatic text classification is a technique to label those text documents with predefined categories. This has various applications including sentiment analysis, spam detection, NLP etc. There is much work done on english text classification but there is a huge gap with Urdu. There isn't any standard algorithm known that outperforms all others. Also it is observed that classifiers usually perform better when the text is preprocessed, but there aren't any standard stemmer, stop word list, tokenizer etc. available for urdu text. Urdu is rich morphologically and it's a challenge to design preprocessing tools for urdu. This research tends to reduce the gap by testing different classification algorithms using different dimensionality reduction combinations on urdu news data set to know which performs better. It also includes designing a stemmer, tokenizer and preparing a stop word list. In this research it was concluded that SVM performed better with the combination of both preprocessing techniques. Fasttext library was also tested for urdu text classification which achieved 95%accuracy and f-score 1 %less than SVM. Another approach used is that topic modeling has been performed using LDA and documents have been weighed as topics. Classification using documents as topics didn't perform well but Random Forest performed better than Naive Bayes and SVM. It's in future work to design a POS tagger that may improve performance of stemmer and to test deep learning methods for urdu text classification.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 Mohammad Ali Jinnah University International Conference on Computing (MAJICC)

自引率

0.00%

发文量