Aman Farooq, Zainab Noreen, Safiyah Batool, Fouzia Naz
{"title":"乌尔都语新闻分类:使用机器学习技术的实证研究","authors":"Aman Farooq, Zainab Noreen, Safiyah Batool, Fouzia Naz","doi":"10.1109/MAJICC56935.2022.9994152","DOIUrl":null,"url":null,"abstract":"Text is a rich source of information and there is unlimited text on the internet. Automatic text classification is a technique to label those text documents with predefined categories. This has various applications including sentiment analysis, spam detection, NLP etc. There is much work done on english text classification but there is a huge gap with Urdu. There isn't any standard algorithm known that outperforms all others. Also it is observed that classifiers usually perform better when the text is preprocessed, but there aren't any standard stemmer, stop word list, tokenizer etc. available for urdu text. Urdu is rich morphologically and it's a challenge to design preprocessing tools for urdu. This research tends to reduce the gap by testing different classification algorithms using different dimensionality reduction combinations on urdu news data set to know which performs better. It also includes designing a stemmer, tokenizer and preparing a stop word list. In this research it was concluded that SVM performed better with the combination of both preprocessing techniques. Fasttext library was also tested for urdu text classification which achieved 95%accuracy and f-score 1 %less than SVM. Another approach used is that topic modeling has been performed using LDA and documents have been weighed as topics. Classification using documents as topics didn't perform well but Random Forest performed better than Naive Bayes and SVM. It's in future work to design a POS tagger that may improve performance of stemmer and to test deep learning methods for urdu text classification.","PeriodicalId":205027,"journal":{"name":"2022 Mohammad Ali Jinnah University International Conference on Computing (MAJICC)","volume":"86 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Urdu News Classification: An Empirical Study Using Machine Learning Techniques\",\"authors\":\"Aman Farooq, Zainab Noreen, Safiyah Batool, Fouzia Naz\",\"doi\":\"10.1109/MAJICC56935.2022.9994152\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Text is a rich source of information and there is unlimited text on the internet. Automatic text classification is a technique to label those text documents with predefined categories. This has various applications including sentiment analysis, spam detection, NLP etc. There is much work done on english text classification but there is a huge gap with Urdu. There isn't any standard algorithm known that outperforms all others. Also it is observed that classifiers usually perform better when the text is preprocessed, but there aren't any standard stemmer, stop word list, tokenizer etc. available for urdu text. Urdu is rich morphologically and it's a challenge to design preprocessing tools for urdu. This research tends to reduce the gap by testing different classification algorithms using different dimensionality reduction combinations on urdu news data set to know which performs better. It also includes designing a stemmer, tokenizer and preparing a stop word list. In this research it was concluded that SVM performed better with the combination of both preprocessing techniques. Fasttext library was also tested for urdu text classification which achieved 95%accuracy and f-score 1 %less than SVM. Another approach used is that topic modeling has been performed using LDA and documents have been weighed as topics. Classification using documents as topics didn't perform well but Random Forest performed better than Naive Bayes and SVM. It's in future work to design a POS tagger that may improve performance of stemmer and to test deep learning methods for urdu text classification.\",\"PeriodicalId\":205027,\"journal\":{\"name\":\"2022 Mohammad Ali Jinnah University International Conference on Computing (MAJICC)\",\"volume\":\"86 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 Mohammad Ali Jinnah University International Conference on Computing (MAJICC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/MAJICC56935.2022.9994152\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 Mohammad Ali Jinnah University International Conference on Computing (MAJICC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MAJICC56935.2022.9994152","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Urdu News Classification: An Empirical Study Using Machine Learning Techniques
Text is a rich source of information and there is unlimited text on the internet. Automatic text classification is a technique to label those text documents with predefined categories. This has various applications including sentiment analysis, spam detection, NLP etc. There is much work done on english text classification but there is a huge gap with Urdu. There isn't any standard algorithm known that outperforms all others. Also it is observed that classifiers usually perform better when the text is preprocessed, but there aren't any standard stemmer, stop word list, tokenizer etc. available for urdu text. Urdu is rich morphologically and it's a challenge to design preprocessing tools for urdu. This research tends to reduce the gap by testing different classification algorithms using different dimensionality reduction combinations on urdu news data set to know which performs better. It also includes designing a stemmer, tokenizer and preparing a stop word list. In this research it was concluded that SVM performed better with the combination of both preprocessing techniques. Fasttext library was also tested for urdu text classification which achieved 95%accuracy and f-score 1 %less than SVM. Another approach used is that topic modeling has been performed using LDA and documents have been weighed as topics. Classification using documents as topics didn't perform well but Random Forest performed better than Naive Bayes and SVM. It's in future work to design a POS tagger that may improve performance of stemmer and to test deep learning methods for urdu text classification.