Shaun Anthony Little, Kaushik Roy, Ahmed Al Hamoud
{"title":"Performance Benchmark of Machine Learning-Based Methodology for Swahili News Article Categorization","authors":"Shaun Anthony Little, Kaushik Roy, Ahmed Al Hamoud","doi":"10.1109/ICMLA55696.2022.00238","DOIUrl":null,"url":null,"abstract":"As data increases at unprecedented rates, so does the need to classify this data, including news article data. Unfortunately, most news article categorization research utilizes global languages such as English or Spanish, and not much research considers low-resource languages like Swahili. Testing multiple classifiers and preprocessing methods, we show that the SVM model with tokenization and stop word removal has the highest accuracy (85.13%) scores for Swahili news article categorization. These results from the first publicly available peer-reviewed Swahili news article dataset provide benchmark performance for Swahili news article categorization and contribute to lean Swahili text classification research.","PeriodicalId":128160,"journal":{"name":"2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA)","volume":"197 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLA55696.2022.00238","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
As data increases at unprecedented rates, so does the need to classify this data, including news article data. Unfortunately, most news article categorization research utilizes global languages such as English or Spanish, and not much research considers low-resource languages like Swahili. Testing multiple classifiers and preprocessing methods, we show that the SVM model with tokenization and stop word removal has the highest accuracy (85.13%) scores for Swahili news article categorization. These results from the first publicly available peer-reviewed Swahili news article dataset provide benchmark performance for Swahili news article categorization and contribute to lean Swahili text classification research.