Shaun Anthony Little, Kaushik Roy, Ahmed Al Hamoud
{"title":"基于机器学习的斯瓦希里语新闻分类方法的性能基准","authors":"Shaun Anthony Little, Kaushik Roy, Ahmed Al Hamoud","doi":"10.1109/ICMLA55696.2022.00238","DOIUrl":null,"url":null,"abstract":"As data increases at unprecedented rates, so does the need to classify this data, including news article data. Unfortunately, most news article categorization research utilizes global languages such as English or Spanish, and not much research considers low-resource languages like Swahili. Testing multiple classifiers and preprocessing methods, we show that the SVM model with tokenization and stop word removal has the highest accuracy (85.13%) scores for Swahili news article categorization. These results from the first publicly available peer-reviewed Swahili news article dataset provide benchmark performance for Swahili news article categorization and contribute to lean Swahili text classification research.","PeriodicalId":128160,"journal":{"name":"2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA)","volume":"197 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Performance Benchmark of Machine Learning-Based Methodology for Swahili News Article Categorization\",\"authors\":\"Shaun Anthony Little, Kaushik Roy, Ahmed Al Hamoud\",\"doi\":\"10.1109/ICMLA55696.2022.00238\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As data increases at unprecedented rates, so does the need to classify this data, including news article data. Unfortunately, most news article categorization research utilizes global languages such as English or Spanish, and not much research considers low-resource languages like Swahili. Testing multiple classifiers and preprocessing methods, we show that the SVM model with tokenization and stop word removal has the highest accuracy (85.13%) scores for Swahili news article categorization. These results from the first publicly available peer-reviewed Swahili news article dataset provide benchmark performance for Swahili news article categorization and contribute to lean Swahili text classification research.\",\"PeriodicalId\":128160,\"journal\":{\"name\":\"2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA)\",\"volume\":\"197 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICMLA55696.2022.00238\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLA55696.2022.00238","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Performance Benchmark of Machine Learning-Based Methodology for Swahili News Article Categorization
As data increases at unprecedented rates, so does the need to classify this data, including news article data. Unfortunately, most news article categorization research utilizes global languages such as English or Spanish, and not much research considers low-resource languages like Swahili. Testing multiple classifiers and preprocessing methods, we show that the SVM model with tokenization and stop word removal has the highest accuracy (85.13%) scores for Swahili news article categorization. These results from the first publicly available peer-reviewed Swahili news article dataset provide benchmark performance for Swahili news article categorization and contribute to lean Swahili text classification research.