Francis Jonathan, Dong Yang, Glyn Gowing, Songjie Wei
{"title":"用于检测社交网络中攻击性斯瓦希里语消息的机器学习框架与Apache Spark实现","authors":"Francis Jonathan, Dong Yang, Glyn Gowing, Songjie Wei","doi":"10.1109/PIC53636.2021.9687001","DOIUrl":null,"url":null,"abstract":"Languages morphological context varies by community. The linguistic analysis became more complex due to grammatical variations, cultural, traditional, slang, misspellings, and language variance. Many studies in sentimental analysis have focused on natural language processing and peoples opinions. Text language processing takes time, requires lots of storage space, and a fast computer to work in distributed networks. Many developers choose Hadoop and Map Reduce to process Big Data. This study developed a methodology that employs Apache Spark as a text classification processing engine since it is faster in cluster computing systems. African libraries and packages for language lemmatization and stemming are still lacking. The proposed approach was utilized to detect offensive Swahili texts in social networks. Swahili is the third most widely spoken language in Africa. Four different machine learning techniques were tested as benchmarks, with the multinomial logistic model proving to be the most effective. The evaluation measures show that the proposed machine learning framework is versatile and suitable for usage in centralized and distributed systems.","PeriodicalId":297239,"journal":{"name":"2021 IEEE International Conference on Progress in Informatics and Computing (PIC)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Machine Learning Framework for Detecting Offensive Swahili Messages in Social Networks with Apache Spark Implementation\",\"authors\":\"Francis Jonathan, Dong Yang, Glyn Gowing, Songjie Wei\",\"doi\":\"10.1109/PIC53636.2021.9687001\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Languages morphological context varies by community. The linguistic analysis became more complex due to grammatical variations, cultural, traditional, slang, misspellings, and language variance. Many studies in sentimental analysis have focused on natural language processing and peoples opinions. Text language processing takes time, requires lots of storage space, and a fast computer to work in distributed networks. Many developers choose Hadoop and Map Reduce to process Big Data. This study developed a methodology that employs Apache Spark as a text classification processing engine since it is faster in cluster computing systems. African libraries and packages for language lemmatization and stemming are still lacking. The proposed approach was utilized to detect offensive Swahili texts in social networks. Swahili is the third most widely spoken language in Africa. Four different machine learning techniques were tested as benchmarks, with the multinomial logistic model proving to be the most effective. The evaluation measures show that the proposed machine learning framework is versatile and suitable for usage in centralized and distributed systems.\",\"PeriodicalId\":297239,\"journal\":{\"name\":\"2021 IEEE International Conference on Progress in Informatics and Computing (PIC)\",\"volume\":\"70 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-12-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE International Conference on Progress in Informatics and Computing (PIC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PIC53636.2021.9687001\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Conference on Progress in Informatics and Computing (PIC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PIC53636.2021.9687001","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Machine Learning Framework for Detecting Offensive Swahili Messages in Social Networks with Apache Spark Implementation
Languages morphological context varies by community. The linguistic analysis became more complex due to grammatical variations, cultural, traditional, slang, misspellings, and language variance. Many studies in sentimental analysis have focused on natural language processing and peoples opinions. Text language processing takes time, requires lots of storage space, and a fast computer to work in distributed networks. Many developers choose Hadoop and Map Reduce to process Big Data. This study developed a methodology that employs Apache Spark as a text classification processing engine since it is faster in cluster computing systems. African libraries and packages for language lemmatization and stemming are still lacking. The proposed approach was utilized to detect offensive Swahili texts in social networks. Swahili is the third most widely spoken language in Africa. Four different machine learning techniques were tested as benchmarks, with the multinomial logistic model proving to be the most effective. The evaluation measures show that the proposed machine learning framework is versatile and suitable for usage in centralized and distributed systems.