Improving the Detection of Multilingual South African Abusive Language via Skip-gram using Joint Multilevel Domain Adaptation: The Detection of Multilingual South African Abusive Language using Skip-gram and Domain Adaptation: ACM Transactions on Asian and Low-Resource Language Information Processing: Vol 0, No ja
IF 1.8 4区 计算机科学Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
{"title":"Improving the Detection of Multilingual South African Abusive Language via Skip-gram using Joint Multilevel Domain Adaptation: The Detection of Multilingual South African Abusive Language using Skip-gram and Domain Adaptation: ACM Transactions on Asian and Low-Resource Language Information Processing: Vol 0, No ja","authors":"Oluwafemi Oriola, Eduan Kotzé","doi":"10.1145/3638759","DOIUrl":null,"url":null,"abstract":"<p>The distinctiveness and sparsity of low-resource multilingual South African abusive language necessitate the development of a novel solution to automatically detect different classes of abusive language instances using machine learning. Skip-gram has been used to address sparsity in machine learning classification problems but is inadequate in detecting South African abusive language due to the considerable amount of rare features and class imbalance. Joint Domain Adaptation has been used to enlarge features of a low-resource target domain for improved classification outcomes by jointly learning from the target domain and large-resource source domain. This paper, therefore, builds a Skip-gram model based on Joint Domain Adaptation to improve the detection of multilingual South African abusive language. Contrary to the existing Joint Domain Adaptation approaches, a Joint Multilevel Domain Adaptation model involving adaptation of monolingual source domain instances and multilingual target domain instances with high frequency of rare features was executed at the first level, and adaptation of target-domain features and first-level features at the next level. Both surface-level and embedding word features were used to evaluate the proposed model. In the evaluation of surface-level features, the Joint Multilevel Domain Adaptation model outperformed the state-of-the-art models with accuracy of 0.92 and F1-score of 0.68. In the evaluation of embedding features, the proposed model outperformed the state-of-the-art models with accuracy of 0.88 and F1-score of 0.64. The Joint Multilevel Domain Adaptation model significantly improved the average information gain of the rare features in different language categories and reduced class imbalance.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"27 1","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2023-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Asian and Low-Resource Language Information Processing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3638759","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
The distinctiveness and sparsity of low-resource multilingual South African abusive language necessitate the development of a novel solution to automatically detect different classes of abusive language instances using machine learning. Skip-gram has been used to address sparsity in machine learning classification problems but is inadequate in detecting South African abusive language due to the considerable amount of rare features and class imbalance. Joint Domain Adaptation has been used to enlarge features of a low-resource target domain for improved classification outcomes by jointly learning from the target domain and large-resource source domain. This paper, therefore, builds a Skip-gram model based on Joint Domain Adaptation to improve the detection of multilingual South African abusive language. Contrary to the existing Joint Domain Adaptation approaches, a Joint Multilevel Domain Adaptation model involving adaptation of monolingual source domain instances and multilingual target domain instances with high frequency of rare features was executed at the first level, and adaptation of target-domain features and first-level features at the next level. Both surface-level and embedding word features were used to evaluate the proposed model. In the evaluation of surface-level features, the Joint Multilevel Domain Adaptation model outperformed the state-of-the-art models with accuracy of 0.92 and F1-score of 0.68. In the evaluation of embedding features, the proposed model outperformed the state-of-the-art models with accuracy of 0.88 and F1-score of 0.64. The Joint Multilevel Domain Adaptation model significantly improved the average information gain of the rare features in different language categories and reduced class imbalance.
期刊介绍:
The ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) publishes high quality original archival papers and technical notes in the areas of computation and processing of information in Asian languages, low-resource languages of Africa, Australasia, Oceania and the Americas, as well as related disciplines. The subject areas covered by TALLIP include, but are not limited to:
-Computational Linguistics: including computational phonology, computational morphology, computational syntax (e.g. parsing), computational semantics, computational pragmatics, etc.
-Linguistic Resources: including computational lexicography, terminology, electronic dictionaries, cross-lingual dictionaries, electronic thesauri, etc.
-Hardware and software algorithms and tools for Asian or low-resource language processing, e.g., handwritten character recognition.
-Information Understanding: including text understanding, speech understanding, character recognition, discourse processing, dialogue systems, etc.
-Machine Translation involving Asian or low-resource languages.
-Information Retrieval: including natural language processing (NLP) for concept-based indexing, natural language query interfaces, semantic relevance judgments, etc.
-Information Extraction and Filtering: including automatic abstraction, user profiling, etc.
-Speech processing: including text-to-speech synthesis and automatic speech recognition.
-Multimedia Asian Information Processing: including speech, image, video, image/text translation, etc.
-Cross-lingual information processing involving Asian or low-resource languages.
-Papers that deal in theory, systems design, evaluation and applications in the aforesaid subjects are appropriate for TALLIP. Emphasis will be placed on the originality and the practical significance of the reported research.