{"title":"A New Corpus and Lexicon for Offensive Tamazight Language Detection","authors":"K. Abainia, Kenza Kara, Tassadit Hamouni","doi":"10.1145/3544795.3544852","DOIUrl":null,"url":null,"abstract":"In this paper, we address the offensive language detection on Tamazight language, which is one of the under-resourced languages that are still in their infancy and lack of standard orthography. We are particularly interested in the Kabyle dialect, mainly spoken in some cities of northern Algeria (i.e. Tizi-ouzou and Bejaïa). We propose a new corpus of offensive Tamazight language (i.e. OTAM corpus) compiling 6.2k texts, as well as a new lexicon of offensive and abusive Tamazight words with 12.6k entries. We have evaluated several baseline classifiers of machine learning and deep learning, where the results showed that we could produce acceptable results without features engineering.","PeriodicalId":103807,"journal":{"name":"Proceedings of the 7th International Workshop on Social Media World Sensors","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 7th International Workshop on Social Media World Sensors","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3544795.3544852","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
In this paper, we address the offensive language detection on Tamazight language, which is one of the under-resourced languages that are still in their infancy and lack of standard orthography. We are particularly interested in the Kabyle dialect, mainly spoken in some cities of northern Algeria (i.e. Tizi-ouzou and Bejaïa). We propose a new corpus of offensive Tamazight language (i.e. OTAM corpus) compiling 6.2k texts, as well as a new lexicon of offensive and abusive Tamazight words with 12.6k entries. We have evaluated several baseline classifiers of machine learning and deep learning, where the results showed that we could produce acceptable results without features engineering.