{"title":"Evaluation of Text Representation Method to Detect Cyber Aggression in Hindi English Code Mixed Social Media Text","authors":"Shikha Mundra, Namita Mittal","doi":"10.1145/3474124.3474185","DOIUrl":null,"url":null,"abstract":"with the widespread growth in modern technologies and social media networks, people spend their massive amount of time communicating and gathering information online from the web. This phenomenon leads to an increase in the number of active social media users from multilingual societies over every year along with a major challenge to monitor aggressive and harmful content posted informally onto the large-scale social media. A recent study showed the victims of cyber aggression suffer from various impacts as depression, suicide attempts or must leave the social media platform which focuses on emerging need to automatically understand such type of offensive content. The majority of the text in social media belongs to non-English language but research has so far concentrated on English texts only hence text understanding is the major issue in social media as non-English speakers do not always use Unicode to write in their language, they use phonetic typing, frequently insert English elements and mix multiple languages. In our work, we studied already existing work deeply and investigate multiple text embedding techniques onto cyber aggression detection dataset having a challenging issue of Hindi English code mixed text understanding and revealed that character-based embedding is performing best in noisy data and can be enhanced with inclusion only aggressive words density as a feature without in-depth preprocessing. Also, our model overcomes the constraint of the availability of pre-trained word embedding.","PeriodicalId":144611,"journal":{"name":"2021 Thirteenth International Conference on Contemporary Computing (IC3-2021)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 Thirteenth International Conference on Contemporary Computing (IC3-2021)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3474124.3474185","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
with the widespread growth in modern technologies and social media networks, people spend their massive amount of time communicating and gathering information online from the web. This phenomenon leads to an increase in the number of active social media users from multilingual societies over every year along with a major challenge to monitor aggressive and harmful content posted informally onto the large-scale social media. A recent study showed the victims of cyber aggression suffer from various impacts as depression, suicide attempts or must leave the social media platform which focuses on emerging need to automatically understand such type of offensive content. The majority of the text in social media belongs to non-English language but research has so far concentrated on English texts only hence text understanding is the major issue in social media as non-English speakers do not always use Unicode to write in their language, they use phonetic typing, frequently insert English elements and mix multiple languages. In our work, we studied already existing work deeply and investigate multiple text embedding techniques onto cyber aggression detection dataset having a challenging issue of Hindi English code mixed text understanding and revealed that character-based embedding is performing best in noisy data and can be enhanced with inclusion only aggressive words density as a feature without in-depth preprocessing. Also, our model overcomes the constraint of the availability of pre-trained word embedding.