Shirley Anugrah Hayati, Aditi Chaudhary, Naoki Otani, A. Black
{"title":"Dataset Analysis and Augmentation for Emoji-Sensitive Irony Detection","authors":"Shirley Anugrah Hayati, Aditi Chaudhary, Naoki Otani, A. Black","doi":"10.18653/v1/d19-5527","DOIUrl":"https://doi.org/10.18653/v1/d19-5527","url":null,"abstract":"Irony detection is an important task with applications in identification of online abuse and harassment. With the ubiquitous use of non-verbal cues such as emojis in social media, in this work we aim to study the role of these structures in irony detection. Since the existing irony detection datasets have <10% ironic tweets with emoji, classifiers trained on them are insensitive to emojis. We propose an automated pipeline for creating a more balanced dataset.","PeriodicalId":414714,"journal":{"name":"Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)","volume":"7 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120842997","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CodeSwitch-Reddit: Exploration of Written Multilingual Discourse in Online Discussion Forums","authors":"Ella Rabinovich, Masih Sultani, S. Stevenson","doi":"10.18653/v1/D19-5558","DOIUrl":"https://doi.org/10.18653/v1/D19-5558","url":null,"abstract":"In contrast to many decades of research on oral code-switching, the study of written multilingual productions has only recently enjoyed a surge of interest. Many open questions remain regarding the sociolinguistic underpinnings of written code-switching, and progress has been limited by a lack of suitable resources. We introduce a novel, large, and diverse dataset of written code-switched productions, curated from topical threads of multiple bilingual communities on the Reddit discussion platform, and explore questions that were mainly addressed in the context of spoken language thus far. We investigate whether findings in oral code-switching concerning content and style, as well as speaker proficiency, are carried over into written code-switching in discussion forums. The released dataset can further facilitate a range of research and practical activities.","PeriodicalId":414714,"journal":{"name":"Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133252921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}