HOMO-MEX:用于Twitter上LGBT+恐惧症检测的墨西哥西班牙语注释语料库

Juan Vásquez, S. Andersen, G. Bel-Enguix, Helena Gómez-Adorno, Sergio-Luis Ojeda-Trueba
{"title":"HOMO-MEX:用于Twitter上LGBT+恐惧症检测的墨西哥西班牙语注释语料库","authors":"Juan Vásquez, S. Andersen, G. Bel-Enguix, Helena Gómez-Adorno, Sergio-Luis Ojeda-Trueba","doi":"10.18653/v1/2023.woah-1.20","DOIUrl":null,"url":null,"abstract":"In the past few years, the NLP community has actively worked on detecting LGBT+Phobia in online spaces, using textual data publicly available Most of these are for the English language and its variants since it is the most studied language by the NLP community. Nevertheless, efforts towards creating corpora in other languages are active worldwide. Despite this, the Spanish language is an understudied language regarding digital LGBT+Phobia. The only corpus we found in the literature was for the Peninsular Spanish dialects, which use LGBT+phobic terms different than those in the Mexican dialect. For this reason, we present Homo-MEX, a novel corpus for detecting LGBT+Phobia in Mexican Spanish. In this paper, we describe our data-gathering and annotation process. Also, we present a classification benchmark using various traditional machine learning algorithms and two pre-trained deep learning models to showcase our corpus classification potential.","PeriodicalId":378248,"journal":{"name":"The 7th Workshop on Online Abuse and Harms (WOAH)","volume":"74 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"HOMO-MEX: A Mexican Spanish Annotated Corpus for LGBT+phobia Detection on Twitter\",\"authors\":\"Juan Vásquez, S. Andersen, G. Bel-Enguix, Helena Gómez-Adorno, Sergio-Luis Ojeda-Trueba\",\"doi\":\"10.18653/v1/2023.woah-1.20\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In the past few years, the NLP community has actively worked on detecting LGBT+Phobia in online spaces, using textual data publicly available Most of these are for the English language and its variants since it is the most studied language by the NLP community. Nevertheless, efforts towards creating corpora in other languages are active worldwide. Despite this, the Spanish language is an understudied language regarding digital LGBT+Phobia. The only corpus we found in the literature was for the Peninsular Spanish dialects, which use LGBT+phobic terms different than those in the Mexican dialect. For this reason, we present Homo-MEX, a novel corpus for detecting LGBT+Phobia in Mexican Spanish. In this paper, we describe our data-gathering and annotation process. Also, we present a classification benchmark using various traditional machine learning algorithms and two pre-trained deep learning models to showcase our corpus classification potential.\",\"PeriodicalId\":378248,\"journal\":{\"name\":\"The 7th Workshop on Online Abuse and Harms (WOAH)\",\"volume\":\"74 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"The 7th Workshop on Online Abuse and Harms (WOAH)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.18653/v1/2023.woah-1.20\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"The 7th Workshop on Online Abuse and Harms (WOAH)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/2023.woah-1.20","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

在过去的几年里,NLP社区一直积极致力于检测在线空间中的LGBT+恐惧症,使用公开的文本数据,其中大多数是针对英语及其变体的,因为它是NLP社区研究最多的语言。尽管如此,全世界都在积极努力创建其他语言的语料库。尽管如此,西班牙语是一种关于数字LGBT+恐惧症的未充分研究的语言。我们在文献中发现的唯一语料库是半岛西班牙方言,这些方言使用的LGBT+恐惧症术语与墨西哥方言不同。基于这个原因,我们提出了一种新的用于检测墨西哥西班牙语中LGBT+恐惧症的语料库——Homo-MEX。在本文中,我们描述了我们的数据收集和注释过程。此外,我们提出了一个分类基准,使用各种传统机器学习算法和两个预训练的深度学习模型来展示我们的语料库分类潜力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
HOMO-MEX: A Mexican Spanish Annotated Corpus for LGBT+phobia Detection on Twitter
In the past few years, the NLP community has actively worked on detecting LGBT+Phobia in online spaces, using textual data publicly available Most of these are for the English language and its variants since it is the most studied language by the NLP community. Nevertheless, efforts towards creating corpora in other languages are active worldwide. Despite this, the Spanish language is an understudied language regarding digital LGBT+Phobia. The only corpus we found in the literature was for the Peninsular Spanish dialects, which use LGBT+phobic terms different than those in the Mexican dialect. For this reason, we present Homo-MEX, a novel corpus for detecting LGBT+Phobia in Mexican Spanish. In this paper, we describe our data-gathering and annotation process. Also, we present a classification benchmark using various traditional machine learning algorithms and two pre-trained deep learning models to showcase our corpus classification potential.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信