Hebrew offensive language taxonomy and dataset

Q2 Arts and Humanities
Chaya Liebeskind, N. Vanetik, Marina Litvak
{"title":"Hebrew offensive language taxonomy and dataset","authors":"Chaya Liebeskind, N. Vanetik, Marina Litvak","doi":"10.1515/lpp-2023-0017","DOIUrl":null,"url":null,"abstract":"Abstract This paper introduces a streamlined taxonomy for categorizing offensive language in Hebrew, addressing a gap in the literature that has, until now, largely focused on Indo-European languages. Our taxonomy divides offensive language into seven levels (six explicit and one implicit level). We based our work on the simplified offensive language (SOL) taxonomy introduced in (Lewandowska-Tomaszczyk et al. 2021a) hoping that our adjustment of SOL to the Hebrew language will be capable of reflecting the unique linguistic and cultural nuances of Hebrew. The study involves both linguistic and cultural analysis beyond Natural Language Processing (NLP). We employed manual linguistic analysis to understand the nuances of offensive language in Hebrew. An accompanying dataset, gathered on Twitter and manually curated by human annotators, is described in detail. This dataset was constructed to both validate the taxonomy and serve as a foundation for future research on offensive language detection and analysis in Hebrew. Preliminary analysis of the dataset reveals intriguing patterns and distributions, underscoring the complexity and specificity of offensive expressions in the Hebrew language. The aim of our work is to capture the complexity and specificity of offensive expressions in Hebrew beyond what automated NLP methods alone can provide. Our findings highlight the significance of considering linguistic and cultural variations when researching and correcting abusive language online. We believe that our streamlined taxonomy and associated dataset will be crucial in improving research in Hebrew language sociocultural studies, natural language processing, and offensive language detection. Our study also makes a substantial contribution to the study of low-resource languages and can be used as a model for future research on other languages.","PeriodicalId":39423,"journal":{"name":"Lodz Papers in Pragmatics","volume":" 13","pages":"325 - 351"},"PeriodicalIF":0.0000,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Lodz Papers in Pragmatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1515/lpp-2023-0017","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Arts and Humanities","Score":null,"Total":0}
引用次数: 0

Abstract

Abstract This paper introduces a streamlined taxonomy for categorizing offensive language in Hebrew, addressing a gap in the literature that has, until now, largely focused on Indo-European languages. Our taxonomy divides offensive language into seven levels (six explicit and one implicit level). We based our work on the simplified offensive language (SOL) taxonomy introduced in (Lewandowska-Tomaszczyk et al. 2021a) hoping that our adjustment of SOL to the Hebrew language will be capable of reflecting the unique linguistic and cultural nuances of Hebrew. The study involves both linguistic and cultural analysis beyond Natural Language Processing (NLP). We employed manual linguistic analysis to understand the nuances of offensive language in Hebrew. An accompanying dataset, gathered on Twitter and manually curated by human annotators, is described in detail. This dataset was constructed to both validate the taxonomy and serve as a foundation for future research on offensive language detection and analysis in Hebrew. Preliminary analysis of the dataset reveals intriguing patterns and distributions, underscoring the complexity and specificity of offensive expressions in the Hebrew language. The aim of our work is to capture the complexity and specificity of offensive expressions in Hebrew beyond what automated NLP methods alone can provide. Our findings highlight the significance of considering linguistic and cultural variations when researching and correcting abusive language online. We believe that our streamlined taxonomy and associated dataset will be crucial in improving research in Hebrew language sociocultural studies, natural language processing, and offensive language detection. Our study also makes a substantial contribution to the study of low-resource languages and can be used as a model for future research on other languages.
希伯来攻击性语言分类和数据集
本文介绍了一种精简的分类法,用于对希伯来语中的攻击性语言进行分类,解决了迄今为止主要集中在印欧语言上的文献差距。我们的分类法将攻击性语言分为七个级别(六个显级和一个隐级)。我们的工作基于(Lewandowska-Tomaszczyk et al. 2021a)中介绍的简化攻击性语言(SOL)分类法,希望我们对希伯来语的SOL调整能够反映希伯来语独特的语言和文化差异。该研究涉及自然语言处理(NLP)之外的语言和文化分析。我们使用人工语言分析来理解希伯来语中冒犯性语言的细微差别。详细描述了在Twitter上收集并由人类注释者手动管理的附带数据集。该数据集的构建既可以验证分类,也可以作为未来研究希伯来语冒犯性语言检测和分析的基础。对数据集的初步分析揭示了有趣的模式和分布,强调了希伯来语中冒犯性表达的复杂性和特殊性。我们工作的目的是捕捉希伯来语中冒犯性表达的复杂性和特殊性,而不仅仅是自动化NLP方法所能提供的。我们的研究结果强调了在研究和纠正在线辱骂性语言时考虑语言和文化差异的重要性。我们相信,我们的流线型分类法和相关数据集将在改善希伯来语社会文化研究,自然语言处理和攻击性语言检测方面的研究至关重要。我们的研究也为低资源语言的研究做出了实质性的贡献,并可以作为未来其他语言研究的模型。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Lodz Papers in Pragmatics
Lodz Papers in Pragmatics Arts and Humanities-Language and Linguistics
CiteScore
1.10
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信