Learning From Arabic Corpora But Not Always From Arabic Speakers: A Case Study of the Arabic Wikipedia Editions

Saied Alshahrani, Esma Wali, Jeanna Neefe Matthews
{"title":"Learning From Arabic Corpora But Not Always From Arabic Speakers: A Case Study of the Arabic Wikipedia Editions","authors":"Saied Alshahrani, Esma Wali, Jeanna Neefe Matthews","doi":"10.18653/v1/2022.wanlp-1.34","DOIUrl":null,"url":null,"abstract":"Wikipedia is a common source of training data for Natural Language Processing (NLP) research, especially as a source for corpora in languages other than English. However, for many downstream NLP tasks, it is important to understand the degree to which these corpora reflect representative contributions of native speakers. In particular, many entries in a given language may be translated from other languages or produced through other automated mechanisms. Language models built using corpora like Wikipedia can embed history, culture, bias, stereotypes, politics, and more, but it is important to understand whose views are actually being represented. In this paper, we present a case study focusing specifically on differences among the Arabic Wikipedia editions (Modern Standard Arabic, Egyptian, and Moroccan). In particular, we document issues in the Egyptian Arabic Wikipedia with automatic creation/generation and translation of content pages from English without human supervision. These issues could substantially affect the performance and accuracy of Large Language Models (LLMs) trained from these corpora, producing models that lack the cultural richness and meaningful representation of native speakers. Fortunately, the metadata maintained by Wikipedia provides visibility into these issues, but unfortunately, this is not the case for all corpora used to train LLMs.","PeriodicalId":355149,"journal":{"name":"Workshop on Arabic Natural Language Processing","volume":"4 2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Workshop on Arabic Natural Language Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/2022.wanlp-1.34","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Wikipedia is a common source of training data for Natural Language Processing (NLP) research, especially as a source for corpora in languages other than English. However, for many downstream NLP tasks, it is important to understand the degree to which these corpora reflect representative contributions of native speakers. In particular, many entries in a given language may be translated from other languages or produced through other automated mechanisms. Language models built using corpora like Wikipedia can embed history, culture, bias, stereotypes, politics, and more, but it is important to understand whose views are actually being represented. In this paper, we present a case study focusing specifically on differences among the Arabic Wikipedia editions (Modern Standard Arabic, Egyptian, and Moroccan). In particular, we document issues in the Egyptian Arabic Wikipedia with automatic creation/generation and translation of content pages from English without human supervision. These issues could substantially affect the performance and accuracy of Large Language Models (LLMs) trained from these corpora, producing models that lack the cultural richness and meaningful representation of native speakers. Fortunately, the metadata maintained by Wikipedia provides visibility into these issues, but unfortunately, this is not the case for all corpora used to train LLMs.
从阿拉伯语料库中学习,但并不总是从阿拉伯语使用者那里学习:阿拉伯语维基百科版本的案例研究
维基百科是自然语言处理(NLP)研究训练数据的常见来源,特别是作为非英语语言语料库的来源。然而,对于许多下游的NLP任务,了解这些语料库在多大程度上反映了母语人士的代表性贡献是很重要的。特别是,给定语言中的许多条目可能是从其他语言翻译过来的,或者通过其他自动化机制生成的。使用像维基百科这样的语料库构建的语言模型可以嵌入历史、文化、偏见、刻板印象、政治等等,但重要的是要了解谁的观点实际上被代表了。在本文中,我们提出了一个案例研究,特别关注阿拉伯语维基百科版本(现代标准阿拉伯语,埃及语和摩洛哥语)之间的差异。特别是,我们在埃及阿拉伯语维基百科中记录问题,在没有人工监督的情况下,自动创建/生成和翻译英语内容页。这些问题可能会严重影响从这些语料库中训练的大型语言模型(llm)的性能和准确性,从而产生缺乏文化丰富性和有意义的母语使用者表示的模型。幸运的是,维基百科维护的元数据提供了对这些问题的可见性,但不幸的是,并非所有用于培训法学硕士的语料库都是如此。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信