实现可靠的阿拉伯语人工智能文本检测:解决分音符难题

Information Pub Date : 2024-07-19 DOI:10.3390/info15070419
Hamed Alshammari, Khaled Elleithy
{"title":"实现可靠的阿拉伯语人工智能文本检测:解决分音符难题","authors":"Hamed Alshammari, Khaled Elleithy","doi":"10.3390/info15070419","DOIUrl":null,"url":null,"abstract":"Current AI detection systems often struggle to distinguish between Arabic human-written text (HWT) and AI-generated text (AIGT) due to the small marks present above and below the Arabic text called diacritics. This study introduces robust Arabic text detection models using Transformer-based pre-trained models, specifically AraELECTRA, AraBERT, XLM-R, and mBERT. Our primary goal is to detect AIGTs in essays and overcome the challenges posed by the diacritics that usually appear in Arabic religious texts. We created several novel datasets with diacritized and non-diacritized texts comprising up to 9666 HWT and AIGT training examples. We aimed to assess the robustness and effectiveness of the detection models on out-of-domain (OOD) datasets to assess their generalizability. Our detection models trained on diacritized examples achieved up to 98.4% accuracy compared to GPTZero’s 62.7% on the AIRABIC benchmark dataset. Our experiments reveal that, while including diacritics in training enhances the recognition of the diacritized HWTs, duplicating examples with and without diacritics is inefficient despite the high accuracy achieved. Applying a dediacritization filter during evaluation significantly improved model performance, achieving optimal performance compared to both GPTZero and the detection models trained on diacritized examples but evaluated without dediacritization. Although our focus was on Arabic due to its writing challenges, our detector architecture is adaptable to any language.","PeriodicalId":510156,"journal":{"name":"Information","volume":" 428","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Toward Robust Arabic AI-Generated Text Detection: Tackling Diacritics Challenges\",\"authors\":\"Hamed Alshammari, Khaled Elleithy\",\"doi\":\"10.3390/info15070419\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Current AI detection systems often struggle to distinguish between Arabic human-written text (HWT) and AI-generated text (AIGT) due to the small marks present above and below the Arabic text called diacritics. This study introduces robust Arabic text detection models using Transformer-based pre-trained models, specifically AraELECTRA, AraBERT, XLM-R, and mBERT. Our primary goal is to detect AIGTs in essays and overcome the challenges posed by the diacritics that usually appear in Arabic religious texts. We created several novel datasets with diacritized and non-diacritized texts comprising up to 9666 HWT and AIGT training examples. We aimed to assess the robustness and effectiveness of the detection models on out-of-domain (OOD) datasets to assess their generalizability. Our detection models trained on diacritized examples achieved up to 98.4% accuracy compared to GPTZero’s 62.7% on the AIRABIC benchmark dataset. Our experiments reveal that, while including diacritics in training enhances the recognition of the diacritized HWTs, duplicating examples with and without diacritics is inefficient despite the high accuracy achieved. Applying a dediacritization filter during evaluation significantly improved model performance, achieving optimal performance compared to both GPTZero and the detection models trained on diacritized examples but evaluated without dediacritization. Although our focus was on Arabic due to its writing challenges, our detector architecture is adaptable to any language.\",\"PeriodicalId\":510156,\"journal\":{\"name\":\"Information\",\"volume\":\" 428\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3390/info15070419\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/info15070419","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

目前的人工智能检测系统通常很难区分阿拉伯语人写文本(HWT)和人工智能生成文本(AIGT),原因是阿拉伯语文本上下存在一些小标记,这些标记被称为 "变音符"。本研究使用基于变换器的预训练模型,特别是 AraELECTRA、AraBERT、XLM-R 和 mBERT,引入了稳健的阿拉伯语文本检测模型。我们的主要目标是检测文章中的 AIGT,并克服通常出现在阿拉伯语宗教文本中的变音所带来的挑战。我们创建了几个新颖的数据集,其中包含9666个HWT和AIGT训练示例,包括变音和非变音文本。我们的目标是评估检测模型在域外(OOD)数据集上的稳健性和有效性,以评估其通用性。与 GPTZero 在 AIRABIC 基准数据集上 62.7% 的准确率相比,我们在变音示例上训练的检测模型达到了高达 98.4% 的准确率。我们的实验表明,虽然在训练中加入变音可以提高变音 HWT 的识别率,但重复有变音和无变音的示例效率很低,尽管能达到很高的准确率。在评估过程中应用去读音过滤器大大提高了模型的性能,与 GPTZero 和在去读音示例上训练但未进行去读音评估的检测模型相比,都达到了最佳性能。尽管由于阿拉伯语在书写方面的挑战,我们将重点放在了阿拉伯语上,但我们的检测器架构可适用于任何语言。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Toward Robust Arabic AI-Generated Text Detection: Tackling Diacritics Challenges
Current AI detection systems often struggle to distinguish between Arabic human-written text (HWT) and AI-generated text (AIGT) due to the small marks present above and below the Arabic text called diacritics. This study introduces robust Arabic text detection models using Transformer-based pre-trained models, specifically AraELECTRA, AraBERT, XLM-R, and mBERT. Our primary goal is to detect AIGTs in essays and overcome the challenges posed by the diacritics that usually appear in Arabic religious texts. We created several novel datasets with diacritized and non-diacritized texts comprising up to 9666 HWT and AIGT training examples. We aimed to assess the robustness and effectiveness of the detection models on out-of-domain (OOD) datasets to assess their generalizability. Our detection models trained on diacritized examples achieved up to 98.4% accuracy compared to GPTZero’s 62.7% on the AIRABIC benchmark dataset. Our experiments reveal that, while including diacritics in training enhances the recognition of the diacritized HWTs, duplicating examples with and without diacritics is inefficient despite the high accuracy achieved. Applying a dediacritization filter during evaluation significantly improved model performance, achieving optimal performance compared to both GPTZero and the detection models trained on diacritized examples but evaluated without dediacritization. Although our focus was on Arabic due to its writing challenges, our detector architecture is adaptable to any language.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信