AI-Generated Text Detector for Arabic Language Using Encoder-Based Transformer Architecture

Big Data and Cognitive Computing Pub Date : 2024-03-18 DOI:10.3390/bdcc8030032

Hamed Alshammari, Ahmed El-Sayed, Khaled Elleithy

{"title":"AI-Generated Text Detector for Arabic Language Using Encoder-Based Transformer Architecture","authors":"Hamed Alshammari, Ahmed El-Sayed, Khaled Elleithy","doi":"10.3390/bdcc8030032","DOIUrl":null,"url":null,"abstract":"The effectiveness of existing AI detectors is notably hampered when processing Arabic texts. This study introduces a novel AI text classifier designed specifically for Arabic, tackling the distinct challenges inherent in processing this language. A particular focus is placed on accurately recognizing human-written texts (HWTs), an area where existing AI detectors have demonstrated significant limitations. To achieve this goal, this paper utilized and fine-tuned two Transformer-based models, AraELECTRA and XLM-R, by training them on two distinct datasets: a large dataset comprising 43,958 examples and a custom dataset with 3078 examples that contain HWT and AI-generated texts (AIGTs) from various sources, including ChatGPT 3.5, ChatGPT-4, and BARD. The proposed architecture is adaptable to any language, but this work evaluates these models’ efficiency in recognizing HWTs versus AIGTs in Arabic as an example of Semitic languages. The performance of the proposed models has been compared against the two prominent existing AI detectors, GPTZero and OpenAI Text Classifier, particularly on the AIRABIC benchmark dataset. The results reveal that the proposed classifiers outperform both GPTZero and OpenAI Text Classifier with 81% accuracy compared to 63% and 50% for GPTZero and OpenAI Text Classifier, respectively. Furthermore, integrating a Dediacritization Layer prior to the classification model demonstrated a significant enhancement in the detection accuracy of both HWTs and AIGTs. This Dediacritization step markedly improved the classification accuracy, elevating it from 81% to as high as 99% and, in some instances, even achieving 100%.","PeriodicalId":505155,"journal":{"name":"Big Data and Cognitive Computing","volume":"25 9","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Big Data and Cognitive Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/bdcc8030032","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The effectiveness of existing AI detectors is notably hampered when processing Arabic texts. This study introduces a novel AI text classifier designed specifically for Arabic, tackling the distinct challenges inherent in processing this language. A particular focus is placed on accurately recognizing human-written texts (HWTs), an area where existing AI detectors have demonstrated significant limitations. To achieve this goal, this paper utilized and fine-tuned two Transformer-based models, AraELECTRA and XLM-R, by training them on two distinct datasets: a large dataset comprising 43,958 examples and a custom dataset with 3078 examples that contain HWT and AI-generated texts (AIGTs) from various sources, including ChatGPT 3.5, ChatGPT-4, and BARD. The proposed architecture is adaptable to any language, but this work evaluates these models’ efficiency in recognizing HWTs versus AIGTs in Arabic as an example of Semitic languages. The performance of the proposed models has been compared against the two prominent existing AI detectors, GPTZero and OpenAI Text Classifier, particularly on the AIRABIC benchmark dataset. The results reveal that the proposed classifiers outperform both GPTZero and OpenAI Text Classifier with 81% accuracy compared to 63% and 50% for GPTZero and OpenAI Text Classifier, respectively. Furthermore, integrating a Dediacritization Layer prior to the classification model demonstrated a significant enhancement in the detection accuracy of both HWTs and AIGTs. This Dediacritization step markedly improved the classification accuracy, elevating it from 81% to as high as 99% and, in some instances, even achieving 100%.

查看原文本刊更多论文

使用基于编码器的变换器架构的阿拉伯语人工智能文本检测器

在处理阿拉伯语文本时，现有人工智能检测器的有效性明显受到影响。本研究介绍了一种专为阿拉伯语设计的新型人工智能文本分类器，以应对处理这种语言时固有的独特挑战。重点尤其放在准确识别人写文本（HWT）上，而现有的人工智能检测器在这一领域表现出明显的局限性。为了实现这一目标，本文利用并微调了两个基于变换器的模型 AraELECTRA 和 XLM-R，在两个不同的数据集上对它们进行了训练：一个由 43958 个示例组成的大型数据集和一个由 3078 个示例组成的自定义数据集，其中包含来自 ChatGPT 3.5、ChatGPT-4 和 BARD 等不同来源的 HWT 和人工智能生成的文本 (AIGT)。所提出的架构适用于任何语言，但本研究以闪米特语言中的阿拉伯语为例，评估了这些模型在识别 HWT 和 AIGT 时的效率。特别是在 AIRABIC 基准数据集上，与现有的两个著名人工智能检测器 GPTZero 和 OpenAI 文本分类器进行了比较。结果显示，所提出的分类器的准确率比 GPTZero 和 OpenAI 文本分类器都要高，达到 81%，而 GPTZero 和 OpenAI 文本分类器的准确率分别为 63% 和 50%。此外，在分类模型之前集成 Dediacritization Layer 也显著提高了 HWT 和 AIGT 的检测准确率。这一 Dediacritization 步骤显著提高了分类准确率，从 81% 提高到 99%，在某些情况下甚至达到 100%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Big Data and Cognitive Computing

自引率

0.00%

发文量