Ahmad Abdelaal, Abdallah Elsaadany, Abdelrhman Ahmed Medhat, Aysha Al Shamsi, Noha Gamal ElDin Saad Ali
{"title":"跨语言的抄袭检测:阿拉伯语和英语-阿拉伯语长文件的综合研究。","authors":"Ahmad Abdelaal, Abdallah Elsaadany, Abdelrhman Ahmed Medhat, Aysha Al Shamsi, Noha Gamal ElDin Saad Ali","doi":"10.7717/peerj-cs.3128","DOIUrl":null,"url":null,"abstract":"<p><p>Plagiarism detection in Arabic texts remains a significant challenge due to the complex morphological structure, rich linguistic diversity, and scarcity of high-quality labeled datasets. This study proposes a robust framework for Arabic plagiarism detection by integrating Siamese neural networks (SNN) with state-of-the-art transformer architectures, specifically AraT5 and Longformer. The system employs a hybrid workflow, combining transformer-based encoders and a classification objective to implicitly learn textual similarity. To address the inherent imbalance in Arabic plagiarism datasets, both weighted cross-entropy loss and Dice loss functions were utilized to optimize model training. Extensive experiments were conducted using the ExAraCorpusPAN2015 dataset, demonstrating the effectiveness of the proposed architecture. Results indicate that AraT5 with weighted cross-entropy loss outperformed other configurations, achieving an F1-score of 0.9058. Additionally, comparative analysis with existing methodologies highlights the superiority of our approach in handling nuanced semantic and structural variations within Arabic texts. This study underscores the importance of transformer-based architectures and class-specific loss functions in enhancing plagiarism detection accuracy in under-resourced languages like Arabic.</p>","PeriodicalId":54224,"journal":{"name":"PeerJ Computer Science","volume":"11 ","pages":"e3128"},"PeriodicalIF":2.5000,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12453725/pdf/","citationCount":"0","resultStr":"{\"title\":\"Plagiarism detection across languages: a comprehensive study of Arabic and English-to-Arabic long documents.\",\"authors\":\"Ahmad Abdelaal, Abdallah Elsaadany, Abdelrhman Ahmed Medhat, Aysha Al Shamsi, Noha Gamal ElDin Saad Ali\",\"doi\":\"10.7717/peerj-cs.3128\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Plagiarism detection in Arabic texts remains a significant challenge due to the complex morphological structure, rich linguistic diversity, and scarcity of high-quality labeled datasets. This study proposes a robust framework for Arabic plagiarism detection by integrating Siamese neural networks (SNN) with state-of-the-art transformer architectures, specifically AraT5 and Longformer. The system employs a hybrid workflow, combining transformer-based encoders and a classification objective to implicitly learn textual similarity. To address the inherent imbalance in Arabic plagiarism datasets, both weighted cross-entropy loss and Dice loss functions were utilized to optimize model training. Extensive experiments were conducted using the ExAraCorpusPAN2015 dataset, demonstrating the effectiveness of the proposed architecture. Results indicate that AraT5 with weighted cross-entropy loss outperformed other configurations, achieving an F1-score of 0.9058. Additionally, comparative analysis with existing methodologies highlights the superiority of our approach in handling nuanced semantic and structural variations within Arabic texts. This study underscores the importance of transformer-based architectures and class-specific loss functions in enhancing plagiarism detection accuracy in under-resourced languages like Arabic.</p>\",\"PeriodicalId\":54224,\"journal\":{\"name\":\"PeerJ Computer Science\",\"volume\":\"11 \",\"pages\":\"e3128\"},\"PeriodicalIF\":2.5000,\"publicationDate\":\"2025-08-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12453725/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"PeerJ Computer Science\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.7717/peerj-cs.3128\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"PeerJ Computer Science","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.7717/peerj-cs.3128","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Plagiarism detection across languages: a comprehensive study of Arabic and English-to-Arabic long documents.
Plagiarism detection in Arabic texts remains a significant challenge due to the complex morphological structure, rich linguistic diversity, and scarcity of high-quality labeled datasets. This study proposes a robust framework for Arabic plagiarism detection by integrating Siamese neural networks (SNN) with state-of-the-art transformer architectures, specifically AraT5 and Longformer. The system employs a hybrid workflow, combining transformer-based encoders and a classification objective to implicitly learn textual similarity. To address the inherent imbalance in Arabic plagiarism datasets, both weighted cross-entropy loss and Dice loss functions were utilized to optimize model training. Extensive experiments were conducted using the ExAraCorpusPAN2015 dataset, demonstrating the effectiveness of the proposed architecture. Results indicate that AraT5 with weighted cross-entropy loss outperformed other configurations, achieving an F1-score of 0.9058. Additionally, comparative analysis with existing methodologies highlights the superiority of our approach in handling nuanced semantic and structural variations within Arabic texts. This study underscores the importance of transformer-based architectures and class-specific loss functions in enhancing plagiarism detection accuracy in under-resourced languages like Arabic.
期刊介绍:
PeerJ Computer Science is the new open access journal covering all subject areas in computer science, with the backing of a prestigious advisory board and more than 300 academic editors.