ARABICNLPPub Date : 2024-01-23DOI: 10.18653/v1/2023.arabicnlp-1.81
Mohammed Elkomy, Amany Sarhan
{"title":"TCE at Qur’an QA 2023 Shared Task: Low Resource Enhanced Transformer-based Ensemble Approach for Qur’anic QA","authors":"Mohammed Elkomy, Amany Sarhan","doi":"10.18653/v1/2023.arabicnlp-1.81","DOIUrl":"https://doi.org/10.18653/v1/2023.arabicnlp-1.81","url":null,"abstract":"In this paper, we present our approach to tackle Qur’an QA 2023 shared tasks A and B. To address the challenge of low-resourced training data, we rely on transfer learning together with a voting ensemble to improve prediction stability across multiple runs. Additionally, we employ different architectures and learning mechanisms for a range of Arabic pre-trained transformer-based models for both tasks. To identify unanswerable questions, we propose using a thresholding mechanism. Our top-performing systems greatly surpass the baseline performance on the hidden split, achieving a MAP score of 25.05% for task A and a partial Average Precision (pAP) of 57.11% for task B.","PeriodicalId":503921,"journal":{"name":"ARABICNLP","volume":"17 6","pages":"728-742"},"PeriodicalIF":0.0,"publicationDate":"2024-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139603885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ARABICNLPPub Date : 2023-12-16DOI: 10.18653/v1/2023.arabicnlp-1.69
Mohamed Lichouri, Khaled Lounnas, Aicha Zitouni, H. Latrache, R. Djeradi
{"title":"USTHB at NADI 2023 shared task: Exploring Preprocessing and Feature Engineering Strategies for Arabic Dialect Identification","authors":"Mohamed Lichouri, Khaled Lounnas, Aicha Zitouni, H. Latrache, R. Djeradi","doi":"10.18653/v1/2023.arabicnlp-1.69","DOIUrl":"https://doi.org/10.18653/v1/2023.arabicnlp-1.69","url":null,"abstract":"In this paper, we conduct an in-depth analysis of several key factors influencing the performance of Arabic Dialect Identification NADI’2023, with a specific focus on the first subtask involving country-level dialect identification. Our investigation encompasses the effects of surface preprocessing, morphological preprocessing, FastText vector model, and the weighted concatenation of TF-IDF features. For classification purposes, we employ the Linear Support Vector Classification (LSVC) model. During the evaluation phase, our system demonstrates noteworthy results, achieving an F_1 score of 62.51%. This achievement closely aligns with the average F_1 scores attained by other systems submitted for the first subtask, which stands at 72.91%.","PeriodicalId":503921,"journal":{"name":"ARABICNLP","volume":"227 2","pages":"647-651"},"PeriodicalIF":0.0,"publicationDate":"2023-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139176897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ARABICNLPPub Date : 2023-12-13DOI: 10.18653/v1/2023.arabicnlp-1.9
S. Kwon, Gagan Bhatia, El Moatez Billah Nagoudi, M. Abdul-Mageed
{"title":"Beyond English: Evaluating LLMs for Arabic Grammatical Error Correction","authors":"S. Kwon, Gagan Bhatia, El Moatez Billah Nagoudi, M. Abdul-Mageed","doi":"10.18653/v1/2023.arabicnlp-1.9","DOIUrl":"https://doi.org/10.18653/v1/2023.arabicnlp-1.9","url":null,"abstract":"Large language models (LLMs) finetuned to follow human instruction have recently exhibited significant capabilities in various English NLP tasks. However, their performance in grammatical error correction (GEC), especially on languages other than English, remains significantly unexplored. In this work, we evaluate the abilities of instruction finetuned LLMs in Arabic GEC, a complex task due to Arabic’s rich morphology. Our findings suggest that various prompting methods, coupled with (in-context) few-shot learning, demonstrate considerable effectiveness, with GPT-4 achieving up to 65.49 F1 score under expert prompting (approximately 5 points higher than our established baseline). Despite these positive results, we find that instruction finetuned models, regardless of their size, are still outperformed by fully finetuned ones, even if they are significantly smaller in size. This disparity highlights substantial room for improvements for LLMs. Inspired by methods used in low-resource machine translation, we also develop a method exploiting synthetic data that significantly outperforms previous models on two standard Arabic benchmarks. Our best model achieves a new SOTA on Arabic GEC, with 73.29 and 73.26 F1 on the 2014 and 2015 QALB datasets, respectively, compared to peer-reviewed published baselines.","PeriodicalId":503921,"journal":{"name":"ARABICNLP","volume":"192 1","pages":"101-119"},"PeriodicalIF":0.0,"publicationDate":"2023-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139181302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mavericks at ArAIEval Shared Task: Towards a Safer Digital Space - Transformer Ensemble Models Tackling Deception and Persuasion","authors":"Sudeep Mangalvedhekar, Kshitij Deshpande, Yash Patwardhan, Vedant Deshpande, Ravindra Murumkar","doi":"10.48550/arXiv.2311.18730","DOIUrl":"https://doi.org/10.48550/arXiv.2311.18730","url":null,"abstract":"In this paper, we highlight our approach for the “Arabic AI Tasks Evaluation (ArAiEval) Shared Task 2023”. We present our approaches for task 1-A and task 2-A of the shared task which focus on persuasion technique detection and disinformation detection respectively. Detection of persuasion techniques and disinformation has become imperative to avoid distortion of authentic information. The tasks use multigenre snippets of tweets and news articles for the given binary classification problem. We experiment with several transformer-based models that were pre-trained on the Arabic language. We fine-tune these state-of-the-art models on the provided dataset. Ensembling is employed to enhance the performance of the systems. We achieved a micro F1-score of 0.742 on task 1-A (8th rank on the leaderboard) and 0.901 on task 2-A (7th rank on the leaderboard) respectively.","PeriodicalId":503921,"journal":{"name":"ARABICNLP","volume":"14 1","pages":"513-518"},"PeriodicalIF":0.0,"publicationDate":"2023-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139200020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ARABICNLPPub Date : 2023-11-15DOI: 10.48550/arXiv.2311.08844
Abdelrahman Mohamed, Fakhraddin Alwajih, El Moatez Billah Nagoudi, Alcides Alcoba Inciarte, M. Abdul-Mageed
{"title":"Violet: A Vision-Language Model for Arabic Image Captioning with Gemini Decoder","authors":"Abdelrahman Mohamed, Fakhraddin Alwajih, El Moatez Billah Nagoudi, Alcides Alcoba Inciarte, M. Abdul-Mageed","doi":"10.48550/arXiv.2311.08844","DOIUrl":"https://doi.org/10.48550/arXiv.2311.08844","url":null,"abstract":"Although image captioning has a vast array of applications, it has not reached its full potential in languages other than English. Arabic, for instance, although the native language of more than 400 million people, remains largely underrepresented in this area. This is due to the lack of labeled data and powerful Arabic generative models. We alleviate this issue by presenting a novel vision-language model dedicated to Arabic, dubbed Violet. Our model is based on a vision encoder and a Gemini text decoder that maintains generation fluency while allowing fusion between the vision and language components. To train our model, we introduce a new method for automatically acquiring data from available English datasets. We also manually prepare a new dataset for evaluation. Violet performs sizeably better than our baselines on all of our evaluation datasets. For example, it reaches a CIDEr score of 61.2 on our manually annotated dataset and achieves an improvement of 13 points on Flickr8k.","PeriodicalId":503921,"journal":{"name":"ARABICNLP","volume":"11 3","pages":"1-11"},"PeriodicalIF":0.0,"publicationDate":"2023-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139272911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}