Evaluating ChatGPT performance in Arabic dialects: A comparative study showing defects in responding to Jordanian and Tunisian general health prompts

Malik Sallam, Dhia Mousa
{"title":"Evaluating ChatGPT performance in Arabic dialects: A comparative study showing defects in responding to Jordanian and Tunisian general health prompts","authors":"Malik Sallam, Dhia Mousa","doi":"10.58496/mjaih/2024/001","DOIUrl":null,"url":null,"abstract":"Background: The role of artificial intelligence (AI) is increasingly recognized to enhance digital health literacy. There is of particular importance with widespread availability and popularity of AI chatbots such as ChatGPT and its possible impact on health literacy. The involves the need to understand AI models’ performance across different languages, dialects, and cultural contexts. This study aimed to evaluate ChatGPT performance in response to prompting in two different Arabic dialects, namely Tunisian and Jordanian. \nMethods: This descriptive study followed the METRICS checklist for the design and reporting of AI based studies in healthcare. Ten general health queries were translated into Tunisian and Jordanian dialects of Arabic by bilingual native speakers. The performance of two AI models, ChatGPT-3.5 and ChatGPT-4 in response to Tunisian, Jordanian, and English were evaluated using the CLEAR tool tailored for assessment of health information generated by AI models. \nResults: ChatGPT-3.5 performance was categorized as average in Tunisian Arabic, with an overall CLEAR score of 2.83, compared to above average score of 3.40 in Jordanian Arabic. ChatGPT-4 showed a similar pattern with marginally better outcomes with a CLEAR score of 3.20 in Tunisian rated as average and above average performance in Jordanian with a CLEAR score of 3.53. The CLEAR components consistently showed superior performance in the Jordanian dialect for both models despite the lack of statistical significance. Using English content as a reference, the responses to both Tunisian and Jordanian dialects were significantly inferior (P<.001). \nConclusion: The findings highlight a critical dialectical performance gap in ChatGPT, underlining the need to enhance linguistic and cultural diversity in AI models’ development, particularly for health-related content. Collaborative efforts among AI developers, linguists, and healthcare professionals are needed to improve the performance of AI models across different languages, dialects, and cultural contexts. Future studies are recommended to broaden the scope across an extensive range of languages and dialects, which would help in achieving equitable access to health information across various communities.","PeriodicalId":424250,"journal":{"name":"Mesopotamian Journal of Artificial Intelligence in Healthcare","volume":" 39","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Mesopotamian Journal of Artificial Intelligence in Healthcare","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.58496/mjaih/2024/001","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Background: The role of artificial intelligence (AI) is increasingly recognized to enhance digital health literacy. There is of particular importance with widespread availability and popularity of AI chatbots such as ChatGPT and its possible impact on health literacy. The involves the need to understand AI models’ performance across different languages, dialects, and cultural contexts. This study aimed to evaluate ChatGPT performance in response to prompting in two different Arabic dialects, namely Tunisian and Jordanian. Methods: This descriptive study followed the METRICS checklist for the design and reporting of AI based studies in healthcare. Ten general health queries were translated into Tunisian and Jordanian dialects of Arabic by bilingual native speakers. The performance of two AI models, ChatGPT-3.5 and ChatGPT-4 in response to Tunisian, Jordanian, and English were evaluated using the CLEAR tool tailored for assessment of health information generated by AI models. Results: ChatGPT-3.5 performance was categorized as average in Tunisian Arabic, with an overall CLEAR score of 2.83, compared to above average score of 3.40 in Jordanian Arabic. ChatGPT-4 showed a similar pattern with marginally better outcomes with a CLEAR score of 3.20 in Tunisian rated as average and above average performance in Jordanian with a CLEAR score of 3.53. The CLEAR components consistently showed superior performance in the Jordanian dialect for both models despite the lack of statistical significance. Using English content as a reference, the responses to both Tunisian and Jordanian dialects were significantly inferior (P<.001). Conclusion: The findings highlight a critical dialectical performance gap in ChatGPT, underlining the need to enhance linguistic and cultural diversity in AI models’ development, particularly for health-related content. Collaborative efforts among AI developers, linguists, and healthcare professionals are needed to improve the performance of AI models across different languages, dialects, and cultural contexts. Future studies are recommended to broaden the scope across an extensive range of languages and dialects, which would help in achieving equitable access to health information across various communities.
评估阿拉伯语方言中的 ChatGPT 性能:一项比较研究显示了对约旦和突尼斯普通健康提示的响应缺陷
背景:人工智能(AI)在提高数字健康素养方面的作用日益得到认可。人工智能聊天机器人(如 ChatGPT)的广泛使用和普及及其对健康素养可能产生的影响尤为重要。这就需要了解人工智能模型在不同语言、方言和文化背景下的表现。本研究旨在评估 ChatGPT 在两种不同的阿拉伯语方言(即突尼斯语和约旦语)中响应提示的性能。研究方法这项描述性研究采用了 METRICS 核对表,用于设计和报告基于人工智能的医疗保健研究。由双语母语人士将十个一般健康查询翻译成突尼斯和约旦的阿拉伯方言。使用专为评估人工智能模型生成的健康信息而定制的 CLEAR 工具,评估了两种人工智能模型 ChatGPT-3.5 和 ChatGPT-4 在突尼斯语、约旦语和英语中的表现。评估结果ChatGPT-3.5 在突尼斯阿拉伯语中的表现被归类为平均水平,CLEAR 总分为 2.83 分,而在约旦阿拉伯语中的表现高于平均水平,为 3.40 分。ChatGPT-4 也显示了类似的模式,突尼斯语的 CLEAR 得分为 3.20,结果略好,被评为平均水平,而约旦语的 CLEAR 得分为 3.53,高于平均水平。尽管缺乏统计学意义,但两个模型的 CLEAR 要素在约旦方言中始终表现优异。以英语内容为参照,突尼斯方言和约旦方言的反应明显较差(P<.001)。结论研究结果凸显了 ChatGPT 在方言性能方面的重要差距,强调了在开发人工智能模型时加强语言和文化多样性的必要性,尤其是在开发与健康相关的内容时。需要人工智能开发人员、语言学家和医疗保健专业人员通力合作,提高人工智能模型在不同语言、方言和文化背景下的性能。建议未来的研究扩大范围,涵盖更广泛的语言和方言,这将有助于实现不同社区公平获取健康信息。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信