Unmasking large language models by means of OpenAI GPT-4 and Google AI: A deep instruction-based analysis

Idrees A. Zahid , Shahad Sabbar Joudar , A.S. Albahri , O.S. Albahri , A.H. Alamoodi , Jose Santamaría , Laith Alzubaidi
{"title":"Unmasking large language models by means of OpenAI GPT-4 and Google AI: A deep instruction-based analysis","authors":"Idrees A. Zahid ,&nbsp;Shahad Sabbar Joudar ,&nbsp;A.S. Albahri ,&nbsp;O.S. Albahri ,&nbsp;A.H. Alamoodi ,&nbsp;Jose Santamaría ,&nbsp;Laith Alzubaidi","doi":"10.1016/j.iswa.2024.200431","DOIUrl":null,"url":null,"abstract":"<div><p>Large Language Models (LLMs) have become a hot topic in AI due to their ability to mimic human conversation. This study compares the open artificial intelligence generative pretrained transformer-4 (GPT-4) model, based on the (GPT), and Google's artificial intelligence (AI), which is based on the Bidirectional Encoder Representations from Transformers (BERT) framework in terms of the defined capabilities and the built-in architecture. Both LLMs are prominent in AI applications. First, eight different capabilities were identified to evaluate these models, i.e. translation accuracy, text generation, factuality, creativity, intellect, deception avoidance, sentiment classification, and sarcasm detection. Next, each capability was assessed using instructions. Additionally, a categorized LLM evaluation system has been developed by means of using ten research questions per category based on this paper's main contributions from a prompt engineering perspective. It should be highlighted that GPT-4 and Google AI successfully answered 85 % and 68,7 % of the study prompts, respectively. It has been noted that GPT-4 better understands prompts than Google AI, even with verbal flaws, and tolerates grammatical errors. Moreover, the GPT-4 based approach was more precise, accurate, and succinct than Google AI, which was sometimes verbose and less realistic. While GPT-4 beats Google AI in terms of translation accuracy, text generation, factuality, intellectuality, creativity, and deception avoidance, Google AI outperforms the former when considering sarcasm detection. Both sentiment classification models did work properly. More importantly, a human panel of judges was used to assess and evaluate the model comparisons. Statistical analysis of the judges' ratings revealed more robust results based on examining the specific uses, limitations, and expectations of both GPT-4 and Google AI-based approaches. Finally, the two approaches' transformers, parameter sizes, and attention mechanisms have been examined.</p></div>","PeriodicalId":100684,"journal":{"name":"Intelligent Systems with Applications","volume":"23 ","pages":"Article 200431"},"PeriodicalIF":0.0000,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2667305324001054/pdfft?md5=b6c9fa39bd05b579aebb48986c20b9ec&pid=1-s2.0-S2667305324001054-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Intelligent Systems with Applications","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2667305324001054","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Large Language Models (LLMs) have become a hot topic in AI due to their ability to mimic human conversation. This study compares the open artificial intelligence generative pretrained transformer-4 (GPT-4) model, based on the (GPT), and Google's artificial intelligence (AI), which is based on the Bidirectional Encoder Representations from Transformers (BERT) framework in terms of the defined capabilities and the built-in architecture. Both LLMs are prominent in AI applications. First, eight different capabilities were identified to evaluate these models, i.e. translation accuracy, text generation, factuality, creativity, intellect, deception avoidance, sentiment classification, and sarcasm detection. Next, each capability was assessed using instructions. Additionally, a categorized LLM evaluation system has been developed by means of using ten research questions per category based on this paper's main contributions from a prompt engineering perspective. It should be highlighted that GPT-4 and Google AI successfully answered 85 % and 68,7 % of the study prompts, respectively. It has been noted that GPT-4 better understands prompts than Google AI, even with verbal flaws, and tolerates grammatical errors. Moreover, the GPT-4 based approach was more precise, accurate, and succinct than Google AI, which was sometimes verbose and less realistic. While GPT-4 beats Google AI in terms of translation accuracy, text generation, factuality, intellectuality, creativity, and deception avoidance, Google AI outperforms the former when considering sarcasm detection. Both sentiment classification models did work properly. More importantly, a human panel of judges was used to assess and evaluate the model comparisons. Statistical analysis of the judges' ratings revealed more robust results based on examining the specific uses, limitations, and expectations of both GPT-4 and Google AI-based approaches. Finally, the two approaches' transformers, parameter sizes, and attention mechanisms have been examined.

通过 OpenAI GPT-4 和 Google AI 揭开大型语言模型的神秘面纱:基于指令的深度分析
大语言模型(LLM)因其模仿人类对话的能力而成为人工智能领域的热门话题。本研究比较了基于 GPT 的开放人工智能生成预训练变换器-4(GPT-4)模型和基于变换器双向编码器表示(BERT)框架的谷歌人工智能(AI)在定义能力和内置架构方面的差异。这两种 LLM 在人工智能应用中都非常突出。首先,确定了八种不同的能力来评估这些模型,即翻译准确性、文本生成、事实性、创造性、智力、避免欺骗、情感分类和讽刺检测。接下来,使用说明对每种能力进行评估。此外,根据本文在提示工程方面的主要贡献,每个类别使用十个研究问题,开发了一个分类 LLM 评估系统。需要强调的是,GPT-4 和谷歌人工智能分别成功回答了 85% 和 68.7% 的研究提示。我们注意到,GPT-4 比谷歌人工智能能更好地理解提示语,即使存在语言缺陷,也能容忍语法错误。此外,与谷歌人工智能相比,基于 GPT-4 的方法更加精确、准确和简洁,而谷歌人工智能有时言辞冗长,不够逼真。虽然 GPT-4 在翻译准确性、文本生成、事实性、知识性、创造性和避免欺骗方面都优于谷歌人工智能,但在讽刺检测方面,谷歌人工智能却胜过前者。两种情感分类模型都能正常工作。更重要的是,人类评委小组对模型比较进行了评估和评价。在对基于 GPT-4 和谷歌人工智能的方法的具体用途、局限性和期望值进行研究的基础上,对评委的评分进行了统计分析,从而得出了更为可靠的结果。最后,对两种方法的转换器、参数大小和关注机制进行了研究。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
5.60
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信