通过 OpenAI GPT-4 和 Google AI 揭开大型语言模型的神秘面纱：基于指令的深度分析

Intelligent Systems with Applications Pub Date : 2024-09-01 DOI:10.1016/j.iswa.2024.200431

Idrees A. Zahid , Shahad Sabbar Joudar , A.S. Albahri , O.S. Albahri , A.H. Alamoodi , Jose Santamaría , Laith Alzubaidi

{"title":"通过 OpenAI GPT-4 和 Google AI 揭开大型语言模型的神秘面纱：基于指令的深度分析","authors":"Idrees A. Zahid , Shahad Sabbar Joudar , A.S. Albahri , O.S. Albahri , A.H. Alamoodi , Jose Santamaría , Laith Alzubaidi","doi":"10.1016/j.iswa.2024.200431","DOIUrl":null,"url":null,"abstract":"<div><p>Large Language Models (LLMs) have become a hot topic in AI due to their ability to mimic human conversation. This study compares the open artificial intelligence generative pretrained transformer-4 (GPT-4) model, based on the (GPT), and Google's artificial intelligence (AI), which is based on the Bidirectional Encoder Representations from Transformers (BERT) framework in terms of the defined capabilities and the built-in architecture. Both LLMs are prominent in AI applications. First, eight different capabilities were identified to evaluate these models, i.e. translation accuracy, text generation, factuality, creativity, intellect, deception avoidance, sentiment classification, and sarcasm detection. Next, each capability was assessed using instructions. Additionally, a categorized LLM evaluation system has been developed by means of using ten research questions per category based on this paper's main contributions from a prompt engineering perspective. It should be highlighted that GPT-4 and Google AI successfully answered 85 % and 68,7 % of the study prompts, respectively. It has been noted that GPT-4 better understands prompts than Google AI, even with verbal flaws, and tolerates grammatical errors. Moreover, the GPT-4 based approach was more precise, accurate, and succinct than Google AI, which was sometimes verbose and less realistic. While GPT-4 beats Google AI in terms of translation accuracy, text generation, factuality, intellectuality, creativity, and deception avoidance, Google AI outperforms the former when considering sarcasm detection. Both sentiment classification models did work properly. More importantly, a human panel of judges was used to assess and evaluate the model comparisons. Statistical analysis of the judges' ratings revealed more robust results based on examining the specific uses, limitations, and expectations of both GPT-4 and Google AI-based approaches. Finally, the two approaches' transformers, parameter sizes, and attention mechanisms have been examined.</p></div>","PeriodicalId":100684,"journal":{"name":"Intelligent Systems with Applications","volume":"23 ","pages":"Article 200431"},"PeriodicalIF":0.0000,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2667305324001054/pdfft?md5=b6c9fa39bd05b579aebb48986c20b9ec&pid=1-s2.0-S2667305324001054-main.pdf","citationCount":"0","resultStr":"{\"title\":\"Unmasking large language models by means of OpenAI GPT-4 and Google AI: A deep instruction-based analysis\",\"authors\":\"Idrees A. Zahid , Shahad Sabbar Joudar , A.S. Albahri , O.S. Albahri , A.H. Alamoodi , Jose Santamaría , Laith Alzubaidi\",\"doi\":\"10.1016/j.iswa.2024.200431\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Large Language Models (LLMs) have become a hot topic in AI due to their ability to mimic human conversation. This study compares the open artificial intelligence generative pretrained transformer-4 (GPT-4) model, based on the (GPT), and Google's artificial intelligence (AI), which is based on the Bidirectional Encoder Representations from Transformers (BERT) framework in terms of the defined capabilities and the built-in architecture. Both LLMs are prominent in AI applications. First, eight different capabilities were identified to evaluate these models, i.e. translation accuracy, text generation, factuality, creativity, intellect, deception avoidance, sentiment classification, and sarcasm detection. Next, each capability was assessed using instructions. Additionally, a categorized LLM evaluation system has been developed by means of using ten research questions per category based on this paper's main contributions from a prompt engineering perspective. It should be highlighted that GPT-4 and Google AI successfully answered 85 % and 68,7 % of the study prompts, respectively. It has been noted that GPT-4 better understands prompts than Google AI, even with verbal flaws, and tolerates grammatical errors. Moreover, the GPT-4 based approach was more precise, accurate, and succinct than Google AI, which was sometimes verbose and less realistic. While GPT-4 beats Google AI in terms of translation accuracy, text generation, factuality, intellectuality, creativity, and deception avoidance, Google AI outperforms the former when considering sarcasm detection. Both sentiment classification models did work properly. More importantly, a human panel of judges was used to assess and evaluate the model comparisons. Statistical analysis of the judges' ratings revealed more robust results based on examining the specific uses, limitations, and expectations of both GPT-4 and Google AI-based approaches. Finally, the two approaches' transformers, parameter sizes, and attention mechanisms have been examined.</p></div>\",\"PeriodicalId\":100684,\"journal\":{\"name\":\"Intelligent Systems with Applications\",\"volume\":\"23 \",\"pages\":\"Article 200431\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S2667305324001054/pdfft?md5=b6c9fa39bd05b579aebb48986c20b9ec&pid=1-s2.0-S2667305324001054-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Intelligent Systems with Applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2667305324001054\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Intelligent Systems with Applications","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2667305324001054","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

大语言模型（LLM）因其模仿人类对话的能力而成为人工智能领域的热门话题。本研究比较了基于 GPT 的开放人工智能生成预训练变换器-4（GPT-4）模型和基于变换器双向编码器表示（BERT）框架的谷歌人工智能（AI）在定义能力和内置架构方面的差异。这两种 LLM 在人工智能应用中都非常突出。首先，确定了八种不同的能力来评估这些模型，即翻译准确性、文本生成、事实性、创造性、智力、避免欺骗、情感分类和讽刺检测。接下来，使用说明对每种能力进行评估。此外，根据本文在提示工程方面的主要贡献，每个类别使用十个研究问题，开发了一个分类 LLM 评估系统。需要强调的是，GPT-4 和谷歌人工智能分别成功回答了 85% 和 68.7% 的研究提示。我们注意到，GPT-4 比谷歌人工智能能更好地理解提示语，即使存在语言缺陷，也能容忍语法错误。此外，与谷歌人工智能相比，基于 GPT-4 的方法更加精确、准确和简洁，而谷歌人工智能有时言辞冗长，不够逼真。虽然 GPT-4 在翻译准确性、文本生成、事实性、知识性、创造性和避免欺骗方面都优于谷歌人工智能，但在讽刺检测方面，谷歌人工智能却胜过前者。两种情感分类模型都能正常工作。更重要的是，人类评委小组对模型比较进行了评估和评价。在对基于 GPT-4 和谷歌人工智能的方法的具体用途、局限性和期望值进行研究的基础上，对评委的评分进行了统计分析，从而得出了更为可靠的结果。最后，对两种方法的转换器、参数大小和关注机制进行了研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Unmasking large language models by means of OpenAI GPT-4 and Google AI: A deep instruction-based analysis

Large Language Models (LLMs) have become a hot topic in AI due to their ability to mimic human conversation. This study compares the open artificial intelligence generative pretrained transformer-4 (GPT-4) model, based on the (GPT), and Google's artificial intelligence (AI), which is based on the Bidirectional Encoder Representations from Transformers (BERT) framework in terms of the defined capabilities and the built-in architecture. Both LLMs are prominent in AI applications. First, eight different capabilities were identified to evaluate these models, i.e. translation accuracy, text generation, factuality, creativity, intellect, deception avoidance, sentiment classification, and sarcasm detection. Next, each capability was assessed using instructions. Additionally, a categorized LLM evaluation system has been developed by means of using ten research questions per category based on this paper's main contributions from a prompt engineering perspective. It should be highlighted that GPT-4 and Google AI successfully answered 85 % and 68,7 % of the study prompts, respectively. It has been noted that GPT-4 better understands prompts than Google AI, even with verbal flaws, and tolerates grammatical errors. Moreover, the GPT-4 based approach was more precise, accurate, and succinct than Google AI, which was sometimes verbose and less realistic. While GPT-4 beats Google AI in terms of translation accuracy, text generation, factuality, intellectuality, creativity, and deception avoidance, Google AI outperforms the former when considering sarcasm detection. Both sentiment classification models did work properly. More importantly, a human panel of judges was used to assess and evaluate the model comparisons. Statistical analysis of the judges' ratings revealed more robust results based on examining the specific uses, limitations, and expectations of both GPT-4 and Google AI-based approaches. Finally, the two approaches' transformers, parameter sizes, and attention mechanisms have been examined.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Intelligent Systems with Applications

CiteScore

5.60

自引率

0.00%

发文量