A Comparison of Numeric Assessments of Ideas From Two Large Language Models: With Implications for Validating and Choosing LLMs

IF 5.6 3区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Daniel E. O’Leary
{"title":"A Comparison of Numeric Assessments of Ideas From Two Large Language Models: With Implications for Validating and Choosing LLMs","authors":"Daniel E. O’Leary","doi":"10.1109/mis.2024.3396371","DOIUrl":null,"url":null,"abstract":"This article compares numeric assessments generated by ChatGPT and Claude along four dimensions of novelty, feasibility, impact, and disruption, to study their ability to rate ideas. We find that those chatbots make numeric assessments that are consistent with the expected relationships between those dimensions, for example, novelty is negatively correlated with feasibility. We also find that the two chatbots make statistically significantly different numeric assessments of the same idea information. We suggest that this type of analysis can also be used to provide a type of validation of underlying chatbot capabilities. In addition, we suggest that, as part of their chatbot requirements analysis, enterprises use this approach to ensure that the chatbot appropriately “understands” concepts, in which they are directly interested.","PeriodicalId":13160,"journal":{"name":"IEEE Intelligent Systems","volume":"30 1","pages":""},"PeriodicalIF":5.6000,"publicationDate":"2024-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Intelligent Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1109/mis.2024.3396371","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

This article compares numeric assessments generated by ChatGPT and Claude along four dimensions of novelty, feasibility, impact, and disruption, to study their ability to rate ideas. We find that those chatbots make numeric assessments that are consistent with the expected relationships between those dimensions, for example, novelty is negatively correlated with feasibility. We also find that the two chatbots make statistically significantly different numeric assessments of the same idea information. We suggest that this type of analysis can also be used to provide a type of validation of underlying chatbot capabilities. In addition, we suggest that, as part of their chatbot requirements analysis, enterprises use this approach to ensure that the chatbot appropriately “understands” concepts, in which they are directly interested.
比较两个大型语言模型的创意数值评估:对验证和选择语言模型的启示
本文比较了 ChatGPT 和 Claude 从新颖性、可行性、影响力和破坏性四个维度生成的数字评估,以研究它们对想法进行评级的能力。我们发现,这些聊天机器人做出的数字评估符合这些维度之间的预期关系,例如,新颖性与可行性呈负相关。我们还发现,两个聊天机器人对相同创意信息的数字评估在统计学上存在显著差异。我们认为这种分析也可以用来验证聊天机器人的基本能力。此外,我们建议,作为聊天机器人需求分析的一部分,企业可以使用这种方法来确保聊天机器人能够恰当地 "理解 "他们直接感兴趣的概念。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
IEEE Intelligent Systems
IEEE Intelligent Systems 工程技术-工程:电子与电气
CiteScore
13.80
自引率
3.10%
发文量
122
审稿时长
1 months
期刊介绍: IEEE Intelligent Systems serves users, managers, developers, researchers, and purchasers who are interested in intelligent systems and artificial intelligence, with particular emphasis on applications. Typically they are degreed professionals, with backgrounds in engineering, hard science, or business. The publication emphasizes current practice and experience, together with promising new ideas that are likely to be used in the near future. Sample topic areas for feature articles include knowledge-based systems, intelligent software agents, natural-language processing, technologies for knowledge management, machine learning, data mining, adaptive and intelligent robotics, knowledge-intensive processing on the Web, and social issues relevant to intelligent systems. Also encouraged are application features, covering practice at one or more companies or laboratories; full-length product stories (which require refereeing by at least three reviewers); tutorials; surveys; and case studies. Often issues are theme-based and collect articles around a contemporary topic under the auspices of a Guest Editor working with the EIC.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信