Using ChatGPT with Confidence for Biodiversity-Related Information Tasks

Michael Elliott, José Fortes
{"title":"Using ChatGPT with Confidence for Biodiversity-Related Information Tasks","authors":"Michael Elliott, José Fortes","doi":"10.3897/biss.7.112926","DOIUrl":null,"url":null,"abstract":"Recent advancements in conversational Artificial Intelligence (AI), such as OpenAI's Chat Generative Pre-Trained Transformer (ChatGPT), present the possibility of using large language models (LLMs) as tools for retrieving, analyzing, and transforming scientific information. We have found that ChatGPT (GPT 3.5) can provide accurate biodiversity knowledge in response to questions about species descriptions, occurrences, and taxonomy, as well as structure information according to data sharing standards such as Darwin Core. A rigorous evaluation of ChatGPT's capabilities in biodiversity-related tasks may help to inform viable use cases for today's LLMs in research and information workflows. In this work, we test the extent of ChatGPT's biodiversity knowledge, characterize its mistakes, and suggest how LLM-based systems might be designed to complete knowledge-based tasks with confidence. To test ChatGPT's biodiversity knowledge, we compiled a question-and-answer test set derived from Darwin Core records available in Integrated Digitized Biocollections (iDigBio). Each question focuses on one or more Darwin Core terms to test the model’s ability to recall species occurrence information and its understanding of the standard. The test set covers a range of locations, taxonomic groups, and both common and rare species (defined by the number of records in iDigBio). The results of the tests will be presented. We also tested ChatGPT on generative tasks, such as creating species occurrence maps. A visual comparison of the maps with iDigBio data shows that for some species, ChatGPT can generate fairly accurate representationsof their geographic ranges (Fig. 1). ChatGPT's incorrect responses in our tests show several patterns of mistakes. First, responses can be self-conflicting. For example, when asked \"Does Acer saccharum naturally occur in Benton, Oregon?\", ChatGPT responded \"YES, Acer saccharum DOES NOT naturally occur in Benton, Oregon\". ChatGPT can also be misled by semantics in species names. For Rafinesquia neomexicana , the word \"neomexicana\" leads ChatGPT to believe that the species primarily occurs in New Mexico, USA. ChatGPT may also confuse species, such as when attempting to describe a lesser-known species (e.g., a rare bee) within the same genus as a better-known species. Other causes of mistakes include hallucination (Ji et al. 2023), memorization (Chang and Bergen 2023), and user deception (Li et al. 2023). Some mistakes may be avoided by prompt engineering, e.g., few-shot prompting (Chang and Bergen 2023) and chain-of-thought prompting (Wei et al. 2022). These techniques assist Large Language Models (LLMs) by clarifying expectations or by guiding recollection. However, such methods cannot help when LLMs lack required knowledge. In these cases, alternative approaches are needed. A desired reliability can be theoretically guaranteed if responses that contain mistakes are discarded or corrected. This requires either detecting or predicting mistakes. Sometimes mistakes can be ruled out by verifying responses with a trusted source. For example, a trusted specimen record might be found that corroborates the response. The difficulty, however, is finding such records programmatically; e.g., using iDigBio and Global Biodiversity Information Facility's (GBIF) search Application Programming Interfaces (APIs) requires specifying indexed terms that might not appear in an LLM's response. This presents a secondary problem for which LLMs may be well suited. Note that with presence-only data, it can be difficult to disprove presence claims or prove absence claims. Besides verification, mistakes may be predicted using probabilistic methods. Formulating mistake probabilities often relies on heuristics. For example, variability in a model’s responses to a repeated query can be a sign of hallucination (Manakul et al. 2023). In practice, both probabilistic and verification methods may be needed to reach a desired reliability. LLM outputs that can be verified may be directly accepted (or discarded), while others are judged by estimating mistake probabilities. We will consider a set of heuristics and verification methods, and report empirical assessments of their impact on ChatGPT’s reliability.","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"178 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biodiversity Information Science and Standards","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3897/biss.7.112926","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Recent advancements in conversational Artificial Intelligence (AI), such as OpenAI's Chat Generative Pre-Trained Transformer (ChatGPT), present the possibility of using large language models (LLMs) as tools for retrieving, analyzing, and transforming scientific information. We have found that ChatGPT (GPT 3.5) can provide accurate biodiversity knowledge in response to questions about species descriptions, occurrences, and taxonomy, as well as structure information according to data sharing standards such as Darwin Core. A rigorous evaluation of ChatGPT's capabilities in biodiversity-related tasks may help to inform viable use cases for today's LLMs in research and information workflows. In this work, we test the extent of ChatGPT's biodiversity knowledge, characterize its mistakes, and suggest how LLM-based systems might be designed to complete knowledge-based tasks with confidence. To test ChatGPT's biodiversity knowledge, we compiled a question-and-answer test set derived from Darwin Core records available in Integrated Digitized Biocollections (iDigBio). Each question focuses on one or more Darwin Core terms to test the model’s ability to recall species occurrence information and its understanding of the standard. The test set covers a range of locations, taxonomic groups, and both common and rare species (defined by the number of records in iDigBio). The results of the tests will be presented. We also tested ChatGPT on generative tasks, such as creating species occurrence maps. A visual comparison of the maps with iDigBio data shows that for some species, ChatGPT can generate fairly accurate representationsof their geographic ranges (Fig. 1). ChatGPT's incorrect responses in our tests show several patterns of mistakes. First, responses can be self-conflicting. For example, when asked "Does Acer saccharum naturally occur in Benton, Oregon?", ChatGPT responded "YES, Acer saccharum DOES NOT naturally occur in Benton, Oregon". ChatGPT can also be misled by semantics in species names. For Rafinesquia neomexicana , the word "neomexicana" leads ChatGPT to believe that the species primarily occurs in New Mexico, USA. ChatGPT may also confuse species, such as when attempting to describe a lesser-known species (e.g., a rare bee) within the same genus as a better-known species. Other causes of mistakes include hallucination (Ji et al. 2023), memorization (Chang and Bergen 2023), and user deception (Li et al. 2023). Some mistakes may be avoided by prompt engineering, e.g., few-shot prompting (Chang and Bergen 2023) and chain-of-thought prompting (Wei et al. 2022). These techniques assist Large Language Models (LLMs) by clarifying expectations or by guiding recollection. However, such methods cannot help when LLMs lack required knowledge. In these cases, alternative approaches are needed. A desired reliability can be theoretically guaranteed if responses that contain mistakes are discarded or corrected. This requires either detecting or predicting mistakes. Sometimes mistakes can be ruled out by verifying responses with a trusted source. For example, a trusted specimen record might be found that corroborates the response. The difficulty, however, is finding such records programmatically; e.g., using iDigBio and Global Biodiversity Information Facility's (GBIF) search Application Programming Interfaces (APIs) requires specifying indexed terms that might not appear in an LLM's response. This presents a secondary problem for which LLMs may be well suited. Note that with presence-only data, it can be difficult to disprove presence claims or prove absence claims. Besides verification, mistakes may be predicted using probabilistic methods. Formulating mistake probabilities often relies on heuristics. For example, variability in a model’s responses to a repeated query can be a sign of hallucination (Manakul et al. 2023). In practice, both probabilistic and verification methods may be needed to reach a desired reliability. LLM outputs that can be verified may be directly accepted (or discarded), while others are judged by estimating mistake probabilities. We will consider a set of heuristics and verification methods, and report empirical assessments of their impact on ChatGPT’s reliability.
在生物多样性相关信息任务中自信地使用ChatGPT
对话式人工智能(AI)的最新进展,如OpenAI的聊天生成预训练转换器(ChatGPT),提供了使用大型语言模型(llm)作为检索、分析和转换科学信息的工具的可能性。我们发现,ChatGPT (GPT 3.5)可以提供准确的生物多样性知识,以回答关于物种描述、发生和分类的问题,并根据达尔文核心等数据共享标准构建信息。对ChatGPT在生物多样性相关任务中的能力进行严格评估,可能有助于为当今法学硕士在研究和信息工作流程中的可行用例提供信息。在这项工作中,我们测试了ChatGPT的生物多样性知识的程度,描述了它的错误,并建议如何设计基于法学硕士的系统来自信地完成基于知识的任务。为了测试ChatGPT的生物多样性知识,我们编制了一个问答测试集,该测试集来自集成数字化生物收集(iDigBio)中的达尔文核心记录。每个问题都集中在一个或多个达尔文核心术语上,以测试模型回忆物种发生信息的能力及其对标准的理解。测试集涵盖了一系列地点、分类组以及常见和稀有物种(由iDigBio中的记录数量定义)。测试结果将会公布。我们还在生成任务上测试了ChatGPT,例如创建物种发生图。与iDigBio数据的可视化比较表明,对于某些物种,ChatGPT可以相当准确地表示其地理范围(图1)。在我们的测试中,ChatGPT的错误响应显示了几种错误模式。首先,反应可能是自相矛盾的。例如,当被问到“糖糖槭是否自然生长在俄勒冈州本顿?”,ChatGPT回答“是的,糖糖槭不自然生长在俄勒冈州本顿”。ChatGPT还可能被物种名称中的语义所误导。对于Rafinesquia neomexicana,“neomexicana”这个词使ChatGPT相信该物种主要出现在美国的新墨西哥州。ChatGPT也可能混淆物种,例如当试图描述一个不太知名的物种(例如,一种罕见的蜜蜂)与一个更知名的物种在同一属时。其他导致错误的原因包括幻觉(Ji et al. 2023)、记忆(Chang and Bergen 2023)和用户欺骗(Li et al. 2023)。通过提示工程可以避免一些错误,例如,few-shot提示(Chang and Bergen 2023)和chain-of-thought提示(Wei et al. 2022)。这些技术通过澄清期望或引导回忆来帮助大型语言模型(llm)。然而,当法学硕士缺乏必要的知识时,这些方法就没有帮助了。在这些情况下,需要其他方法。如果包含错误的响应被丢弃或纠正,理论上可以保证期望的可靠性。这需要检测或预测错误。有时可以通过与可信来源验证响应来排除错误。例如,可能会找到一个可信的样本记录来证实响应。然而,困难在于如何通过程序找到这些记录;例如,使用iDigBio和全球生物多样性信息设施(GBIF)的搜索应用程序编程接口(api)需要指定可能不会出现在法学硕士回复中的索引术语。这提出了法学硕士可能非常适合的第二个问题。请注意,对于仅存在的数据,可能很难反驳存在声明或证明缺席声明。除了验证之外,还可以使用概率方法预测错误。制定错误概率通常依赖于启发式。例如,模型对重复查询的响应的可变性可能是幻觉的迹象(Manakul et al. 2023)。在实践中,可能需要概率方法和验证方法来达到期望的可靠性。可以验证的LLM输出可以直接接受(或丢弃),而其他输出则通过估计错误概率来判断。我们将考虑一组启发式和验证方法,并报告其对ChatGPT可靠性影响的经验评估。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信