对PubChem数据库检索任务中支持搜索的预训练大型语言模型的评估。

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances Pub Date : 2025-03-24 eCollection Date: 2025-01-01 DOI:10.1093/bioadv/vbaf064

Ash Sze, Soha Hassoun

{"title":"对PubChem数据库检索任务中支持搜索的预训练大型语言模型的评估。","authors":"Ash Sze, Soha Hassoun","doi":"10.1093/bioadv/vbaf064","DOIUrl":null,"url":null,"abstract":"Motivation: Databases are indispensable in biological and biomedical research, hosting vast amounts of structured and unstructured data, facilitating the organization, retrieval, and analysis of complex data. Database access, however, remains a manual, tedious, and sometimes overwhelming, task. The availability of Large Language Models (LLMs) has the potential to play a transformative role in accessing databases.Results: We investigate in this study the current state of using a pretrained, search-enabled LLMs (ChatGPT-4o), for data retrieval from PubChem, a flagship database that plays a critical role in biological and biomedical research. We evaluate eight PubChem access protocols that were previously documented. We develop a methodology for adopting the protocols into an LLM-prompt, where we supplement the prompt with additional context through iterative prompt refinement as needed. To further evaluate the LLM capabilities, we instruct the LLM to perform the retrieval. We quantitatively and qualitatively show that instructing ChatGPT-4o to generate programmatic access is more likely to yield the correct answers. We provide insightful future directions in developing LLMs for database access.Availability and implementation: All text used to prompt ChatGPT-4o is provided in the manuscript.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf064"},"PeriodicalIF":2.4000,"publicationDate":"2025-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12073969/pdf/","citationCount":"0","resultStr":"{\"title\":\"Evaluation of search-enabled pretrained Large Language Models on retrieval tasks for the PubChem database.\",\"authors\":\"Ash Sze, Soha Hassoun\",\"doi\":\"10.1093/bioadv/vbaf064\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Motivation: Databases are indispensable in biological and biomedical research, hosting vast amounts of structured and unstructured data, facilitating the organization, retrieval, and analysis of complex data. Database access, however, remains a manual, tedious, and sometimes overwhelming, task. The availability of Large Language Models (LLMs) has the potential to play a transformative role in accessing databases.Results: We investigate in this study the current state of using a pretrained, search-enabled LLMs (ChatGPT-4o), for data retrieval from PubChem, a flagship database that plays a critical role in biological and biomedical research. We evaluate eight PubChem access protocols that were previously documented. We develop a methodology for adopting the protocols into an LLM-prompt, where we supplement the prompt with additional context through iterative prompt refinement as needed. To further evaluate the LLM capabilities, we instruct the LLM to perform the retrieval. We quantitatively and qualitatively show that instructing ChatGPT-4o to generate programmatic access is more likely to yield the correct answers. We provide insightful future directions in developing LLMs for database access.Availability and implementation: All text used to prompt ChatGPT-4o is provided in the manuscript.\",\"PeriodicalId\":72368,\"journal\":{\"name\":\"Bioinformatics advances\",\"volume\":\"5 1\",\"pages\":\"vbaf064\"},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2025-03-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12073969/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Bioinformatics advances\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/bioadv/vbaf064\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q2\",\"JCRName\":\"MATHEMATICAL & COMPUTATIONAL BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics advances","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioadv/vbaf064","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

动机：数据库在生物和生物医学研究中是不可或缺的，它承载着大量的结构化和非结构化数据，促进了复杂数据的组织、检索和分析。然而，数据库访问仍然是一项手动的、乏味的、有时令人难以应付的任务。大型语言模型（llm）的可用性有可能在访问数据库方面发挥变革性作用。结果：我们在本研究中调查了使用预训练的、支持搜索的法学硕士（chatgpt - 40）从PubChem中检索数据的现状，PubChem是一个在生物学和生物医学研究中起关键作用的旗舰数据库。我们评估了之前记录的8个PubChem访问协议。我们开发了一种将协议采用到llm提示的方法，其中我们根据需要通过迭代提示细化来为提示补充额外的上下文。为了进一步评估LLM的能力，我们指示LLM执行检索。我们定量和定性地表明，指示chatgpt - 40生成程序化访问更有可能产生正确的答案。我们为数据库访问开发llm提供了有见地的未来方向。可用性和实施：手稿中提供了用于提示chatgpt - 40的所有文本。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Evaluation of search-enabled pretrained Large Language Models on retrieval tasks for the PubChem database.

Motivation: Databases are indispensable in biological and biomedical research, hosting vast amounts of structured and unstructured data, facilitating the organization, retrieval, and analysis of complex data. Database access, however, remains a manual, tedious, and sometimes overwhelming, task. The availability of Large Language Models (LLMs) has the potential to play a transformative role in accessing databases.

Results: We investigate in this study the current state of using a pretrained, search-enabled LLMs (ChatGPT-4o), for data retrieval from PubChem, a flagship database that plays a critical role in biological and biomedical research. We evaluate eight PubChem access protocols that were previously documented. We develop a methodology for adopting the protocols into an LLM-prompt, where we supplement the prompt with additional context through iterative prompt refinement as needed. To further evaluate the LLM capabilities, we instruct the LLM to perform the retrieval. We quantitatively and qualitatively show that instructing ChatGPT-4o to generate programmatic access is more likely to yield the correct answers. We provide insightful future directions in developing LLMs for database access.

Availability and implementation: All text used to prompt ChatGPT-4o is provided in the manuscript.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Bioinformatics advances

CiteScore

1.60

自引率

0.00%

发文量