{"title":"对PubChem数据库检索任务中支持搜索的预训练大型语言模型的评估。","authors":"Ash Sze, Soha Hassoun","doi":"10.1093/bioadv/vbaf064","DOIUrl":null,"url":null,"abstract":"<p><strong>Motivation: </strong>Databases are indispensable in biological and biomedical research, hosting vast amounts of structured and unstructured data, facilitating the organization, retrieval, and analysis of complex data. Database access, however, remains a manual, tedious, and sometimes overwhelming, task. The availability of Large Language Models (LLMs) has the potential to play a transformative role in accessing databases.</p><p><strong>Results: </strong>We investigate in this study the current state of using a pretrained, search-enabled LLMs (ChatGPT-4o), for data retrieval from PubChem, a flagship database that plays a critical role in biological and biomedical research. We evaluate eight PubChem access protocols that were previously documented. We develop a methodology for adopting the protocols into an LLM-prompt, where we supplement the prompt with additional context through iterative prompt refinement as needed. To further evaluate the LLM capabilities, we instruct the LLM to perform the retrieval. We quantitatively and qualitatively show that instructing ChatGPT-4o to generate programmatic access is more likely to yield the correct answers. We provide insightful future directions in developing LLMs for database access.</p><p><strong>Availability and implementation: </strong>All text used to prompt ChatGPT-4o is provided in the manuscript.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf064"},"PeriodicalIF":2.4000,"publicationDate":"2025-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12073969/pdf/","citationCount":"0","resultStr":"{\"title\":\"Evaluation of search-enabled pretrained Large Language Models on retrieval tasks for the PubChem database.\",\"authors\":\"Ash Sze, Soha Hassoun\",\"doi\":\"10.1093/bioadv/vbaf064\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Motivation: </strong>Databases are indispensable in biological and biomedical research, hosting vast amounts of structured and unstructured data, facilitating the organization, retrieval, and analysis of complex data. Database access, however, remains a manual, tedious, and sometimes overwhelming, task. The availability of Large Language Models (LLMs) has the potential to play a transformative role in accessing databases.</p><p><strong>Results: </strong>We investigate in this study the current state of using a pretrained, search-enabled LLMs (ChatGPT-4o), for data retrieval from PubChem, a flagship database that plays a critical role in biological and biomedical research. We evaluate eight PubChem access protocols that were previously documented. We develop a methodology for adopting the protocols into an LLM-prompt, where we supplement the prompt with additional context through iterative prompt refinement as needed. To further evaluate the LLM capabilities, we instruct the LLM to perform the retrieval. We quantitatively and qualitatively show that instructing ChatGPT-4o to generate programmatic access is more likely to yield the correct answers. We provide insightful future directions in developing LLMs for database access.</p><p><strong>Availability and implementation: </strong>All text used to prompt ChatGPT-4o is provided in the manuscript.</p>\",\"PeriodicalId\":72368,\"journal\":{\"name\":\"Bioinformatics advances\",\"volume\":\"5 1\",\"pages\":\"vbaf064\"},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2025-03-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12073969/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Bioinformatics advances\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/bioadv/vbaf064\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q2\",\"JCRName\":\"MATHEMATICAL & COMPUTATIONAL BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics advances","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioadv/vbaf064","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
Evaluation of search-enabled pretrained Large Language Models on retrieval tasks for the PubChem database.
Motivation: Databases are indispensable in biological and biomedical research, hosting vast amounts of structured and unstructured data, facilitating the organization, retrieval, and analysis of complex data. Database access, however, remains a manual, tedious, and sometimes overwhelming, task. The availability of Large Language Models (LLMs) has the potential to play a transformative role in accessing databases.
Results: We investigate in this study the current state of using a pretrained, search-enabled LLMs (ChatGPT-4o), for data retrieval from PubChem, a flagship database that plays a critical role in biological and biomedical research. We evaluate eight PubChem access protocols that were previously documented. We develop a methodology for adopting the protocols into an LLM-prompt, where we supplement the prompt with additional context through iterative prompt refinement as needed. To further evaluate the LLM capabilities, we instruct the LLM to perform the retrieval. We quantitatively and qualitatively show that instructing ChatGPT-4o to generate programmatic access is more likely to yield the correct answers. We provide insightful future directions in developing LLMs for database access.
Availability and implementation: All text used to prompt ChatGPT-4o is provided in the manuscript.