Jingwen Bai, Selvakumar Kamatchinathan, Deepti J Kundu, Chakradhar Bandla, Juan Antonio Vizcaíno, Yasset Perez-Riverol
{"title":"Open-source large language models in action: A bioinformatics chatbot for PRIDE database.","authors":"Jingwen Bai, Selvakumar Kamatchinathan, Deepti J Kundu, Chakradhar Bandla, Juan Antonio Vizcaíno, Yasset Perez-Riverol","doi":"10.1002/pmic.202400005","DOIUrl":null,"url":null,"abstract":"<p><p>We here present a chatbot assistant infrastructure (https://www.ebi.ac.uk/pride/chatbot/) that simplifies user interactions with the PRIDE database's documentation and dataset search functionality. The framework utilizes multiple Large Language Models (LLM): llama2, chatglm, mixtral (mistral), and openhermes. It also includes a web service API (Application Programming Interface), web interface, and components for indexing and managing vector databases. An Elo-ranking system-based benchmark component is included in the framework as well, which allows for evaluating the performance of each LLM and for improving PRIDE documentation. The chatbot not only allows users to interact with PRIDE documentation but can also be used to search and find PRIDE datasets using an LLM-based recommendation system, enabling dataset discoverability. Importantly, while our infrastructure is exemplified through its application in the PRIDE database context, the modular and adaptable nature of our approach positions it as a valuable tool for improving user experiences across a spectrum of bioinformatics and proteomics tools and resources, among other domains. The integration of advanced LLMs, innovative vector-based construction, the benchmarking framework, and optimized documentation collectively form a robust and transferable chatbot assistant infrastructure. The framework is open-source (https://github.com/PRIDE-Archive/pride-chatbot).</p>","PeriodicalId":224,"journal":{"name":"Proteomics","volume":" ","pages":"e2400005"},"PeriodicalIF":3.4000,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proteomics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1002/pmic.202400005","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/3/31 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
We here present a chatbot assistant infrastructure (https://www.ebi.ac.uk/pride/chatbot/) that simplifies user interactions with the PRIDE database's documentation and dataset search functionality. The framework utilizes multiple Large Language Models (LLM): llama2, chatglm, mixtral (mistral), and openhermes. It also includes a web service API (Application Programming Interface), web interface, and components for indexing and managing vector databases. An Elo-ranking system-based benchmark component is included in the framework as well, which allows for evaluating the performance of each LLM and for improving PRIDE documentation. The chatbot not only allows users to interact with PRIDE documentation but can also be used to search and find PRIDE datasets using an LLM-based recommendation system, enabling dataset discoverability. Importantly, while our infrastructure is exemplified through its application in the PRIDE database context, the modular and adaptable nature of our approach positions it as a valuable tool for improving user experiences across a spectrum of bioinformatics and proteomics tools and resources, among other domains. The integration of advanced LLMs, innovative vector-based construction, the benchmarking framework, and optimized documentation collectively form a robust and transferable chatbot assistant infrastructure. The framework is open-source (https://github.com/PRIDE-Archive/pride-chatbot).
我们在此介绍一种聊天机器人助手基础架构 (https://www.ebi.ac.uk/pride/chatbot/),它能简化用户与 PRIDE 数据库的文档和数据集搜索功能的交互。该框架采用了多种大型语言模型(LLM):llama2、chatglm、mixtral(mistral)和 openhermes。它还包括一个网络服务 API(应用编程接口)、网络接口以及用于索引和管理矢量数据库的组件。该框架还包括一个基于 Elo 排名系统的基准组件,用于评估每个 LLM 的性能和改进 PRIDE 文档。聊天机器人不仅可以让用户与 PRIDE 文档互动,还可以使用基于 LLM 的推荐系统搜索和查找 PRIDE 数据集,从而实现数据集的可发现性。重要的是,虽然我们的基础架构是通过在 PRIDE 数据库中的应用来体现的,但我们的方法具有模块化和适应性强的特点,这使其成为一种有价值的工具,可用于改善生物信息学和蛋白质组学工具和资源等领域的用户体验。先进的 LLMs、创新的基于向量的构建、基准测试框架和优化的文档整合在一起,形成了一个强大且可移植的聊天机器人助手基础架构。该框架是开源的(https://github.com/PRIDE-Archive/pride-chatbot)。
期刊介绍:
PROTEOMICS is the premier international source for information on all aspects of applications and technologies, including software, in proteomics and other "omics". The journal includes but is not limited to proteomics, genomics, transcriptomics, metabolomics and lipidomics, and systems biology approaches. Papers describing novel applications of proteomics and integration of multi-omics data and approaches are especially welcome.