Fan Yang, Huanjun Kong, Jie Ying, Zihong Chen, Tao Luo, Wanli Jiang, Zhonghang Yuan, Zhefan Wang, Zhaona Ma, Shikuan Wang, Wanfeng Ma, Xiaoyi Wang, Xiaoying Li, Zhengyin Hu, Xiaodong Ma, Minguo Liu, Xiqing Wang, Fan Chen, Nanqing Dong
{"title":"SeedLLM·Rice: A large language model integrated with rice biological knowledge graph.","authors":"Fan Yang, Huanjun Kong, Jie Ying, Zihong Chen, Tao Luo, Wanli Jiang, Zhonghang Yuan, Zhefan Wang, Zhaona Ma, Shikuan Wang, Wanfeng Ma, Xiaoyi Wang, Xiaoying Li, Zhengyin Hu, Xiaodong Ma, Minguo Liu, Xiqing Wang, Fan Chen, Nanqing Dong","doi":"10.1016/j.molp.2025.05.013","DOIUrl":null,"url":null,"abstract":"<p><p>Rice biology research involves complex decision-making, requiring researchers to navigate a rapidly expanding body of knowledge encompassing extensive literature and multiomics data. The exponential increase in biological data and scientific publications presents significant challenges for efficiently extracting meaningful insights. Although large language models (LLMs) show promise for knowledge retrieval, their application to rice-specific research has been limited by the absence of specialized models and the challenge of synthesizing multimodal data integral to the field. Moreover, the lack of standardized evaluation frameworks for domain-specific tasks impedes the effective assessment of model performance. To address these challenges, we introduce SeedLLM·Rice (SeedLLM), a 7-billion-parameter model trained on 1.4 million rice-related publications, representing nearly 98.24% of global rice research output. Additionally, we present a novel human-centric evaluation framework designed to assess LLM performance in rice biology tasks. Initial evaluations demonstrate that SeedLLM outperforms general-purpose models, including OpenAI GPT-4o1 and DeepSeek-R1, achieving win rates of 57% to 88% on rice-specific tasks. Furthermore, SeedLLM is integrated with the Rice Biological Knowledge Graph (RBKG), which consolidates genome annotations for Nipponbare and large-scale synthesis of transcriptomic and proteomic information from over 1800 studies. This integration enhances the ability of SeedLLM to address complex research questions requiring the fusion of textual and multiomics data. To facilitate global collaboration, we provide free access to SeedLLM and the RBKG via an interactive web portal (https://seedllm.org.cn/). SeedLLM represents a transformative tool for rice biology research, enabling unprecedented discoveries in crop improvement and climate adaptation through advanced reasoning and comprehensive data integration.</p>","PeriodicalId":19012,"journal":{"name":"Molecular Plant","volume":" ","pages":"1118-1129"},"PeriodicalIF":24.1000,"publicationDate":"2025-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Molecular Plant","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1016/j.molp.2025.05.013","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/5/28 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Rice biology research involves complex decision-making, requiring researchers to navigate a rapidly expanding body of knowledge encompassing extensive literature and multiomics data. The exponential increase in biological data and scientific publications presents significant challenges for efficiently extracting meaningful insights. Although large language models (LLMs) show promise for knowledge retrieval, their application to rice-specific research has been limited by the absence of specialized models and the challenge of synthesizing multimodal data integral to the field. Moreover, the lack of standardized evaluation frameworks for domain-specific tasks impedes the effective assessment of model performance. To address these challenges, we introduce SeedLLM·Rice (SeedLLM), a 7-billion-parameter model trained on 1.4 million rice-related publications, representing nearly 98.24% of global rice research output. Additionally, we present a novel human-centric evaluation framework designed to assess LLM performance in rice biology tasks. Initial evaluations demonstrate that SeedLLM outperforms general-purpose models, including OpenAI GPT-4o1 and DeepSeek-R1, achieving win rates of 57% to 88% on rice-specific tasks. Furthermore, SeedLLM is integrated with the Rice Biological Knowledge Graph (RBKG), which consolidates genome annotations for Nipponbare and large-scale synthesis of transcriptomic and proteomic information from over 1800 studies. This integration enhances the ability of SeedLLM to address complex research questions requiring the fusion of textual and multiomics data. To facilitate global collaboration, we provide free access to SeedLLM and the RBKG via an interactive web portal (https://seedllm.org.cn/). SeedLLM represents a transformative tool for rice biology research, enabling unprecedented discoveries in crop improvement and climate adaptation through advanced reasoning and comprehensive data integration.
期刊介绍:
Molecular Plant is dedicated to serving the plant science community by publishing novel and exciting findings with high significance in plant biology. The journal focuses broadly on cellular biology, physiology, biochemistry, molecular biology, genetics, development, plant-microbe interaction, genomics, bioinformatics, and molecular evolution.
Molecular Plant publishes original research articles, reviews, Correspondence, and Spotlights on the most important developments in plant biology.