Annotation of biological samples data to standard ontologies with support from large language models.

IF 4.4 2区生物学 Q2 BIOCHEMISTRY & MOLECULAR BIOLOGY

Computational and structural biotechnology journal Pub Date : 2025-05-26 eCollection Date: 2025-01-01 DOI:10.1016/j.csbj.2025.05.020

Andrea Riquelme-García, Juan Mulero-Hernández, Jesualdo Tomás Fernández-Breis

{"title":"Annotation of biological samples data to standard ontologies with support from large language models.","authors":"Andrea Riquelme-García, Juan Mulero-Hernández, Jesualdo Tomás Fernández-Breis","doi":"10.1016/j.csbj.2025.05.020","DOIUrl":null,"url":null,"abstract":"<p><p>The semantic integration of biological data is hindered by the vast heterogeneity of data sources and their limited semantic formalization. A crucial step in this process is mapping data elements to ontological concepts, which typically involves substantial manual effort. Large Language Models (LLMs) have demonstrated potential in automating complex language-related tasks and may offer a solution to streamline biological data annotation. This study investigates the utility of LLMs-specifically various base and fine-tuned GPT models-for the automatic assignment of ontological identifiers to biological sample labels. We evaluated model performance in annotating labels to four widely used ontologies: the Cell Line Ontology (CLO), Cell Ontology (CL), Uber-anatomy Ontology (UBERON), and BRENDA Tissue Ontology (BTO). Our dataset was compiled from publicly available, high-quality databases containing biologically relevant sequence information, which suffers from inconsistent annotation practices, complicating integrative analyses. Model outputs were compared against annotations generated by text2term, a state-of-the-art annotation tool. The fine-tuned GPT model outperformed both the base models and text2term in annotating cell lines and cell types, particularly for the CL and UBERON ontologies, achieving a precision of 47-64% and a recall of 88-97%. In contrast, base models exhibited significantly lower performance. These results suggest that fine-tuned LLMs can accelerate and improve the accuracy of biological data annotation. Nonetheless, our evaluation highlights persistent challenges, including variable precision across ontology categories and the continued need for expert curation to ensure annotation validity.</p>","PeriodicalId":10715,"journal":{"name":"Computational and structural biotechnology journal","volume":"27 ","pages":"2155-2167"},"PeriodicalIF":4.4000,"publicationDate":"2025-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12162076/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational and structural biotechnology journal","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1016/j.csbj.2025.05.020","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

The semantic integration of biological data is hindered by the vast heterogeneity of data sources and their limited semantic formalization. A crucial step in this process is mapping data elements to ontological concepts, which typically involves substantial manual effort. Large Language Models (LLMs) have demonstrated potential in automating complex language-related tasks and may offer a solution to streamline biological data annotation. This study investigates the utility of LLMs-specifically various base and fine-tuned GPT models-for the automatic assignment of ontological identifiers to biological sample labels. We evaluated model performance in annotating labels to four widely used ontologies: the Cell Line Ontology (CLO), Cell Ontology (CL), Uber-anatomy Ontology (UBERON), and BRENDA Tissue Ontology (BTO). Our dataset was compiled from publicly available, high-quality databases containing biologically relevant sequence information, which suffers from inconsistent annotation practices, complicating integrative analyses. Model outputs were compared against annotations generated by text2term, a state-of-the-art annotation tool. The fine-tuned GPT model outperformed both the base models and text2term in annotating cell lines and cell types, particularly for the CL and UBERON ontologies, achieving a precision of 47-64% and a recall of 88-97%. In contrast, base models exhibited significantly lower performance. These results suggest that fine-tuned LLMs can accelerate and improve the accuracy of biological data annotation. Nonetheless, our evaluation highlights persistent challenges, including variable precision across ontology categories and the continued need for expert curation to ensure annotation validity.

查看原文本刊更多论文

在大型语言模型的支持下，将生物样本数据标注为标准本体。

生物数据的语义整合受到数据源的巨大异质性和语义形式化的限制。这个过程中的一个关键步骤是将数据元素映射到本体概念，这通常需要大量的手工工作。大型语言模型（llm）在自动化复杂的语言相关任务方面已经显示出潜力，并可能为简化生物数据注释提供解决方案。本研究调查了llms的效用-特别是各种基本和微调的GPT模型-用于生物样本标签的本体标识符的自动分配。我们评估了将标签注释到四种广泛使用的本体的模型性能：细胞系本体（CLO）、细胞本体（CL）、优步解剖本体（UBERON）和布伦达组织本体（BTO）。我们的数据集是从公开的、包含生物学相关序列信息的高质量数据库中编译而来的，这些数据库的注释实践不一致，使综合分析变得复杂。将模型输出与最先进的注释工具text2term生成的注释进行比较。优化后的GPT模型在注释细胞系和细胞类型方面优于基本模型和text2term，特别是对于CL和UBERON本体，达到47-64%的准确率和88-97%的召回率。相比之下，基本模型表现出明显较低的性能。这些结果表明，微调的llm可以加速和提高生物数据标注的准确性。尽管如此，我们的评估强调了持续存在的挑战，包括跨本体类别的可变精度以及对专家管理以确保注释有效性的持续需求。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computational and structural biotechnology journal Biochemistry, Genetics and Molecular Biology-Biophysics

CiteScore

9.30

自引率

3.30%

发文量

540

审稿时长

6 weeks

期刊介绍： Computational and Structural Biotechnology Journal (CSBJ) is an online gold open access journal publishing research articles and reviews after full peer review. All articles are published, without barriers to access, immediately upon acceptance. The journal places a strong emphasis on functional and mechanistic understanding of how molecular components in a biological process work together through the application of computational methods. Structural data may provide such insights, but they are not a pre-requisite for publication in the journal. Specific areas of interest include, but are not limited to: Structure and function of proteins, nucleic acids and other macromolecules Structure and function of multi-component complexes Protein folding, processing and degradation Enzymology Computational and structural studies of plant systems Microbial Informatics Genomics Proteomics Metabolomics Algorithms and Hypothesis in Bioinformatics Mathematical and Theoretical Biology Computational Chemistry and Drug Discovery Microscopy and Molecular Imaging Nanotechnology Systems and Synthetic Biology