Characterization and automated classification of sentences in the biomedical literature: a case study for biocuration of gene expression and protein kinase activity.

IF 3.6 4区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Database: The Journal of Biological Databases and Curation Pub Date : 2025-01-18 DOI:10.1093/database/baaf063

Daniela Raciti, Kimberly M Van Auken, Valerio Arnaboldi, Christopher J Tabone, Hans-Michael Muller, Paul W Sternberg

{"title":"Characterization and automated classification of sentences in the biomedical literature: a case study for biocuration of gene expression and protein kinase activity.","authors":"Daniela Raciti, Kimberly M Van Auken, Valerio Arnaboldi, Christopher J Tabone, Hans-Michael Muller, Paul W Sternberg","doi":"10.1093/database/baaf063","DOIUrl":null,"url":null,"abstract":"<p><p>Biological knowledgebases are essential resources for biomedical researchers, providing ready access to gene function and genomic data. Professional, manual curation of knowledgebases, however, is labour-intensive and thus high-performing machine learning (ML) methods that improve biocuration efficiency are needed. Here, we report on sentence-level classification to identify biocuration-relevant sentences in the full text of published references for two gene function data types: gene expression and protein kinase activity. We performed a detailed characterization of sentences from references in the WormBase bibliography and used this characterization to define three tasks for classifying sentences as either (i) fully curatable, (ii) fully and partially curatable, or (iii) all language-related. We evaluated various ML models applied to these tasks and found that GPT and BioBERT achieve the highest average performance, resulting in F1 performance scores ranging from 0.89 to 0.99 depending upon the task. Moreover, our inter-annotator agreement analyses and curator timing exercises demonstrated that curators readily converged on classification of high-quality training sentences that take a relatively short period of time to collect, making expansion of this approach to other data types a realistic addition to existing biocuration workflows. Our findings demonstrate the feasibility of extracting biocuration-relevant sentences from full text. Integrating these models into professional biocuration workflows, such as those used by the Alliance of Genome Resources and the ACKnowledge community curation platform, might well facilitate efficient and accurate annotation of the biomedical literature.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2025 ","pages":""},"PeriodicalIF":3.6000,"publicationDate":"2025-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12482909/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Database: The Journal of Biological Databases and Curation","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/database/baaf063","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Biological knowledgebases are essential resources for biomedical researchers, providing ready access to gene function and genomic data. Professional, manual curation of knowledgebases, however, is labour-intensive and thus high-performing machine learning (ML) methods that improve biocuration efficiency are needed. Here, we report on sentence-level classification to identify biocuration-relevant sentences in the full text of published references for two gene function data types: gene expression and protein kinase activity. We performed a detailed characterization of sentences from references in the WormBase bibliography and used this characterization to define three tasks for classifying sentences as either (i) fully curatable, (ii) fully and partially curatable, or (iii) all language-related. We evaluated various ML models applied to these tasks and found that GPT and BioBERT achieve the highest average performance, resulting in F1 performance scores ranging from 0.89 to 0.99 depending upon the task. Moreover, our inter-annotator agreement analyses and curator timing exercises demonstrated that curators readily converged on classification of high-quality training sentences that take a relatively short period of time to collect, making expansion of this approach to other data types a realistic addition to existing biocuration workflows. Our findings demonstrate the feasibility of extracting biocuration-relevant sentences from full text. Integrating these models into professional biocuration workflows, such as those used by the Alliance of Genome Resources and the ACKnowledge community curation platform, might well facilitate efficient and accurate annotation of the biomedical literature.

Abstract Image

查看原文本刊更多论文

生物医学文献中句子的表征和自动分类：基因表达和蛋白激酶活性的生物固化案例研究。

生物知识库是生物医学研究人员必不可少的资源，提供了获取基因功能和基因组数据的便捷途径。然而，专业的人工知识库管理是劳动密集型的，因此需要高性能的机器学习（ML）方法来提高生物存储效率。在这里，我们报告了句子级分类，以识别两种基因功能数据类型（基因表达和蛋白激酶活性）的全文中与生物培养相关的句子。我们对WormBase参考文献中的句子进行了详细的表征，并使用这种表征定义了三个任务，用于将句子分类为：(i)完全可策展，（ii）完全和部分可策展，或（iii）所有语言相关。我们评估了应用于这些任务的各种ML模型，发现GPT和BioBERT的平均性能最高，根据任务的不同，F1性能得分在0.89到0.99之间。此外，我们的注释者间协议分析和策展人计时练习表明，策展人很容易对高质量的训练句子进行分类，这些句子的收集时间相对较短，这使得将这种方法扩展到其他数据类型成为现有生物定位工作流程的现实补充。我们的研究结果证明了从全文中提取生物定位相关句子的可行性。将这些模型集成到专业的生物标记工作流程中，例如基因组资源联盟和社区管理平台所使用的那些，可能会很好地促进高效和准确的生物医学文献注释。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Database: The Journal of Biological Databases and Curation MATHEMATICAL & COMPUTATIONAL BIOLOGY-

CiteScore

9.00

自引率

3.40%

发文量

100

审稿时长

>12 weeks

期刊介绍： Huge volumes of primary data are archived in numerous open-access databases, and with new generation technologies becoming more common in laboratories, large datasets will become even more prevalent. The archiving, curation, analysis and interpretation of all of these data are a challenge. Database development and biocuration are at the forefront of the endeavor to make sense of this mounting deluge of data. Database: The Journal of Biological Databases and Curation provides an open access platform for the presentation of novel ideas in database research and biocuration, and aims to help strengthen the bridge between database developers, curators, and users.