Biomedical literature-based clinical phenotype definition discovery using large language models.

IF 3.6 4区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Database: The Journal of Biological Databases and Curation Pub Date : 2025-01-18 DOI:10.1093/database/baaf047

Samar Binkheder, Xiaofu Liu, Michael Wu, Lei Wang, Aditi Shendre, Sara K Quinney, Wei-Qi Wei, Lang Li

{"title":"Biomedical literature-based clinical phenotype definition discovery using large language models.","authors":"Samar Binkheder, Xiaofu Liu, Michael Wu, Lei Wang, Aditi Shendre, Sara K Quinney, Wei-Qi Wei, Lang Li","doi":"10.1093/database/baaf047","DOIUrl":null,"url":null,"abstract":"<p><p>Electronic health record (EHR) phenotyping is a high-demand task because most phenotypes are not usually readily defined. The objective of this study is to develop an effective text-mining approach that automatically extracts clinical phenotype definitions-related sentences from biomedical literature. Abstract-level and full-text sentence-level classifiers were developed for clinical phenotype discovery from PubMed. We compared the performance of the abstract-level classifier on machine learning algorithms: support vector machine (SVM), logistic regression (LR), naïve Bayes, and decision tree. SVM classifier showed the best performance (F-measure = 98%) in identifying clinical phenotype-relevant abstracts. It predicted 459 406 clinical phenotype-related abstracts. For the full-text sentence-level classifier, we compared the performance of SVM, LR, naïve Bayes, decision trees, convolutional neural networks, Bidirectional Encoder Representations from Transformers (BERT), and Bidirectional Encoder Representations from Transformers for Biomedical Text Mining (BioBERT). BioBERT model was the best performer among the full-text sentence-level classifiers (F-measure = 91%). We used these two optimal classifiers for large-scale screening of the PubMed database, starting with abstract retrieval and followed by predicting clinical phenotype-related sentences from full texts. The large-scale screening predicted over two million clinical phenotype-related sentences. Lastly, we developed a knowledgebase using positively predicted sentences, allowing users to query clinical phenotype-related sentences with a phenotype term of interest. The Clinical Phenotype Knowledgebase (CliPheKB) enables users to search for clinical phenotype terms and retrieve sentences related to a specific clinical phenotype of interest (https://cliphekb.shinyapps.io/phenotype-main/). Building upon prior methods, we developed a text mining pipeline to automatically extract clinical phenotype definition-related sentences from the literature. This high-throughput phenotyping approach is generalizable and scalable, and it is complementary to existing EHR phenotyping methods.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2025 ","pages":""},"PeriodicalIF":3.6000,"publicationDate":"2025-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12462612/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Database: The Journal of Biological Databases and Curation","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/database/baaf047","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Electronic health record (EHR) phenotyping is a high-demand task because most phenotypes are not usually readily defined. The objective of this study is to develop an effective text-mining approach that automatically extracts clinical phenotype definitions-related sentences from biomedical literature. Abstract-level and full-text sentence-level classifiers were developed for clinical phenotype discovery from PubMed. We compared the performance of the abstract-level classifier on machine learning algorithms: support vector machine (SVM), logistic regression (LR), naïve Bayes, and decision tree. SVM classifier showed the best performance (F-measure = 98%) in identifying clinical phenotype-relevant abstracts. It predicted 459 406 clinical phenotype-related abstracts. For the full-text sentence-level classifier, we compared the performance of SVM, LR, naïve Bayes, decision trees, convolutional neural networks, Bidirectional Encoder Representations from Transformers (BERT), and Bidirectional Encoder Representations from Transformers for Biomedical Text Mining (BioBERT). BioBERT model was the best performer among the full-text sentence-level classifiers (F-measure = 91%). We used these two optimal classifiers for large-scale screening of the PubMed database, starting with abstract retrieval and followed by predicting clinical phenotype-related sentences from full texts. The large-scale screening predicted over two million clinical phenotype-related sentences. Lastly, we developed a knowledgebase using positively predicted sentences, allowing users to query clinical phenotype-related sentences with a phenotype term of interest. The Clinical Phenotype Knowledgebase (CliPheKB) enables users to search for clinical phenotype terms and retrieve sentences related to a specific clinical phenotype of interest (https://cliphekb.shinyapps.io/phenotype-main/). Building upon prior methods, we developed a text mining pipeline to automatically extract clinical phenotype definition-related sentences from the literature. This high-throughput phenotyping approach is generalizable and scalable, and it is complementary to existing EHR phenotyping methods.

Abstract Image

查看原文本刊更多论文

使用大型语言模型的基于生物医学文献的临床表型定义发现。

电子健康记录（EHR）表型是一项要求很高的任务，因为大多数表型通常不容易定义。本研究的目的是开发一种有效的文本挖掘方法，自动从生物医学文献中提取临床表型定义相关的句子。摘要级和全文句子级分类器被开发用于PubMed的临床表型发现。我们比较了抽象级分类器在机器学习算法上的性能：支持向量机（SVM）、逻辑回归（LR）、naïve贝叶斯和决策树。SVM分类器在识别临床表型相关摘要方面表现出最好的性能（F-measure = 98%）。它预测了459 406篇临床表型相关的摘要。对于全文句子级分类器，我们比较了SVM、LR、naïve贝叶斯、决策树、卷积神经网络、双向编码器表示（BERT）和双向编码器表示（BioBERT）的性能。BioBERT模型在全文句子级分类器中表现最好（F-measure = 91%）。我们使用这两个最佳分类器对PubMed数据库进行大规模筛选，从摘要检索开始，然后从全文中预测临床表型相关的句子。大规模筛选预测了超过200万个临床表型相关的句子。最后，我们开发了一个使用积极预测句子的知识库，允许用户查询具有感兴趣表型术语的临床表型相关句子。临床表型知识库（CliPheKB）使用户能够搜索临床表型术语并检索与特定临床表型相关的句子（https://cliphekb.shinyapps.io/phenotype-main/）。在先前方法的基础上，我们开发了一个文本挖掘管道，从文献中自动提取临床表型定义相关的句子。这种高通量表型分析方法具有可通用性和可扩展性，是现有EHR表型分析方法的补充。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Database: The Journal of Biological Databases and Curation MATHEMATICAL & COMPUTATIONAL BIOLOGY-

CiteScore

9.00

自引率

3.40%

发文量

100

审稿时长

>12 weeks

期刊介绍： Huge volumes of primary data are archived in numerous open-access databases, and with new generation technologies becoming more common in laboratories, large datasets will become even more prevalent. The archiving, curation, analysis and interpretation of all of these data are a challenge. Database development and biocuration are at the forefront of the endeavor to make sense of this mounting deluge of data. Database: The Journal of Biological Databases and Curation provides an open access platform for the presentation of novel ideas in database research and biocuration, and aims to help strengthen the bridge between database developers, curators, and users.