Benchmarking domain-specific pretrained language models to identify the best model for methodological rigor in clinical studies

IF 4.5 2区医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Biomedical Informatics Pub Date : 2025-04-15 DOI:10.1016/j.jbi.2025.104825

Fangwen Zhou , Rick Parrish , Muhammad Afzal , Ashirbani Saha , R. Brian Haynes , Alfonso Iorio , Cynthia Lokker

{"title":"Benchmarking domain-specific pretrained language models to identify the best model for methodological rigor in clinical studies","authors":"Fangwen Zhou , Rick Parrish , Muhammad Afzal , Ashirbani Saha , R. Brian Haynes , Alfonso Iorio , Cynthia Lokker","doi":"10.1016/j.jbi.2025.104825","DOIUrl":null,"url":null,"abstract":"<div><h3>Objective</h3><div>Encoder-only transformer-based language models have shown promise in automating critical appraisal of clinical literature. However, a comprehensive evaluation of the models for classifying the methodological rigor of randomized controlled trials is necessary to identify the more robust ones. This study benchmarks several state-of-the-art transformer-based language models using a diverse set of performance metrics.</div></div><div><h3>Methods</h3><div>Seven transformer-based language models were fine-tuned on the title and abstract of 42,575 articles from 2003 to 2023 in McMaster University’s Premium LiteratUre Service database under different configurations. The studies reported in the articles addressed questions related to treatment, prevention, or quality improvement for which randomized controlled trials are the gold standard with defined criteria for rigorous methods. Models were evaluated on the validation set using 12 schemes and metrics, including optimization for cross-entropy loss, Brier score, AUROC, average precision, sensitivity, specificity, and accuracy, among others. Threshold tuning was performed to optimize threshold-dependent metrics. Models that achieved the best performance in one or more schemes on the validation set were further tested in hold-out and external datasets.</div></div><div><h3>Results</h3><div>A total of 210 models were fine-tuned. Six models achieved top performance in one or more evaluation schemes. Three BioLinkBERT models outperformed others on 8 of the 12 schemes. BioBERT, BiomedBERT, and SciBERT were best on 1, 1 and 2 schemes, respectively. While model performance remained robust on the hold-out test set, it declined in external datasets. Class weight adjustments improved performance in most instances.</div></div><div><h3>Conclusion</h3><div>BioLinkBERT generally outperformed the other models. Using comprehensive evaluation metrics and threshold tuning optimizes model selection for real-world applications. Future work should assess generalizability to other datasets, explore alternate imbalance strategies, and examine training on full-text articles.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"166 ","pages":"Article 104825"},"PeriodicalIF":4.5000,"publicationDate":"2025-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Biomedical Informatics","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1532046425000541","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Objective

Encoder-only transformer-based language models have shown promise in automating critical appraisal of clinical literature. However, a comprehensive evaluation of the models for classifying the methodological rigor of randomized controlled trials is necessary to identify the more robust ones. This study benchmarks several state-of-the-art transformer-based language models using a diverse set of performance metrics.

Methods

Seven transformer-based language models were fine-tuned on the title and abstract of 42,575 articles from 2003 to 2023 in McMaster University’s Premium LiteratUre Service database under different configurations. The studies reported in the articles addressed questions related to treatment, prevention, or quality improvement for which randomized controlled trials are the gold standard with defined criteria for rigorous methods. Models were evaluated on the validation set using 12 schemes and metrics, including optimization for cross-entropy loss, Brier score, AUROC, average precision, sensitivity, specificity, and accuracy, among others. Threshold tuning was performed to optimize threshold-dependent metrics. Models that achieved the best performance in one or more schemes on the validation set were further tested in hold-out and external datasets.

Results

A total of 210 models were fine-tuned. Six models achieved top performance in one or more evaluation schemes. Three BioLinkBERT models outperformed others on 8 of the 12 schemes. BioBERT, BiomedBERT, and SciBERT were best on 1, 1 and 2 schemes, respectively. While model performance remained robust on the hold-out test set, it declined in external datasets. Class weight adjustments improved performance in most instances.

Conclusion

BioLinkBERT generally outperformed the other models. Using comprehensive evaluation metrics and threshold tuning optimizes model selection for real-world applications. Future work should assess generalizability to other datasets, explore alternate imbalance strategies, and examine training on full-text articles.

Abstract Image

查看原文本刊更多论文

对特定领域的预训练语言模型进行基准测试，以确定临床研究中方法严谨性的最佳模型

目的：仅基于编码的转换器的语言模型在临床文献的自动化批判性评价中显示出前景。然而，有必要对随机对照试验的方法严谨性分类模型进行综合评估，以确定更稳健的模型。本研究使用一组不同的性能指标对几种最先进的基于转换器的语言模型进行基准测试。方法对2003 ~ 2023年麦克马斯特大学优质文献服务数据库中42575篇论文在不同配置下的标题和摘要进行7种基于变换的语言模型微调。文章中报道的研究解决了与治疗、预防或质量改进相关的问题，随机对照试验是严格方法的黄金标准。在验证集上使用12种方案和指标对模型进行评估，包括交叉熵损失优化、Brier评分、AUROC、平均精度、灵敏度、特异性和准确性等。执行阈值调优以优化与阈值相关的指标。在验证集上的一个或多个方案中获得最佳性能的模型在保留和外部数据集中进一步测试。结果共对210个模型进行了微调。6个模型在一个或多个评估方案中获得了最高性能。在12个方案中，有3个BioLinkBERT模型在8个方案中表现优于其他模型。BioBERT、BiomedBERT和SciBERT分别在1、1和2方案上表现最佳。虽然模型性能在保留测试集上保持稳健，但在外部数据集上却有所下降。类权重调整在大多数情况下提高了性能。结论biolinkbert模型总体上优于其他模型。使用全面的评估指标和阈值调优可以优化实际应用程序的模型选择。未来的工作应该评估其他数据集的泛化性，探索替代的失衡策略，并检查全文文章的训练。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Biomedical Informatics 医学-计算机：跨学科应用

CiteScore

8.90

自引率

6.70%

发文量

243

审稿时长

32 days

期刊介绍： The Journal of Biomedical Informatics reflects a commitment to high-quality original research papers, reviews, and commentaries in the area of biomedical informatics methodology. Although we publish articles motivated by applications in the biomedical sciences (for example, clinical medicine, health care, population health, and translational bioinformatics), the journal emphasizes reports of new methodologies and techniques that have general applicability and that form the basis for the evolving science of biomedical informatics. Articles on medical devices; evaluations of implemented systems (including clinical trials of information technologies); or papers that provide insight into a biological process, a specific disease, or treatment options would generally be more suitable for publication in other venues. Papers on applications of signal processing and image analysis are often more suitable for biomedical engineering journals or other informatics journals, although we do publish papers that emphasize the information management and knowledge representation/modeling issues that arise in the storage and use of biological signals and images. System descriptions are welcome if they illustrate and substantiate the underlying methodology that is the principal focus of the report and an effort is made to address the generalizability and/or range of application of that methodology. Note also that, given the international nature of JBI, papers that deal with specific languages other than English, or with country-specific health systems or approaches, are acceptable for JBI only if they offer generalizable lessons that are relevant to the broad JBI readership, regardless of their country, language, culture, or health system.