基于变压器的生物医学文献分组随机试验分类语言模型:模型开发与验证。

IF 3.1 3区 医学 Q2 MEDICAL INFORMATICS
Elaheh Aghaarabi, David Murray
{"title":"基于变压器的生物医学文献分组随机试验分类语言模型:模型开发与验证。","authors":"Elaheh Aghaarabi, David Murray","doi":"10.2196/63267","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>For the public health community, monitoring recently published articles is crucial for staying informed about the latest research developments. However, identifying publications about studies with specific research designs from the extensive body of public health publications is a challenge with the currently available methods.</p><p><strong>Objective: </strong>Our objective is to develop a fine-tuned pretrained language model that can accurately identify publications from clinical trials that use a group- or cluster-randomized trial (GRT), individually randomized group-treatment trial (IRGT), or stepped wedge group- or cluster-randomized trial (SWGRT) design within the biomedical literature.</p><p><strong>Methods: </strong>We fine-tuned the BioMedBERT language model using a dataset of biomedical literature from the Office of Disease Prevention at the National Institute of Health. The model was trained to classify publications into three categories of clinical trials that use nested designs. The model performance was evaluated on unseen data and demonstrated high sensitivity and specificity for each class.</p><p><strong>Results: </strong>When our proposed model was tested for generalizability with unseen data, it delivered high sensitivity and specificity for each class as follows: negatives (0.95 and 0.93), GRTs (0.94 and 0.90), IRGTs (0.81 and 0.97), and SWGRTs (0.96 and 0.99), respectively.</p><p><strong>Conclusions: </strong>Our work demonstrates the potential of fine-tuned, domain-specific language models to accurately identify publications reporting on complex and specialized study designs, addressing a critical need in the public health research community. This model offers a valuable tool for the public health community to directly identify publications from clinical trials that use one of the three classes of nested designs.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e63267"},"PeriodicalIF":3.1000,"publicationDate":"2025-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12148241/pdf/","citationCount":"0","resultStr":"{\"title\":\"Transformer-Based Language Models for Group Randomized Trial Classification in Biomedical Literature: Model Development and Validation.\",\"authors\":\"Elaheh Aghaarabi, David Murray\",\"doi\":\"10.2196/63267\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>For the public health community, monitoring recently published articles is crucial for staying informed about the latest research developments. However, identifying publications about studies with specific research designs from the extensive body of public health publications is a challenge with the currently available methods.</p><p><strong>Objective: </strong>Our objective is to develop a fine-tuned pretrained language model that can accurately identify publications from clinical trials that use a group- or cluster-randomized trial (GRT), individually randomized group-treatment trial (IRGT), or stepped wedge group- or cluster-randomized trial (SWGRT) design within the biomedical literature.</p><p><strong>Methods: </strong>We fine-tuned the BioMedBERT language model using a dataset of biomedical literature from the Office of Disease Prevention at the National Institute of Health. The model was trained to classify publications into three categories of clinical trials that use nested designs. The model performance was evaluated on unseen data and demonstrated high sensitivity and specificity for each class.</p><p><strong>Results: </strong>When our proposed model was tested for generalizability with unseen data, it delivered high sensitivity and specificity for each class as follows: negatives (0.95 and 0.93), GRTs (0.94 and 0.90), IRGTs (0.81 and 0.97), and SWGRTs (0.96 and 0.99), respectively.</p><p><strong>Conclusions: </strong>Our work demonstrates the potential of fine-tuned, domain-specific language models to accurately identify publications reporting on complex and specialized study designs, addressing a critical need in the public health research community. This model offers a valuable tool for the public health community to directly identify publications from clinical trials that use one of the three classes of nested designs.</p>\",\"PeriodicalId\":56334,\"journal\":{\"name\":\"JMIR Medical Informatics\",\"volume\":\"13 \",\"pages\":\"e63267\"},\"PeriodicalIF\":3.1000,\"publicationDate\":\"2025-05-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12148241/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"JMIR Medical Informatics\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.2196/63267\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"MEDICAL INFORMATICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Medical Informatics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2196/63267","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}
引用次数: 0

摘要

背景:对于公共卫生界来说,监测最近发表的文章对于了解最新的研究进展至关重要。然而,从广泛的公共卫生出版物中确定具有特定研究设计的研究出版物是目前可用方法的一个挑战。目的:我们的目标是开发一种微调的预训练语言模型,该模型可以准确识别生物医学文献中使用组或组随机试验(GRT)、单独随机组治疗试验(IRGT)或阶梯楔形组或组随机试验(SWGRT)设计的临床试验出版物。方法:我们使用来自美国国立卫生研究院疾病预防办公室的生物医学文献数据集对生物医学bert语言模型进行了微调。该模型经过训练,将出版物分为三类使用嵌套设计的临床试验。模型性能在未见数据上进行评估,并显示出对每个类别的高灵敏度和特异性。结果:当我们提出的模型用未见数据进行通用性测试时,它对每个类别都具有高灵敏度和特异性,分别为阴性(0.95和0.93)、GRTs(0.94和0.90)、irts(0.81和0.97)和SWGRTs(0.96和0.99)。结论:我们的工作证明了微调的、特定领域的语言模型的潜力,可以准确识别复杂和专业研究设计的出版物,解决公共卫生研究界的关键需求。该模型为公共卫生界提供了一个有价值的工具,可以直接识别使用三类嵌套设计之一的临床试验出版物。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Transformer-Based Language Models for Group Randomized Trial Classification in Biomedical Literature: Model Development and Validation.

Background: For the public health community, monitoring recently published articles is crucial for staying informed about the latest research developments. However, identifying publications about studies with specific research designs from the extensive body of public health publications is a challenge with the currently available methods.

Objective: Our objective is to develop a fine-tuned pretrained language model that can accurately identify publications from clinical trials that use a group- or cluster-randomized trial (GRT), individually randomized group-treatment trial (IRGT), or stepped wedge group- or cluster-randomized trial (SWGRT) design within the biomedical literature.

Methods: We fine-tuned the BioMedBERT language model using a dataset of biomedical literature from the Office of Disease Prevention at the National Institute of Health. The model was trained to classify publications into three categories of clinical trials that use nested designs. The model performance was evaluated on unseen data and demonstrated high sensitivity and specificity for each class.

Results: When our proposed model was tested for generalizability with unseen data, it delivered high sensitivity and specificity for each class as follows: negatives (0.95 and 0.93), GRTs (0.94 and 0.90), IRGTs (0.81 and 0.97), and SWGRTs (0.96 and 0.99), respectively.

Conclusions: Our work demonstrates the potential of fine-tuned, domain-specific language models to accurately identify publications reporting on complex and specialized study designs, addressing a critical need in the public health research community. This model offers a valuable tool for the public health community to directly identify publications from clinical trials that use one of the three classes of nested designs.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
JMIR Medical Informatics
JMIR Medical Informatics Medicine-Health Informatics
CiteScore
7.90
自引率
3.10%
发文量
173
审稿时长
12 weeks
期刊介绍: JMIR Medical Informatics (JMI, ISSN 2291-9694) is a top-rated, tier A journal which focuses on clinical informatics, big data in health and health care, decision support for health professionals, electronic health records, ehealth infrastructures and implementation. It has a focus on applied, translational research, with a broad readership including clinicians, CIOs, engineers, industry and health informatics professionals. Published by JMIR Publications, publisher of the Journal of Medical Internet Research (JMIR), the leading eHealth/mHealth journal (Impact Factor 2016: 5.175), JMIR Med Inform has a slightly different scope (emphasizing more on applications for clinicians and health professionals rather than consumers/citizens, which is the focus of JMIR), publishes even faster, and also allows papers which are more technical or more formative than what would be published in the Journal of Medical Internet Research.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信