Journal of Biomedical Informatics最新文献

筛选
英文 中文
HEART: Learning better representation of EHR data with a heterogeneous relation-aware transformer HEART:利用异构关系感知转换器学习更好的电子病历数据表示。
IF 4 2区 医学
Journal of Biomedical Informatics Pub Date : 2024-11-01 DOI: 10.1016/j.jbi.2024.104741
Tinglin Huang , Syed Asad Rizvi , Rohan Krishna Thakur , Vimig Socrates , Meili Gupta , David van Dijk , R. Andrew Taylor , Rex Ying
{"title":"HEART: Learning better representation of EHR data with a heterogeneous relation-aware transformer","authors":"Tinglin Huang ,&nbsp;Syed Asad Rizvi ,&nbsp;Rohan Krishna Thakur ,&nbsp;Vimig Socrates ,&nbsp;Meili Gupta ,&nbsp;David van Dijk ,&nbsp;R. Andrew Taylor ,&nbsp;Rex Ying","doi":"10.1016/j.jbi.2024.104741","DOIUrl":"10.1016/j.jbi.2024.104741","url":null,"abstract":"<div><h3>Objective:</h3><div>Pretrained language models have recently demonstrated their effectiveness in modeling Electronic Health Record (EHR) data by modeling the encounters of patients as sentences. However, existing methods fall short of utilizing the inherent heterogeneous correlations between medical entities—which include diagnoses, medications, procedures, and lab tests. Existing studies either focus merely on diagnosis entities or encode different entities in a homogeneous space, leading to suboptimal performance. Motivated by this, we aim to develop a foundational language model pre-trained on EHR data with explicitly incorporating the heterogeneous correlations among these entities.</div></div><div><h3>Methods:</h3><div>In this study, we propose <span>HEART</span>, a heterogeneous relation-aware transformer for EHR. Our model includes a range of heterogeneous entities within each input sequence and represents pairwise relationships between entities as a relation embedding. Such a higher-order representation allows the model to perform complex reasoning and derive attention weights in the heterogeneous context. Additionally, a multi-level attention scheme is employed to exploit the connection between different encounters while alleviating the high computational costs. For pretraining, <span>HEART</span> engages with two tasks, missing entity prediction and anomaly detection, which both effectively enhance the model’s performance on various downstream tasks.</div></div><div><h3>Results:</h3><div>Extensive experiments on two EHR datasets and five downstream tasks demonstrate <span>HEART</span>’s superior performance compared to four SOTA foundation models. For instance, <span>HEART</span> achieves improvements of 12.1% and 4.1% over Med-BERT in death and readmission prediction, respectively. Additionally, case studies show that <span>HEART</span> offers interpretable insights into the relationships between entities through the learned relation embeddings.</div></div><div><h3>Conclusion:</h3><div>We study the problem of EHR representation learning and propose HEART, a model that leverages the heterogeneous relationships between medical entities. Our approach includes a multi-level encoding scheme and two specialized pretrained objectives, designed to boost both the efficiency and effectiveness of the model. We have comprehensively evaluated HEART across five clinically significant downstream tasks using two EHR datasets. The experimental results verify the model’s great performance and validate its practical utility in healthcare applications. Code: <span><span>https://github.com/Graph-and-Geometric-Learning/HEART</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"159 ","pages":"Article 104741"},"PeriodicalIF":4.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142545685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Algorithms for evaluation of minimal cut sets 最小切割集评估算法。
IF 4 2区 医学
Journal of Biomedical Informatics Pub Date : 2024-11-01 DOI: 10.1016/j.jbi.2024.104740
Marcin Radom , Agnieszka Rybarczyk , Igor Piekarz , Piotr Formanowicz
{"title":"Algorithms for evaluation of minimal cut sets","authors":"Marcin Radom ,&nbsp;Agnieszka Rybarczyk ,&nbsp;Igor Piekarz ,&nbsp;Piotr Formanowicz","doi":"10.1016/j.jbi.2024.104740","DOIUrl":"10.1016/j.jbi.2024.104740","url":null,"abstract":"<div><h3>Objective:</h3><div>We propose a way to enhance the evaluation of minimal cut sets (MCSs) in biological systems modeled by Petri nets, by providing criteria and methodology for determining their optimality in disabling specific processes without affecting critical system components.</div></div><div><h3>Methods:</h3><div>This study concerns Petri nets to model biological systems and utilizes two primary approaches for MCS evaluation. First is the analyzing impact on t-invariants to identify structural dependencies. Second is assessing the impact on potentially starved transitions caused by the inactivity of specific MCSs. This approach deal with net dynamics. These methodologies aim to offer practical tools for assessing the quality and effectiveness of MCSs.</div></div><div><h3>Results:</h3><div>The proposed methodologies were applied to two case studies. In the first case, a cholesterol metabolism network was analyzed to investigate how local inflammation and oxidative stress, in conjunction with cholesterol imbalances, influence the progression of atherosclerosis. The MCSs were ranked, with the top sets presented, focusing on those that disabled the fewest number of t-invariants. In the second case, a carbohydrate metabolism disorder model was examined to understand its impact on atherosclerosis progression. The analysis aimed to identify MCSs that could inhibit the atherosclerosis process by targeting specific transitions. Both studies utilized the Holmes software for calculations, demonstrating the effectiveness of the proposed evaluation methodologies in ranking MCSs for practical biological applications.</div></div><div><h3>Conclusion:</h3><div>The algorithms proposed in this paper offer an analytical approach for evaluating the quality of MCSs in biological systems. By providing criteria for MCS optimality, these approaches have potential to enhance the utility of MCS analysis in systems biology, aiding in the understanding and manipulation of complex biological networks.</div><div>Algorithm are implemented within Holmes software, an open-source project available at <span><span>https://github.com/bszawulak/HolmesPN</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"159 ","pages":"Article 104740"},"PeriodicalIF":4.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142501142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Call for Papers: Data Generation in Healthcare Environments 医疗环境中的数据生成。
IF 4 2区 医学
Journal of Biomedical Informatics Pub Date : 2024-11-01 DOI: 10.1016/j.jbi.2024.104742
Ricardo Cardoso Pereira (Guest Editors) , Pedro Pereira Rodrigues , Irina Sousa Moreira , Pedro Henriques Abreu (Managing Guest Editor)
{"title":"Call for Papers: Data Generation in Healthcare Environments","authors":"Ricardo Cardoso Pereira (Guest Editors) ,&nbsp;Pedro Pereira Rodrigues ,&nbsp;Irina Sousa Moreira ,&nbsp;Pedro Henriques Abreu (Managing Guest Editor)","doi":"10.1016/j.jbi.2024.104742","DOIUrl":"10.1016/j.jbi.2024.104742","url":null,"abstract":"","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"159 ","pages":"Article 104742"},"PeriodicalIF":4.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142568734","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PLRTE: Progressive learning for biomedical relation triplet extraction using large language models PLRTE:使用大型语言模型进行生物医学关系三元组提取的渐进式学习。
IF 4 2区 医学
Journal of Biomedical Informatics Pub Date : 2024-11-01 DOI: 10.1016/j.jbi.2024.104738
Yi-Kai Zheng , Bi Zeng , Yi-Chun Feng , Lu Zhou , Yi-Xue Li
{"title":"PLRTE: Progressive learning for biomedical relation triplet extraction using large language models","authors":"Yi-Kai Zheng ,&nbsp;Bi Zeng ,&nbsp;Yi-Chun Feng ,&nbsp;Lu Zhou ,&nbsp;Yi-Xue Li","doi":"10.1016/j.jbi.2024.104738","DOIUrl":"10.1016/j.jbi.2024.104738","url":null,"abstract":"<div><div>Document-level relation triplet extraction is crucial in biomedical text mining, aiding in drug discovery and the construction of biomedical knowledge graphs. Current language models face challenges in generalizing to unseen datasets and relation types in biomedical relation triplet extraction, which limits their effectiveness in these crucial tasks. To address this challenge, our study optimizes models from two critical dimensions: data-task relevance and granularity of relations, aiming to enhance their generalization capabilities significantly. We introduce a novel progressive learning strategy to obtain the PLRTE model. This strategy not only enhances the model’s capability to comprehend diverse relation types in the biomedical domain but also implements a structured four-level progressive learning process through semantic relation augmentation, compositional instruction, and dual-axis level learning. Our experiments on the DDI and BC5CDR document-level biomedical relation triplet datasets demonstrate a significant performance improvement of 5% to 20% over the current state-of-the-art baselines. Furthermore, our model exhibits exceptional generalization capabilities on the unseen Chemprot and GDA datasets, further validating the effectiveness of optimizing data-task association and relation granularity for enhancing model generalizability.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"159 ","pages":"Article 104738"},"PeriodicalIF":4.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142466410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Adapting the open-source Gen3 platform and kubernetes for the NIH HEAL IMPOWR and MIRHIQL clinical trial data commons: Customization, cloud transition, and optimization 将开源 Gen3 平台和 kubernetes 用于 NIH HEAL IMPOWR 和 MIRHIQL 临床试验数据中心:定制、云过渡和优化。
IF 4 2区 医学
Journal of Biomedical Informatics Pub Date : 2024-11-01 DOI: 10.1016/j.jbi.2024.104749
Meredith C.B. Adams , Colin Griffin , Hunter Adams , Stephen Bryant , Robert W. Hurley , Umit Topaloglu
{"title":"Adapting the open-source Gen3 platform and kubernetes for the NIH HEAL IMPOWR and MIRHIQL clinical trial data commons: Customization, cloud transition, and optimization","authors":"Meredith C.B. Adams ,&nbsp;Colin Griffin ,&nbsp;Hunter Adams ,&nbsp;Stephen Bryant ,&nbsp;Robert W. Hurley ,&nbsp;Umit Topaloglu","doi":"10.1016/j.jbi.2024.104749","DOIUrl":"10.1016/j.jbi.2024.104749","url":null,"abstract":"<div><h3>Objective</h3><div>This study aims to provide the decision-making framework, strategies, and software used to successfully deploy the first combined chronic pain and opioid use data clinical trial data commons using the Gen3 platform.</div></div><div><h3>Materials and Methods</h3><div>The approach involved adapting the open-source Gen3 platform and Kubernetes for the needs of the NIH HEAL IMPOWR and MIRHIQL networks. Key steps included customizing the Gen3 architecture, transitioning from Amazon to Google Cloud, adapting data ingestion and harmonization processes, ensuring security and compliance for the Kubernetes environment, and optimizing performance and user experience.</div></div><div><h3>Results</h3><div>The primary result was a fully operational IMPOWR data commons built on Gen3. Key features include a modular architecture supporting diverse clinical trial data types, automated processes for data management, fine-grained access control and auditing, and researcher-friendly interfaces for data exploration and analysis.</div></div><div><h3>Discussion</h3><div>The successful development of the Wake Forest IDEA-CC data commons represents a significant milestone for chronic pain and addiction research. Harmonized, FAIR data from diverse studies can be discovered in a secure, scalable repository. Challenges remain in long-term maintenance and governance, but the commons provides a foundation for accelerating scientific progress. Key lessons learned include the importance of engaging both technical and domain experts, the need for flexible yet robust infrastructure, and the value of building on established open-source platforms.</div></div><div><h3>Conclusion</h3><div>The WF IDEA-CC Gen3 data commons demonstrates the feasibility and value of developing a shared data infrastructure for chronic pain and opioid use research. The lessons learned can inform similar efforts in other clinical domains.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"159 ","pages":"Article 104749"},"PeriodicalIF":4.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142604422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A mother-child data linkage approach using data from the information system for the development of research in primary care (SIDIAP) in Catalonia 利用加泰罗尼亚初级保健研究发展信息系统(SIDIAP)的数据进行母婴数据链接的方法。
IF 4 2区 医学
Journal of Biomedical Informatics Pub Date : 2024-11-01 DOI: 10.1016/j.jbi.2024.104747
E. Segundo, M. Far, C.I. Rodríguez-Casado, J.M. Elorza, J. Carrere-Molina, R. Mallol-Parera, M. Aragón
{"title":"A mother-child data linkage approach using data from the information system for the development of research in primary care (SIDIAP) in Catalonia","authors":"E. Segundo,&nbsp;M. Far,&nbsp;C.I. Rodríguez-Casado,&nbsp;J.M. Elorza,&nbsp;J. Carrere-Molina,&nbsp;R. Mallol-Parera,&nbsp;M. Aragón","doi":"10.1016/j.jbi.2024.104747","DOIUrl":"10.1016/j.jbi.2024.104747","url":null,"abstract":"<div><h3>Background</h3><div>Large-scale clinical databases containing routinely collected electronic health records (EHRs) data are a valuable source of information for research studies. For example, they can be used in pharmacoepidemiology studies to evaluate the effects of maternal medication exposure on neonatal and pediatric outcomes. Yet, this type of studies is infeasible without proper mother–child linkage.</div></div><div><h3>Methods</h3><div>We leveraged all eligible active records (N = 8,553,321) of the Information System for Research in Primary Care (SIDIAP) database. Mothers and infants were linked using a deterministic approach and linkage accuracy was evaluated in terms of the number of records from candidate mothers that failed to link. We validated the mother–child links identified by comparison of linked and unlinked records for both candidate mothers and descendants. Differences across these two groups were evaluated by means of effect size calculations instead of <em>p</em>-values. Overall, we described our data linkage process following the GUidance for Information about Linking Data sets (GUILD) principles.</div></div><div><h3>Results</h3><div>We were able to identify 744,763 unique mother–child relationships, linking 83.8 % candidate mothers with delivery dates within a period of 15 years. Of note, we provide a record-level category label used to derive a global confidence metric for the presented linkage process. Our validation analysis showed that the two groups were similar in terms of a number of aggregated attributes.</div></div><div><h3>Conclusions</h3><div>Complementing the SIDIAP database with mother–child links will allow clinical researchers to expand their epidemiologic studies with the ultimate goal of improving outcomes for pregnant women and their children. Importantly, the reported information at each step of the data linkage process will contribute to the validity of analyses and interpretation of results in future studies using this resource.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"159 ","pages":"Article 104747"},"PeriodicalIF":4.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142604420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Triple and quadruple optimization for feature selection in cancer biomarker discovery 癌症生物标记物发现中特征选择的三重和四重优化。
IF 4 2区 医学
Journal of Biomedical Informatics Pub Date : 2024-10-11 DOI: 10.1016/j.jbi.2024.104736
L. Cattelani, V. Fortino
{"title":"Triple and quadruple optimization for feature selection in cancer biomarker discovery","authors":"L. Cattelani,&nbsp;V. Fortino","doi":"10.1016/j.jbi.2024.104736","DOIUrl":"10.1016/j.jbi.2024.104736","url":null,"abstract":"<div><div>The proliferation of omics data has advanced cancer biomarker discovery but often falls short in external validation, mainly due to a narrow focus on prediction accuracy that neglects clinical utility and validation feasibility. We introduce three- and four-objective optimization strategies based on genetic algorithms to identify clinically actionable biomarkers in omics studies, addressing classification tasks aimed at distinguishing hard-to-differentiate cancer subtypes beyond histological analysis alone. Our hypothesis is that by optimizing more than one characteristic of cancer biomarkers, we may identify biomarkers that will enhance their success in external validation. Our objectives are to: (i) assess the biomarker panel’s accuracy using a machine learning (ML) framework; (ii) ensure the biomarkers exhibit significant fold-changes across subtypes, thereby boosting the success rate of PCR or immunohistochemistry validations; (iii) select a concise set of biomarkers to simplify the validation process and reduce clinical costs; and (iv) identify biomarkers crucial for predicting overall survival, which plays a significant role in determining the prognostic value of cancer subtypes. We implemented and applied triple and quadruple optimization algorithms to renal carcinoma gene expression data from TCGA. The study targets kidney cancer subtypes that are difficult to distinguish through histopathology methods. Selected RNA-seq biomarkers were assessed against the gold standard method, which relies solely on clinical information, and in external microarray-based validation datasets. Notably, these biomarkers achieved over 0.8 of accuracy in external validations and added significant value to survival predictions, outperforming the use of clinical data alone with a superior c-index. The provided tool also helps explore the trade-off between objectives, offering multiple solutions for clinical evaluation before proceeding to costly validation or clinical trials.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"159 ","pages":"Article 104736"},"PeriodicalIF":4.0,"publicationDate":"2024-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142466411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improving tabular data extraction in scanned laboratory reports using deep learning models 利用深度学习模型改进扫描实验室报告中的表格数据提取。
IF 4 2区 医学
Journal of Biomedical Informatics Pub Date : 2024-10-10 DOI: 10.1016/j.jbi.2024.104735
Yiming Li , Qiang Wei , Xinghan Chen , Jianfu Li , Cui Tao , Hua Xu
{"title":"Improving tabular data extraction in scanned laboratory reports using deep learning models","authors":"Yiming Li ,&nbsp;Qiang Wei ,&nbsp;Xinghan Chen ,&nbsp;Jianfu Li ,&nbsp;Cui Tao ,&nbsp;Hua Xu","doi":"10.1016/j.jbi.2024.104735","DOIUrl":"10.1016/j.jbi.2024.104735","url":null,"abstract":"<div><h3>Objective</h3><div>Medical laboratory testing is essential in healthcare, providing crucial data for diagnosis and treatment. Nevertheless, patients’ lab testing results are often transferred via fax across healthcare organizations and are not immediately available for timely clinical decision making. Thus, it is important to develop new technologies to accurately extract lab testing information from scanned laboratory reports. This study aims to develop an advanced deep learning-based Optical Character Recognition (OCR) method to identify tables containing lab testing results in scanned laboratory reports.</div></div><div><h3>Methods</h3><div>Extracting tabular data from scanned lab reports involves two stages: table detection (i.e., identifying the area of a table object) and table recognition (i.e., identifying and extracting tabular structures and contents). DETR R18 algorithm as well as YOLOv8s were involved for table detection, and we compared the performance of PaddleOCR and the encoder-dual-decoder (EDD) model for table recognition. 650 tables from 632 randomly selected laboratory test reports were annotated and used to train and evaluate those models. For table detection evaluation, we used metrics such as Average Precision (AP), Average Recall (AR), AP50, and AP75. For table recognition evaluation, we employed Tree-Edit Distance (TEDS).</div></div><div><h3>Results</h3><div>For table detection, fine-tuned DETR R18 demonstrated superior performance (AP50: 0.774; AP75: 0.644; AP: 0.601; AR: 0.766). In terms of table recognition, fine-tuned EDD outperformed other models with a TEDS score of 0.815. The proposed OCR pipeline (fine-tuned DETR R18 and fine-tuned EDD), demonstrated impressive results, achieving a TEDS score of 0.699 and a TEDS structure score of 0.764.</div></div><div><h3>Conclusions</h3><div>Our study presents a dedicated OCR pipeline for scanned clinical documents, utilizing state-of-the-art deep learning models for region-of-interest detection and table recognition. The high TEDS scores demonstrate the effectiveness of our approach, which has significant implications for clinical data analysis and decision-making.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"159 ","pages":"Article 104735"},"PeriodicalIF":4.0,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142406431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Learning to match patients to clinical trials using large language models 利用大型语言模型学习将患者与临床试验相匹配。
IF 4 2区 医学
Journal of Biomedical Informatics Pub Date : 2024-10-09 DOI: 10.1016/j.jbi.2024.104734
Maciej Rybinski , Wojciech Kusa , Sarvnaz Karimi , Allan Hanbury
{"title":"Learning to match patients to clinical trials using large language models","authors":"Maciej Rybinski ,&nbsp;Wojciech Kusa ,&nbsp;Sarvnaz Karimi ,&nbsp;Allan Hanbury","doi":"10.1016/j.jbi.2024.104734","DOIUrl":"10.1016/j.jbi.2024.104734","url":null,"abstract":"<div><h3>Objective:</h3><div>This study investigates the use of Large Language Models (LLMs) for matching patients to clinical trials (CTs) within an information retrieval pipeline. Our objective is to enhance the process of patient-trial matching by leveraging the semantic processing capabilities of LLMs, thereby improving the effectiveness of patient recruitment for clinical trials.</div></div><div><h3>Methods:</h3><div>We employed a multi-stage retrieval pipeline integrating various methodologies, including BM25 and Transformer-based rankers, along with LLM-based methods. Our primary datasets were the TREC Clinical Trials 2021–23 track collections. We compared LLM-based approaches, focusing on methods that leverage LLMs in query formulation, filtering, relevance ranking, and re-ranking of CTs.</div></div><div><h3>Results:</h3><div>Our results indicate that LLM-based systems, particularly those involving re-ranking with a fine-tuned LLM, outperform traditional methods in terms of nDCG and Precision measures. The study demonstrates that fine-tuning LLMs enhances their ability to find eligible trials. Moreover, our LLM-based approach is competitive with state-of-the-art systems in the TREC challenges.</div><div>The study shows the effectiveness of LLMs in CT matching, highlighting their potential in handling complex semantic analysis and improving patient-trial matching. However, the use of LLMs increases the computational cost and reduces efficiency. We provide a detailed analysis of effectiveness-efficiency trade-offs.</div></div><div><h3>Conclusion:</h3><div>This research demonstrates the promising role of LLMs in enhancing the patient-to-clinical trial matching process, offering a significant advancement in the automation of patient recruitment. Future work should explore optimising the balance between computational cost and retrieval effectiveness in practical applications.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"159 ","pages":"Article 104734"},"PeriodicalIF":4.0,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142400333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Augmenting biomedical named entity recognition with general-domain resources 利用通用领域资源增强生物医学命名实体识别。
IF 4 2区 医学
Journal of Biomedical Informatics Pub Date : 2024-10-04 DOI: 10.1016/j.jbi.2024.104731
Yu Yin , Hyunjae Kim , Xiao Xiao , Chih Hsuan Wei , Jaewoo Kang , Zhiyong Lu , Hua Xu , Meng Fang , Qingyu Chen
{"title":"Augmenting biomedical named entity recognition with general-domain resources","authors":"Yu Yin ,&nbsp;Hyunjae Kim ,&nbsp;Xiao Xiao ,&nbsp;Chih Hsuan Wei ,&nbsp;Jaewoo Kang ,&nbsp;Zhiyong Lu ,&nbsp;Hua Xu ,&nbsp;Meng Fang ,&nbsp;Qingyu Chen","doi":"10.1016/j.jbi.2024.104731","DOIUrl":"10.1016/j.jbi.2024.104731","url":null,"abstract":"<div><h3>Objective</h3><div>Training a neural network-based biomedical named entity recognition (BioNER) model usually requires extensive and costly human annotations. While several studies have employed multi-task learning with multiple BioNER datasets to reduce human effort, this approach does not consistently yield performance improvements and may introduce label ambiguity in different biomedical corpora. We aim to tackle those challenges through transfer learning from easily accessible resources with fewer concept overlaps with biomedical datasets.</div></div><div><h3>Methods</h3><div>We proposed GERBERA, a simple-yet-effective method that utilized general-domain NER datasets for training. We performed multi-task learning to train a pre-trained biomedical language model with both the target BioNER dataset and the general-domain dataset. Subsequently, we fine-tuned the models specifically for the BioNER dataset.</div></div><div><h3>Results</h3><div>We systematically evaluated GERBERA on five datasets of eight entity types, collectively consisting of 81,410 instances. Despite using fewer biomedical resources, our models demonstrated superior performance compared to baseline models trained with additional BioNER datasets. Specifically, our models consistently outperformed the baseline models in six out of eight entity types, achieving an average improvement of 0.9% over the best baseline performance across eight entities. Our method was especially effective in amplifying performance on BioNER datasets characterized by limited data, with a 4.7% improvement in F1 scores on the JNLPBA-RNA dataset.</div></div><div><h3>Conclusion</h3><div>This study introduces a new training method that leverages cost-effective general-domain NER datasets to augment BioNER models. This approach significantly improves BioNER model performance, making it a valuable asset for scenarios with scarce or costly biomedical datasets. We make data, codes, and models publicly available via <span><span>https://github.com/qingyu-qc/bioner_gerbera</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"159 ","pages":"Article 104731"},"PeriodicalIF":4.0,"publicationDate":"2024-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142377852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信