Shuai Liu , Xiao Yan , Xiao Guo , Shun Qi , Huaning Wang , Xiangyu Chang
{"title":"Federated Bayesian network learning from multi-site data","authors":"Shuai Liu , Xiao Yan , Xiao Guo , Shun Qi , Huaning Wang , Xiangyu Chang","doi":"10.1016/j.jbi.2025.104784","DOIUrl":"10.1016/j.jbi.2025.104784","url":null,"abstract":"<div><h3>Objective:</h3><div>Identifying functional connectivity biomarkers of major depressive disorder (MDD) patients is essential to advance the understanding of disorder mechanisms and early intervention. Multi-site data arise naturally which could enhance the statistical power of single-site-based methods. However, the main concern is the inter-site heterogeneity and data sharing barriers between different sites. Our objective is to overcome these barriers to learn multiple Bayesian networks (BNs) from rs-fMRI data.</div></div><div><h3>Methods:</h3><div>We propose a federated joint estimator and the corresponding optimization algorithm, called NOTEARS-PFL. Specifically, we incorporate both shared and site-specific information into NOTEARS-PFL by utilizing the sparse group lasso penalty. Addressing data-sharing constraint, we develop the alternating direction method of multipliers for the optimization of NOTEARS-PFL. This entails processing neuroimaging data locally at each site, followed by the transmission of the learned network structures for central global updates.</div></div><div><h3>Results:</h3><div>The effectiveness and accuracy of the NOTEARS-PFL method are validated through its application on both synthetic and real-world multi-site resting-state functional magnetic resonance imaging (rs-fMRI) datasets. This demonstrates its superior efficiency and precision in comparison to alternative approaches.</div></div><div><h3>Conclusion:</h3><div>We proposed a toolbox called NOTEARS-PFL to learn the heterogeneous brain functional connectivity in MDD patients using multi-site data efficiently and with the data sharing constraint. The comprehensive experiments on both synthetic data and real-world multi-site rs-fMRI datasets with MDD highlight the excellent efficacy of our proposed method.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"163 ","pages":"Article 104784"},"PeriodicalIF":4.0,"publicationDate":"2025-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143255682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Naimin Jing , Yiwen Lu , Jiayi Tong , James Weaver , Patrick Ryan , Hua Xu , Yong Chen
{"title":"Evaluating the Bias, type I error and statistical power of the prior Knowledge-Guided integrated likelihood estimation (PIE) for bias reduction in EHR based association studies","authors":"Naimin Jing , Yiwen Lu , Jiayi Tong , James Weaver , Patrick Ryan , Hua Xu , Yong Chen","doi":"10.1016/j.jbi.2025.104787","DOIUrl":"10.1016/j.jbi.2025.104787","url":null,"abstract":"<div><h3>Objectives</h3><div>Binary outcomes in electronic health records (EHR) derived using automated phenotype algorithms may suffer from phenotyping error, resulting in bias in association estimation. Huang et al. <span><span>[1]</span></span> proposed the Prior Knowledge-Guided Integrated Likelihood Estimation (PIE) method to mitigate the estimation bias, however, their investigation focused on point estimation without statistical inference, and the evaluation of PIE therein using simulation was a proof-of-concept with only a limited scope of scenarios. This study aims to comprehensively assess PIE’s performance including (1) how well PIE performs under a wide spectrum of operating characteristics of phenotyping algorithms under real-world scenarios (e. g., low prevalence, low sensitivity, high specificity); (2) beyond point estimation, how much variation of the PIE estimator was introduced by the prior distribution; and (3) from a hypothesis testing point of view, if PIE improves type I error and statistical power relative to the naïve method (i.e., ignoring the phenotyping error).</div></div><div><h3>Methods</h3><div>Synthetic data and use-case analysis were utilized to evaluate PIE. The synthetic data were generated under diverse outcome prevalence, phenotyping algorithm sensitivity, and association effect sizes. Simulation studies compared PIE under different prior distributions with the naïve method, assessing bias, variance, type I error, and power. Use-case analysis compared the performance of PIE and the naïve method in estimating the association of multiple predictors with COVID-19 infection.</div></div><div><h3>Results</h3><div>PIE exhibited reduced bias compared to the naïve method across varied simulation settings, with comparable type I error and power. As the effect size became larger, the bias reduced by PIE was larger. PIE has superior performance when prior distributions aligned closely with true phenotyping algorithm characteristics. Impact of prior quality was minor for low-prevalence outcomes but large for common outcomes. In use-case analysis, PIE maintains a relatively accurate estimation across different scenarios, particularly outperforming the naïve approach under large effect sizes.</div></div><div><h3>Conclusion</h3><div>PIE effectively mitigates estimation bias in a wide spectrum of real-world settings, particularly with accurate prior information. Its main benefit lies in bias reduction rather than hypothesis testing. The impact of the prior is small for low-prevalence outcomes.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"163 ","pages":"Article 104787"},"PeriodicalIF":4.0,"publicationDate":"2025-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143189348","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Multi-Source drug combination and Omnidirectional feature fusion approach for predicting Drug-Drug interaction events","authors":"Shiwei Gao, Jingjing Xie, Yizhao Zhao","doi":"10.1016/j.jbi.2025.104772","DOIUrl":"10.1016/j.jbi.2025.104772","url":null,"abstract":"<div><h3>Background</h3><div>In the medical context where polypharmacy is increasingly common, accurately predicting drug-drug interactions (DDIs) is necessary for enhancing clinical medication safety and personalized treatment. Despite progress in identifying potential DDIs, a deep understanding of the underlying mechanisms of DDIs remains limited, constraining the rapid development and clinical application of new drugs.</div></div><div><h3>Methods</h3><div>This study introduces a novel multimodal drug-drug interaction (MMDDI) model based on multi-source drug data and comprehensive feature fusion techniques, aiming to improve the accuracy and depth of DDI prediction. We utilized the real-world DrugBank dataset, which contains rich drug information. Our task was to predict multiple interaction events between drug pairs and analyze the underlying mechanisms of these interactions. The MMDDI model achieves precise predictions through four key stages: feature extraction, drug pairing strategy, fusion network, and multi-source feature integration. We employed advanced data fusion techniques and machine learning algorithms for multidimensional analysis of drug features and interaction events.</div></div><div><h3>Results</h3><div>The MMDDI model was comprehensively evaluated on three representative prediction tasks. Experimental results demonstrated that the MMDDI model outperforms existing technologies in terms of predictive accuracy, generalization ability, and interpretability. Specifically, the MMDDI model achieved an accuracy of 93% on the test set, and the area under the AUC-ROC curve reached 0.9505, showing excellent predictive performance. Furthermore, the model’s interpretability analysis revealed the complex relationships between drug features and interaction mechanisms, providing new insights for clinical medication decisions.</div></div><div><h3>Conclusion</h3><div>The MMDDI model not only improves the accuracy of DDI prediction but also provides significant scientific support for clinical medication safety and drug development by deeply analyzing the mechanisms of drug interactions. These findings have the potential to improve patient medication outcomes and contribute to the development of personalized medicine.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"162 ","pages":"Article 104772"},"PeriodicalIF":4.0,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143006081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Valley-Forecast: Forecasting Coccidioidomycosis incidence via enhanced LSTM models trained on comprehensive meteorological data","authors":"Leif Huender , Mary Everett , John Shovic","doi":"10.1016/j.jbi.2025.104774","DOIUrl":"10.1016/j.jbi.2025.104774","url":null,"abstract":"<div><div>Coccidioidomycosis (cocci), or more commonly known as Valley Fever, is a fungal infection caused by Coccidioides species that poses a significant public health challenge, particularly in the semi-arid regions of the Americas, with notable prevalence in California and Arizona. Previous epidemiological studies have established a correlation between cocci incidence and regional weather patterns, indicating that climatic factors influence the fungus’s life cycle and subsequent disease transmission. This study hypothesizes that Long Short-Term Memory (LSTM) and extended Long Short-Term Memory (xLSTM) models, known for their ability to capture long-term dependencies in time-series data, can outperform traditional statistical methods in predicting cocci outbreak cases. Our research analyzed daily meteorological features from 2001 to 2022 across 48 counties in California, covering diverse microclimates and cocci incidence. The study evaluated 846 LSTM models and 176 xLSTM models with various fine-tuning metrics. To ensure the reliability of our results, these advanced neural network architectures are cross analyzed with Baseline Regression and Multi-Layer Perceptron (MLP) models, providing a comprehensive comparative framework. We found that LSTM-type architectures outperform traditional methods, with xLSTM achieving the lowest test RMSE of 282.98 (95% CI: 259.2-306.8) compared to the baseline’s 468.51 (95% CI: 458.2-478.8), demonstrating a reduction of 39.60% in prediction error. While both LSTM (283.50, 95% CI: 259.7-307.3) and MLP (293.14, 95% CI: 268.3-318.0) also showed substantial improvements over the baseline, the overlapping confidence intervals suggest similar predictive capabilities among the advanced models. This improvement in predictive capability suggests a strong correlation between temporal microclimatic variations and regional cocci incidences. The increased predictive power of these models has significant public health implications, potentially informing strategies for cocci outbreak prevention and control. Moreover, this study represents the first application of the novel xLSTM architecture in epidemiological research and pioneers the evaluation of modern machine learning methods’ accuracy in predicting cocci outbreaks. These findings contribute to the ongoing efforts to address cocci, offering a new approach to understanding and potentially mitigating the impact of the disease in affected regions.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"162 ","pages":"Article 104774"},"PeriodicalIF":4.0,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143006147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shixu Lin , Lucas Garay , Yining Hua , Zhijiang Guo , Wanxin Li , Minghui Li , Yujie Zhang , Xiaolin Xu , Jie Yang
{"title":"Analysis of longitudinal social media for monitoring symptoms during a pandemic","authors":"Shixu Lin , Lucas Garay , Yining Hua , Zhijiang Guo , Wanxin Li , Minghui Li , Yujie Zhang , Xiaolin Xu , Jie Yang","doi":"10.1016/j.jbi.2025.104778","DOIUrl":"10.1016/j.jbi.2025.104778","url":null,"abstract":"<div><h3>Objective</h3><div>Current studies leveraging social media data for disease monitoring face challenges like noisy colloquial language and insufficient tracking of user disease progression in longitudinal data settings. This study aims to develop a pipeline for collecting, cleaning, and analyzing large-scale longitudinal social media data for disease monitoring, with a focus on COVID-19 pandemic.</div></div><div><h3>Materials and methods</h3><div>This pipeline initiates by screening COVID-19 cases from tweets spanning February 1, 2020, to April 30, 2022. Longitudinal data is collected for each patient, two months before and three months after self-reporting. Symptoms are extracted using Name Entity Recognition (NER), followed by denoising with a combination of Graph Convolutional Network (GCN) and Bidirectional Encoder Representations from Transformers (BERT) model to retain only User-experienced Symptom Mentions (USM). Subsequently, symptoms are mapped to standardized medical concepts using the Unified Medical Language System (UMLS). Finally, this study conducts symptom pattern analysis and visualization to illustrate temporal changes in symptom prevalence and co-occurrence.</div></div><div><h3>Results</h3><div>This study identified 191,096 self-reported COVID-19-positive cases from COVID-19-related tweets and retrospectively collected 811,398,280 historical tweets, of which 2,120,964 contained symptoms information. After denoising, 39 % (832,287) of symptom-sharing tweets reflected user-experienced mentions. The trained USM model achieved an average F1 score of 0.927. Further analysis revealed a higher prevalence of upper respiratory tract symptoms during the Omicron period compared to the Delta and Wild-type periods. Additionally, there was a pronounced co-occurrence of lower respiratory tract and nervous system symptoms in the Wild-type strain and Delta variant.</div></div><div><h3>Conclusion</h3><div>This study established a robust framework for analyzing longitudinal social media data to monitor symptoms during a pandemic. By integrating denoising of user-experienced symptom mentions, our findings reveal the duration of different symptoms over time and by variant within a cohort of nearly 200,000 patients, providing critical insights into symptom trends that are often difficult to capture through traditional data source.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"162 ","pages":"Article 104778"},"PeriodicalIF":4.0,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143006056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jun Wen , Hao Xue , Everett Rush , Vidul A. Panickan , Tianrun Cai , Doudou Zhou , Yuk-Lam Ho , Lauren Costa , Edmon Begoli , Chuan Hong , J. Michael Gaziano , Kelly Cho , Katherine P. Liao , Junwei Lu , Tianxi Cai
{"title":"DOME: Directional medical embedding vectors from Electronic Health Records","authors":"Jun Wen , Hao Xue , Everett Rush , Vidul A. Panickan , Tianrun Cai , Doudou Zhou , Yuk-Lam Ho , Lauren Costa , Edmon Begoli , Chuan Hong , J. Michael Gaziano , Kelly Cho , Katherine P. Liao , Junwei Lu , Tianxi Cai","doi":"10.1016/j.jbi.2024.104768","DOIUrl":"10.1016/j.jbi.2024.104768","url":null,"abstract":"<div><h3>Motivation:</h3><div>The increasing availability of Electronic Health Record (EHR) systems has created enormous potential for translational research. Recent developments in representation learning techniques have led to effective large-scale representations of EHR concepts along with knowledge graphs that empower downstream EHR studies. However, most existing methods require training with patient-level data, limiting their abilities to expand the training with multi-institutional EHR data. On the other hand, scalable approaches that only require summary-level data do not incorporate temporal dependencies between concepts.</div></div><div><h3>Methods:</h3><div>We introduce a DirectiOnal Medical Embedding (DOME) algorithm to encode temporally directional relationships between medical concepts, using summary-level EHR data. Specifically, DOME first aggregates patient-level EHR data into an asymmetric co-occurrence matrix. Then it computes two Positive Pointwise Mutual Information (PPMI) matrices to correspondingly encode the pairwise prior and posterior dependencies between medical concepts. Following that, a joint matrix factorization is performed on the two PPMI matrices, which results in three vectors for each concept: a semantic embedding and two directional context embeddings. They collectively provide a comprehensive depiction of the temporal relationship between EHR concepts.</div></div><div><h3>Results:</h3><div>We highlight the advantages and translational potential of DOME through three sets of validation studies. First, DOME consistently improves existing direction-agnostic embedding vectors for disease risk prediction in several diseases, for example achieving a relative gain of 5.5% in the area under the receiver operating characteristic (AUROC) for lung cancer. Second, DOME excels in directional drug-disease relationship inference by successfully differentiating between drug side effects and indications, correspondingly achieving relative AUROC gain over the state-of-the-art methods by 10.8% and 6.6%. Finally, DOME effectively constructs directional knowledge graphs, which distinguish disease risk factors from comorbidities, thereby revealing disease progression trajectories. The source codes are provided at <span><span>https://github.com/celehs/Directional-EHR-embedding</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"162 ","pages":"Article 104768"},"PeriodicalIF":4.0,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142926986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhenzhong Liu , Kelong Chen , Shuai Wang , Yijun Xiao , Guobin Zhang
{"title":"Deep learning in surgical process modeling: A systematic review of workflow recognition","authors":"Zhenzhong Liu , Kelong Chen , Shuai Wang , Yijun Xiao , Guobin Zhang","doi":"10.1016/j.jbi.2025.104779","DOIUrl":"10.1016/j.jbi.2025.104779","url":null,"abstract":"<div><div>Objective: The application of artificial intelligence (AI) in health care has led to a surge of interest in surgical process modeling (SPM). The objective of this study is to investigate the role of deep learning in recognizing surgical workflows and extracting reliable patterns from datasets used in minimally invasive surgery, thereby advancing the development of context-aware intelligent systems in endoscopic surgeries. Methods<strong>:</strong> We conducted a comprehensive search of articles related to SPM from 2018 to April 2024 in the PubMed, Web of Science, Google Scholar, and IEEE Xplore databases. We chose surgical videos with annotations to describe the article on surgical process modeling and focused on examining the specific methods and research results of each study. Results: The search initially yielded 2937 articles. After filtering on the basis of the relevance of titles, abstracts, and content, 59 articles were selected for full-text review. These studies highlight the widespread adoption of neural networks, and transformers for surgical workflow analysis (SWA). They focus on minimally invasive surgeries performed with laparoscopes and microscopes. However, the process of surgical annotation lacks detailed description, and there are significant differences in the annotation process for different surgical procedures. Conclusion: Time and spatial sequences are key factors determining the identification of surgical phase. RNN, TCN, and transformer networks are commonly used to extract long-distance temporal relationships. Multimodal data input is beneficial, as it combines information from surgical instruments. However, publicly available datasets often lack clinical knowledge, and establishing large annotated datasets for surgery remains a challenge. To reduce annotation costs, methods such as semi supervised learning, self-supervised learning, contrastive learning, transfer learning, and active learning are commonly used.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"162 ","pages":"Article 104779"},"PeriodicalIF":4.0,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143006134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Modelling diversity in hospital strategies in city-scale ambulance dispatching with coupled game-theoretic model and discrete-event simulation","authors":"Xinyu Fu , Valeria Krzhizhanovskaya , Alexey Yakovlev , Sergey Kovalchuk","doi":"10.1016/j.jbi.2025.104777","DOIUrl":"10.1016/j.jbi.2025.104777","url":null,"abstract":"<div><div>The optimization in the ambulance dispatching process is significant for patients who need early treatments. However, the problem of dynamic ambulance redeployment for destination hospital selection has rarely been investigated. The paper proposes an approach to model and simulate the ambulance dispatching process in multi-agent healthcare environments of large cities. The proposed approach is based on using the coupled game-theoretic (GT) approach to identify hospital strategies (considering hospitals as players within a non-cooperative game) and performing discrete-event simulation (DES) of patient delivery and provision of healthcare services to evaluate ambulance dispatching (selection of target hospital). Assuming the collective nature of decisions on patient delivery, the approach assesses the influence of the diverse behaviors of hospitals on system performance with possible further optimization of this performance. The approach is studied through a series of cases starting with a simplified 1D model and proceeding with a coupled 2D model and real-world application. The study considers the problem of dispatching ambulances to patients with the Acute Coronary Syndrome (ACS) directed to the Percutaneous Coronary Intervention (PCI) in the target hospital. A real-world case study of data from Saint Petersburg (Russia) is analyzed showing the better conformity of the global characteristics (mortality rate) of the healthcare system with the proposed approach being applied to discovering the agents’ diverse behavior.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"162 ","pages":"Article 104777"},"PeriodicalIF":4.0,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143006137","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Areej Alhassan , Viktor Schlegel , Monira Aloud , Riza Batista-Navarro , Goran Nenadic
{"title":"Discontinuous named entities in clinical text: A systematic literature review","authors":"Areej Alhassan , Viktor Schlegel , Monira Aloud , Riza Batista-Navarro , Goran Nenadic","doi":"10.1016/j.jbi.2025.104783","DOIUrl":"10.1016/j.jbi.2025.104783","url":null,"abstract":"<div><h3>Objective</h3><div>Extracting named entities from clinical free-text presents unique challenges, particularly when dealing with discontinuous entities—mentions that are separated by unrelated words. Traditional NER methods often struggle to accurately identify these entities, prompting the development of specialised computational solutions. This paper systematically reviews and presents the methodologies developed for Discontinuous Named Entity Recognition in clinical texts, highlighting their effectiveness and the challenges they face.</div></div><div><h3>Method</h3><div>We conducted a systematic literature review focused on discontinuous named entities, using structured searches across four Computer Science-related and one medical-related electronic database. A combination of search terms, grouped into three synonym categories—problem, entity/approach, and task—yielded 2,442 articles. Guided by our research objectives, we identified five key dimensions to systematically annotate and normalise the data for comprehensive analysis.</div></div><div><h3>Result</h3><div>The review included 44 studies which were coded across several key dimensions: the chronological development of approaches, the corpora used, the downstream tasks affected by discontinuous named entities, the methodological approaches proposed to address the issue, and the reported performance outcomes. The discussion section examines the challenges encountered in this area and suggests potential directions for future research.</div></div><div><h3>Conclusion</h3><div>Significant progress has been made in discontinuous named entity recognition; however, there remains a need for more adaptable, generalisable solutions that are independent of custom annotation schemes. Exploring various configurations of generative language models presents a promising avenue for advancing this area. Additionally, future research should investigate the impact of precise versus imprecise recognition of discontinuous entities on clinical downstream tasks to better understand its practical implications in healthcare applications.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"162 ","pages":"Article 104783"},"PeriodicalIF":4.0,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143038921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ziming Gan , Doudou Zhou , Everett Rush , Vidul A. Panickan , Yuk-Lam Ho , George Ostrouchovm , Zhiwei Xu , Shuting Shen , Xin Xiong , Kimberly F. Greco , Chuan Hong , Clara-Lea Bonzel , Jun Wen , Lauren Costa , Tianrun Cai , Edmon Begoli , Zongqi Xia , J. Michael Gaziano , Katherine P. Liao , Kelly Cho , Junwei Lu
{"title":"ARCH: Large-scale knowledge graph via aggregated narrative codified health records analysis","authors":"Ziming Gan , Doudou Zhou , Everett Rush , Vidul A. Panickan , Yuk-Lam Ho , George Ostrouchovm , Zhiwei Xu , Shuting Shen , Xin Xiong , Kimberly F. Greco , Chuan Hong , Clara-Lea Bonzel , Jun Wen , Lauren Costa , Tianrun Cai , Edmon Begoli , Zongqi Xia , J. Michael Gaziano , Katherine P. Liao , Kelly Cho , Junwei Lu","doi":"10.1016/j.jbi.2024.104761","DOIUrl":"10.1016/j.jbi.2024.104761","url":null,"abstract":"<div><h3>Objective:</h3><div>Electronic health record (EHR) systems contain a wealth of clinical data stored as both codified data and free-text narrative notes (NLP). The complexity of EHR presents challenges in feature representation, information extraction, and uncertainty quantification. To address these challenges, we proposed an efficient <strong>A</strong>ggregated na<strong>R</strong>rative <strong>C</strong>odified <strong>H</strong>ealth (ARCH) records analysis to generate a large-scale knowledge graph (KG) for a comprehensive set of EHR codified and narrative features.</div></div><div><h3>Methods:</h3><div>Using data from 12.5 million Veterans Affairs patients, ARCH first derives embedding vectors and generates similarities along with associated <span><math><mi>p</mi></math></span>-values to measure the strength of relatedness between clinical features with statistical certainty quantification. Next, ARCH performs a sparse embedding regression to remove indirect linkage between features to build a sparse KG. Finally, ARCH was validated on various clinical tasks, including detecting known relationships between entity pairs, predicting drug side effects, disease phenotyping, as well as sub-typing Alzheimer’s disease patients.</div></div><div><h3>Results:</h3><div>ARCH produces high-quality clinical embeddings and KG for over 60,000 codified and narrative EHR concepts. The KG and embeddings are visualized in the R-shiny powered web-API.<span><span><sup>3</sup></span></span> ARCH achieved high accuracy in detecting EHR concept relationships, with AUCs of 0.926 (codified) and 0.861 (NLP) for similar EHR concepts, and 0.810 (codified) and 0.843 (NLP) for related pairs. It detected drug side effects with a 0.723 AUC, which improved to 0.826 after fine-tuning. Using both codified and NLP features, the detection power increased significantly. Compared to other methods, ARCH has superior accuracy and enhances weakly supervised phenotyping algorithms’ performance. Notably, it successfully categorized Alzheimer’s patients into two subgroups with varying mortality rates.</div></div><div><h3>Conclusion:</h3><div>The proposed ARCH algorithm generates large-scale high-quality semantic representations and knowledge graph for both codified and NLP EHR features, useful for a wide range of predictive modeling tasks.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"162 ","pages":"Article 104761"},"PeriodicalIF":4.0,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143038917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}