{"title":"PregAN-NET: Addressing Class Imbalance with GANs in Interpretable Computational Framework for Predicting Safety Profile of Drugs Considering Adverse Reactions During Pregnancy","authors":"Anushka Chaurasia , Deepak Kumar , Yogita","doi":"10.1016/j.jbi.2025.104832","DOIUrl":"10.1016/j.jbi.2025.104832","url":null,"abstract":"<div><div>Adverse Drug Reactions (ADRs) during pregnancy pose significant risks to both the mother and the fetus. Conventional approaches to predict ADR are inadequate due to ethical restrictions that prevent performing medication studies in pregnant women, leading to restricted data samples. Hence, computational techniques have been promising for ADR predictions. However, most of these techniques have focused on the general population and face the challenge of class imbalance and lack of model interpretability. In the present work, an ensemble learning-based PregAN-NET framework has been proposed that addresses the issue of class imbalance by generating synthetic data employing Conditional Tabular Generative Adversarial Network (CTGAN) and integrates neural network and gradient boosting as a Boosted Neural Ensemble (BNE) architecture to predict safe and unsafe drugs considering their adverse reactions during pregnancy. Furthermore, the SHAP method has been employed to enhance the post-hoc interpretability of the BNE architecture by analyzing the contribution of different features towards prediction. The proposed framework has been applied to chemical and biological properties from PubChem and DrugBank, along with class labels from the ADReCS database. CTGAN has been evaluated for data balancing, showing a 2% to 5% performance improvement over SMOTE. The BNE architecture has outperformed six state-of-the-art methods by achieving mean ROC-AUC scores between 77.00% and 90.00% for chemical data, 66.00% and 74.00% for biological data, and 70.00% to 75.00% for combined datasets. Further, the top 20 contributory features in prediction corresponding to the different drug properties have been identified.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"166 ","pages":"Article 104832"},"PeriodicalIF":4.0,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143891351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Thomas A. Lasko , William W. Stead , John M. Still , Thomas Z. Li , Michael Kammer , Marco Barbero-Mota , Eric V. Strobl , Bennett A. Landman , Fabien Maldonado
{"title":"Unsupervised discovery of clinical disease signatures using probabilistic independence","authors":"Thomas A. Lasko , William W. Stead , John M. Still , Thomas Z. Li , Michael Kammer , Marco Barbero-Mota , Eric V. Strobl , Bennett A. Landman , Fabien Maldonado","doi":"10.1016/j.jbi.2025.104837","DOIUrl":"10.1016/j.jbi.2025.104837","url":null,"abstract":"<div><h3>Objective</h3><div>This study uses probabilistic independence to disentangle patient-specific sources of disease and their signatures in Electronic Health Record (EHR) data.</div></div><div><h3>Materials and Methods</h3><div>We model a disease source as an unobserved root node in the causal graph of observed EHR variables (laboratory test results, medication exposures, billing codes, and demographics), and a signature as the set of downstream effects that a given source has on those observed variables. We used probabilistic independence to infer 2000 sources and their signatures from 9195 variables in <span><math><mrow><mn>630</mn><mo>,</mo><mn>000</mn></mrow></math></span> cross-sectional training instances sampled at random times from 269,099 longitudinal patient records. We evaluated the learned sources by using them to infer and explain the causes of benign vs. malignant pulmonary nodules in 13,252 records, comparing the inferred causes to an external reference list and other medical literature. We compared models trained by three different algorithms and used corresponding models trained directly from the observed variables as baselines.</div></div><div><h3>Results</h3><div>The model recovered 92% of malignant and 30% of benign causes in the reference standard. Of the top 20 inferred causes of malignancy, 14 were not listed in the reference standard, but had supporting evidence in the literature, as did 11 of the top 20 inferred causes of benign nodules. The model decomposed listed malignant causes by an average factor of 5.5 and benign causes by 4.1, with most stratifying by disease course or treatment regimen. Predictive accuracy of causal predictive models trained on source expressions (Random Forest AUC 0.788) was similar to (p = 0.058) their associational baselines (0.738).</div></div><div><h3>Discussion</h3><div>Most of the unrecovered causes were due to the rarity of the condition or lack of sufficient detail in the input data. Surprisingly, the causal model found many patients with apparently undiagnosed cancer as the source of the malignant nodules. Causal model AUC also suggests that some sources remained undiscovered in this cohort.</div></div><div><h3>Conclusion</h3><div>These promising results demonstrate the potential of using probabilistic independence to disentangle complex clinical signatures from noisy, asynchronous, and incomplete EHR data that represent the confluence of multiple simultaneous conditions, and to identify patient-specific causes that support precise treatment decisions.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"166 ","pages":"Article 104837"},"PeriodicalIF":4.0,"publicationDate":"2025-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143894732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yaniv Alon , Etti Naimi , Chedva Levin , Hila Videl , Mor Saban
{"title":"Leveraging natural language processing to elucidate real-world clinical decision-making paradigms: A proof of concept study","authors":"Yaniv Alon , Etti Naimi , Chedva Levin , Hila Videl , Mor Saban","doi":"10.1016/j.jbi.2025.104829","DOIUrl":"10.1016/j.jbi.2025.104829","url":null,"abstract":"<div><h3>Background</h3><div>Understanding how clinicians arrive at decisions in actual practice settings is vital for advancing personalized, evidence-based care. However, systematic analysis of qualitative decision data poses challenges.</div></div><div><h3>Methods</h3><div>We analyzed transcribed interviews with Hebrew-speaking clinicians on decision processes using natural language processing (NLP). Word frequency and characterized terminology use, while large language models (ChatGPT from OpenAI and Gemini by Google) identified potential cognitive paradigms.</div></div><div><h3>Results</h3><div>Word frequency analysis of clinician interviews identified experience and knowledge as most influential on decision-making. NLP tentatively recognized heuristics-based reasoning grounded in past cases and intuition as dominant cognitive paradigms. Elements of shared decision-making through individualizing care with patients and families were also observed. Limited Hebrew clinical language resources required developing preliminary lexicons and dynamically adjusting stopwords. Findings also provided preliminary support for heuristics guiding clinical judgment while highlighting needs for broader sampling and enhanced analytical frameworks.</div></div><div><h3>Conclusions</h3><div>This study represents the first use of integrated qualitative and computational methods to systematically elucidate clinical decision-making. Findings supported experience-based heuristics guiding cognition. With methodological enhancements, similar analyses could transform global understanding of tailored care delivery. Standardizing interdisciplinary collaborations on developing NLP tools and analytical frameworks may advance equitable, evidence-based healthcare by elucidating real-world clinical reasoning processes across diverse populations and settings.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"166 ","pages":"Article 104829"},"PeriodicalIF":4.0,"publicationDate":"2025-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143869304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Louis Adedapo Gomez , Jan Claassen , Samantha Kleinberg
{"title":"Causal inference for time series datasets with partially overlapping variables","authors":"Louis Adedapo Gomez , Jan Claassen , Samantha Kleinberg","doi":"10.1016/j.jbi.2025.104828","DOIUrl":"10.1016/j.jbi.2025.104828","url":null,"abstract":"<div><h3>Objective:</h3><div>Healthcare data provides a unique opportunity to learn causal relationships but the largest datasets, such as from hospitals or intensive care units, are often observational and do not standardize variables collected for all patients. Rather, the variables depend on a patient’s health status, treatment plan, and differences between providers. This poses major challenges for causal inference, which either must restrict analysis to patients with complete data (reducing power) or learn patient-specific models (making it difficult to generalize). While missing variables can lead to confounding, variables absent for one individual are often measured in another.</div></div><div><h3>Methods:</h3><div>We propose a novel method, called Causal Model Combination for Time Series (CMC-TS), to learn causal relationships from time series with partially overlapping variable sets. CMC-TS overcomes errors by specifically leveraging partial overlap between datasets (e.g., patients) to iteratively reconstruct missing variables and correct errors by reweighting inferences using shared information across datasets. We evaluated CMC-TS and compared it to the state of the art on both simulated data and real-world data from stroke patients admitted to a neurological intensive care unit.</div></div><div><h3>Results:</h3><div>On simulated data, CMC-TS had the fewest false discoveries and highest F1-score compared to baselines. On real data from stroke patients in a neurological intensive care unit, we found fewer implausible and more highly ranked plausible causes of a clinically important adverse event.</div></div><div><h3>Conclusion:</h3><div>Our approach may lead to better use of observational healthcare data for causal inference, by enabling causal inference from patient data with partially overlapping variable sets.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"166 ","pages":"Article 104828"},"PeriodicalIF":4.0,"publicationDate":"2025-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143869243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jennifer Cooper , Thomas Jackson , Shamil Haroon , Francesca L. Crowe , Eleanor Hathaway , Leah Fitzsimmons , Krishnarajah Nirantharakumar
{"title":"Defining phenotypes of disease severity for long-term cardiovascular, renal, metabolic, and mental health conditions in primary care electronic health records: A mixed-methods study using the nominal group technique","authors":"Jennifer Cooper , Thomas Jackson , Shamil Haroon , Francesca L. Crowe , Eleanor Hathaway , Leah Fitzsimmons , Krishnarajah Nirantharakumar","doi":"10.1016/j.jbi.2025.104831","DOIUrl":"10.1016/j.jbi.2025.104831","url":null,"abstract":"<div><h3>Objective</h3><div>Inclusion of severity measures for long-term conditions (LTC) could improve prediction models for multiple long-term conditions (MLTC) but some severity measures have limited availability in electronic health records (EHR). We aimed to develop consensus on feasible severity phenotypes for nine cardio-renal-metabolic and mental health conditions.</div></div><div><h3>Methods</h3><div>This was a mixed-methods study using novel methodology. From existing literature, we identified potential severity phenotypes and explored feasibility of their use in EHR through analysis of data from 31 randomly selected general practices in the Clinical Practice Research Datalink (CPRD) Aurum database, a large UK-based primary care EHR database. We recruited clinical academic experts to participate in a survey and nominal group technique workshop. Participants used a Likert scale to rate clinical importance and feasibility for each severity phenotype independently (informed by the exploratory analysis). For the optimal severity phenotype (highest combined score) for each condition, adjusted hazard ratios (aHR) of five-year mortality were calculated using Cox regression on the full CPRD database.</div></div><div><h3>Results</h3><div>Fifteen existing severity indexes for nine conditions informed the survey. Eighteen clinical academics participated in the survey, twelve also participated in the workshops. Combined mean scores for clinical importance and feasibility were highest for estimated glomerular filtration rate (eGFR) for chronic kidney disease (CKD) (9.42/10) and for microvascular complications of diabetes (9.08/10). Mortality was higher for each reduction in eGFR stage; Stage 3b aHR 1.42, 95 %CI 1.41–1.44 versus Stage 3a CKD and for each additional microvascular complication of diabetes; one complication aHR 1.44, 95 %CI 1.32–1.57 versus none. Some phenotypes (e.g., aneurysm diameter) were not well recorded within the database and could not feasibly be applied.</div></div><div><h3>Conclusion</h3><div>We developed a methodology for identifying severity phenotypes in EHRs. Severity phenotypes were identified for diabetes (type 1 and 2), ischaemic heart disease, CKD and peripheral vascular disease. Data quality in EHR should be improved for under-recorded severity measures.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"166 ","pages":"Article 104831"},"PeriodicalIF":4.0,"publicationDate":"2025-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143877475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Changlong Wang , You Zhou , Yuanshu Li , Wei Pang , Liupu Wang , Wei Du , Hui Yang , Ying Jin
{"title":"ICPPNet: A semantic segmentation network model based on inter-class positional prior for scoliosis reconstruction in ultrasound images","authors":"Changlong Wang , You Zhou , Yuanshu Li , Wei Pang , Liupu Wang , Wei Du , Hui Yang , Ying Jin","doi":"10.1016/j.jbi.2025.104827","DOIUrl":"10.1016/j.jbi.2025.104827","url":null,"abstract":"<div><h3>Objective:</h3><div>Considering the radiation hazard of X-ray, safer, more convenient and cost-effective ultrasound methods are gradually becoming new diagnostic approaches for scoliosis. For ultrasound images of spine regions, it is challenging to accurately identify spine regions in images due to relatively small target areas and the presence of a lot of interfering information. Therefore, we developed a novel neural network that incorporates prior knowledge to precisely segment spine regions in ultrasound images.</div></div><div><h3>Materials and methods:</h3><div>We constructed a dataset of ultrasound images of spine regions for semantic segmentation. The dataset contains 3136 images of 30 patients with scoliosis. And we propose a network model (ICPPNet), which fully utilizes inter-class positional prior knowledge by combining an inter-class positional probability heatmap, to achieve accurate segmentation of target areas.</div></div><div><h3>Results:</h3><div>ICPPNet achieved an average Dice similarity coefficient of 70.83<span><math><mtext>%</mtext></math></span> and an average 95<span><math><mtext>%</mtext></math></span> Hausdorff distance of 11.28 mm on the dataset, demonstrating its excellent performance. The average error between the Cobb angle measured by our method and the Cobb angle measured by X-ray images is 1.41 degrees, and the coefficient of determination is 0.9879 with a strong correlation.</div></div><div><h3>Discussion and conclusion:</h3><div>ICPPNet provides a new solution for the medical image segmentation task with positional prior knowledge between target classes. And ICPPNet strongly supports the subsequent reconstruction of spine models using ultrasound images.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"166 ","pages":"Article 104827"},"PeriodicalIF":4.0,"publicationDate":"2025-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143874964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abel Corrêa Dias, Viviane Pereira Moreira, João Luiz Dihl Comba
{"title":"RoBIn: A Transformer-based model for risk of bias inference with machine reading comprehension","authors":"Abel Corrêa Dias, Viviane Pereira Moreira, João Luiz Dihl Comba","doi":"10.1016/j.jbi.2025.104819","DOIUrl":"10.1016/j.jbi.2025.104819","url":null,"abstract":"<div><h3>Objective:</h3><div>Scientific publications are essential for uncovering insights, testing new drugs, and informing healthcare policies. Evaluating the quality of these publications often involves assessing their Risk of Bias (RoB), a task traditionally performed by human reviewers. The goal of this work is to create a dataset and develop models that allow automated RoB assessment in clinical trials.</div></div><div><h3>Methods:</h3><div>We use data from the Cochrane Database of Systematic Reviews (CDSR) as ground truth to label open-access clinical trial publications from PubMed. This process enabled us to develop training and test datasets specifically for machine reading comprehension and RoB inference. Additionally, we created extractive (RoBIn<sup>Ext</sup>) and generative (RoBIn<sup>Gen</sup>) Transformer-based approaches to extract relevant evidence and classify the RoB effectively.</div></div><div><h3>Results:</h3><div>RoBIn was evaluated across various settings and benchmarked against state-of-the-art methods, including large language models (LLMs). In most cases, the best-performing RoBIn variant surpasses traditional machine learning and LLM-based approaches, achieving a AUROC of 0.83.</div></div><div><h3>Conclusion:</h3><div>This work addresses RoB assessment in clinical trials by introducing RoBIn, two Transformer-based models for RoB inference and evidence retrieval, which outperform traditional models and LLMs, demonstrating its potential to improve efficiency and scalability in clinical research evaluation. We also introduce a public dataset that is automatically annotated and can be used to enable future research to enhance automated RoB assessment.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"166 ","pages":"Article 104819"},"PeriodicalIF":4.0,"publicationDate":"2025-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143843115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fangwen Zhou , Rick Parrish , Muhammad Afzal , Ashirbani Saha , R. Brian Haynes , Alfonso Iorio , Cynthia Lokker
{"title":"Benchmarking domain-specific pretrained language models to identify the best model for methodological rigor in clinical studies","authors":"Fangwen Zhou , Rick Parrish , Muhammad Afzal , Ashirbani Saha , R. Brian Haynes , Alfonso Iorio , Cynthia Lokker","doi":"10.1016/j.jbi.2025.104825","DOIUrl":"10.1016/j.jbi.2025.104825","url":null,"abstract":"<div><h3>Objective</h3><div>Encoder-only transformer-based language models have shown promise in automating critical appraisal of clinical literature. However, a comprehensive evaluation of the models for classifying the methodological rigor of randomized controlled trials is necessary to identify the more robust ones. This study benchmarks several state-of-the-art transformer-based language models using a diverse set of performance metrics.</div></div><div><h3>Methods</h3><div>Seven transformer-based language models were fine-tuned on the title and abstract of 42,575 articles from 2003 to 2023 in McMaster University’s Premium LiteratUre Service database under different configurations. The studies reported in the articles addressed questions related to treatment, prevention, or quality improvement for which randomized controlled trials are the gold standard with defined criteria for rigorous methods. Models were evaluated on the validation set using 12 schemes and metrics, including optimization for cross-entropy loss, Brier score, AUROC, average precision, sensitivity, specificity, and accuracy, among others. Threshold tuning was performed to optimize threshold-dependent metrics. Models that achieved the best performance in one or more schemes on the validation set were further tested in hold-out and external datasets.</div></div><div><h3>Results</h3><div>A total of 210 models were fine-tuned. Six models achieved top performance in one or more evaluation schemes. Three BioLinkBERT models outperformed others on 8 of the 12 schemes. BioBERT, BiomedBERT, and SciBERT were best on 1, 1 and 2 schemes, respectively. While model performance remained robust on the hold-out test set, it declined in external datasets. Class weight adjustments improved performance in most instances.</div></div><div><h3>Conclusion</h3><div>BioLinkBERT generally outperformed the other models. Using comprehensive evaluation metrics and threshold tuning optimizes model selection for real-world applications. Future work should assess generalizability to other datasets, explore alternate imbalance strategies, and examine training on full-text articles.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"166 ","pages":"Article 104825"},"PeriodicalIF":4.0,"publicationDate":"2025-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143843116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Silvia Cascianelli, Iva Milojkovic, Marco Masseroli
{"title":"A novel machine learning-based workflow to capture intra-patient heterogeneity through transcriptional multi-label characterization and clinically relevant classification","authors":"Silvia Cascianelli, Iva Milojkovic, Marco Masseroli","doi":"10.1016/j.jbi.2025.104817","DOIUrl":"10.1016/j.jbi.2025.104817","url":null,"abstract":"<div><h3>Objectives:</h3><div>Patient classification into specific molecular subtypes is paramount in biomedical research and clinical practice to face complex, heterogeneous diseases. Existing methods, especially for gene expression-based cancer subtyping, often simplify patient molecular portraits, neglecting the potential co-occurrence of traits from multiple subtypes. Yet, recognizing intra-sample heterogeneity is essential for more precise patient characterization and improved personalized treatments.</div></div><div><h3>Methods:</h3><div>We developed a novel computational workflow, named MULTI-STAR, which addresses current limitations and provides tailored solutions for reliable multi-label patient subtyping. MULTI-STAR uses state-of-the-art subtyping methods to obtain promising machine learning-based multi-label classifiers, leveraging gene expression profiles. It modifies standard single-label similarity-based techniques to obtain multi-label patient characterizations. Then, it employs these characterizations to train single-sample predictors using different multi-label strategies and find the best-performing classifiers.</div></div><div><h3>Results:</h3><div>MULTI-STAR classifiers offer advanced multi-label recognition of all the subtypes contributing to the molecular and clinical traits of a patient, also distinguishing the primary from the additional relevant secondary subtype(s). The efficacy was demonstrated by developing multi-label solutions for breast and colorectal cancer subtyping that outperform existing methods in terms of prognostic value, primarily for overall survival predictions, and ability to work on a single sample at a time, as required in clinical practice.</div></div><div><h3>Conclusions:</h3><div>This work emphasizes the importance of moving to multi-label subtyping to capture all the molecular traits of individual patients, considering also previously overlooked secondary assignments and paving the way for improved clinical decision-making processes in diverse heterogeneous disease contexts. Indeed, MULTI-STAR novel, reproducible and generalizable approach provides comprehensive representations of patient inner heterogeneity and clinically relevant insights, contributing to precision medicine and personalized treatments.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"166 ","pages":"Article 104817"},"PeriodicalIF":4.0,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143816805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Javier Petri , Pilar Barcena Barbeira , Martina Pesce , Verónica Xhardez , Rodrigo Laje , Viviana Cotik
{"title":"Low-cost algorithms for clinical notes phenotype classification to enhance epidemiological surveillance: A case study","authors":"Javier Petri , Pilar Barcena Barbeira , Martina Pesce , Verónica Xhardez , Rodrigo Laje , Viviana Cotik","doi":"10.1016/j.jbi.2025.104795","DOIUrl":"10.1016/j.jbi.2025.104795","url":null,"abstract":"<div><h3>Objective:</h3><div>Our study aims to enhance epidemic intelligence through event-based surveillance in an emerging pandemic context. We classified electronic health records (EHRs) from La Rioja, Argentina, focusing on predicting COVID-19-related categories in a scenario with limited disease knowledge, evolving symptoms, non-standardized coding practices, and restricted training data due to privacy issues.</div></div><div><h3>Methods:</h3><div>Using natural language processing techniques, we developed rapid, cost-effective methods suitable for implementation with limited resources. We annotated a corpus for training and testing classification models, ranging from simple logistic regression to more complex fine-tuned transformers.</div></div><div><h3>Results:</h3><div>The transformer-based, Spanish-adapted models BETO Clínico and RoBERTa Clínico, further pre-trained with an unannotated portion of our corpus, were the best-performing models (F1= 88.13% and 87.01%). A simple logistic regression (LR) model ranked third (F1=85.09%), outperforming more complex models like XGBoost and BiLSTM. Data classified as COVID-confirmed using LR and BETO Clínico exhibit stronger time-series Pearson correlation with official COVID-19 case counts from the National Health Surveillance System (SNVS 2.0) in La Rioja province compared to the correlations observed between the International Code of Diseases (ICD-10) codes and the SNVS 2.0 data (0.840, 0.873, and 0.663, p-values <span><math><mrow><mo>≤</mo><mn>3</mn><mo>×</mo><mn>1</mn><msup><mrow><mn>0</mn></mrow><mrow><mo>−</mo><mn>7</mn></mrow></msup></mrow></math></span>). Both models have a good Pearson correlation with ICD-10 codes assigned to the clinical notes for confirmed (0.940 and 0.902) and for suspected cases (0.960 and 0.954), p-values <span><math><mrow><mo>≤</mo><mn>1</mn><mo>.</mo><mn>7</mn><mo>×</mo><mn>1</mn><msup><mrow><mn>0</mn></mrow><mrow><mo>−</mo><mn>18</mn></mrow></msup></mrow></math></span>.</div></div><div><h3>Conclusion:</h3><div>This study shows that simple, resource-efficient methods can achieve results comparable to complex approaches. BETO Clínico and LR strongly correlate with official data, revealing uncoded confirmed cases at the pandemic’s onset. Our results suggest that annotating a smaller set of EHRs and training a simple model may be more cost-effective than manual coding. This points to potentially efficient strategies in public health emergencies, particularly in resource-limited settings, and provides valuable insights for future epidemic response efforts.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"166 ","pages":"Article 104795"},"PeriodicalIF":4.0,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143833466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}