{"title":"Identifying cooperating cancer driver genes in individual patients through hypergraph random walk","authors":"Tong Zhang , Shao-Wu Zhang , Ming-Yu Xie , Yan Li","doi":"10.1016/j.jbi.2024.104710","DOIUrl":"10.1016/j.jbi.2024.104710","url":null,"abstract":"<div><h3>Objective</h3><p>Identifying cancer driver genes, especially rare or patient-specific cancer driver genes, is a primary goal in cancer therapy. Although researchers have proposed some methods to tackle this problem, these methods mostly identify cancer driver genes at single gene level, overlooking the cooperative relationship among cancer driver genes. Identifying cooperating cancer driver genes in individual patients is pivotal for understanding cancer etiology and advancing the development of personalized therapies.</p></div><div><h3>Methods</h3><p>Here, we propose a novel Personalized Cooperating cancer Driver Genes (PCoDG) method by using hypergraph random walk to identify the cancer driver genes that cooperatively drive individual patient cancer progression. By leveraging the powerful ability of hypergraph in representing multi-way relationships, PCoDG first employs the personalized hypergraph to depict the complex interactions among mutated genes and differentially expressed genes of an individual patient. Then, a hypergraph random walk algorithm based on hyperedge similarity is utilized to calculate the importance scores of mutated genes, integrating these scores with signaling pathway data to identify the cooperating cancer driver genes in individual patients.</p></div><div><h3>Results</h3><p>The experimental results on three TCGA cancer datasets (i.e., BRCA, LUAD, and COADREAD) demonstrate the effectiveness of PCoDG in identifying personalized cooperating cancer driver genes. These genes identified by PCoDG not only offer valuable insights into patient stratification correlating with clinical outcomes, but also provide an useful reference resource for tailoring personalized treatments.</p></div><div><h3>Conclusion</h3><p>We propose a novel method that can effectively identify cooperating cancer driver genes for individual patients, thereby deepening our understanding of the cooperative relationship among personalized cancer driver genes and advancing the development of precision oncology.</p></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"157 ","pages":"Article 104710"},"PeriodicalIF":4.0,"publicationDate":"2024-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142004329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lara J. Kanbar , Anagh Mishra , Alexander Osborn , Andrew Cifuentes , Jennifer Combs , Michael Sorter , Drew Barzman , Judith W. Dexheimer
{"title":"Investigation of bias in the automated assessment of school violence","authors":"Lara J. Kanbar , Anagh Mishra , Alexander Osborn , Andrew Cifuentes , Jennifer Combs , Michael Sorter , Drew Barzman , Judith W. Dexheimer","doi":"10.1016/j.jbi.2024.104709","DOIUrl":"10.1016/j.jbi.2024.104709","url":null,"abstract":"<div><h3>Objectives</h3><p>Natural language processing and machine learning have the potential to lead to biased predictions. We designed a novel Automated RIsk Assessment (ARIA) machine learning algorithm that assesses risk of violence and aggression in adolescents using natural language processing of transcribed student interviews. This work evaluated the possible sources of bias in the study design and the algorithm, tested how much of a prediction was explained by demographic covariates, and investigated the misclassifications based on demographic variables.</p></div><div><h3>Methods</h3><p>We recruited students 10–18 years of age and enrolled in middle or high schools in Ohio, Kentucky, Indiana, and Tennessee. The reference standard outcome was determined by a forensic psychiatrist as either a “high” or “low” risk level. ARIA used L2-regularized logistic regression to predict a risk level for each student using contextual and semantic features. We conducted three analyses: a PROBAST analysis of risk in study design; analysis of demographic variables as covariates; and a prediction analysis. Covariates were included in the linear regression analyses and comprised of race, sex, ethnicity, household education, annual household income, age at the time of visit, and utilization of public assistance.</p></div><div><h3>Results</h3><p>We recruited 412 students from 204 schools. ARIA performed with an AUC of 0.92, sensitivity of 71%, NPV of 77%, and specificity of 95%. Of these, 387 students with complete demographic information were included in the analysis. Individual linear regressions resulted in a coefficient of determination less than 0.08 across all demographic variables. When using all demographic variables to predict ARIA’s risk assessment score, the multiple linear regression model resulted in a coefficient of determination of 0.189. ARIA performed with a lower False Negative Rate (FNR) of 15.2% (CI [0 – 40]) for the Black subgroup and 12.7%, CI [0 – 41.4] for Other races, compared to an FNR of 26.1% (CI [14.1 – 41.8]) in the White subgroup.</p></div><div><h3>Conclusions</h3><p>Bias assessment is needed to address shortcomings within machine learning. In our work, student race, ethnicity, sex, use of public assistance, and annual household income did not explain ARIA’s risk assessment score of students. ARIA will continue to be evaluated regularly with increased subject recruitment.</p></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"157 ","pages":"Article 104709"},"PeriodicalIF":4.0,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141995770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Majid Afshar , Yanjun Gao , Deepak Gupta , Emma Croxford , Dina Demner-Fushman
{"title":"On the role of the UMLS in supporting diagnosis generation proposed by Large Language Models","authors":"Majid Afshar , Yanjun Gao , Deepak Gupta , Emma Croxford , Dina Demner-Fushman","doi":"10.1016/j.jbi.2024.104707","DOIUrl":"10.1016/j.jbi.2024.104707","url":null,"abstract":"<div><h3>Objective:</h3><p>Traditional knowledge-based and machine learning diagnostic decision support systems have benefited from integrating the medical domain knowledge encoded in the Unified Medical Language System (UMLS). The emergence of Large Language Models (LLMs) to supplant traditional systems poses questions of the quality and extent of the medical knowledge in the models’ internal knowledge representations and the need for external knowledge sources. The objective of this study is three-fold: to probe the diagnosis-related medical knowledge of popular LLMs, to examine the benefit of providing the UMLS knowledge to LLMs (grounding the diagnosis predictions), and to evaluate the correlations between human judgments and the UMLS-based metrics for generations by LLMs.</p></div><div><h3>Methods:</h3><p>We evaluated diagnoses generated by LLMs from consumer health questions and daily care notes in the electronic health records using the ConsumerQA and Problem Summarization datasets. Probing LLMs for the UMLS knowledge was performed by prompting the LLM to complete the diagnosis-related UMLS knowledge paths. Grounding the predictions was examined in an approach that integrated the UMLS graph paths and clinical notes in prompting the LLMs. The results were compared to prompting without the UMLS paths. The final experiments examined the alignment of different evaluation metrics, UMLS-based and non-UMLS, with human expert evaluation.</p></div><div><h3>Results:</h3><p>In probing the UMLS knowledge, GPT-3.5 significantly outperformed Llama2 and a simple baseline yielding an F1 score of 10.9% in completing one-hop UMLS paths for a given concept. Grounding diagnosis predictions with the UMLS paths improved the results for both models on both tasks, with the highest improvement (4%) in SapBERT score. There was a weak correlation between the widely used evaluation metrics (ROUGE and SapBERT) and human judgments.</p></div><div><h3>Conclusion:</h3><p>We found that while popular LLMs contain some medical knowledge in their internal representations, augmentation with the UMLS knowledge provides performance gains around diagnosis generation. The UMLS needs to be tailored for the task to improve the LLMs predictions. Finding evaluation metrics that are aligned with human judgments better than the traditional ROUGE and BERT-based scores remains an open research question.</p></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"157 ","pages":"Article 104707"},"PeriodicalIF":4.0,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141982357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuqing Lei , Adam Christian Naj , Hua Xu , Ruowang Li , Yong Chen
{"title":"Balancing the efforts of chart review and gains in PRS prediction accuracy: An empirical study","authors":"Yuqing Lei , Adam Christian Naj , Hua Xu , Ruowang Li , Yong Chen","doi":"10.1016/j.jbi.2024.104705","DOIUrl":"10.1016/j.jbi.2024.104705","url":null,"abstract":"<div><h3>Objective</h3><p>Phenotypic misclassification in genetic association analyses can impact the accuracy of PRS-based prediction models. The bias reduction method proposed by Tong et al. (2019) has demonstrated its efficacy in reducing the effects of bias on the estimation of association parameters between genotype and phenotype while minimizing variance by employing chart reviews on a subset of the data for validating phenotypes, however its improvement of subsequent PRS prediction accuracy remains unclear. Our study aims to fill this gap by assessing the performance of simulated PRS models and estimating the optimal number of chart reviews needed for validation.</p></div><div><h3>Methods</h3><p>To comprehensively assess the efficacy of the bias reduction method proposed by Tong et al. in enhancing the accuracy of PRS-based prediction models, we simulated each phenotype under different correlation structures (an independent model, a weakly correlated model, a strongly correlated model) and introduced error-prone phenotypes using two distinct error mechanisms (differential and non-differential phenotyping errors). To facilitate this, we used genotype and phenotype data from 12 case-control datasets in the Alzheimer’s Disease Genetics Consortium (ADGC) to produce simulated phenotypes. The evaluation included analyses across various misclassification rates of original phenotypes as well as quantities of validation set. Additionally, we determined the median threshold, identifying the minimal validation size required for a meaningful improvement in the accuracy of PRS-based predictions across a broad spectrum.</p></div><div><h3>Results</h3><p>This simulation study demonstrated that incorporating chart review does not universally guarantee enhanced performance of PRS-based prediction models. Specifically, in scenarios with minimal misclassification rates and limited validation sizes, PRS models utilizing debiased regression coefficients demonstrated inferior predictive capabilities compared to models using error-prone phenotypes. Put differently, the effectiveness of the bias reduction method is contingent upon the misclassification rates of phenotypes and the size of the validation set employed during chart reviews. Notably, when dealing with datasets featuring higher misclassification rates, the advantages of utilizing this bias reduction method become more evident, requiring a smaller validation set to achieve better performance.</p></div><div><h3>Conclusion</h3><p>This study highlights the importance of choosing an appropriate validation set size to balance between the efforts of chart review and the gain in PRS prediction accuracy. Consequently, our study establishes a valuable guidance for validation planning, across a diverse array of sensitivity and specificity combinations.</p></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"157 ","pages":"Article 104705"},"PeriodicalIF":4.0,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141971201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Boran Hao , Yang Hu , William G. Adams , Sabrina A. Assoumou , Heather E. Hsu , Nahid Bhadelia , Ioannis Ch. Paschalidis
{"title":"A GPT-based EHR modeling system for unsupervised novel disease detection","authors":"Boran Hao , Yang Hu , William G. Adams , Sabrina A. Assoumou , Heather E. Hsu , Nahid Bhadelia , Ioannis Ch. Paschalidis","doi":"10.1016/j.jbi.2024.104706","DOIUrl":"10.1016/j.jbi.2024.104706","url":null,"abstract":"<div><h3>Objective</h3><p>To develop an <em>Artificial Intelligence (AI)</em>-based anomaly detection model as a complement of an “astute physician” in detecting novel disease cases in a hospital and preventing emerging outbreaks<em>.</em></p></div><div><h3>Methods</h3><p>Data included hospitalized patients (n = 120,714) at a safety-net hospital in Massachusetts. A novel <em>Generative Pre-trained Transformer (GPT)</em>-based clinical anomaly detection system was designed and further trained using <em>Empirical Risk Minimization (ERM)</em>, which can model a hospitalized patient’s <em>Electronic Health Records (EHR)</em> and detect atypical patients. Methods and performance metrics, similar to the ones behind the recent <em>Large Language Models (LLMs)</em>, were leveraged to capture the dynamic evolution of the patient’s clinical variables and compute an <em>Out-Of-Distribution (OOD)</em> anomaly score.</p></div><div><h3>Results</h3><p>In a completely unsupervised setting, hospitalizations for <em>Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2)</em> infection could have been predicted by our GPT model at the beginning of the COVID-19 pandemic, with an Area Under the Receiver Operating Characteristic Curve (AUC) of 92.2 %, using 31 extracted clinical variables and a 3-day detection window. Our GPT achieves individual patient-level anomaly detection and mortality prediction AUC of 78.3 % and 94.7 %, outperforming traditional linear models by 6.6 % and 9 %, respectively. Different types of clinical trajectories of a SARS-CoV-2 infection are captured by our model to make interpretable detections, while a trend of over-pessimistic outcome prediction yields a more effective detection pathway. Furthermore, our comprehensive GPT model can potentially assist clinicians with forecasting patient clinical variables and developing personalized treatment plans.</p></div><div><h3>Conclusion</h3><p>This study demonstrates that an emerging outbreak can be accurately detected within a hospital, by using a GPT to model patient EHR time sequences and labeling them as anomalous when actual outcomes are not supported by the model. Such a GPT is also a comprehensive model with the functionality of generating future patient clinical variables, which can potentially assist clinicians in developing personalized treatment plans.</p></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"157 ","pages":"Article 104706"},"PeriodicalIF":4.0,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141912806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rodrigo Bonacin , Elaine Barbosa de Figueiredo , Ferrucio de Franco Rosa , Julio Cesar dos Reis , Mariangela Dametto
{"title":"The reuse of electronic health records information models in the oncology domain: Studies with the bioframe framework","authors":"Rodrigo Bonacin , Elaine Barbosa de Figueiredo , Ferrucio de Franco Rosa , Julio Cesar dos Reis , Mariangela Dametto","doi":"10.1016/j.jbi.2024.104704","DOIUrl":"10.1016/j.jbi.2024.104704","url":null,"abstract":"<div><h3>Objective:</h3><p>The reuse of Electronic Health Records (EHR) information models (<em>e.g.</em>, templates and archetypes) may bring various benefits, including higher standardization, integration, interoperability, increased productivity in developing EHR systems, and unlock potential Artificial Intelligence applications built on top of medical records. The literature presents recent advances in standards for modeling EHR, in Knowledge Organization Systems (KOS) and EHR data reuse. However, methods, development processes, and frameworks to improve the reuse of EHR information models are still scarce. This study proposes a software engineering framework, named BioFrame, and analyzes how the reuse of EHR information models can be improved during the development of EHR systems.</p></div><div><h3>Methods:</h3><p>EHR standards and KOS, including ontologies, identified from systematic reviews were considered in developing the BioFrame framework. We used the structure of the OpenEHR to model templates and archetypes, as well as its relationship to international KOS used in the oncology domain. Our framework was applied in the context of pediatric oncology. Three data entry forms concerning nutrition and one utilized during the first pediatric oncology consultations were analyzed to measure the reuse of information models.</p></div><div><h3>Results:</h3><p>There was an increase in the adherence rate to international KOS of 18% to the original forms. There was an increase in the concepts reused in all 12 scenarios analyzed, with an average reuse of 6.55% in the original forms compared to 17.1% using BioFrame, resulting in significant differences.</p></div><div><h3>Conclusions:</h3><p>Our results point to higher reuse rates achieved due to an engineering process that provided greater adherence to EHR standards combined with semantic artifacts. This reveals the potential to develop new methods and frameworks aimed at EHR information model reuse. Additional research is needed to evaluate the impacts of the reuse of the EHR information model on interoperability, EHR data reuse, and data quality and assess the proposed framework in other health domains.</p></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"157 ","pages":"Article 104704"},"PeriodicalIF":4.0,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141912807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"fmi-ii: Table of Contents","authors":"","doi":"10.1016/S1532-0464(24)00116-3","DOIUrl":"10.1016/S1532-0464(24)00116-3","url":null,"abstract":"","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"156 ","pages":"Article 104698"},"PeriodicalIF":4.0,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1532046424001163/pdfft?md5=76710ccc769127af9cdd07b59ff00b67&pid=1-s2.0-S1532046424001163-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141960189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}