Katherine E Brown, Chao Yan, Zhuohang Li, Xinmeng Zhang, Benjamin X Collins, You Chen, Ellen Wright Clayton, Murat Kantarcioglu, Yevgeniy Vorobeychik, Bradley A Malin
{"title":"Large language models are less effective at clinical prediction tasks than locally trained machine learning models.","authors":"Katherine E Brown, Chao Yan, Zhuohang Li, Xinmeng Zhang, Benjamin X Collins, You Chen, Ellen Wright Clayton, Murat Kantarcioglu, Yevgeniy Vorobeychik, Bradley A Malin","doi":"10.1093/jamia/ocaf038","DOIUrl":"10.1093/jamia/ocaf038","url":null,"abstract":"<p><strong>Objectives: </strong>To determine the extent to which current large language models (LLMs) can serve as substitutes for traditional machine learning (ML) as clinical predictors using data from electronic health records (EHRs), we investigated various factors that can impact their adoption, including overall performance, calibration, fairness, and resilience to privacy protections that reduce data fidelity.</p><p><strong>Materials and methods: </strong>We evaluated GPT-3.5, GPT-4, and traditional ML (as gradient-boosting trees) on clinical prediction tasks in EHR data from Vanderbilt University Medical Center (VUMC) and MIMIC IV. We measured predictive performance with area under the receiver operating characteristic (AUROC) and model calibration using Brier Score. To evaluate the impact of data privacy protections, we assessed AUROC when demographic variables are generalized. We evaluated algorithmic fairness using equalized odds and statistical parity across race, sex, and age of patients. We also considered the impact of using in-context learning by incorporating labeled examples within the prompt.</p><p><strong>Results: </strong>Traditional ML [AUROC: 0.847, 0.894 (VUMC, MIMIC)] substantially outperformed GPT-3.5 (AUROC: 0.537, 0.517) and GPT-4 (AUROC: 0.629, 0.602) (with and without in-context learning) in predictive performance and output probability calibration [Brier Score (ML vs GPT-3.5 vs GPT-4): 0.134 vs 0.384 vs 0.251, 0.042 vs 0.06 vs 0.219)].</p><p><strong>Discussion: </strong>Traditional ML is more robust than GPT-3.5 and GPT-4 in generalizing demographic information to protect privacy. GPT-4 is the fairest model according to our selected metrics but at the cost of poor model performance.</p><p><strong>Conclusion: </strong>These findings suggest that non-fine-tuned LLMs are less effective and robust than locally trained ML for clinical prediction tasks, but they are improving across releases.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.7,"publicationDate":"2025-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143582390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S Momsen Reincke, Camilo Espinosa, Philip Chung, Tomin James, Eloïse Berson, Nima Aghaeepour
{"title":"Mitigation of outcome conflation in predicting patient outcomes using electronic health records.","authors":"S Momsen Reincke, Camilo Espinosa, Philip Chung, Tomin James, Eloïse Berson, Nima Aghaeepour","doi":"10.1093/jamia/ocaf033","DOIUrl":"10.1093/jamia/ocaf033","url":null,"abstract":"<p><strong>Objectives: </strong>Artificial intelligence (AI) models utilizing electronic health record data for disease prediction can enhance risk stratification but may lack specificity, which is crucial for reducing the economic and psychological burdens associated with false positives. This study aims to evaluate the impact of confounders on the specificity of single-outcome prediction models and assess the effectiveness of a multi-class architecture in mitigating outcome conflation.</p><p><strong>Materials and methods: </strong>We evaluated a state-of-the-art model predicting pancreatic cancer from disease code sequences in an independent cohort of 2.3 million patients and compared this single-outcome model with a multi-class model designed to predict multiple cancer types simultaneously. Additionally, we conducted a clinical simulation experiment to investigate the impact of confounders on the specificity of single-outcome prediction models.</p><p><strong>Results: </strong>While we were able to independently validate the pancreatic cancer prediction model, we found that its prediction scores were also correlated with ovarian cancer, suggesting conflation of outcomes due to underlying confounders. Building on this observation, we demonstrate that the specificity of single-outcome prediction models is impaired by confounders using a clinical simulation experiment. Introducing a multi-class architecture improves specificity in predicting cancer types compared to the single-outcome model while preserving performance, mitigating the conflation of outcomes in both the real-world and simulated contexts.</p><p><strong>Discussion: </strong>Our results highlight the risk of outcome conflation in single-outcome AI prediction models and demonstrate the effectiveness of a multi-class approach in mitigating this issue.</p><p><strong>Conclusion: </strong>The number of predicted outcomes needs to be carefully considered when employing AI disease risk prediction models.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.7,"publicationDate":"2025-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143582391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Correction to: Development and evaluation of a training curriculum to engage researchers on accessing and analyzing the All of Us data.","authors":"","doi":"10.1093/jamia/ocaf044","DOIUrl":"https://doi.org/10.1093/jamia/ocaf044","url":null,"abstract":"","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.7,"publicationDate":"2025-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143574590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sungho Shim, Min-Soo Kim, Che Gyem Yae, Yong Koo Kang, Jae Rock Do, Hong Kyun Kim, Hyun-Lim Yang
{"title":"Development and validation of a multi-stage self-supervised learning model for optical coherence tomography image classification.","authors":"Sungho Shim, Min-Soo Kim, Che Gyem Yae, Yong Koo Kang, Jae Rock Do, Hong Kyun Kim, Hyun-Lim Yang","doi":"10.1093/jamia/ocaf021","DOIUrl":"https://doi.org/10.1093/jamia/ocaf021","url":null,"abstract":"<p><strong>Objective: </strong>This study aimed to develop a novel multi-stage self-supervised learning model tailored for the accurate classification of optical coherence tomography (OCT) images in ophthalmology reducing reliance on costly labeled datasets while maintaining high diagnostic accuracy.</p><p><strong>Materials and methods: </strong>A private dataset of 2719 OCT images from 493 patients was employed, along with 3 public datasets comprising 84 484 images from 4686 patients, 3231 images from 45 patients, and 572 images. Extensive internal, external, and clinical validation were performed to assess model performance. Grad-CAM was employed for qualitative analysis to interpret the model's decisions by highlighting relevant areas. Subsampling analyses evaluated the model's robustness with varying labeled data availability.</p><p><strong>Results: </strong>The proposed model outperformed conventional supervised or self-supervised learning-based models, achieving state-of-the-art results across 3 public datasets. In a clinical validation, the model exhibited up to 17.50% higher accuracy and 17.53% higher macro F-1 score than a supervised learning-based model under limited training data.</p><p><strong>Discussion: </strong>The model's robustness in OCT image classification underscores the potential of the multi-stage self-supervised learning to address challenges associated with limited labeled data. The availability of source codes and pre-trained models promotes the use of this model in a variety of clinical settings, facilitating broader adoption.</p><p><strong>Conclusion: </strong>This model offers a promising solution for advancing OCT image classification, achieving high accuracy while reducing the cost of extensive expert annotation and potentially streamlining clinical workflows, thereby supporting more efficient patient management.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.7,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143558511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Expectations of healthcare AI and the role of trust: understanding patient views on how AI will impact cost, access, and patient-provider relationships.","authors":"Paige Nong, Molin Ji","doi":"10.1093/jamia/ocaf031","DOIUrl":"https://doi.org/10.1093/jamia/ocaf031","url":null,"abstract":"<p><strong>Objectives: </strong>Although efforts to effectively govern AI continue to develop, relatively little work has been done to systematically measure and include patient perspectives or expectations of AI in governance. This analysis is designed to understand patient expectations of healthcare AI.</p><p><strong>Materials and methods: </strong>Cross-sectional nationally representative survey of US adults fielded from June to July of 2023. A total of 2039 participants completed the survey and cross-sectional population weights were applied to produce national estimates.</p><p><strong>Results: </strong>Among US adults, 19.55% expect AI to improve their relationship with their doctor, while 19.4% expect it to increase affordability and 30.28% expect it will improve their access to care. Trust in providers and the healthcare system are positively associated with expectations of AI when controlling for demographic factors, general attitudes toward technology, and other healthcare-related variables.</p><p><strong>Discussion: </strong>US adults generally have low expectations of benefit from AI in healthcare, but those with higher trust in their providers and health systems are more likely to expect to benefit from AI.</p><p><strong>Conclusion: </strong>Trust and provider relationships should be key considerations for health systems as they create their AI governance processes and communicate with patients about AI tools. Evidence of patient benefit should be prioritized to preserve or promote trust.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.7,"publicationDate":"2025-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143558514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The health data utility and the resurgence of health information exchanges as a national resource.","authors":"Anjum Khurshid, Indra Neil Sarkar","doi":"10.1093/jamia/ocaf032","DOIUrl":"https://doi.org/10.1093/jamia/ocaf032","url":null,"abstract":"<p><strong>Objectives: </strong>(1) Describe the evolution of Health Information Exchanges (HIEs) into Health Data Utilities (HDUs); (2) Provide motivation for HDUs as a public strategic investment target.</p><p><strong>Materials and methods: </strong>We examine trends in developing HIEs into HDUs and compare their criticality to that of the national highway system as an investment in the public good.</p><p><strong>Results: </strong>We propose that investment in HDUs is essential for our nation's healthcare data ecosystem. This investment will address the increased need for healthcare delivery and public health data.</p><p><strong>Discussion: </strong>HDUs can meet the current and future needs of healthcare delivery and public health surveillance. Their structure and capabilities will underpin their success to support data-driven decision-making.</p><p><strong>Conclusion: </strong>Transforming HIEs into HDUs is essential to realizing the vision of a distributed and connected healthcare data system. Public funding is critical for this model's success, similar to the continued investment in the national highway system.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.7,"publicationDate":"2025-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143558515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiayuan Huang, Jatin Arora, Abdullah Mesut Erzurumluoglu, Stephen A Stanhope, Daniel Lam, Hongyu Zhao, Zhihao Ding, Zuoheng Wang, Johann de Jong
{"title":"Enhancing patient representation learning with inferred family pedigrees improves disease risk prediction.","authors":"Xiayuan Huang, Jatin Arora, Abdullah Mesut Erzurumluoglu, Stephen A Stanhope, Daniel Lam, Hongyu Zhao, Zhihao Ding, Zuoheng Wang, Johann de Jong","doi":"10.1093/jamia/ocae297","DOIUrl":"10.1093/jamia/ocae297","url":null,"abstract":"<p><strong>Background: </strong>Machine learning and deep learning are powerful tools for analyzing electronic health records (EHRs) in healthcare research. Although family health history has been recognized as a major predictor for a wide spectrum of diseases, research has so far adopted a limited view of family relations, essentially treating patients as independent samples in the analysis.</p><p><strong>Methods: </strong>To address this gap, we present ALIGATEHR, which models inferred family relations in a graph attention network augmented with an attention-based medical ontology representation, thus accounting for the complex influence of genetics, shared environmental exposures, and disease dependencies.</p><p><strong>Results: </strong>Taking disease risk prediction as a use case, we demonstrate that explicitly modeling family relations significantly improves predictions across the disease spectrum. We then show how ALIGATEHR's attention mechanism, which links patients' disease risk to their relatives' clinical profiles, successfully captures genetic aspects of diseases using longitudinal EHR diagnosis data. Finally, we use ALIGATEHR to successfully distinguish the 2 main inflammatory bowel disease subtypes with highly shared risk factors and symptoms (Crohn's disease and ulcerative colitis).</p><p><strong>Conclusion: </strong>Overall, our results highlight that family relations should not be overlooked in EHR research and illustrate ALIGATEHR's great potential for enhancing patient representation learning for predictive and interpretable modeling of EHRs.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"435-446"},"PeriodicalIF":4.7,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11833479/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142900000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Olga Yakusheva, Lara Khadr, Kathryn A Lee, Hannah C Ratliff, Deanna J Marriott, Deena Kelly Costa
{"title":"An electronic health record metadata-mining approach to identifying patient-level interprofessional clinician teams in the intensive care unit.","authors":"Olga Yakusheva, Lara Khadr, Kathryn A Lee, Hannah C Ratliff, Deanna J Marriott, Deena Kelly Costa","doi":"10.1093/jamia/ocae275","DOIUrl":"10.1093/jamia/ocae275","url":null,"abstract":"<p><strong>Objectives: </strong>Advances in health informatics rapidly expanded use of big-data analytics and electronic health records (EHR) by clinical researchers seeking to optimize interprofessional ICU team care. This study developed and validated a program for extracting interprofessional teams assigned to each patient each shift from EHR event logs.</p><p><strong>Materials and methods: </strong>A retrospective analysis of EHR event logs for mechanically-ventilated patients 18 and older from 5 ICUs in an academic medical center during 1/1/2018-12/31/2019. We defined interprofessional teams as all medical providers (physicians, physician assistants, and nurse practitioners), registered nurses, and respiratory therapists assigned to each patient each shift. We created an EHR event logs-mining program that extracts clinicians who interact with each patient's medical record each shift. The algorithm was validated using the Message Understanding Conference (MUC-6) method against manual chart review of a random sample of 200 patient-shifts from each ICU by two independent reviewers.</p><p><strong>Results: </strong>Our sample included 4559 ICU encounters and 72 846 patient-shifts. Our program extracted 3288 medical providers, 2702 registered nurses, and 219 respiratory therapists linked to these encounters. Eighty-three percent of patient-shift teams included medical providers, 99.3% included registered nurses, and 74.1% included respiratory therapists; 63.4% of shift-level teams included clinicians from all three professions. The program demonstrated 95.9% precision, 96.2% recall, and high face validity.</p><p><strong>Discussion: </strong>Our EHR event logs-mining program has high precision, recall, and validity for identifying patient-levelshift interprofessional teams in ICUs.</p><p><strong>Conclusions: </strong>Algorithmic and artificial intelligence approaches have a strong potential for informing research to optimize patient team assignments and improve ICU care and outcomes.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"426-434"},"PeriodicalIF":4.7,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11833494/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142839957","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rohan Khera, Mitsuaki Sawano, Frederick Warner, Andreas Coppi, Aline F Pedroso, Erica S Spatz, Huihui Yu, Michael Gottlieb, Sharon Saydah, Kari A Stephens, Kristin L Rising, Joann G Elmore, Mandy J Hill, Ahamed H Idris, Juan Carlos C Montoy, Kelli N O'Laughlin, Robert A Weinstein, Arjun Venkatesh
{"title":"Assessment of health conditions from patient electronic health record portals vs self-reported questionnaires: an analysis of the INSPIRE study.","authors":"Rohan Khera, Mitsuaki Sawano, Frederick Warner, Andreas Coppi, Aline F Pedroso, Erica S Spatz, Huihui Yu, Michael Gottlieb, Sharon Saydah, Kari A Stephens, Kristin L Rising, Joann G Elmore, Mandy J Hill, Ahamed H Idris, Juan Carlos C Montoy, Kelli N O'Laughlin, Robert A Weinstein, Arjun Venkatesh","doi":"10.1093/jamia/ocaf027","DOIUrl":"https://doi.org/10.1093/jamia/ocaf027","url":null,"abstract":"<p><strong>Objectives: </strong>Direct electronic access to multiple electronic health record (EHR) systems through patient portals offers a novel avenue for decentralized research. Given the critical value of patient characterization, we sought to compare computable evaluation of health conditions from patient-portal EHR against the traditional self-report.</p><p><strong>Materials and methods: </strong>In the nationwide Innovative Support for Patients with SARS-CoV-2 Infections Registry (INSPIRE) study, which linked self-reported questionnaires with multiplatform patient-portal EHR data, we compared self-reported health conditions across different clinical domains against computable definitions based on diagnosis codes, medications, vital signs, and laboratory testing. We assessed their concordance using Cohen's Kappa and the prognostic significance of differentially captured features as predictors of 1-year all-cause hospitalization risk.</p><p><strong>Results: </strong>Among 1683 participants (mean age 41 ± 15 years, 67% female, 63% non-Hispanic Whites), the prevalence of conditions varied substantially between EHR and self-report (-13.2% to +11.6% across definitions). Compared with comprehensive EHR phenotypes, self-report under-captured all conditions, including hypertension (27.9% vs 16.2%), diabetes (10.1% vs 6.2%), and heart disease (8.5% vs 4.3%). However, diagnosis codes alone were insufficient. The risk for 1-year hospitalization was better defined by the same features from patient-portal EHR (area under the receiver operating curve [AUROC] 0.79) than from self-report (AUROC 0.68).</p><p><strong>Discussion: </strong>EHR-derived computable phenotypes identified a higher prevalence of comorbidities than self-report, with prognostic value of additionally identified features. However, definitions based solely on diagnosis codes often undercaptured self-reported conditions, suggesting a role of broader EHR elements.</p><p><strong>Conclusion: </strong>In this nationwide study, patient-portal-derived EHR data enabled extensive capture of patient characteristics across multiple EHR platforms, allowing better disease phenotyping compared with self-report.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.7,"publicationDate":"2025-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143558470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ying Li, Surabhi Datta, Majid Rastegar-Mojarad, Kyeryoung Lee, Hunki Paek, Julie Glasgow, Chris Liston, Long He, Xiaoyan Wang, Yingxin Xu
{"title":"Enhancing systematic literature reviews with generative artificial intelligence: development, applications, and performance evaluation.","authors":"Ying Li, Surabhi Datta, Majid Rastegar-Mojarad, Kyeryoung Lee, Hunki Paek, Julie Glasgow, Chris Liston, Long He, Xiaoyan Wang, Yingxin Xu","doi":"10.1093/jamia/ocaf030","DOIUrl":"https://doi.org/10.1093/jamia/ocaf030","url":null,"abstract":"<p><strong>Objectives: </strong>We developed and validated a large language model (LLM)-assisted system for conducting systematic literature reviews in health technology assessment (HTA) submissions.</p><p><strong>Materials and methods: </strong>We developed a five-module system using abstracts acquired from PubMed: (1) literature search query setup; (2) study protocol setup using population, intervention/comparison, outcome, and study type (PICOs) criteria; (3) LLM-assisted abstract screening; (4) LLM-assisted data extraction; and (5) data summarization. The system incorporates a human-in-the-loop design, allowing real-time PICOs criteria adjustment. This is achieved by collecting information on disagreements between the LLM and human reviewers regarding inclusion/exclusion decisions and their rationales, enabling informed PICOs refinement. We generated four evaluation sets including relapsed and refractory multiple myeloma (RRMM) and advanced melanoma to evaluate the LLM's performance in three key areas: (1) recommending inclusion/exclusion decisions during abstract screening, (2) providing valid rationales for abstract exclusion, and (3) extracting relevant information from included abstracts.</p><p><strong>Results: </strong>The system demonstrated relatively high performance across all evaluation sets. For abstract screening, it achieved an average sensitivity of 90%, F1 score of 82, accuracy of 89%, and Cohen's κ of 0.71, indicating substantial agreement between human reviewers and LLM-based results. In identifying specific exclusion rationales, the system attained accuracies of 97% and 84%, and F1 scores of 98 and 89 for RRMM and advanced melanoma, respectively. For data extraction, the system achieved an F1 score of 93.</p><p><strong>Discussion: </strong>Results showed high sensitivity, Cohen's κ, and PABAK for abstract screening, and high F1 scores for data extraction. This human-in-the-loop AI-assisted SLR system demonstrates the potential of GPT-4's in context learning capabilities by eliminating the need for manually annotated training data. In addition, this LLM-based system offers subject matter experts greater control through prompt adjustment and real-time feedback, enabling iterative refinement of PICOs criteria based on performance metrics.</p><p><strong>Conclusion: </strong>The system demonstrates potential to streamline systematic literature reviews, potentially reducing time, cost, and human errors while enhancing evidence generation for HTA submissions.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.7,"publicationDate":"2025-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143558513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}