{"title":"Large language models for identifying depression concerns in cancer patients.","authors":"Yu Wang, Xin Ye, Huiping Luo, Wei Feng","doi":"10.1093/jamia/ocaf072","DOIUrl":"https://doi.org/10.1093/jamia/ocaf072","url":null,"abstract":"","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.7,"publicationDate":"2025-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143990465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christopher Sun, Martin S Copenhaver, Ana Cecilia Zenteno Langle, Bruno Viscomi, Ed Raeke, Bethany J Daily, Peter Dunn, Retsef Levi
{"title":"Improved Intrahospital Transport Time via Proximity-Based Staff Assignments.","authors":"Christopher Sun, Martin S Copenhaver, Ana Cecilia Zenteno Langle, Bruno Viscomi, Ed Raeke, Bethany J Daily, Peter Dunn, Retsef Levi","doi":"10.1093/jamia/ocaf081","DOIUrl":"https://doi.org/10.1093/jamia/ocaf081","url":null,"abstract":"<p><strong>Background: </strong>Intrahospital patient transport is pivotal in enabling hospital operations and facilitating safe and efficient patient movement. However, transport delays are common in hospitals, signaling a need for improvement. This study develops, implements, and evaluates a proximity-based transporter-to-request assignment system aimed at improving transport service system efficiency.</p><p><strong>Methods: </strong>In this observational study, we used discrete-event simulation to design and optimize an enhancement to an electronic medical record's original first-in, first-out transporter-to-request assignment system, and we implemented it at a quaternary care academic medical center. Our enhancement prioritizes requests based on the proximity of available transporters within pre-specified areas. We compared transport request completion time (primary outcome) and the percentage of transports exceeding 45-minutes (secondary outcome) during control (01/2021-02/2022) and intervention (02/2022-03/2023) periods and estimated their differences using multivariate generalized linear models to adjust for confounding factors including variable workforce levels and workload.</p><p><strong>Results: </strong>A total of 136,414 transport requests were included in the study. The intervention was associated with an adjusted 5.0% (95% confidence interval 1.8%-8.5%) reduction in completion times and a 16.0% (7.4%-23.9%) relative reduction in the percentage of trips exceeding the 45-minute completion time target.</p><p><strong>Discussion: </strong>The intervention's improvements stem from reductions in unnecessary travel time between transport requests, common to first-in first-out assignment systems. The intervention was designed to be natively integrated into existing electronic health record systems, reducing barriers to real-world adoption.</p><p><strong>Conclusion: </strong>Implementing a proximity-based assignment system designed based on simulation-optimization modeling improved intrahospital patient transport efficiency without requiring additional staff.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.7,"publicationDate":"2025-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144019956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A comparative analysis of privacy-preserving large language models for automated echocardiography report analysis.","authors":"Elham Mahmoudi, Sanaz Vahdati, Chieh-Ju Chao, Bardia Khosravi, Ajay Misra, Francisco Lopez-Jimenez, Bradley J Erickson","doi":"10.1093/jamia/ocaf056","DOIUrl":"https://doi.org/10.1093/jamia/ocaf056","url":null,"abstract":"<p><strong>Background: </strong>Automated data extraction from echocardiography reports could facilitate large-scale registry creation and clinical surveillance of valvular heart diseases (VHD). We evaluated the performance of open-source large language models (LLMs) guided by prompt instructions and chain of thought (CoT) for this task.</p><p><strong>Methods: </strong>From consecutive transthoracic echocardiographies performed in our center, we utilized 200 random reports from 2019 for prompt optimization and 1000 from 2023 for evaluation. Five instruction-tuned LLMs (Qwen2.0-72B, Llama3.0-70B, Mixtral8-46.7B, Llama3.0-8B, and Phi3.0-3.8B) were guided by prompt instructions with and without CoT to classify prosthetic valve presence and VHD severity. Performance was evaluated using classification metrics against expert-labeled ground truth. Mean squared error (MSE) was also calculated for predicted severity's deviation from actual severity.</p><p><strong>Results: </strong>With CoT prompting, Llama3.0-70B and Qwen2.0 achieved the highest performance (accuracy: 99.1% and 98.9% for VHD severity; 100% and 99.9% for prosthetic valve; MSE: 0.02 and 0.05, respectively). Smaller models showed lower accuracy for VHD severity (54.1%-85.9%) but maintained high accuracy for prosthetic valve detection (>96%). Chain of thought reasoning yielded higher accuracy for larger models while increasing processing time from 2-25 to 67-154 seconds per report. Based on CoT reasonings, the wrong predictions were mainly due to model outputs being influenced by irrelevant information in the text or failure to follow the prompt instructions.</p><p><strong>Conclusions: </strong>Our study demonstrates the near-perfect performance of open-source LLMs for automated echocardiography report interpretation with the purpose of registry formation and disease surveillance. While larger models achieved exceptional accuracy through prompt optimization, practical implementation requires balancing performance with computational efficiency.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.7,"publicationDate":"2025-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144051988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yan Wang, Jimin Huang, Huan He, Vincent Zhang, Yujia Zhou, Xubing Hao, Pritham Ram, Lingfei Qian, Qianqian Xie, Ruey-Ling Weng, Fongci Lin, Yan Hu, Licong Cui, Xiaoqian Jiang, Hua Xu, Na Hong
{"title":"CDEMapper: enhancing National Institutes of Health common data element use with large language models.","authors":"Yan Wang, Jimin Huang, Huan He, Vincent Zhang, Yujia Zhou, Xubing Hao, Pritham Ram, Lingfei Qian, Qianqian Xie, Ruey-Ling Weng, Fongci Lin, Yan Hu, Licong Cui, Xiaoqian Jiang, Hua Xu, Na Hong","doi":"10.1093/jamia/ocaf064","DOIUrl":"https://doi.org/10.1093/jamia/ocaf064","url":null,"abstract":"<p><strong>Objective: </strong>Common Data Elements (CDEs) standardize data collection and sharing across studies, enhancing data interoperability and improving research reproducibility. However, implementing CDEs presents challenges due to the broad range and variety of data elements. This study aims to develop a CDE mapping tool to bridge the gap between local data elements and National Institutes of Health (NIH) CDEs.</p><p><strong>Methods: </strong>We propose CDEMapper, a large language model (LLM)-powered mapping tool designed to assist in mapping local data elements to NIH CDEs. CDEMapper has 3 core modules: (1) CDE indexing and embeddings. NIH CDEs were indexed and embedded to support semantic search; (2) CDE recommendations. The tool combines Elasticsearch (BM25 methods) with GPT services to recommend candidate CDEs and their permissible values; and (3) Human review. Users review and select the best match for their data elements and value sets. We evaluate the tool's recommendation accuracy and usability against manual annotations and testing.</p><p><strong>Results: </strong>CDEMapper offers a publicly available, LLM-powered, and intuitive user interface that consolidates essential and advanced mapping services into a streamlined pipeline. The evaluation results demonstrated that the augmented BM25 with GPT embeddings and a GPT ranker achieved the overall best performance. The usability test also highlighted the effectiveness and efficiency of our tool.</p><p><strong>Discussions and conclusions: </strong>This work opens up the potential of using LLMs to assist with CDE mapping when aligning local data elements with NIH CDEs. Additionally, this effort helps researchers better understand the gaps between their data elements and NIH CDEs while promoting CDE reusability.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.7,"publicationDate":"2025-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144005421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Eric G Poon, Christy Harris Lemak, Juan C Rojas, Janet Guptill, David Classen
{"title":"Adoption of artificial intelligence in healthcare: survey of health system priorities, successes, and challenges.","authors":"Eric G Poon, Christy Harris Lemak, Juan C Rojas, Janet Guptill, David Classen","doi":"10.1093/jamia/ocaf065","DOIUrl":"https://doi.org/10.1093/jamia/ocaf065","url":null,"abstract":"<p><strong>Importance: </strong>The US healthcare system faces significant challenges, including clinician burnout, operational inefficiencies, and concerns about patient safety. Artificial intelligence (AI), particularly generative AI, has the potential to address these challenges, but its adoption, effectiveness, and barriers to implementation are not well understood.</p><p><strong>Objective: </strong>To evaluate the current state of AI adoption in US healthcare systems, assess successes and barriers to implementation during the early generative AI era.</p><p><strong>Design, setting, and participants: </strong>This cross-sectional survey was conducted in Fall 2024, and included 67 health systems members of the Scottsdale Institute, a collaborative of US non-profit healthcare organizations. Forty-three health systems completed the survey (64% response rate). Respondents provided data on the deployment status and perceived success of 37 AI use cases across 10 categories.</p><p><strong>Main outcomes and measures: </strong>The primary outcomes were the extent of AI use case development, piloting, or deployment, the degree of reported success for AI use cases, and the most significant barriers to adoption.</p><p><strong>Results: </strong>Across the 43 responding health systems, AI adoption and perceptions of success varied significantly. Ambient Notes, a generative AI tool for clinical documentation, was the only use case with 100% of respondents reporting adoption activities, and 53% reported a high degree of success with using AI for Clinical Documentation. Imaging and radiology emerged as the most widely deployed clinical AI use case, with 90% of organizations reporting at least partial deployment, although successes with diagnostic use cases were limited. Similarly, many organizations have deployed AI for clinical risk stratification such as early sepsis detection, but only 38% report high success in this area. Immature AI tools were identified a significant barrier to adoption, cited by 77% of respondents, followed by financial concerns (47%) and regulatory uncertainty (40%).</p><p><strong>Conclusions and relevance: </strong>Ambient Notes is rapidly advancing in US healthcare systems and demonstrating early success. Other AI use cases show varying degrees of adoption and success, constrained by barriers such as immature AI tools, financial concerns, and regulatory uncertainty. Addressing these challenges through robust evaluations, shared strategies, and governance models will be essential to ensure effective integration and adoption of AI into healthcare practice.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.7,"publicationDate":"2025-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144057241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Expectations of healthcare AI and the role of trust: understanding patient views on how AI will impact cost, access, and patient-provider relationships.","authors":"Paige Nong, Molin Ji","doi":"10.1093/jamia/ocaf031","DOIUrl":"10.1093/jamia/ocaf031","url":null,"abstract":"<p><strong>Objectives: </strong>Although efforts to effectively govern AI continue to develop, relatively little work has been done to systematically measure and include patient perspectives or expectations of AI in governance. This analysis is designed to understand patient expectations of healthcare AI.</p><p><strong>Materials and methods: </strong>Cross-sectional nationally representative survey of US adults fielded from June to July of 2023. A total of 2039 participants completed the survey and cross-sectional population weights were applied to produce national estimates.</p><p><strong>Results: </strong>Among US adults, 19.55% expect AI to improve their relationship with their doctor, while 19.4% expect it to increase affordability and 30.28% expect it will improve their access to care. Trust in providers and the healthcare system are positively associated with expectations of AI when controlling for demographic factors, general attitudes toward technology, and other healthcare-related variables.</p><p><strong>Discussion: </strong>US adults generally have low expectations of benefit from AI in healthcare, but those with higher trust in their providers and health systems are more likely to expect to benefit from AI.</p><p><strong>Conclusion: </strong>Trust and provider relationships should be key considerations for health systems as they create their AI governance processes and communicate with patients about AI tools. Evidence of patient benefit should be prioritized to preserve or promote trust.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"795-799"},"PeriodicalIF":4.7,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12012342/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143558514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Titus Schleyer, Manijeh Berenji, Monica Deck, Hana Chung, Joshua Choi, Theresa A Cullen, Timothy Burdick, Amanda Zaleski, Kelly Jean Thomas Craig, Oluseyi Fayanju, Muhammad Muinul Islam
{"title":"A call for the informatics community to define priority practice and research areas at the intersection of climate and health: report from 2023 mini-summit.","authors":"Titus Schleyer, Manijeh Berenji, Monica Deck, Hana Chung, Joshua Choi, Theresa A Cullen, Timothy Burdick, Amanda Zaleski, Kelly Jean Thomas Craig, Oluseyi Fayanju, Muhammad Muinul Islam","doi":"10.1093/jamia/ocae292","DOIUrl":"10.1093/jamia/ocae292","url":null,"abstract":"<p><strong>Objective: </strong>Although biomedical informatics has multiple roles to play in addressing the climate crisis, collaborative action and research agendas have yet to be developed. As a first step, AMIA's new Climate, Health, and Informatics Working Group held a mini-summit entitled Climate and health: How can informatics help? during the AMIA 2023 Fall Symposium to define an initial set of areas of interest and begin mobilizing informaticians to confront the urgent challenges of climate change.</p><p><strong>Materials and methods: </strong>The AMIA Climate, Health, and Informatics Working Group (at the time, an AMIA Discussion Forum), the International Medical Informatics Association (IMIA), the International Academy of Health Sciences Informatics (IAHSI), and the Regenstrief Institute hosted a mini-summit entitled Climate and health: How can informatics help? on November 11, 2023, during the AMIA 2023 Annual Symposium (New Orleans, LA, USA). Using an affinity diagramming approach, the mini-summit organizers posed 2 questions to ∼50 attendees (40 in-person, 10 virtual).</p><p><strong>Results: </strong>Participants expressed a broad array of viewpoints on actions that can be undertaken now and areas needing research to support future actions. Areas of current action ranged from enhanced education to expanded telemedicine to assessment of community vulnerability. Areas of research ranged from emergency preparedness to climate-specific clinical coding to risk prediction models.</p><p><strong>Discussion: </strong>The mini-summit was intended as a first step in helping the informatics community at large set application and research priorities for climate, health, and informatics.</p><p><strong>Conclusion: </strong>The working group will use these perspectives as it seeks further input, and begins to establish priorities for climate-related biomedical informatics actions and research.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"971-979"},"PeriodicalIF":4.7,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12012334/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143626625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sungho Shim, Min-Soo Kim, Che Gyem Yae, Yong Koo Kang, Jae Rock Do, Hong Kyun Kim, Hyun-Lim Yang
{"title":"Development and validation of a multi-stage self-supervised learning model for optical coherence tomography image classification.","authors":"Sungho Shim, Min-Soo Kim, Che Gyem Yae, Yong Koo Kang, Jae Rock Do, Hong Kyun Kim, Hyun-Lim Yang","doi":"10.1093/jamia/ocaf021","DOIUrl":"10.1093/jamia/ocaf021","url":null,"abstract":"<p><strong>Objective: </strong>This study aimed to develop a novel multi-stage self-supervised learning model tailored for the accurate classification of optical coherence tomography (OCT) images in ophthalmology reducing reliance on costly labeled datasets while maintaining high diagnostic accuracy.</p><p><strong>Materials and methods: </strong>A private dataset of 2719 OCT images from 493 patients was employed, along with 3 public datasets comprising 84 484 images from 4686 patients, 3231 images from 45 patients, and 572 images. Extensive internal, external, and clinical validation were performed to assess model performance. Grad-CAM was employed for qualitative analysis to interpret the model's decisions by highlighting relevant areas. Subsampling analyses evaluated the model's robustness with varying labeled data availability.</p><p><strong>Results: </strong>The proposed model outperformed conventional supervised or self-supervised learning-based models, achieving state-of-the-art results across 3 public datasets. In a clinical validation, the model exhibited up to 17.50% higher accuracy and 17.53% higher macro F-1 score than a supervised learning-based model under limited training data.</p><p><strong>Discussion: </strong>The model's robustness in OCT image classification underscores the potential of the multi-stage self-supervised learning to address challenges associated with limited labeled data. The availability of source codes and pre-trained models promotes the use of this model in a variety of clinical settings, facilitating broader adoption.</p><p><strong>Conclusion: </strong>This model offers a promising solution for advancing OCT image classification, achieving high accuracy while reducing the cost of extensive expert annotation and potentially streamlining clinical workflows, thereby supporting more efficient patient management.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"800-810"},"PeriodicalIF":4.7,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12012341/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143558511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anne Fu, Trong Shen, Surain B Roberts, Weihan Liu, Shruthi Vaidyanathan, Kayley-Jasmin Marchena-Romero, Yuen Yu Phyllis Lam, Kieran Shah, Denise Y F Mak, Fahad Razak, Amol A Verma
{"title":"Optimizing the efficiency and effectiveness of data quality assurance in a multicenter clinical dataset.","authors":"Anne Fu, Trong Shen, Surain B Roberts, Weihan Liu, Shruthi Vaidyanathan, Kayley-Jasmin Marchena-Romero, Yuen Yu Phyllis Lam, Kieran Shah, Denise Y F Mak, Fahad Razak, Amol A Verma","doi":"10.1093/jamia/ocaf042","DOIUrl":"10.1093/jamia/ocaf042","url":null,"abstract":"<p><strong>Objectives: </strong>Electronic health records (EHRs) data are increasingly used for research and analysis, but there is little empirical evidence to inform how automated and manual assessments can be combined to efficiently assess data quality in large EHR repositories.</p><p><strong>Materials and methods: </strong>The GEMINI database collected data from 462 226 patient admissions across 32 hospitals from 2021 to 2023. We report data quality issues identified through semi-automated and manual data quality assessments completed during the data collection phase. We conducted a simulation experiment to evaluate the relationship between the number of records reviewed manually, the detection of true data errors (true positives) and the number of manual chart abstraction errors (false positives) that required unnecessary investigation.</p><p><strong>Results: </strong>The semi-automated data quality assessments identified 79 data quality issues requiring correction, of which 14 had a large impact, affecting at least 50% of records in the data. After resolving issues identified through semi-automated assessments, manual validation of 2676 patient encounters at 19 hospitals identified 4 new meaningful data errors (3 in transfusion data and 1 in physician identifiers), distributed across 4 hospitals. There were 365 manual chart abstraction errors, which required investigation by data analysts to identify as \"false positives.\" These errors increased linearly with the number of charts reviewed manually. Simulation results demonstrate that all 3 transfusion data errors were identified with 95% sensitivity after manual review of 5 records, whereas 18 records were needed for the physician's table.</p><p><strong>Discussion and conclusion: </strong>The GEMINI approach represents a scalable framework for data quality assessment and improvement in multisite EHR research databases. Manual data review is important but can be minimized to optimize the trade-off between true and false identification of data quality errors.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"835-844"},"PeriodicalIF":4.7,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12012372/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143626627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rohan Sanghera, Arun James Thirunavukarasu, Marc El Khoury, Jessica O'Logbon, Yuqing Chen, Archie Watt, Mustafa Mahmood, Hamid Butt, George Nishimura, Andrew A S Soltan
{"title":"High-performance automated abstract screening with large language model ensembles.","authors":"Rohan Sanghera, Arun James Thirunavukarasu, Marc El Khoury, Jessica O'Logbon, Yuqing Chen, Archie Watt, Mustafa Mahmood, Hamid Butt, George Nishimura, Andrew A S Soltan","doi":"10.1093/jamia/ocaf050","DOIUrl":"10.1093/jamia/ocaf050","url":null,"abstract":"<p><strong>Objective: </strong>screening is a labor-intensive component of systematic review involving repetitive application of inclusion and exclusion criteria on a large volume of studies. We aimed to validate large language models (LLMs) used to automate abstract screening.</p><p><strong>Materials and methods: </strong>LLMs (GPT-3.5 Turbo, GPT-4 Turbo, GPT-4o, Llama 3 70B, Gemini 1.5 Pro, and Claude Sonnet 3.5) were trialed across 23 Cochrane Library systematic reviews to evaluate their accuracy in zero-shot binary classification for abstract screening. Initial evaluation on a balanced development dataset (n = 800) identified optimal prompting strategies, and the best performing LLM-prompt combinations were then validated on a comprehensive dataset of replicated search results (n = 119 695).</p><p><strong>Results: </strong>On the development dataset, LLMs exhibited superior performance to human researchers in terms of sensitivity (LLMmax = 1.000, humanmax = 0.775), precision (LLMmax = 0.927, humanmax = 0.911), and balanced accuracy (LLMmax = 0.904, humanmax = 0.865). When evaluated on the comprehensive dataset, the best performing LLM-prompt combinations exhibited consistent sensitivity (range 0.756-1.000) but diminished precision (range 0.004-0.096) due to class imbalance. In addition, 66 LLM-human and LLM-LLM ensembles exhibited perfect sensitivity with a maximal precision of 0.458 with the development dataset, decreasing to 0.1450 over the comprehensive dataset; but conferring workload reductions ranging between 37.55% and 99.11%.</p><p><strong>Discussion: </strong>Automated abstract screening can reduce the screening workload in systematic review while maintaining quality. Performance variation between reviews highlights the importance of domain-specific validation before autonomous deployment. LLM-human ensembles can achieve similar benefits while maintaining human oversight over all records.</p><p><strong>Conclusion: </strong>LLMs may reduce the human labor cost of systematic review with maintained or improved accuracy, thereby increasing the efficiency and quality of evidence synthesis.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"893-904"},"PeriodicalIF":4.7,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12012331/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143677361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}