NEJM AI最新文献

筛选
英文 中文
Expert-Level Detection of Epilepsy Markers in EEG on Short and Long Timescales. 长、短时间尺度脑电图癫痫标志物的专家水平检测。
NEJM AI Pub Date : 2025-07-01 Epub Date: 2025-06-26 DOI: 10.1056/aioa2401221
J Li, D M Goldenholz, M Alkofer, C Sun, F A Nascimento, J J Halford, B C Dean, M Galanti, A F Struck, A S Greenblatt, A D Lam, A Herlopian, C Nwankwo, D Weber, D Maus, H A Haider, I Karakis, J Y Yoo, M C Ng, O Selioutski, O Taraschenko, G Osman, R Katyal, S E Schmitt, S Benbadis, S S Cash, W O Tatum, Z Sheikh, W Y Kong, G Bayas, N Turley, S Hong, M B Westover, J Jing
{"title":"Expert-Level Detection of Epilepsy Markers in EEG on Short and Long Timescales.","authors":"J Li, D M Goldenholz, M Alkofer, C Sun, F A Nascimento, J J Halford, B C Dean, M Galanti, A F Struck, A S Greenblatt, A D Lam, A Herlopian, C Nwankwo, D Weber, D Maus, H A Haider, I Karakis, J Y Yoo, M C Ng, O Selioutski, O Taraschenko, G Osman, R Katyal, S E Schmitt, S Benbadis, S S Cash, W O Tatum, Z Sheikh, W Y Kong, G Bayas, N Turley, S Hong, M B Westover, J Jing","doi":"10.1056/aioa2401221","DOIUrl":"10.1056/aioa2401221","url":null,"abstract":"<p><strong>Background: </strong>Epileptiform discharges, or spikes, within electroencephalogram (EEG) recordings are essential for diagnosing epilepsy and localizing seizure origins. Artificial intelligence (AI) offers a promising approach to automating detection, but current models are often hindered by artifact-related false positives and often target either event- or EEG-level classification, thus limiting clinical utility.</p><p><strong>Methods: </strong>We developed SpikeNet2, a deep-learning model based on a residual network architecture, and enhanced it with hard-negative mining to reduce false positives. Our study analyzed 17,812 EEG recordings from 13,523 patients across multiple institutions, including Massachusetts General Brigham (MGB) hospitals. Data from the Human Epilepsy Project (HEP) and SCORE-AI (SAI) were also included. A total of 32,433 event-level samples, labeled by experts, were used for training and evaluation. Performance was assessed using the area under the receiver operating characteristic curve (AUROC), the area under the precision-recall curve (AUPRC), calibration error, and a modified area under the curve (mAUC) metric. The model's generalizability was evaluated using external datasets.</p><p><strong>Results: </strong>SpikeNet2 demonstrated strong performance in event-level spike detection, achieving an AUROC of 0.973 and an AUPRC of 0.995, with 44% of experts surpassing the model on the MGB dataset. In external validation, the model achieved an AUROC of 0.942 and an AUPRC of 0.948 on the HEP dataset. For EEG-level classification, SpikeNet2 recorded an AUROC of 0.958 and an AUPRC of 0.959 on the MGB dataset, an AUROC of 0.888 and an AUPRC of 0.823 on the HEP dataset, and an AUROC of 0.995 and an AUPRC of 0.991 on the SAI dataset, with 32% of experts outperforming the model. The false-positive rate was reduced to an average of nine spikes per hour.</p><p><strong>Conclusions: </strong>SpikeNet2 offers expert-level accuracy in both event-level spike detection and EEG-level classification, while significantly reducing false positives. Its dual functionality and robust performance across diverse datasets make it a promising tool for clinical and telemedicine applications, particularly in resource-limited settings. (Funded by the National Institutes of Health and others.).</p>","PeriodicalId":520343,"journal":{"name":"NEJM AI","volume":"2 7","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12276842/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144677168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Longitudinal Risk Prediction for Pediatric Glioma with Temporal Deep Learning. 利用时间深度学习预测小儿胶质瘤的纵向风险。
NEJM AI Pub Date : 2025-05-01 Epub Date: 2025-04-24 DOI: 10.1056/aioa2400703
Divyanshu Tak, Biniam A Garomsa, Anna Zapaishchykova, Zezhong Ye, Sridhar Vajapeyam, Maryam Mahootiha, Juan Carlos Climent Pardo, Ceilidh Smith, Ariana M Familiar, Tafadzwa L Chaunzwa, Kevin X Liu, Sanjay P Prabhu, Pratiti Bandopadhayay, Ali Nabavizadeh, Sabine Mueller, Hugo J W L Aerts, Daphne Haas-Kogan, Tina Y Poussaint, Benjamin H Kann
{"title":"Longitudinal Risk Prediction for Pediatric Glioma with Temporal Deep Learning.","authors":"Divyanshu Tak, Biniam A Garomsa, Anna Zapaishchykova, Zezhong Ye, Sridhar Vajapeyam, Maryam Mahootiha, Juan Carlos Climent Pardo, Ceilidh Smith, Ariana M Familiar, Tafadzwa L Chaunzwa, Kevin X Liu, Sanjay P Prabhu, Pratiti Bandopadhayay, Ali Nabavizadeh, Sabine Mueller, Hugo J W L Aerts, Daphne Haas-Kogan, Tina Y Poussaint, Benjamin H Kann","doi":"10.1056/aioa2400703","DOIUrl":"10.1056/aioa2400703","url":null,"abstract":"<p><strong>Background: </strong>Pediatric glioma recurrence can cause morbidity and mortality; however, recurrence patterns and severity are heterogeneous and challenging to predict with established clinical and genomic markers. As a result, almost all children undergo frequent, long-term, magnetic resonance imaging (MRI) brain surveillance regardless of individual recurrence risk. Longitudinal deep-learning analysis of serial MRI scans may be an effective approach for improving individualized recurrence prediction in gliomas and other cancers, but, thus far, progress has been limited by data availability and current machine-learning approaches.</p><p><strong>Methods: </strong>We developed a self-supervised temporal deep-learning approach tailored for longitudinal medical imaging analysis, wherein a multistep model encodes patients' serial MRI scans and is trained to classify the correct chronological order as a pretext task. The pretrained model is then fine-tuned to predict the primary end point of interest - in this case, 1-year recurrence prediction for pediatric gliomas from the point of last scan - by leveraging a patient's historical postoperative surveillance scans. We apply the model across 3994 scans from 715 patients followed at three separate institutions in the setting of pediatric low- and high-grade gliomas.</p><p><strong>Results: </strong>Longitudinal imaging analysis with temporal learning improved recurrence prediction performance (F1 score) by up to 58.5% (range, 6.6 to 58.5%) compared with traditional approaches across datasets, with performance improvements in both low- and high-grade gliomas and area under the receiver operating characteristic curve of (range, 75 to 89%) across all datasets. Recurrence prediction performance increased incrementally with the number of historical scans available per patient, reaching plateaus between three and six scans, depending on the dataset.</p><p><strong>Conclusions: </strong>Temporal deep learning enables high-performing longitudinal medical imaging analysis and point-of-care decision support for pediatric brain tumors. Temporal learning may be broadly adaptable to track and predict risk in patients with other cancers and chronic diseases undergoing surveillance imaging. (Funded in part by the National Institutes of Health/National Cancer Institute (U54 CA274516 and P50 CA165962), and Botha-Chan Low Grade Glioma Consortium.).</p>","PeriodicalId":520343,"journal":{"name":"NEJM AI","volume":"2 5","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12176428/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144328309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Machine Learning Achieves Pathologist-Level Coeliac Disease Diagnosis. 机器学习实现病理水平的乳糜泻诊断。
NEJM AI Pub Date : 2025-03-27 DOI: 10.1056/AIoa2400738
F Jaeckle, J Denholm, B Schreiber, S C Evans, M N Wicks, J Y H Chan, A C Bateman, S Natu, M J Arends, E Soilleux
{"title":"Machine Learning Achieves Pathologist-Level Coeliac Disease Diagnosis.","authors":"F Jaeckle, J Denholm, B Schreiber, S C Evans, M N Wicks, J Y H Chan, A C Bateman, S Natu, M J Arends, E Soilleux","doi":"10.1056/AIoa2400738","DOIUrl":"10.1056/AIoa2400738","url":null,"abstract":"<p><strong>Background: </strong>The diagnosis of coeliac disease (CD), an autoimmune disorder with an estimated global prevalence of around 1%, generally relies on the histological examination of duodenal biopsies. However, inter-pathologist agreement for coeliac disease diagnosis is estimated to be no more than 80%. We aim to improve coeliac disease diagnosis by developing a novel, accurate, machine-learning-based diagnostic classifier.</p><p><strong>Methods: </strong>We present a machine learning model that diagnoses the presence or absence of coeliac disease from a set of duodenal biopsies representative of real-world clinical data. Our model was trained on a diverse dataset of 3,383 -slide images (WSIs) of H&E-stained duodenal biopsies from four hospitals featuring five different WSI scanners along with their clinical diagnoses. We trained our model using the multiple-instance-learning paradigm in a weakly-supervised manner with cross-validation and evaluated it on an independent test set featuring 644 unseen scans from a different regional NHS Trust. Additionally, we compared the model's predictions to independent diagnoses from four specialist pathologists on a subset of the test data.</p><p><strong>Results: </strong>Our model diagnosed coeliac disease in an independent test set from a previously unseen source with accuracy, sensitivity, and specificity exceeding 95% and an area under the ROC curve exceeding 99%. These results indicate that the model has the potential to outperform pathologists. In comparing the model's predictions to diagnoses on unseen test data from four independent pathologists, we found statistically indistinguishable results between pathologist-pathologist and pathologist-model inter-observer agreement (<i>p</i> > 96%).</p><p><strong>Conclusions: </strong>Our model achieved pathologist-level performance in diagnosing the presence or absence of coeliac disease from a representative set of duodenal biopsies, representing a significant advancement towards the adoption of machine learning in clinical practice. Additionally, it demonstrated strong generalisability, performing equally well on biopsies from a previously unseen hospital. We concluded that our model has the potential to revolutionise duodenal biopsy diagnosis by accurately identifying or ruling out coeliac disease, thereby significantly reducing the time required for pathologists to make a diagnosis.</p>","PeriodicalId":520343,"journal":{"name":"NEJM AI","volume":"2 4","pages":"aioa2400738"},"PeriodicalIF":0.0,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7617718/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144176350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Large Language Models for More Efficient Reporting of Hospital Quality Measures. 更有效地报告医院质量措施的大型语言模型。
NEJM AI Pub Date : 2024-10-24 Epub Date: 2024-10-21 DOI: 10.1056/aics2400420
Aaron Boussina, Rishivardhan Krishnamoorthy, Kimberly Quintero, Shreyansh Joshi, Gabriel Wardi, Hayden Pour, Nicholas Hilbert, Atul Malhotra, Michael Hogarth, Amy M Sitapati, Chad VanDenBerg, Karandeep Singh, Christopher A Longhurst, Shamim Nemati
{"title":"Large Language Models for More Efficient Reporting of Hospital Quality Measures.","authors":"Aaron Boussina, Rishivardhan Krishnamoorthy, Kimberly Quintero, Shreyansh Joshi, Gabriel Wardi, Hayden Pour, Nicholas Hilbert, Atul Malhotra, Michael Hogarth, Amy M Sitapati, Chad VanDenBerg, Karandeep Singh, Christopher A Longhurst, Shamim Nemati","doi":"10.1056/aics2400420","DOIUrl":"10.1056/aics2400420","url":null,"abstract":"<p><p>Hospital quality measures are a vital component of a learning health system, yet they can be costly to report, statistically underpowered, and inconsistent due to poor interrater reliability. Large language models (LLMs) have recently demonstrated impressive performance on health care-related tasks and offer a promising way to provide accurate abstraction of complete charts at scale. To evaluate this approach, we deployed an LLM-based system that ingests Fast Healthcare Interoperability Resources data and outputs a completed Severe Sepsis and Septic Shock Management Bundle (SEP-1) abstraction. We tested the system on a sample of 100 manual SEP-1 abstractions that University of California San Diego Health reported to the Centers for Medicare & Medicaid Services in 2022. The LLM system achieved agreement with manual abstractors on the measure category assignment in 90 of the abstractions (90%; κ=0.82; 95% confidence interval, 0.71 to 0.92). Expert review of the 10 discordant cases identified four that were mistakes introduced by manual abstraction. This pilot study suggests that LLMs using interoperable electronic health record data may perform accurate abstractions for complex quality measures. (Funded by the National Institute of Allergy and Infectious Diseases [1R42AI177108-1] and others.).</p>","PeriodicalId":520343,"journal":{"name":"NEJM AI","volume":"1 11","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11658346/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142866963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Prospective Multi-Site Validation of AI to Detect Tuberculosis and Chest X-Ray Abnormalities. 人工智能检测肺结核和胸部x线异常的前瞻性多位点验证。
NEJM AI Pub Date : 2024-10-01 Epub Date: 2024-09-26 DOI: 10.1056/aioa2400018
Sahar Kazemzadeh, Atilla P Kiraly, Zaid Nabulsi, Nsala Sanjase, Minyoi Maimbolwa, Brian Shuma, Shahar Jamshy, Christina Chen, Arnav Agharwal, Charles T Lau, Andrew Sellergren, Daniel Golden, Jin Yu, Eric Wu, Yossi Matias, Katherine Chou, Greg S Corrado, Shravya Shetty, Daniel Tse, Krish Eswaran, Yun Liu, Rory Pilgrim, Monde Muyoyeta, Shruthi Prabhakara
{"title":"Prospective Multi-Site Validation of AI to Detect Tuberculosis and Chest X-Ray Abnormalities.","authors":"Sahar Kazemzadeh, Atilla P Kiraly, Zaid Nabulsi, Nsala Sanjase, Minyoi Maimbolwa, Brian Shuma, Shahar Jamshy, Christina Chen, Arnav Agharwal, Charles T Lau, Andrew Sellergren, Daniel Golden, Jin Yu, Eric Wu, Yossi Matias, Katherine Chou, Greg S Corrado, Shravya Shetty, Daniel Tse, Krish Eswaran, Yun Liu, Rory Pilgrim, Monde Muyoyeta, Shruthi Prabhakara","doi":"10.1056/aioa2400018","DOIUrl":"10.1056/aioa2400018","url":null,"abstract":"<p><strong>Background: </strong>Using artificial intelligence (AI) to interpret chest X-rays (CXRs) could support accessible triage tests for active pulmonary tuberculosis (TB) in resource-constrained settings.</p><p><strong>Methods: </strong>The performance of two cloud-based CXR AI systems - one to detect TB and the other to detect CXR abnormalities - in a population with a high TB and human immunodeficiency virus (HIV) burden was evaluated. We recruited 1978 adults who had TB symptoms, were close contacts of known TB patients, or were newly diagnosed with HIV at three clinical sites. The TB-detecting AI (TB AI) scores were converted to binary using two thresholds: a high-sensitivity threshold and an exploratory threshold designed to resemble radiologist performance. Ten radiologists reviewed images for signs of TB, blinded to the reference standard. Primary analysis measured AI detection noninferiority to radiologist performance. Secondary analysis evaluated AI detection as compared with the World Health Organization (WHO) targets (90% sensitivity, 70% specificity). Both used an absolute margin of 5%. The abnormality-detecting AI (abnormality AI) was evaluated for noninferiority to a high-sensitivity target suitable for triaging (90% sensitivity, 50% specificity).</p><p><strong>Results: </strong>Of the 1910 patients analyzed, 1827 (96%) had conclusive TB status, of which 649 (36%) were HIV positive and 192 (11%) were TB positive. The TB AI's sensitivity and specificity were 87% and 70%, respectively, at the high-sensitivity threshold and 78% and 82%, respectively, at the balanced threshold. Radiologists' mean sensitivity was 76% and mean specificity was 82%. At the high-sensitivity threshold, the TB AI was noninferior to average radiologist sensitivity (P<0.001) but not to average radiologist specificity (P=0.99) and was higher than the WHO target for specificity but not sensitivity. At the balanced threshold, the TB AI was comparable to radiologists. The abnormality AI's sensitivity and specificity were 97% and 79%, respectively, with both meeting the prespecified targets.</p><p><strong>Conclusions: </strong>The CXR TB AI was noninferior to radiologists for active pulmonary TB triaging in a population with a high TB and HIV burden. Neither the TB AI nor the radiologists met WHO recommendations for sensitivity in the study population. AI can also be used to detect other CXR abnormalities in the same population.</p>","PeriodicalId":520343,"journal":{"name":"NEJM AI","volume":"1 10","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11737584/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143019977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Validation of a Mobile App for Remote Autism Screening in Toddlers. 幼儿自闭症远程筛查移动应用程序的验证
NEJM AI Pub Date : 2024-10-01 Epub Date: 2024-09-26 DOI: 10.1056/AIcs2400510
Pradeep Raj Krishnappa Babu, J Matias Di Martino, Rachel Aiello, Brian Eichner, Steven Espinosa, Jennifer Green, Jill Howard, Sam Perochon, Marina Spanos, Saritha Vermeer, Geraldine Dawson, Guillermo Sapiro
{"title":"Validation of a Mobile App for Remote Autism Screening in Toddlers.","authors":"Pradeep Raj Krishnappa Babu, J Matias Di Martino, Rachel Aiello, Brian Eichner, Steven Espinosa, Jennifer Green, Jill Howard, Sam Perochon, Marina Spanos, Saritha Vermeer, Geraldine Dawson, Guillermo Sapiro","doi":"10.1056/AIcs2400510","DOIUrl":"10.1056/AIcs2400510","url":null,"abstract":"<p><p>Early detection of autism is important for timely access to diagnostic evaluation and early intervention services, which improve children's outcomes. Despite the ability of clinicians to reliably diagnose autism in toddlers, diagnosis is often delayed. SenseToKnow is a mobile autism screening application (app) delivered on a smartphone or tablet that provides an objective and quantitative assessment of early behavioral signs of autism based on computer vision (CV) and machine learning (ML). This study examined the accuracy of SenseToKnow for autism detection when the app was downloaded and administered remotely at home by caregivers using their own devices. The SenseToKnow app was administered by caregivers of 620 toddlers between 16 and 40 months of age, 188 of whom were subsequently diagnosed with autism by expert clinicians. The app displayed strategically designed movies and a bubble-popping game on an iPhone or iPad while recording the child's behavioral responses through the device's front-facing camera and touch/inertial sensors. Recordings of the child's behavior were then automatically analyzed using CV. Multiple behavioral phenotypes were quantified and combined using ML in an algorithm for autism prediction. SenseToKnow demonstrated a high level of diagnostic accuracy with area under the receiver operating characteristic curve of 0.92, sensitivity of 83.0%, specificity of 93.3%, positive predictive value of 84.3%, and negative predictive value of 92.6%. Accuracy of the app for detecting autism was similar when administered on either a caregiver's iPhone or iPad. These results demonstrate that a mobile autism screening app based on CV can be delivered remotely by caregivers at home on their own devices and can provide a high level of accuracy for autism detection. Remote screening for autism potentially lowers barriers to autism screening, which could reduce disparities in early access to services and support and improve children's outcomes.</p>","PeriodicalId":520343,"journal":{"name":"NEJM AI","volume":"1 10","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12107789/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144176347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CORAL: Expert-Curated Oncology Reports to Advance Language Model Inference. 珊瑚:专家策划肿瘤学报告推进语言模型推理。
NEJM AI Pub Date : 2024-04-01 Epub Date: 2024-03-13 DOI: 10.1056/aidbp2300110
Madhumita Sushil, Vanessa E Kennedy, Divneet Mandair, Brenda Y Miao, Travis Zack, Atul J Butte
{"title":"CORAL: Expert-Curated Oncology Reports to Advance Language Model Inference.","authors":"Madhumita Sushil, Vanessa E Kennedy, Divneet Mandair, Brenda Y Miao, Travis Zack, Atul J Butte","doi":"10.1056/aidbp2300110","DOIUrl":"https://doi.org/10.1056/aidbp2300110","url":null,"abstract":"<p><strong>Background: </strong>Both medical care and observational studies in oncology require a thorough understanding of a patient's disease progression and treatment history, often elaborately documented within clinical notes. As large language models (LLMs) are being considered for use within medical workflows, it becomes important to evaluate their potential in oncology. However, no current information representation schema fully encapsulates the diversity of oncology information within clinical notes, and no comprehensively annotated oncology notes exist publicly, thereby limiting a thorough evaluation.</p><p><strong>Methods: </strong>We curated a new fine-grained, expert-labeled dataset of 40 deidentified breast and pancreatic cancer progress notes at the University of California, San Francisco, and assessed the abilities of three recent LLMs (GPT-4, GPT-3.5-turbo, and FLAN-UL2) in <i>zero-shot</i> extraction of detailed oncological information from two narrative sections of clinical progress notes. Model performance was quantified with BLEU-4, ROUGE-1, and exact-match (EM) F1 score metrics.</p><p><strong>Results: </strong>Our team annotated 9028 entities, 9986 modifiers, and 5312 relationships. The GPT-4 model exhibited overall best performance, with an average BLEU score of 0.73, an average ROUGE score of 0.72, an average EM F1 score of 0.51, and an average accuracy of 68% (expert manual evaluation on subset). Notably, GPT-4 was proficient in tumor characteristic and medication extraction and demonstrated superior performance in advanced reasoning tasks of inferring symptoms due to cancer and considerations of future medications. Common errors included partial responses with missing information and hallucinations with note-specific information.</p><p><strong>Conclusions: </strong>By developing a comprehensive schema and benchmark of oncology-specific information in oncology notes, we uncovered both the strengths and the limitations of LLMs. Our evaluation showed variable zero-shot extraction capability among the GPT-3.5-turbo, GPT-4, and FLAN-UL2 models and highlighted a need for further improvements, particularly in complex medical reasoning, before performing reliable information extraction for clinical research and complex population management and documenting quality patient care. (Funded by the National Institute of Health, Food and Drug Administration and others.).</p>","PeriodicalId":520343,"journal":{"name":"NEJM AI","volume":"1 4","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12007910/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144037821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信