Alejandro Pérez-Vereda, Jesús Fontecha, Adrián Sanchez-Miguel, Luis Cabañero, Iván González, Christopher Nugent
{"title":"Activities of Daily Living Detection through Energy Consumption Data and Machine Learning to Support Independent Aging.","authors":"Alejandro Pérez-Vereda, Jesús Fontecha, Adrián Sanchez-Miguel, Luis Cabañero, Iván González, Christopher Nugent","doi":"10.1007/s10916-025-02256-2","DOIUrl":"https://doi.org/10.1007/s10916-025-02256-2","url":null,"abstract":"<p><p>The aging population presents significant challenges for healthcare and social services, emphasizing the need for innovative solutions that support independent living. This study explores the feasibility of identifying Instrumental Activities of Daily Living (IADLs) through power consumption data collected from smart plug-based system. Using a combination of unsupervised and supervised machine learning techniques, including K-Means clustering and Long Short-Term Memory (LSTM) networks, we developed a method to classify and predict IADLs based on energy usage patterns. The REFIT dataset was used to train and validate the models, ensuring generalizability across different households. Results demonstrate that K-means clustering effectively group energy consumption patterns with Silhouette & DB algorithms in a reasonable time (Silhouette score of 0.88 and a Davies-Bouldin Index of 0.29), while LSTM models trained on monthly household data, demonstrated high rates of activities classified over time (with F1-Score of 0.99). IADLs like cooking, cleaning, and entertainment showed the highest classification accuracy due to their distinct energy features. This approach enables non-intrusive monitoring of daily routines, offering potential applications in Ambient Assisted Living (AAL) environments. Despite limitations in detecting activities without direct energy consumption, this study highlights the potential of energy-based activity recognition for promoting independent aging. Future work will focus on refining abnormal behavior detection and integrating additional contextual factors to improve accuracy.</p>","PeriodicalId":16338,"journal":{"name":"Journal of Medical Systems","volume":"49 1","pages":"124"},"PeriodicalIF":5.7,"publicationDate":"2025-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145212421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GAN-Enhanced Hybrid Deep Learning with Explainable AI for Automated Cataract Diagnosis.","authors":"Shashank Mouli Satapathy, Mitali Gopinath Paul, Anusha Garg, Suhani Bhatnagar","doi":"10.1007/s10916-025-02249-1","DOIUrl":"https://doi.org/10.1007/s10916-025-02249-1","url":null,"abstract":"<p><p>Cataracts, among the most prevalent eye disorders, result in diminished vision due to cloudiness in the eye's natural lens. Timely diagnosis is crucial for preventing irreversible damage. While effective, existing automated systems encounter difficulties like limited dataset variety, lack of interpretability, and suboptimal generalization in real-world scenarios. This study presents a novel deep learning-based method that incorporates Generative AI (GenAI) and Explainable AI (XAI) to enhance cataract detection. The proposed methodology leverages a fine-tuned InceptionResNetV2 with additional layers, trained on a hybrid dataset enriched by merging six open-source datasets, along with synthetic images generated via Generative Adversarial Networks (GANs). Class weights address data imbalance, while stratified K-Fold cross-validation ensures robust evaluation. Our system offers graphical interpretation through Gradient-weighted Class Activation Mapping (Grad-CAM) heatmaps, supporting clinical transparency and reliability. The model evaluation reports a mean K-Fold accuracy of 97.58% with a standard deviation of 0.0040, and a 95% confidence interval (CI) of (0.9702, 0.9814). On the external dataset, the model achieved an overall accuracy of 97%, an AUC of 0.9944, and for the cataract class, a precision of 96%, recall (sensitivity) of 94%, F1-score of 95%. Our method, by incorporating synthetic images and explainable AI, ensures enhanced data diversity, addresses class imbalance, reduced dependency on large annotated datasets, and offers greater interpretability that facilitates expert validation and builds stronger clinical trust, making it superior to existing cataract detection systems.</p>","PeriodicalId":16338,"journal":{"name":"Journal of Medical Systems","volume":"49 1","pages":"123"},"PeriodicalIF":5.7,"publicationDate":"2025-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145206401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qian Ruan, Jinghong Shi, Yunke Dai, Pingliang Yang, Na Zhu, Shun Wang
{"title":"Performance of Large Language Models in Complex Anesthesia Decision-Making: A Comparative Study of Four LLMs in High-Risk Patients.","authors":"Qian Ruan, Jinghong Shi, Yunke Dai, Pingliang Yang, Na Zhu, Shun Wang","doi":"10.1007/s10916-025-02247-3","DOIUrl":"https://doi.org/10.1007/s10916-025-02247-3","url":null,"abstract":"<p><p>To evaluate and compare the performance of four Large Language Models (LLMs) in anesthesia decision-making for critically ill obstetric and geriatric patients and analyze their decision reliability across different surgical specialties. Prospective comparative analysis using standardized case evaluations. Four LLMs (ChatGPT-4o, Claude 3.5 Sonnet, DeepSeek-R1, and Grok 3). Thirty complex surgical cases (10 obstetric, 20 geriatric; 8 specialties) were analyzed. A 12-dimensional framework tested the models using unified prompts and decision points. Five trained anesthesiologists independently evaluated the models across six dimensions (patient assessment, anesthesia plan, risk management, individualization, contingency planning, decision logic; 1-10 scale, total 6-60). Overall, DeepSeek performed best (51.43 ± 2.74 points), significantly outperforming other models (P < 0.001). For obstetric cases, the mean scores were: DeepSeek (52.00 ± 1.83), Grok (49.40 ± 3.06), ChatGPT (47.60 ± 2.88), and Claude (46.60 ± 2.17). For geriatric cases, scores were: DeepSeek (51.15 ± 3.10), Grok (48.60 ± 2.33), ChatGPT (47.35 ± 2.50), and Claude (45.75 ± 2.05). Across specialties, all models performed best in hepatobiliary surgery, burn surgery, and thoracic surgery. DeepSeek demonstrated consistent performance across all dimensions, with notable advantages in decision logic (8.80 ± 0.40) and contingency planning (8.27 ± 0.45). All LLMs demonstrated strong anesthesia decision-making capabilities, with DeepSeek showing the best overall performance. Exploratory analysis revealed performance variations across specialties, although small sample sizes preclude definitive conclusions. Clinical implementation should consider specialty-specific factors and decision process characteristics.</p>","PeriodicalId":16338,"journal":{"name":"Journal of Medical Systems","volume":"49 1","pages":"122"},"PeriodicalIF":5.7,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145199733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Clinical Risk Computation by Large Language Models Using Validated Risk Scores.","authors":"Kaan Kara, Tuba Gunel","doi":"10.1007/s10916-025-02261-5","DOIUrl":"https://doi.org/10.1007/s10916-025-02261-5","url":null,"abstract":"<p><p>Recent advances in artificial intelligence have propelled Large Language Models (LLMs) in natural language understanding, enabling new healthcare applications. While LLMs can analyze health data, directly predicting patient risk scores can be unreliable due to inaccuracies, biases, and difficulty interpreting complex medical data. A more trustworthy approach uses LLMs to calculate traditional clinical risk scores-validated, evidence-based formulas widely accepted in medicine. This improves validity, transparency, and safety by relying on established scoring systems rather than LLM-generated risk assessments, while still allowing LLMs to enhance clinical workflows through clear and interpretable explanations. In this study, we evaluated three public LLMs-GPT-4o-mini, DeepSeek v3, and Google Gemini 2.5 Flash-in calculating five clinical risk scores: CHA₂DS₂-VASc, HAS-BLED, Wells Score, Charlson Comorbidity Index, and Framingham Risk Score. We created 100 patient profiles (20 per score) representing diverse clinical scenarios and converted them into natural language clinical notes. These served as prompts for the LLMs to extract information and compute risk scores. We compared LLM-generated scores to reference scores from validated formulas using accuracy, precision, recall, F1 score, and Pearson correlation. GPT-4o-mini and Gemini 2.5 Flash outperformed DeepSeek v3, showing near-perfect agreement on most scores. However, all models struggled with the complex Framingham Risk Score, indicating challenges for general LLMs in complex risk calculations.</p>","PeriodicalId":16338,"journal":{"name":"Journal of Medical Systems","volume":"49 1","pages":"121"},"PeriodicalIF":5.7,"publicationDate":"2025-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145191673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Pretraining Approach for Small-sample Training Employing Radiographs (PASTER): a Multimodal Transformer Trained by Chest Radiography and Free-text Reports.","authors":"Kai-Chieh Chen, Matthew Kuo, Chun-Ho Lee, Hao-Chun Liao, Dung-Jang Tsai, Shing-An Lin, Chih-Wei Hsiang, Cheng-Kuang Chang, Kai-Hsiung Ko, Yi-Chih Hsu, Wei-Chou Chang, Guo-Shu Huang, Wen-Hui Fang, Chin-Sheng Lin, Shih-Hua Lin, Yuan-Hao Chen, Yi-Jen Hung, Chien-Sung Tsai, Chin Lin","doi":"10.1007/s10916-025-02263-3","DOIUrl":"https://doi.org/10.1007/s10916-025-02263-3","url":null,"abstract":"<p><p>While deep convolutional neural networks (DCNNs) have achieved remarkable performance in chest X-ray interpretation, their success typically depends on access to large-scale, expertly annotated datasets. However, collecting such data in real-world clinical settings can be difficult because of limited labeling resources, privacy concerns, and patient variability. In this study, we applied a multimodal Transformer pretrained on free-text reports and their paired CXRs to evaluate the effectiveness of this method in settings with limited labeled data. Our dataset consisted of more than 1 million CXRs, each accompanied by reports from board-certified radiologists and 31 structured labels. The results indicated that a linear model trained on embeddings from the pretrained model achieved AUCs of 0.907 and 0.903 on internal and external test sets, respectively, using only 128 cases and 384 controls; the results were comparable those of DenseNet trained on the entire dataset, whose AUCs were 0.908 and 0.903, respectively. Additionally, we demonstrated similar results by extending the application of this approach to a subset annotated with structured echocardiographic reports. Furthermore, this multimodal model exhibited excellent small sample learning capabilities when tested on external validation sets such as CheXpert and ChestX-ray14. This research significantly reduces the sample size necessary for future artificial intelligence advancements in CXR interpretation.</p>","PeriodicalId":16338,"journal":{"name":"Journal of Medical Systems","volume":"49 1","pages":"120"},"PeriodicalIF":5.7,"publicationDate":"2025-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145191630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhancing Patient Identification Accuracy in Shared Child Health Records: a Hybrid Approach for the Lao Language Context.","authors":"Thepphouthone Sorsavanh, Chang Liu, Goshiro Yamamoto, Yukiko Mori, Shinji Kobayashi, Tomohiro Kuroda","doi":"10.1007/s10916-025-02260-6","DOIUrl":"10.1007/s10916-025-02260-6","url":null,"abstract":"<p><p>The Shared Child Health Record (SCHR) project in Lao People's Democratic Republic (PDR) aims to enhance pediatric health care services and health outcomes by enabling data exchange between health care systems. However, persistent challenges of duplication due to patient identification are hindered by non-Latin script complexities, including phonetic variations, a tonal alphabet, and temporary naming practices (e.g., placeholder names such as \"Eanoi\"). Existing patient-matching algorithms designed for Latin scripts underperform in this context. We assessed deterministic, probabilistic, and hybrid matching approaches using a Lao SCHR dataset of 20,433 records. A manual gold standard review (3,191 matches) validated their performance. Probabilistic matching employed the Fellegi-Sunter model with Jaro‒Winkler similarity, whereas the hybrid method combined deterministic rules (exact name/DOB matches) and probabilistic adjustments for unresolved cases. The hybrid and probabilistic methods consistently outperformed deterministic matching, achieving a 90% recall rate on the SCHR dataset. Despite its lower performance in Lao health records, the hybrid method resolved approximately 2,872 duplicates in SCHR. Challenges included twin records (shared identifiers) and temporary-to-permanent name transitions. This study is the first to adapt patient-matching methodologies for Lao's linguistic and infrastructural context. While hybrid methods show promise, performance gaps persist compared with those of Latin-based systems. These findings have significant implications with respect to improving the accuracy and efficiency of HIE systems in Lao PDR and other resource-limited settings.Clinical trial number: Not applicable.</p>","PeriodicalId":16338,"journal":{"name":"Journal of Medical Systems","volume":"49 1","pages":"119"},"PeriodicalIF":5.7,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12474658/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145149405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jeong Hyun Lee, Ji Hye Min, Kyowon Gu, Seungchul Han, Jeong Ah Hwang, Seo-Youn Choi, Kyoung Doo Song, Jeong Eun Lee, Jisun Lee, Ji Eun Moon, Hasmik Adetyan, Ju Dong Yang
{"title":"Automated Resectability Classification of Pancreatic Cancer CT Reports with Privacy-Preserving Open-Weight Large Language Models: A Multicenter Study.","authors":"Jeong Hyun Lee, Ji Hye Min, Kyowon Gu, Seungchul Han, Jeong Ah Hwang, Seo-Youn Choi, Kyoung Doo Song, Jeong Eun Lee, Jisun Lee, Ji Eun Moon, Hasmik Adetyan, Ju Dong Yang","doi":"10.1007/s10916-025-02248-2","DOIUrl":"10.1007/s10916-025-02248-2","url":null,"abstract":"<p><strong>Purpose: </strong> To evaluate the effectiveness of open-weight large language models (LLMs) in extracting key radiological features and determining National Comprehensive Cancer Network (NCCN) resectability status from free-text radiology reports for pancreatic ductal adenocarcinoma (PDAC). Methods. Prompts were developed using 30 fictitious reports, internally validated on 100 additional fictitious reports, and tested using 200 real reports from two institutions (January 2022 to December 2023). Two radiologists established ground truth for 18 key features and resectability status. Gemma-2-27b-it and Llama-3-70b-instruct models were evaluated using recall, precision, F1-score, extraction accuracy, and overall resectability accuracy. Statistical analyses included McNemar's test and mixed-effects logistic regression. Results. In internal validation, Llama had significantly higher recall than Gemma (99% vs. 95%, p < 0.01) and slightly higher extraction accuracy (98% vs. 97%). Llama also demonstrated higher overall resectability accuracy (93% vs. 91%). In the internal test set, both models achieved 96% recall and 96% extraction accuracy. Overall resectability accuracy was 95% for Llama and 93% for Gemma. In the external test set, both models had 93% recall. Extraction accuracy was 93% for Llama and 95% for Gemma. Gemma achieved higher overall resectability accuracy (89% vs. 83%), but the difference was not statistically significant (p > 0.05). Conclusion. Open-weight models accurately extracted key radiological features and determined NCCN resectability status from free-text PDAC reports. While internal dataset performance was robust, performance on external data decreased, highlighting the need for institution-specific optimization.</p>","PeriodicalId":16338,"journal":{"name":"Journal of Medical Systems","volume":"49 1","pages":"118"},"PeriodicalIF":5.7,"publicationDate":"2025-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145131065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Integrating Pediatric Anesthesia Sustainability Metrics into Native Electronic Health Records: A Clinical Informatics Approach.","authors":"Mandy Lam, Ashley Wu, Karna Patel, Elaine Ng, Eric Greenwood, Clyde Matava","doi":"10.1007/s10916-025-02259-z","DOIUrl":"https://doi.org/10.1007/s10916-025-02259-z","url":null,"abstract":"<p><p>The growing emphasis on sustainability in healthcare has highlighted anesthetic gases as notable contrib utors to the sector's greenhouse gas emissions. While adult anesthesia practices have increasingly adopted mitigation strategies, such as using lower fresh gas flows and total intravenous anesthesia, pediatric anesthesia poses distinct challenges due to the unique physiological and pharmacological requirements of neonates, infants, and children. The use of third-party applications for accessing anesthesia medical record data is costly. This technical report describes the development and implementation of pediatric-specific anesthesia sustainability metrics and integrating these metrics in native electronic health record systems for real-time data capture and feedback. Using a nominal consensus group process, 24 pediatric-focused metrics were identified across key perioperative phases. Subsequent integration into Epic's Anesthesia module facilitated automated data collection and the creation of interactive dashboards, which offer both department-wide and individualized provider feedback. Our report describes the feasibility of designing novel pediatric-specific sustainability metrics that can be used within the electronic medical record to benchmark environmental goals in pediatric anesthesia practice.</p>","PeriodicalId":16338,"journal":{"name":"Journal of Medical Systems","volume":"49 1","pages":"117"},"PeriodicalIF":5.7,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145113341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Andrew Harold Smith, Brant Tudor, Vishnu Mohan, Mohamed A Rehman, Luis Ahumada
{"title":"Social Network Analysis of Secure Text Messaging Metadata During Clinical Deterioration in an Inpatient Children's Hospital Setting.","authors":"Andrew Harold Smith, Brant Tudor, Vishnu Mohan, Mohamed A Rehman, Luis Ahumada","doi":"10.1007/s10916-025-02250-8","DOIUrl":"10.1007/s10916-025-02250-8","url":null,"abstract":"<p><p>Mitigating clinical deterioration relies upon recognition (afferent limb) and interventions (efferent limb) by a healthcare team. Healthcare provider (HP) communication by text messaging plays a role in facilitating both limbs in the inpatient setting. We sought to quantitatively characterize healthcare provider team communications through the social network analysis (SNA) of secure text messages exchanged in the inpatient setting, and as they relate to a subgroup of patients demonstrating a deterioration during their hospitalization. Messages linked to inpatients exchanged between HPs over a 12-month period, including a cohort of messages linked to patients experiencing deterioration were analyzed using SNA. Subnetworks corresponding to individual patient encounters were constructed, including a series of subnetworks pertaining to patients with an impending clinical deterioration. Network and network participant characteristics were calculated and analyzed. From October 2022 through September 2023 there were 1,065,225 messages delivered by 3,272 HPs, associated with 4,328 inpatient hospital encounters, of which 120 hospital encounters were associated with a deterioration. SNA demonstrated significantly higher measures of eigenvector centrality among frontline providers (FLP) including advanced practice providers and housestaff, relative to attending physician (p < 0.001) and registered nurses (p < 0.001), consistent with greater influence of the FLP on information dissemination through the entire network. Within individual subnetworks associated with the care of patients experiencing a clinical deterioration, FLP participants demonstrated greater overall network influence (p = 0.032) relative to FLP counterparts in networks not associated with a deterioration, despite comparable numbers of participants and connections. Using SNA, we quantitatively characterized a text messaging network within an inpatient hospital setting, demonstrating the importance of FLPs on information dissemination, a finding demonstrated specifically within subnetworks dedicated to the care of individual deteriorating patients. Understanding characteristics of a dynamic communication network of healthcare providers may prove a valuable target in facilitating communication and in mitigating the risks of deterioration.IRB Approval: Johns Hopkins Medicine IRB (#CIR00419339).Clinical trial number: Not applicable.</p>","PeriodicalId":16338,"journal":{"name":"Journal of Medical Systems","volume":"49 1","pages":"116"},"PeriodicalIF":5.7,"publicationDate":"2025-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12449403/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145086339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Large Language Models in Neurology Treatment Decision-Making: a Scoping Review.","authors":"Rushabh Shah, Fabrice Jotterand","doi":"10.1007/s10916-025-02254-4","DOIUrl":"10.1007/s10916-025-02254-4","url":null,"abstract":"<p><p>This scoping review evaluates the expanding role of large language models (LLMs) in neurology, an area drawing growing interest of researchers and clinicians alike. A substantial existing body of literature supports the efficacy of LLMs for diagnostic applications. However, clinicians' emerging point of interest now lies in understanding the applications of LLMs in guiding treatment decisions. Our study therefore aims to synthesize and evaluate existing neurological studies focused on LLMs in treatment decision-making. A comprehensive search was conducted in the electronic databases OVID/Medline, Web of Science, and the Cochrane Library through September 18th, 2024. Inclusion criteria included original studies published within the last five years focused on evaluating the efficacy of LLMs in treatment decision-making in neurology. The protocol was registered on the Open Science Framework ( https://doi.org/10.17605/OSF.IO/Y6N3E ). Four studies were identified. ChatGPT was the LLM utilized in each article, though varying in model versions. Each study demonstrated positive outcomes across varying metrics, with models generally aligning with clinician decisions. However, the lack of observed studies and variability of neurological topics limit the generalizability of these AI tools. This scoping review analyzes the existing body of evidence on LLMs in treatment decision-making in neurology. While current studies suggest potential to support clinical care, there is insufficient evidence at this stage to claim outcome improvement. Findings are not yet generalizable across neurological practice, as existing promise appears limited to narrow use cases. Prospective validation across subspecialties is needed to support broader clinical application.</p>","PeriodicalId":16338,"journal":{"name":"Journal of Medical Systems","volume":"49 1","pages":"115"},"PeriodicalIF":5.7,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145069653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}