Mihai Horia Popescu, Kevin Roitero, Vincenzo Della Mea
{"title":"Integrating Confidence, Difficulty, and Language Model Calibration for Better Explainability in Clinical Documents Coding: Applications of AI.","authors":"Mihai Horia Popescu, Kevin Roitero, Vincenzo Della Mea","doi":"10.2196/78764","DOIUrl":"https://doi.org/10.2196/78764","url":null,"abstract":"<p><strong>Background: </strong>In recent years, there has been increasing interest in developing machine and deep learning models capable of annotating clinical documents with semantically relevant labels. However, the complex nature of these models often leads to significant challenges regarding interpretability and transparency.</p><p><strong>Objective: </strong>This study aims to improve the interpretability of transformer models and evaluate the explainability of a deep learning-based annotation of coded clinical documents derived from death certificates. Specifically, the focus is on interpreting and explaining model behavior and predictions by leveraging calibrated confidence, saliency maps, and measures of instance difficulty applied to textualized representations coded using the International Statistical Classification of Diseases and Related Health Problems (ICD). In particular, the instance difficulty approach has previously proven effective in interpreting image-based models.</p><p><strong>Methods: </strong>We used disease language bidirectional encoder representations from transformers, a domain-specific bidirectional encoder representations from transformers model pretrained on ICD classification-related data, to analyze reverse-coded representations of death certificates from the US National Center for Health Statistics, covering the years 2014 to 2017 and comprising 12,919,268 records. The model inputs consist of textualized representations of ICD-coded fields derived from death certificates, obtained by mapping codes to the corresponding ICD concept titles. For this study, we extracted a subset of 400,000 certificates for training, 100,000 for testing, and 10,000 for validation. We assessed the model's calibration and applied a temperature scaling post-hoc calibration method to improve the reliability of its confidence scores. Additionally, we introduced mechanisms to rank instances by difficulty using Variance of Gradients scores, which also facilitate the detection of out-of-distribution cases. Saliency maps were also used to enhance interpretability by highlighting which tokens in the input text most influenced the model's predictions.</p><p><strong>Results: </strong>Experimental results on a pre-fine-tuned model for predicting the underlying cause of death from reverse-coded death certificate representations, which already achieves high accuracy (0.990), show good out-of-the-box calibration with respect to expected calibration error (1.40), though less so for maximum calibration error (30.91). Temperature scaling further reduces expected calibration error (1.13) while significantly increasing maximum calibration error (42.17). We report detailed Variance of Gradients analyses at the ICD category and chapter levels, including distributions of target and input categories, and provide word-level attributions using Integrated Gradients for both correctly classified and failure cases.</p><p><strong>Conclusions: </strong>This study","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e78764"},"PeriodicalIF":2.0,"publicationDate":"2026-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13113207/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147791203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Acceptance and Readiness for AI Among United Arab Emirates-Based Health Care Practitioners: Exploratory Cross-Sectional Survey.","authors":"Ghufran Alsalloum, Yara Badr, Ayman Alzaatreh, Abdulrahim Shamayleh, Muhammad Kumail, Nour Aymn Ahmad, Yacine Hadjiat","doi":"10.2196/80173","DOIUrl":"10.2196/80173","url":null,"abstract":"<p><strong>Background: </strong>Artificial intelligence (AI) can enhance diagnostic accuracy, efficiency, and decision-making in health care, but real-world impact depends on practitioners' acceptance and readiness to use AI in clinical workflows. The United Arab Emirates offers a policy-driven context to study these factors, given active national AI strategies and rapid health system digitization.</p><p><strong>Objective: </strong>This study aimed to develop and validate a model explaining how trust, perceptions, perceived risk, and perceived benefit shape practitioners' acceptance of AI and, in turn, their readiness to implement AI in clinical practice. The model integrates the Technology Acceptance Model, the Unified Theory of Acceptance and Use of Technology, and the Theory of Trust and Acceptance of Artificial Intelligence Technology.</p><p><strong>Methods: </strong>We conducted a cross-sectional online survey of 182 United Arab Emirates-based health care practitioners (physicians, nurses, dentists, and allied health staff). Constructs included trust, perception, perceived risk, perceived benefit, acceptance, and readiness. Knowledge of AI was also assessed using true or false statements. We performed confirmatory factor analysis and structural equation modeling, reporting standard fit indices. The survey adhered to the Checklist for Reporting Results of Internet E-Surveys guidelines, and ethics approval and electronic consent were obtained.</p><p><strong>Results: </strong>Trust was positively associated with perception (β=.704; P<.001) and perceived benefit (β=.191; P=.02) and negatively associated with perceived risk (β=-.301; P<.001). Acceptance was positively associated with trust (β=.452; P<.001), perception (β=.459; P<.001), and perceived benefit (β=.168; P=.002), and negatively associated with perceived risk (β=-.140; P=.009). Acceptance strongly predicted readiness (β=.874; P<.001). The model fit indices are standardized root-mean-square residual of 0.068, root-mean-square error of approximation of 0.0913, goodness-of-fit index of 0.802, adjusted goodness-of-fit index of 0.763, and comparative fit index of 0.906. Our knowledge assessment found notable gaps among participants, underscoring a need for education and training. Our study sample was predominantly drawn from Dubai-based health care settings (103/182, 57%) and nursing roles (71/182, 39%); therefore, these findings primarily reflect the Dubai health regulatory environment and nursing workflows and may not generalize to the broader federal health care system across all Emirates.</p><p><strong>Conclusions: </strong>Trust is a central lever for advancing AI acceptance and implementation readiness among the study cohort of United Arab Emirates-based health care practitioners. Implementation programs should prioritize building institutional and technical trust (transparency, safety, and governance), reducing perceived risk (privacy, security, and reliability), and amplifying perce","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e80173"},"PeriodicalIF":2.0,"publicationDate":"2026-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13097271/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147730854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shaojun Yu, Shu Liu, K Robin Yabroff, Farhad Islami, Fumiko Chino, Jing Zhang, Zhiyuan Zheng
{"title":"Primary Health Conditions Among Medical Crowdfunding Campaigns in the United States: Natural Language Processing Study.","authors":"Shaojun Yu, Shu Liu, K Robin Yabroff, Farhad Islami, Fumiko Chino, Jing Zhang, Zhiyuan Zheng","doi":"10.2196/83413","DOIUrl":"https://doi.org/10.2196/83413","url":null,"abstract":"","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e83413"},"PeriodicalIF":2.0,"publicationDate":"2026-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13135153/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147824367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ruonan Jin, Chao Ling, Yixuan Hou, Yuhan Sun, Ning Li, Jiefei Han, Jin Sheng, Qizhao Wang, Yuepeng Liu, Shen Zheng, Xingyu Ren, Chiyu Chen, Jue Wang, Cheng Li
{"title":"Augmenting Large Language Model With Prompt Engineering and Supervised Fine-Tuning in Non-Small Cell Lung Cancer Tumor-Node-Metastasis Staging: Framework Development and Validation.","authors":"Ruonan Jin, Chao Ling, Yixuan Hou, Yuhan Sun, Ning Li, Jiefei Han, Jin Sheng, Qizhao Wang, Yuepeng Liu, Shen Zheng, Xingyu Ren, Chiyu Chen, Jue Wang, Cheng Li","doi":"10.2196/77988","DOIUrl":"10.2196/77988","url":null,"abstract":"<p><strong>Background: </strong>Accurate tumor node metastasis (TNM) staging is fundamental for treatment planning and prognosis in non-small cell lung cancer (NSCLC). However, its complexity poses significant challenges. Traditional rule-based natural language processing methods are constrained by their reliance on manually crafted rules and are susceptible to inconsistencies in clinical reporting.</p><p><strong>Objective: </strong>This study aimed to develop and validate a robust, accurate, and operationally efficient artificial intelligence framework for the TNM staging of NSCLC by strategically enhancing a large language model, GLM-4-Air (general language model), through advanced prompt engineering and supervised fine-tuning (SFT).</p><p><strong>Methods: </strong>We constructed a curated dataset of 492 deidentified real-world medical imaging reports, with TNM staging annotations rigorously validated by senior physicians according to the AJCC (American Joint Committee on Cancer) 8th edition guidelines. The GLM-4-Air model was systematically optimized via a multi-phase process: iterative prompt engineering incorporating chain-of-thought reasoning and domain knowledge injection for all staging tasks, followed by parameter-efficient SFT using low-rank adaptation for the reasoning-intensive primary tumor characteristics (T) and regional lymph node involvement (N) staging tasks. The final hybrid model was evaluated on a completely held-out test set (black-box) and benchmarked against GPT-4o using standard metrics, statistical tests, and a clinical impact analysis of staging errors.</p><p><strong>Results: </strong>The optimized hybrid GLM-4-Air model demonstrated reliable performance. It achieved higher staging accuracies on the black-box test set: 92% (95% CI 0.850-0.959) for T, 86% (95% CI 0.779-0.915) for N, 92% (95% CI 0.850-0.959) for distant metastasis status (M), and 90% for overall clinical staging; by comparison, GPT-4o attained 87% (95% CI 0.790-0.922), 70% (95% CI 0.604-0.781), 78% (95% CI 0.689-0.850), and 80%, respectively. The model's robustness was further evidenced by its macro-average F1-scores of 0.914 (T), 0.815 (N), and 0.831 (M), consistently surpassing those of GPT-4o (0.836, 0.620, and 0.698). Analysis of confusion matrices confirmed the model's proficiency in identifying critical staging features while effectively minimizing false negatives. Crucially, the clinical impact assessment showed a substantial reduction in severe category I errors, which are defined as misclassifications that could significantly influence subsequent clinical decisions. Our model committed 0 category I errors in M staging and fewer category I errors in T and N staging. Furthermore, the framework demonstrated practical deployability, achieving efficient inference on consumer-grade hardware (eg, 4 RTX 4090 GPUs) with latencies suitable and acceptable for clinical workflows.</p><p><strong>Conclusions: </strong>The proposed hybrid framework, integrating ","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e77988"},"PeriodicalIF":2.0,"publicationDate":"2026-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13082344/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147693966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aliće Grünig, Jenifer Kriebel, Julian Varghese, Tim Herrmann, Sarah Sandmann, Christian Bruns
{"title":"Implementation and User Evaluation of an On-Premise Large Language Model in a German University Hospital Setting: Cross-Sectional Survey.","authors":"Aliće Grünig, Jenifer Kriebel, Julian Varghese, Tim Herrmann, Sarah Sandmann, Christian Bruns","doi":"10.2196/84362","DOIUrl":"10.2196/84362","url":null,"abstract":"<p><strong>Background: </strong>Large language models (LLMs) are increasingly used by employees at university hospitals for information retrieval or decision support. Self-hosted on-premise systems provide a secure environment and conform to data privacy and security regulations for handling sensitive personal data. Automation of standard procedures using an LLM application can substantially reduce time-consuming administrative tasks and facilitate the analysis of large datasets.</p><p><strong>Objective: </strong>The objective of our study was to gather feedback from registered artificial intelligence (AI) users on the usability and common use cases of the on-premise LLM infrastructure we established at the University Medicine Magdeburg to optimize the models to the needs of our facility.</p><p><strong>Methods: </strong>We developed an online questionnaire to which registered AI users were given access and were informed via email.</p><p><strong>Results: </strong>Of 322 registered AI users, 98 (30.4%) participated in the user survey. After filtering incomplete responses, results from 91 (28.3%) participants remained for further analysis. Speed and quality received overall high approval rates. Most of the users (n=57, 62.6%) used the platform at least once per week, and 44% (n=40) of the users reported saving at least 30 minutes of work per week by using our AI platform. A diverse set of use cases was observed, varying by profession; for example, health care and research professionals used the AI platform more frequently for creation and analysis tasks than administrative staff.</p><p><strong>Conclusions: </strong>Our data indicate that the implementation of a self-hosted on-premise LLM was associated with positive perceptions among a diverse group of professionals working at a university hospital, saving time and meeting their individual needs.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e84362"},"PeriodicalIF":2.0,"publicationDate":"2026-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13082445/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147693989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cristina Amadó Ruiz, Alberto Martín, Jaume Miró Ramos, Francisco Javier Ruz Torres, Antonio Ruiz, Adil El Haji, Pere Clavé, Omar Ortega
{"title":"Using Machine Learning to Improve Screening for Oropharyngeal Dysphagia in Hospitalized Versus Primary Care Adult Patients With COVID-19 Disease: Retrospective Observational Study.","authors":"Cristina Amadó Ruiz, Alberto Martín, Jaume Miró Ramos, Francisco Javier Ruz Torres, Antonio Ruiz, Adil El Haji, Pere Clavé, Omar Ortega","doi":"10.2196/81028","DOIUrl":"10.2196/81028","url":null,"abstract":"<p><strong>Background: </strong>Oropharyngeal dysphagia (OD) commonly occurs in patients with COVID-19 disease, posing diagnostic challenges due to isolation protocols.</p><p><strong>Objective: </strong>This study aimed at evaluating Artificial Intelligence Massive Screening for Oropharyngeal Dysphagia (AIMS-OD), a machine learning software for real-time OD screening, comparing OD prevalence and clinical outcomes using OD ICD-10 (International Statistical Classification of Diseases, Tenth Revision) R13 codes (R13-OD) and high-risk AIMS-OD (H-AIMS-OD) scores (>0.5), in hospital and primary care patients with COVID-19 disease. It explored clinical characteristics, OD risk factors, and clinical outcomes.</p><p><strong>Methods: </strong>This retrospective, observational study analyzed patients with SARS-CoV-2 aged 18 years and older in Catalonia from January 1 to August 31, 2020, including hospital and primary care data on clinical information, International Classification of Diseases, Tenth Revision (ICD-10) codes, hospital stay, discharge destination, and mortality. AIMS-OD assessed OD risk, stratifying patients by age (aged 18-69 years and 70 years and older).</p><p><strong>Results: </strong>Among 257,541 patients with COVID-19 disease, 59.3% (152,721/257,541) were aged 18-69 years and 40.7% (104,820/257,541) were aged 70 years and older. Hospital and primary care R13-OD prevalence was 3.5% and 4.3%, respectively; AIMS-OD showed 34.8% and 15.4%, with True prevalence at 16.7% and 7.4%. Patients aged 70 years and older had worse clinical outcomes and worse prognosis. Patients in R13-OD experienced significantly worse clinical outcomes than patients with H-AIMS-OD, who in turn fared worse than those with no R13-OD and with low AIMS-OD risk. Risk factors for patients with COVID-19 R13-OD included age, neuroleptic use, stroke, dementia, and delirium.</p><p><strong>Conclusions: </strong>AIMS-OD screening revealed high prevalence and significant underdiagnosis in patients with COVID-19 disease across settings. Early detection and risk stratification using AIMS-OD could improve clinical decision-making, diagnosis, and management, particularly in older patients with comorbidities.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e81028"},"PeriodicalIF":2.0,"publicationDate":"2026-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13075777/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147679170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bin Bin Liu, Zhi Geng Jin, Zhe Qi Zhang, Hong Wang, Hao Wang, Hui Zhang, Chang Zhen Li, Fei Qi, Yu Tao Guo
{"title":"Immersive, Interactive, Intelligent Patient Educational System for Venous Thromboembolism (ChatVTE): Development and Validation Study.","authors":"Bin Bin Liu, Zhi Geng Jin, Zhe Qi Zhang, Hong Wang, Hao Wang, Hui Zhang, Chang Zhen Li, Fei Qi, Yu Tao Guo","doi":"10.2196/82775","DOIUrl":"10.2196/82775","url":null,"abstract":"<p><strong>Background: </strong>Effective patient education is crucial in preventing venous thromboembolism (VTE), improving patient outcomes, and reducing health care costs. However, traditional educational methods often lack engagement and fail to address individual patient needs comprehensively.</p><p><strong>Objective: </strong>This study aimed to develop and preliminarily validate an immersive, large language model-based patient education system for VTE designed to promote patient engagement and care adherence by delivering highly relevant, actionable, and patient-centered information.</p><p><strong>Methods: </strong>We developed ChatVTE, an interactive, intelligent patient education platform, by integrating a retrieval-augmented large language model (Qwen1.5-7B) with text-to-speech and lip-synch technologies. The system's performance was initially assessed through a comparative evaluation against ChatGPT. This involved using a standardized set of VTE-related questions, administered from December 10 to 31, 2024, with responses rigorously evaluated by 4 VTE domain experts using a 5-point Likert scale for accuracy, completeness, consistency, and safety. Subsequently, we consecutively enrolled a prospective cohort of 25 adult inpatients with VTE from the Departments of Pulmonary Vascular and Thrombotic Diseases and General Surgery at the Sixth Medical Center of the Chinese People's Liberation Army General Hospital between March 1 and May 31, 2025. These participants engaged with the ChatVTE system throughout their inpatient stay and completed postintervention assessments upon discharge.</p><p><strong>Results: </strong>Expert evaluation demonstrated that ChatVTE significantly outperformed ChatGPT in accuracy, completeness, consistency (all P<.001, r>0.5), and safety (P=.01, r=0.327). Among the 25 enrolled patients (age: mean 55.4, SD 13.2 years), ChatVTE achieved high average scores (mean score >4.0/5.0) in 8 of the 9 experience dimensions evaluated but received a notably lower score in the emotional support domain (1.92/5.0).</p><p><strong>Conclusions: </strong>This study validates the feasibility of ChatVTE in the management of patients with VTE, demonstrating its potential to enhance the quality of patient-health care provider interaction and the efficacy of knowledge dissemination. These preliminary findings suggest that ChatVTE could be a valuable tool for improving patient education and facilitating shared clinical decision-making.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e82775"},"PeriodicalIF":2.0,"publicationDate":"2026-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13052474/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147629404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Adeola Adegbemijo, Anna M Maw, Katy E Trinkley, Amoolya A Varghese, Stephanie Tulk Jesso
{"title":"AI-Assisted Rapid Quality Analysis in Implementation Science: Methodological Study.","authors":"Adeola Adegbemijo, Anna M Maw, Katy E Trinkley, Amoolya A Varghese, Stephanie Tulk Jesso","doi":"10.2196/81149","DOIUrl":"10.2196/81149","url":null,"abstract":"<p><strong>Background: </strong>Translating evidence-based therapies from \"bench to bedside\" remains challenging, and implementation science (IS) experts are crucial for this process. Qualitative analyses are essential, but require extensive time and cost for manual coding. Now, many turn to artificial intelligence (AI) to accelerate the pace of qualitative analysis, but significant questions remain about the quality, validity, and ethics of applying large language models like ChatGPT (OpenAI) to qualitative data. To this end, we have developed a method for AI-assisted rapid qualitative analysis that addresses these concerns.</p><p><strong>Objective: </strong>This study aimed to develop AI-assisted rapid qualitative analysis for implementation science as an open-source encoder-based small language model (SLM) to aid IS experts. We focus on 2 efficient and high-performing SLMs: distilled bidirectional encoder representations from transformers (DistilBERT) and efficiently learning an encoder that classifies token replacements accurately (ELECTRA). The objective is to assess these models' accuracy in reproducing expert coding, their generalizability to new coding scenarios, and enhancing their accessibility for nontechnical experts through user-friendly tools.</p><p><strong>Methods: </strong>Two previously coded IS datasets were used to train DistilBERT and ELECTRA models. These datasets were coded by IS experts using a mixed deductive and inductive approach, with initial categories derived from the domains of an IS framework: Practical, Robust Implementation, and Sustainability Model. We fine-tuned and evaluated DistilBERT and ELECTRA on these datasets, measuring performance by area under the precision-recall curve and Cohen κ. To facilitate use by nonprogrammers, we then developed an open-source Python package (pytranscripts) to streamline transcript processing, model classification, and evaluation. Additionally, a companion Streamlit web application allows users to upload interview transcripts and obtain automated coding and analytics without any coding expertise.</p><p><strong>Results: </strong>Our findings demonstrate the success of leveraging SMLs to significantly accelerate qualitative analysis while maintaining high levels of accuracy and agreement with human annotators, although results are not universal and depend on how researchers approach qualitative coding. On the original dataset, DistilBERT achieved near-perfect agreement with human coders (Cohen κ=0.95), while ELECTRA showed substantial agreement (Cohen κ=0.71). However, both models' performance declined on the second, more ambiguous dataset, with DistilBERT's Cohen κ dropping to 0.48 and ELECTRA's to 0.39. Two primary drivers of performance drop appear to be related to the number of codes applied to the dataset, and whether coders apply multiple codes to each piece of data or constrain themselves to applying one.</p><p><strong>Conclusions: </strong>This work demonstrates that SLMs ","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e81149"},"PeriodicalIF":2.0,"publicationDate":"2026-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13096769/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147629420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploring the Ethical and Practical Considerations of Artificial Intelligence in Real-World Health Care Settings: Stakeholder Focus Group Study.","authors":"Carmen Wendy Ulizio, Devika Dua, Naya Meenkashi Mukul, Santosh Areti, Kristin Kostick-Quenet, Vasiliki Nataly Rahimzadeh","doi":"10.2196/85163","DOIUrl":"10.2196/85163","url":null,"abstract":"<p><strong>Background: </strong>Artificial intelligence (AI) technologies continue to transform how we research human disease, diagnose and treat patients, and operate hospitals. However, emerging ethical dilemmas surrounding their design, use, and oversight demand both policy attention and empirical research.</p><p><strong>Objective: </strong>This study aims to explore current AI development, integration, and use activities across the Texas Medical Center (TMC), the largest medical center in the world, and identify emerging ethical priorities.</p><p><strong>Methods: </strong>We conducted a total of 3 qualitative focus groups via Zoom (Zoom Video Communications, Inc) between May and June 2025 to gauge the perspectives of 19 clinicians, developers, administrators, and patient advocates on core aspects of clinical AI tools at the point of care.</p><p><strong>Results: </strong>Participants described current development and deployment of AI tools across the TMC, with areas of high potential focused on extending clinical expertise, reducing administrative burden, and improving cross-specialty collaboration. However, they also identified many challenges, including significant barriers to accessing quality datasets for training, insufficient systematic governance on the validation, auditing, and use of AI tools in the clinic, and limited patient involvement in AI development decisions. Discussion on validation of models occurring primarily in well-resourced locations like the TMC raised worries about a potential digital divide in health care. These concerns were heightened for practitioners working in safety-net hospitals and in other underresourced health care settings. Participants also highlighted that discussions on AI ethics at the development stage are currently lacking and suggested embedding bioethicists into development teams to account for this issue. Clinicians and patient advocates differed in their views on patient notification about the use of AI at the point of care, justifying future research on this question. Accountability also remained an unresolved issue, with participants split on whether the provider should take full responsibility for any patient care errors resulting from AI.</p><p><strong>Conclusions: </strong>These contributions identify the ethical tensions currently occurring in the real-world daily lives of professionals involved with health AI within a large regional academic medical center. Addressing these challenges will require AI-specific governance that ensures contextual validation, easy access to data, independent auditing, meaningful stakeholder involvement, and support and education for frontline users who must integrate these tools into their daily practice.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e85163"},"PeriodicalIF":2.0,"publicationDate":"2026-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13087557/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147610891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Allison Diane Ihle, Breann Wicks, Vangelis Metsis, Autumn Starfall, Fleur Clapham, Aleksei Gorbachev, Sean Shanley, Christina Strauser, Jacqueline M McGrath
{"title":"Training an AI Chatbot to Manage Health in Underserved Populations: Methodological Approach.","authors":"Allison Diane Ihle, Breann Wicks, Vangelis Metsis, Autumn Starfall, Fleur Clapham, Aleksei Gorbachev, Sean Shanley, Christina Strauser, Jacqueline M McGrath","doi":"10.2196/84145","DOIUrl":"10.2196/84145","url":null,"abstract":"<p><strong>Background: </strong>Health disparities such as morbidity and mortality among childbearing women remain high in the United States, especially among those with risks associated with criminal legal system involvement. These underserved women are often managed through community supervision such as probation. They have many needs and could benefit from easily accessible mobile health (mHealth) apps that specifically target their health and safety using artificial intelligence (AI).</p><p><strong>Objective: </strong>The purpose of this methodological case study is to provide our detailed strategies and findings for systematically designing, optimizing, and testing an AI chatbot.</p><p><strong>Methods: </strong>This methodological case study used an mHealth app's AI chatbot, JUN, that involved preliminary studies and development efforts to support childbearing women on community supervision. We applied the Information Systems Research framework to guide the steps on how we designed, tailored, configured, and tested the chatbot using a retrieval-augmented generation framework. We demonstrated the feasibility of using an in-context learning approach addressing relevance, design, and rigor cycles.</p><p><strong>Results: </strong>During both crisis and noncrisis situations, the JUN chatbot had an overall performance of 89% accuracy (N=178) in detecting a \"crisis.\" Qualitative findings displayed increased usability of JUN to manage health at night by participants. The findings also demonstrated that the role of caregiving or current pregnancy was a motivating factor to manage health using technology such as the JUN app. Collectively, the sample expressed that barriers to managing their health effectively were associated with limited transportation, time off work, and insurance coverage. Participants in the community supervision group also described that stress related to criminal legal system involvement put limitations in how they managed their health and well-being. Altogether, participants from both groups discussed how an anonymous chat feature and app store accessibility would enhance the usability and acceptability of JUN among users. Pregnant women used the app to manage feelings of fatigue, shortness of breath, food cravings, anxiety, confidence, determination, frustration, excitement, happiness, hopefulness, irritation, love, as well as acknowledgment of their own feelings. Pregnant participants on community supervision had more housing (P=.05) and food (P=.01) insecurity, worry about electricity being turned off (P=.04), and needing resources (P=.01) compared to pregnant women without community supervision.</p><p><strong>Conclusions: </strong>We illustrate the methodological case study to design, optimize, and test an AI chatbot within an mHealth app to provide health and safety-related support for childbearing women on community supervision. This methodological case study poses possibilities for further development and testing of inter","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e84145"},"PeriodicalIF":2.0,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13085989/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147596762","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}