Mark Iscoe, Huan Li, Haipeng Xue, Vimig Socrates, Aidan Gilson, Thomas Huang, Richard Andrew Taylor
{"title":"Evaluating the Potential Impact of AI on Urinary Tract Infection Diagnosis in the Emergency Department Across Demographic Groups: Retrospective Cohort Study.","authors":"Mark Iscoe, Huan Li, Haipeng Xue, Vimig Socrates, Aidan Gilson, Thomas Huang, Richard Andrew Taylor","doi":"10.2196/91148","DOIUrl":"10.2196/91148","url":null,"abstract":"<p><strong>Background: </strong>Urinary tract infection (UTI) is a common emergency department (ED) presentation but can be challenging to diagnose; both overdiagnosis and underdiagnosis are common, and older adults may be at particular risk of misdiagnosis. Artificial intelligence (AI) shows promise in augmenting diagnosis, but performance across patient populations remains underexamined.</p><p><strong>Objective: </strong>We developed an AI model that combined urine culture positivity prediction and natural language processing (NLP) to predict UTI diagnosis using only information available at the time of a patient's ED visit. We then evaluated the model's performance relative to that of physicians in diagnosing UTI across intersectional patient groups.</p><p><strong>Methods: </strong>We conducted a single-center, multisite retrospective analysis of nonpregnant adult ED patients who had a urinalysis and urine culture test performed during their ED visit at 9 EDs in a single US health system from June 2013 to August 2021. Intersectional groups were defined by binned age (18-44, 45-64, 65-84, and ≥85 years), sex, race, and ethnicity. An Extreme Gradient Boosting classifier model was developed to predict culture positivity (≥10,000 colony-forming units per milliliter) from urinalysis data using 5-fold cross-validation and a 80%-20% train-test split. UTI signs and symptoms were identified using a previously described NLP model. UTI was defined as a positive urine culture and at least 1 UTI sign or symptom identified through NLP. Model performance was evaluated using the area under the receiver operating characteristic curve and rates of overdiagnosis (proportion of patients without UTI mistakenly diagnosed with UTI) and underdiagnosis (proportion of patients with UTI who were not diagnosed ). Model over- and underdiagnosis rates were compared to those of physicians, with physician diagnosis inferred from a composite proxy outcome of either explicit UTI diagnosis or prescription of a relevant antibiotic in the absence of an alternative infectious disease diagnosis. Cross-group performance variance was assessed through the coefficient of variation (CV) for accuracy and diagnostic odds ratio (DOR).</p><p><strong>Results: </strong>Of 149,449 included encounters, 22,521 (15.1%) had positive cultures and 20,080 (13.4%) met the definition of UTI. Model area under the receiver operating characteristic curve was 0.93 (95% CI 0.93-0.93). At a diagnostic threshold of 28%, the model had lower rates of overdiagnosis and underdiagnosis than physicians for each intersectional group. The model's cross-group CV was 0.039 (95% CI 0-0.36) for accuracy and 0.48 (95% CI 0.14-0.81) for DOR. Physicians' CV was 0.080 (95% CI 0-0.40) for accuracy and 0.33 (95% CI 0.004-0.66) for DOR.</p><p><strong>Conclusions: </strong>In this proof-of-concept study, an AI model had lower overdiagnosis and underdiagnosis rates than a proxy for physician diagnosis across intersectional groups","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e91148"},"PeriodicalIF":2.0,"publicationDate":"2026-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13148603/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147847255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sebastian Martini, Sabine Schluessel, Ughur Aghamaliyev, Michaela Rippl, Linda Deissler, Olivia Tausendfreund, Desiree Nuebler, Katharina Mueller, Ralf Schmidmaier, Michael Drey
{"title":"Expert Evaluation of the Perceived Accuracy, Relevance, and Safety of Large Language Model-Generated Patient Information in Geriatrics: Cross-Condition Study.","authors":"Sebastian Martini, Sabine Schluessel, Ughur Aghamaliyev, Michaela Rippl, Linda Deissler, Olivia Tausendfreund, Desiree Nuebler, Katharina Mueller, Ralf Schmidmaier, Michael Drey","doi":"10.2196/91369","DOIUrl":"10.2196/91369","url":null,"abstract":"<p><strong>Background: </strong>Large language models (LLMs) are increasingly used to generate patient-oriented medical information. In geriatrics, such information must balance accuracy, relevance, and safety, as older adults may be particularly susceptible to misleading or harmful advice. However, systematic evaluations of expert perceptions across multiple geriatric conditions remain limited.</p><p><strong>Objective: </strong>This study aimed to explore geriatricians' perceptions of the accuracy, relevance, and potential harm of LLM-generated patient information across common geriatric conditions and to examine variability and interrater agreement in expert ratings.</p><p><strong>Methods: </strong>In this cross-sectional expert rating study, 10 geriatricians evaluated 50 LLM-generated statements covering 5 geriatric conditions (sarcopenia, osteoporosis, urinary incontinence, depression, and dementia). Statements addressed diagnostic, etiological, prognostic, risk-related, and therapeutic aspects. Experts rated perceived accuracy, relevance, and potential harm using 5-point Likert scales. Rating distributions were summarized using medians and IQRs. The Kendall coefficient of concordance (W) was used exploratorily to assess agreement in the relative ordering of statements within predefined strata. Readability was assessed using Flesch-Kincaid Grade Level and Flesch Reading Ease.</p><p><strong>Results: </strong>Expert ratings indicated high perceived accuracy (median 4.32, IQR 4.01-4.59 and perceived relevance (median 4.51, IQR 4.06-4.66), while perceived potential harm remained low (median 1.59, IQR 1.17-1.92). IQR values ranged from 0.00 to 1.38 with most values clustering below 0.5, indicating limited dispersion in expert ratings. Agreement in the relative ordering of statements varied across domains, with W values ranging from 0.27 to 0.62 (median 0.53, IQR 0.46-0.58), indicating moderate concordance. No statements combined low perceived accuracy with high perceived potential harm. Readability analysis indicated generally accessible language, with a median Flesch-Kincaid Grade Level of 8.3 (IQR 7.4-9.6) and a median Flesch Reading Ease score of 60.8 (IQR 50.1-66.9).</p><p><strong>Conclusions: </strong>LLM-generated patient information for common geriatric conditions was rated as largely accurate and relevant, with low potential harm in typical scenarios. Variability in expert emphasis and the exploratory nature of agreement analyses highlight the limitations of perception-based evaluation. Future studies should incorporate guideline-based validation, readability optimization, and patient-centered outcomes to more comprehensively evaluate the safety and suitability of LLM-generated information for geriatric patient education.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e91369"},"PeriodicalIF":2.0,"publicationDate":"2026-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147824377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abdolamir Karbalaie, Farhad Abtahi, Charlotte K Häger
{"title":"Participant-Aware Model Validation for Repeated-Measures Data: Comparative Cross-Validation Study.","authors":"Abdolamir Karbalaie, Farhad Abtahi, Charlotte K Häger","doi":"10.2196/87728","DOIUrl":"https://doi.org/10.2196/87728","url":null,"abstract":"<p><strong>Background: </strong>Repeated-measures datasets are common in biomechanics and digital health, where each participant contributes multiple correlated trials. If cross-validation (CV) ignores this structure, information can leak from training to test folds, inflating performance and undermining clinical credibility.</p><p><strong>Objective: </strong>This study evaluates the impact of participant-aware validation strategies on model reliability in repeated-measures classification tasks, using fear of reinjury prediction following anterior cruciate ligament reconstruction (ACLR) as a case study.</p><p><strong>Methods: </strong>We analyzed 623 hop trials from 72 individuals after ACLR to classify fear of reinjury based on biomechanical features. Four CV strategies were compared: stratified 10-fold CV, leave-one-participant-out cross-validation (LOPOCV), group 3-fold CV, and a nested framework combining LOPOCV (outer loop) with group 3-fold CV (inner loop). Ten supervised classifiers were benchmarked across classification accuracy, train-test generalization gap, model ranking consistency, and computational efficiency.</p><p><strong>Results: </strong>Stratified 10-fold CV systematically overestimated model performance (eg, extra trees accuracy of 0.91 vs 0.66 under LOPOCV) due to participant-level data leakage. Group and nested CV strategies yielded more conservative and stable estimates. The nested LOPOCV + group CV framework achieved a good balance between generalization and participant-aware separation, with reduced bias and overfitting compared with nonnested alternatives.</p><p><strong>Conclusions: </strong>Participant-aware validation strategies are essential for trustworthy machine learning (ML) evaluation in repeated-measures settings. Nested CV designs improve reproducibility, reduce selection bias, and align with regulatory expectations for clinical ML tools. These findings support best practices in model validation for biomechanics and digital health applications.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e87728"},"PeriodicalIF":2.0,"publicationDate":"2026-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147824355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Linqi Lu, Yifan Deng, Chuan Tian, Sijia Yang, Dhavan V Shah
{"title":"A Fine-Tuned Multimodal AI Chatbot for Dietary Health and Nutrition, Purrfessor: Development and Mixed Methods Evaluation.","authors":"Linqi Lu, Yifan Deng, Chuan Tian, Sijia Yang, Dhavan V Shah","doi":"10.2196/74111","DOIUrl":"https://doi.org/10.2196/74111","url":null,"abstract":"<p><strong>Background: </strong>The integration of Large Language and Vision Assistant models with food and nutrition data enables multimodal meal analysis and contextual dietary guidance. Despite this potential, the reliability and practical usefulness of such systems for supporting everyday dietary decision-making remain underexplored.</p><p><strong>Objective: </strong>This study introduces Purrfessor, an innovative artificial intelligence (AI) chatbot designed to provide personalized dietary guidance through interactive, multimodal engagement. The study aimed to evaluate its performance in ingredient recognition and recipe generation.</p><p><strong>Methods: </strong>The Purrfessor chatbot was trained using a combination of the FoodData Central database from the US Department of Agriculture (USDA), the Recipe2img dataset featuring food images and corresponding recipes, a curated human-annotated dataset derived from Recipe1M, and a customized question-and-answer dialogue dataset. The system operates under a session-based, multiturn interaction paradigm, with memory retained only within an active session and no cross-session memory persistence. We implemented a 2-phase evaluation framework combining AI-based performance assessment and human scoring.</p><p><strong>Results: </strong>Purrfessor achieved a high average cosine similarity of 0.90 in ingredient recognition with human-coded references. In GPT-4.1-based (OpenAI) evaluation of recipe generation quality, Purrfessor outperformed the raw Large Language and Vision Assistant model across all evaluated dimensions, with the largest improvements in completeness (7.44 vs 6.52), consistency (8.90 vs 7.81), and clarity (9.13 vs 8.39). Overall recipe quality improved from 7.66 to 8.35. Automatic metrics indicated strong ingredient coverage (0.78) and moderate step complexity (0.74), with lower coherence (0.62) and temperature and time specification (0.59), yielding an overall structured score of 0.68. Human evaluators rated Purrfessor's question-and-answer accuracy highly: correctness (mean 8.71, SD 1.15), relevance (mean 9.99, SD 0.10), and clarity (mean 9.33, SD 0.68). Error analysis indicated that 56% of responses contained minor hallucinations (ie, inclusion of inferred secondary details or invisible garnishes). At the same time, core food identification and overall recipe logic remained accurate.</p><p><strong>Conclusions: </strong>Findings highlight the role of anthropomorphic chatbot design and multimodal AI in supporting engaging dietary health conversations. This study offers an example of AI-driven, evidence-based dietary guidance and underscores the potential of health chatbots to nudge informed health decision-making. Insights contribute to the development of digital health interventions and personalized health communication strategies, with implications for the design of engaging, user-centered AI health assistants.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e74111"},"PeriodicalIF":2.0,"publicationDate":"2026-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13132530/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147824321","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fangwen Zhou, Cynthia Lokker, Rick Parrish, R Brian Haynes, Alfonso Iorio, Ashirbani Saha, Muhammad Afzal
{"title":"Fine-Tuning and Benchmarking Transformer Models for Multiclass Classification of Clinical Research Papers: Retrospective Modeling Study.","authors":"Fangwen Zhou, Cynthia Lokker, Rick Parrish, R Brian Haynes, Alfonso Iorio, Ashirbani Saha, Muhammad Afzal","doi":"10.2196/77311","DOIUrl":"https://doi.org/10.2196/77311","url":null,"abstract":"<p><strong>Background: </strong>The exponential growth of digital information has led to an unprecedented expansion in the volume of unstructured text data. Efficient classification of these data is critical for timely evidence synthesis and informed decision-making in health care. Machine learning techniques have shown considerable promise for text classification tasks. However, multiclass classification of papers by study publication type has been largely overlooked compared to binary or multilabel classification. Addressing this gap could significantly enhance knowledge translation workflows and support systematic review processes.</p><p><strong>Objective: </strong>This study aimed to fine-tune and evaluate domain-specific transformer-based language models on a gold-standard dataset for multiclass classification of clinical literature into mutually exclusive categories: original studies, reviews, evidence-based guidelines, and nonexperimental studies.</p><p><strong>Methods: </strong>The titles and abstracts of McMaster's Premium Literature Service (PLUS) dataset comprising 162,380 papers were used for fine-tuning seven domain-specific transformers. Clinical experts classified the papers into four mutually exclusive publication types. PLUS data were split in an 80:10:10 ratio into training, validation, and testing sets, with the Clinical Hedges dataset used for external validation. A grid search evaluated the impact of class weight (CW) adjustments, learning rate (LR), batch size (BS), warmup ratio, and weight decay (WD), totaling 1890 configurations. Models were assessed using 10 metrics, including the area under the receiver operating characteristic curve (AUROC), the F<sub>1</sub>-score (harmonic mean of precision and recall), and Matthew's correlation coefficient (MCC). The performance of individual classes was assessed using a one-to-rest approach, and overall performance was assessed using the macro average. Optimal models identified from validation results were further tested on both PLUS and Clinical Hedges, with calibration assessed visually.</p><p><strong>Results: </strong>Ten best-performing models achieved macro AUROC≥0.99, F<sub>1</sub>-score≥0.89, and MCC≥0.88 on the validation and testing sets. Performance declined on Clinical Hedges. Models were consistently better at classifying original studies and reviews. Biomedical Bidirectional Encoder Representations from Transformers (fine-tuned on biomedical text; BioBERT)-based models had superior calibration performance, especially for original studies and reviews. Optimal configurations for search included lower LRs (1 × 10-5 and 3 × 10-5), midrange BSs (32-128), and lower WD (0.005-0.010). CW adjustments improved recall but generally reduced performance on other metrics. Models generally struggled with accurately classifying nonexperimental and guideline studies, potentially due to class imbalance and content heterogeneity.</p><p><strong>Conclusions: </strong>This study used a compr","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e77311"},"PeriodicalIF":2.0,"publicationDate":"2026-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147824398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Eduardo De la Vega-Taboada, Sofia A Portillo, Lina Maria Gomez-Garcia, Ann Banchoff, Victoria Maria Bermudez, Diana M Chavez, Tate Isabella Sgaraglino, Eugenia Flores Millender, Olga L Sarmiento, Abby C King
{"title":"A New Model for Youth-Driven Community Change: Exploratory Testing of Artificial Intelligence-Supported Citizen Science.","authors":"Eduardo De la Vega-Taboada, Sofia A Portillo, Lina Maria Gomez-Garcia, Ann Banchoff, Victoria Maria Bermudez, Diana M Chavez, Tate Isabella Sgaraglino, Eugenia Flores Millender, Olga L Sarmiento, Abby C King","doi":"10.2196/79464","DOIUrl":"https://doi.org/10.2196/79464","url":null,"abstract":"<p><strong>Background: </strong>Generative artificial intelligence (AI) systems are increasingly used in health and community settings, yet empirical evidence on how they function within participatory, youth-led action frameworks remains limited. Large language models can provide structured feedback to support planning and critical reflection, and AI-based image transformation can generate realistic visual prototypes to enhance shared understanding. However, risks include output variability, feasibility gaps when AI-generated recommendations or visualizations imply solutions that are not operationally workable, and the potential to displace adolescent voice and agency if AI outputs are treated as authoritative rather than as inputs for collective deliberation.</p><p><strong>Objective: </strong>This study examines how 2 generative AI tools-structured feedback using a GPT model and AI-based image transformation-functioned as deliberative and visualization supports within a youth-led citizen science intervention addressing environmental health concerns in El Pozón, Cartagena, Colombia.</p><p><strong>Methods: </strong>This exploratory action research study included a preparation phase and an implementation phase. During preparation, researchers iteratively tested SecureGPT (a privacy-enhanced version of ChatGPT 4.0) prompt configurations and compared DALL-E with Adobe Photoshop AI for place-based image modification, selecting a fixed prompt format requesting 3 strengths, 3 weaknesses, and 5 reflective questions (3-3-5). During implementation, 12 adolescent citizen scientists completed the Our Voice process. AI use was facilitator-mediated: prompts were co-developed through youth consensus, a facilitator entered prompts and operated tools while youth observed, and outputs were reviewed with the group in real time before use. Data sources included structured field notes, analytic memos, archived prompts and outputs, and session recordings. Analysis was descriptive and process-oriented, examining how AI shaped deliberation, solution refinement, and stakeholder engagement.</p><p><strong>Results: </strong>Structured GPT prompts supported deeper critical analysis and iterative refinement toward more feasible interventions. Model outputs varied in usefulness; role-based prompting often produced redundant responses, and early outputs were occasionally overly generic, requiring facilitator guidance and prompt refinement. The structured 3-3-5 format improved specificity and reduced wordiness. DALL-E did not generate sufficiently realistic place-based modifications, whereas Adobe Photoshop AI, used with iterative prompting and area-selection tools, produced visually plausible prototypes that supported group discussion and stakeholder communication. Highly realistic visualizations also introduced a feasibility gap when depicted infrastructure exceeded operational constraints, requiring explicit framing of images as aspirational prototypes rather than technical d","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e79464"},"PeriodicalIF":2.0,"publicationDate":"2026-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147791054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vitor Ulisses Monnaka, Jéssica Andrade-Silva, Gilberto Szarf, Henrique Min Ho Lee
{"title":"Implementing Artificial Intelligence in Radiology: Design Thinking Road Map.","authors":"Vitor Ulisses Monnaka, Jéssica Andrade-Silva, Gilberto Szarf, Henrique Min Ho Lee","doi":"10.2196/87360","DOIUrl":"10.2196/87360","url":null,"abstract":"<p><strong>Unlabelled: </strong>Despite its promising potential to transform medical care, particularly in the field of medical images, the integration of artificial intelligence (AI) into clinical practice remains a complex and multifaceted challenge. In real-world settings, AI tools may demonstrate limited clinical impact, suboptimal performance, and security vulnerabilities, and face regulatory constraints. This viewpoint explores how the principles of design thinking can provide a structured road map for AI implementation in radiology. By emphasizing user-centeredness, fostering multidisciplinary collaboration, and embedding iterative refinement, this approach offers practical guidance for identifying clinical and operational needs, selecting and validating appropriate solutions, and ensuring effective deployment with continuous improvement.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e87360"},"PeriodicalIF":2.0,"publicationDate":"2026-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13128057/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147791168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jamie Linnea Luckhaus, Therese Scott Duncan, Anna Kharko, Anna Clareborn, Maria Hägglund, Charlotte Blease, Sara Riggare
{"title":"A Qualitative Exploration of Ethical Aspects of Using AI in Parkinson Disease: Patient Panel Study.","authors":"Jamie Linnea Luckhaus, Therese Scott Duncan, Anna Kharko, Anna Clareborn, Maria Hägglund, Charlotte Blease, Sara Riggare","doi":"10.2196/74144","DOIUrl":"https://doi.org/10.2196/74144","url":null,"abstract":"<p><strong>Background: </strong>As Parkinson disease (PD) rates increase, so does interest in finding new technological solutions for PD management. Despite substantial efforts to explore potential applications of artificial intelligence (AI) in PD management, research from the perspectives of people with PD on AI remains limited.</p><p><strong>Objective: </strong>This study aims to explore the ethical considerations of AI in PD management from the perspective of people with PD.</p><p><strong>Methods: </strong>A qualitative triangulation of 13 interviews and 2 focus groups (FGs) with a panel of expert-by-experience people with PD from 6 European countries was carried out using abductive thematic analysis. The 6 biomedical ethical principles conceptualized by Beauchamp and Childress guided the analysis. Participants varied in diagnosis, disease experiences, and technological backgrounds. A researcher with PD was involved from start to finish, providing valuable insights into data collection and analysis.</p><p><strong>Results: </strong>Although optimistic that AI could enhance autonomy and beneficence through personalized, actionable insights for people with PD and their health care professionals, concerns arose over patient involvement, model accuracy and privacy, ethical injustices, and the psychological impact. Risk prediction, prognosis, and medication response were viewed differently in terms of potential value and ethical considerations, with risk prediction being perceived as the most ethically complex. To uphold autonomy, it was considered important for AI insights to be patient-accessible, and sensitive insights should be communicated by a health care professional who recognizes individual differences in desiring and responding to AI predictions.</p><p><strong>Conclusions: </strong>While people with PD felt AI could personalize (self-)care and increase autonomy, concerns about psychological harm and widening inequalities highlight the importance of ethical safeguards. Our findings underscore the importance of AI integrations that prioritize individual needs, actively engage people with PD in the development, implementation, and interpretation of predictive AI, and establish guidelines to support health care professionals and minimize patient harm. Different forms of implementation and precautions should be taken for risk, progression, and medication response prediction.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e74144"},"PeriodicalIF":2.0,"publicationDate":"2026-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13123883/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147791147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Moustafa Elmetwaly Kandeel, Eid G Abo Hamza, Alaa Abouahmed, Gehad Mohamed AbdelAziz, Adham Hashish, Tarek Abo El Wafa, Ahmed Khalil, Ahmed Eldakak
{"title":"AI Applications Integrating Legal and Regulatory Perspectives in Mental Health: Systematic Review.","authors":"Moustafa Elmetwaly Kandeel, Eid G Abo Hamza, Alaa Abouahmed, Gehad Mohamed AbdelAziz, Adham Hashish, Tarek Abo El Wafa, Ahmed Khalil, Ahmed Eldakak","doi":"10.2196/84305","DOIUrl":"https://doi.org/10.2196/84305","url":null,"abstract":"<p><strong>Background: </strong>Artificial intelligence (AI) offers new methods to improve diagnosis and treatment in mental health. However, its use raises legal and ethical concerns.</p><p><strong>Objective: </strong>AI is increasingly being used for mental health care, but its clinical prominence and ethical implications are yet to be determined. This systematic review discusses the clinical efficacy and the ethical issues of AI in mental health treatment and is trying to focus on the main conclusions with regard to the diagnostic accuracy and the therapeutic efficacy.</p><p><strong>Methods: </strong>The review encompasses an exhaustive analysis of 35 studies in the narrow time frame of 2013-2024. It allows for multidatabase exploration and follows the systematic and well-established practice of PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) 2020 guidelines. This review searched PubMed (biomedical emphasis), IEEE Xplore (engineering or AI), PsycINFO (psychological literature), Scopus (multidisciplinary focus), and Cochrane Library (evidence-based treatment) from January 1, 2013, to December 31, 2024. Studies include those that focused on AI applications for diagnosis, treatment, or patient engagement, excluding tangential uses (eg, administrative tasks). Only English-language publications were searched to mitigate language bias, though this introduces potential geographic bias.</p><p><strong>Results: </strong>AI-enabled interventions of natural language processing models showed up to 89% accuracy for depression detection. The wearables, as in the Empatica E4, showed an F1-score of 0.81 to predict anxiety episodes. AI-enabled therapies, such as chat-based interventions and online cognitive behavioral therapy, have been shown to improve the anxiety symptoms of about 30% in some studies; however, there was considerable variability in the impact based on study design, intervention duration, and comparator conditions, as well as the overall methodological quality of the studies. However, challenges remain, such as including biases in training data, evidenced by performance declines of up to 15% in non-English datasets, and concerns over data privacy.</p><p><strong>Conclusions: </strong>In addressing mental health, AI has the potential to revolutionize mental health treatment, offering cost-saving, personalized, and culturally sensitive interventions while protecting privacy, equity, and human agency.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e84305"},"PeriodicalIF":2.0,"publicationDate":"2026-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13117227/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147791172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Peter Wadie, Bishoy Zakher, Khalid Elgazzar, Abdulhamid Alsbakhi, Abdul-Mohsen G Alhejaily
{"title":"AI in Point-of-Care Imaging for Clinical Decision Support: Systematic Review of Diagnostic Accuracy, Task-Shifting, and Explainability.","authors":"Peter Wadie, Bishoy Zakher, Khalid Elgazzar, Abdulhamid Alsbakhi, Abdul-Mohsen G Alhejaily","doi":"10.2196/80928","DOIUrl":"https://doi.org/10.2196/80928","url":null,"abstract":"<p><strong>Background: </strong>Artificial intelligence (AI) integrated with point-of-care imaging is a promising approach to expand access in settings with limited specialist availability. However, no systematic review has comprehensively evaluated AI-assisted clinical decision support across multiple point-of-care imaging modalities, assessed explainability implementation, or quantified clinical impact evidence gaps.</p><p><strong>Objective: </strong>We aim to systematically evaluate and synthesize evidence on AI-based clinical decision support systems using point-of-care imaging.</p><p><strong>Methods: </strong>We searched PubMed, Scopus, IEEE Xplore, and Web of Science (January 2018 to November 2025). We included research studies evaluating AI or machine learning systems applied to point-of-care-capable imaging modalities in clinical settings with clinical decision support outputs. Two reviewers independently screened studies, extracted data across 15 domains, and assessed methodological quality using QUADAS-2 (Quality Assessment of Diagnostic Accuracy Studies 2). Proposed frameworks were developed to evaluate explainability implementation and clinical impact evidence. Narrative synthesis was performed due to substantial data heterogeneity.</p><p><strong>Results: </strong>Of 2113 records identified, 20 studies met inclusion criteria, encompassing approximately 78,000 patients across 15 countries. Studies evaluated tuberculosis (n=5), breast cancer (n=3), deep vein thrombosis (DVT) (n=2), and 9 other conditions using ultrasound (7/20, 35%), chest x-ray (5/20, 25%), photography-based and colposcopic imaging (3/20, 15%), fundus photography (2/20, 10%), microscopy (2/20, 10%), and dermoscopy (1/20, 5%). Median sensitivity was 93.6% (IQR 87%-98%), and median specificity was 90.6% (IQR 74.5%-96.7%). Task-shifting was demonstrated in 65% (13/20) of studies, with nonspecialists achieving specialist-level performance after a median of 1 hour of training (range 30 minutes to 6 months; n=6 studies reporting specific durations). The explainable artificial intelligence (XAI) implementation cascade revealed critical gaps: 75% (15/20) of studies did not mention explainability, 10% (2/20) provided explanations to users, and none evaluated whether clinicians understood explanations or whether XAI influenced decisions. The clinical impact pyramid showed 15% (3/20) of studies reported technical accuracy only, 65% (13/20) reported process outcomes, 20% (4/20) documented clinical actions, and none measured patient outcomes. Methodological quality was concerning, as 70% (14/20) of studies were at high or very high risk of bias, with verification bias (14/20, 70%) and selection bias (10/20, 50%) being the most common. The overall certainty of evidence was very low-GRADE (Grading of Recommendations, Assessment, Development, and Evaluation) ⊕◯◯◯, primarily due to risk of bias, heterogeneity, and imprecision.</p><p><strong>Conclusions: </strong>AI-assisted point-of-ca","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e80928"},"PeriodicalIF":2.0,"publicationDate":"2026-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13119389/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147791243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}