Katy E Trinkley, Danielle Maestas Duran, Shelley Zhang, Meagan Bean, Larry A Allen, Russell E Glasgow, Amy G Huebschmann, Chen-Tan Lin, Jason N Mansoori, Anna M Maw, James Mitchell, Laura D Scherer, Daniel D Matlock
{"title":"Application of Nudges to Design Clinical Decision Support Tools: Systematic Approach Guided by Implementation Science.","authors":"Katy E Trinkley, Danielle Maestas Duran, Shelley Zhang, Meagan Bean, Larry A Allen, Russell E Glasgow, Amy G Huebschmann, Chen-Tan Lin, Jason N Mansoori, Anna M Maw, James Mitchell, Laura D Scherer, Daniel D Matlock","doi":"10.2196/73189","DOIUrl":"10.2196/73189","url":null,"abstract":"<p><strong>Background: </strong>Clinical decision support (CDS) is one strategy to increase evidence-based practices by clinicians. Despite its potential, CDS tools produce mixed results and are often disliked by clinicians. Principles from behavioral economics such as \"nudges\" may improve the effectiveness and clinician satisfaction of CDS tools. This paper outlines a pragmatic approach grounded in implementation science to identify and prioritize how to incorporate different types of nudges into CDS tools.</p><p><strong>Objective: </strong>The purpose of this paper is to describe a systematic and pragmatic approach grounded in implementation science to identify and prioritize how best to incorporate different types of nudges into CDS tools. We provide a case example of how this systematic approach was applied to design a CDS tool to improve guideline-concordant prescribing of mineralocorticoid receptor antagonists for patients with heart failure and reduced ejection fraction.</p><p><strong>Methods: </strong>We applied the Messenger, Incentives, Norms, Defaults, Salience, Priming, Affect, Commitments, and Ego nudge framework and the Practical, Robust Implementation and Sustainability Model implementation science framework to systematically and pragmatically identify and prioritize different types of nudges for CDS tools. To illustrate how these frameworks can be applied in a real-life scenario, we use a case example of a CDS tool to improve guideline-concordant prescribing for patients with heart failure. We describe a process of how these frameworks can be used pragmatically by clinicians and informaticists or more technical CDS builders to apply nudge theory to CDS tools.</p><p><strong>Results: </strong>We defined four iterative steps guided by the Practical, Robust Implementation and Sustainability Model: (1) engage partners for user-centered design, (2) develop a shared understanding of the nudge types, (3) determine the overarching CDS format, and (4) brainstorm and prioritize nudge types to address each modifiable contextual issue. These steps are iterative and intended to be adapted to align with the local resources and needs of various clinical scenarios and settings. We provide illustrative examples of how this approach was applied to the case example, including who we engaged, details of nudge design decisions, and lessons learned.</p><p><strong>Conclusions: </strong>We present a pragmatic approach to guide the selection and prioritization of nudges, informed by implementation science. This approach can be used to comprehensively and systematically consider key issues when designing CDS to optimize clinician satisfaction, effectiveness, equity, and sustainability while minimizing the potential for unintended consequences. This approach can be adapted and generalized to other health settings and clinical situations, advancing the goals of learning health systems to expedite the translation of evidence into practice.</p>","PeriodicalId":16337,"journal":{"name":"Journal of Medical Internet Research","volume":"27 ","pages":"e73189"},"PeriodicalIF":6.0,"publicationDate":"2025-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12463335/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145149351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Manuel Alberto Silva, Emma J Hamilton, David A Russell, Fran Game, Sheila C Wang, Sofia Baptista, Matilde Monteiro-Soares
{"title":"Diabetic Foot Ulcer Classification Models Using Artificial Intelligence and Machine Learning Techniques: Systematic Review.","authors":"Manuel Alberto Silva, Emma J Hamilton, David A Russell, Fran Game, Sheila C Wang, Sofia Baptista, Matilde Monteiro-Soares","doi":"10.2196/69408","DOIUrl":"10.2196/69408","url":null,"abstract":"<p><strong>Background: </strong>Diabetes-related foot ulceration (DFU) is a common complication of diabetes, with a significant impact on survival, health care costs, and health-related quality of life. The prognosis of DFU varies widely among individuals. The International Working Group on the Diabetic Foot recently updated their guidelines on how to classify ulcers using \"classical\" classification and scoring systems. No system was recommended for individual prognostication, and the group considered that more detail in ulcer characterization was needed and that machine learning (ML)-based models may be the solution. Despite advances in the field, no assessment of available evidence was done.</p><p><strong>Objective: </strong>This study aimed to identify and collect available evidence assessing the ability of ML-based models to predict clinical outcomes in people with DFU.</p><p><strong>Methods: </strong>We searched the MEDLINE database (PubMed), Scopus, Web of Science, and IEEE Xplore for papers published up to July 2023. Studies were eligible if they were anterograde analytical studies that examined the prognostic abilities of ML models in predicting clinical outcomes in a population that included at least 80% of adults with DFU. The literature was screened independently by 2 investigators (MMS and DAR or EH in the first phase, and MMS and MAS in the second phase) for eligibility criteria and data extracted. The risk of bias was evaluated using the Quality In Prognosis Studies tool and the Prediction model Risk Of Bias Assessment Tool by 2 investigators (MMS and MAS) independently. A narrative synthesis was conducted.</p><p><strong>Results: </strong>We retrieved a total of 2412 references after removing duplicates, of which 167 were subjected to full-text screening. Two references were added from searching relevant studies' lists of references. A total of 11 studies, comprising 13 papers, were included focusing on 3 outcomes: wound healing, lower extremity amputation, and mortality. Overall, 55 predictive models were created using mostly clinical characteristics, random forest as the developing method, and area under the receiver operating characteristic curve (AUROC) as a discrimination accuracy measure. AUROC varied from 0.56 to 0.94, with the majority of the models reporting an AUROC equal or superior to 0.8 but lacking 95% CIs. All studies were found to have a high risk of bias, mainly due to a lack of uniform variable definitions, outcome definitions and follow-up periods, insufficient sample sizes, and inadequate handling of missing data.</p><p><strong>Conclusions: </strong>We identified several ML-based models predicting clinical outcomes with good discriminatory ability in people with DFU. Due to the focus on development and internal validation of the models, the proposal of several models in each study without selecting the \"best one,\" and the use of nonexplainable techniques, the use of this type of model is clearly impaired. Future ","PeriodicalId":16337,"journal":{"name":"Journal of Medical Internet Research","volume":"27 ","pages":"e69408"},"PeriodicalIF":6.0,"publicationDate":"2025-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145137708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christopher Tacca, Arturo Vazquez Galvez, Isobel Margaret Thompson, Alexander Dawid Bincalar, Christoph Tremmel, Richard Gomer, Martin Warner, Chris Freeman, M C Schraefel
{"title":"How Well Do Older Adult Fitness Technologies Match User Needs and Preferences? Scoping Review of 2014-2024 Literature.","authors":"Christopher Tacca, Arturo Vazquez Galvez, Isobel Margaret Thompson, Alexander Dawid Bincalar, Christoph Tremmel, Richard Gomer, Martin Warner, Chris Freeman, M C Schraefel","doi":"10.2196/75667","DOIUrl":"10.2196/75667","url":null,"abstract":"<p><strong>Background: </strong>The population is aging, and research on maintaining older adult independent living is growing in interest. Digital technologies have been developed to support older adults' independent living through fitness. However, reviews of current fitness technologies for older adults indicate that the success is considerably limited.</p><p><strong>Objective: </strong>This scoping review investigates older adult fitness by comparing current interventions to known needs and preferences of older adults from older adult-specific technology acceptance research, barriers and enablers to physical activity, and qualitative research on fitness technologies. The review questions are (1) How well do current older adult fitness technologies align with known preferences? (2) How well do current research methodologies evaluate the known needs and preferences?</p><p><strong>Methods: </strong>Research papers from the last 10 years were searched in the ACM Digital Library, IEEE Xplore, Medline, and PsycINFO databases using keywords related to older adults, technology, and exercise. Papers were only included if they specifically evaluated fitness technologies, focused on older adults, and mentioned a specific technology used in the intervention. To evaluate the fitness interventions, an assessment tool, the Older Adult Fitness Technology Translation Assessment tool, was synthesized through literature on technology acceptance, barriers and enablers to physical activity, and qualitative research on fitness technologies. Interventions were scored by 5 reviewers using a dual-review approach.</p><p><strong>Results: </strong>A total of 43 research papers were selected:16 from medical journals, 15 from engineering journals, 7 from human-computer interaction journals, 3 from public health, and 2 from combined computing and engineering journals. The Older Adult Fitness Technology Translation Assessment tool contained six assessment factors: (1) compatibility with lifestyle, (2) similarity with experience, (3) dignity and independence, (4) privacy concerns, (5) social support, and (6) emotion. The average scores of the 6 factors were 2.93 (SD 0.86) on compatibility with lifestyle, 3.10 (SD 0.74) on similarity to experience, 3.49 (SD 0.64) on dignity and independence, 3.17 (SD 0.86) on privacy concerns, 3.74 (SD 0.81) on short-term outcomes, 2.75 (SD 1.21) on long-term outcomes, 2.79 (SD 0.88) on social support, and 3.17 (SD 1.19) on emotion. No research paper scored a 3 or above on all 6 factors.</p><p><strong>Conclusions: </strong>The results show a lack of alignment between the known preferences of older adults and the design and assessment of current older adult fitness technologies. Areas for growth include (1) alignment between the needs of older adults and fitness technology intervention design, (2) translation of findings from older adult design work to designs in practice, and (3) explicit usage of older adult-specific factors in research. We ","PeriodicalId":16337,"journal":{"name":"Journal of Medical Internet Research","volume":"27 ","pages":"e75667"},"PeriodicalIF":6.0,"publicationDate":"2025-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145130969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Author's Reply: Critical Limitations in Systematic Reviews of Large Language Models in Health Care.","authors":"Andre Python, HongYi Li, Jun-Fen Fu","doi":"10.2196/82729","DOIUrl":"10.2196/82729","url":null,"abstract":"","PeriodicalId":16337,"journal":{"name":"Journal of Medical Internet Research","volume":"27 ","pages":"e82729"},"PeriodicalIF":6.0,"publicationDate":"2025-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12459737/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145137720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"AI Act Compliance within the MyHealth@EU Framework: A Tutorial.","authors":"Monika Simjanoska Misheva, Dragan Shahpaski, Jovana Dobreva, Djansel Bukovec, Blagojche Gjorgjioski, Marjan Nikolov, Dalibor Frtunikj, Petre Lameski, Azir Aliu, Kostadin Mishev, Matjaž Gams","doi":"10.2196/81184","DOIUrl":"https://doi.org/10.2196/81184","url":null,"abstract":"<p><strong>Unstructured: </strong>Background: The integration of AI into clinical workflows is advancing even before full compliance with the MyHealth@EU framework is achieved. While AI-based Clinical Decision Support Systems (CDSS) are automatically classified as high-risk under the EU AI Act, cross-border health data exchange must also satisfy MyHealth@EU interoperability requirements. This creates a dual-compliance challenge: vertical safety and ethics controls mandated by the AI Act, and horizontal semantic-transport requirements enforced through OpenNCP gateways, many of which are still maturing toward production readiness. Objective: This paper provides a practical, phase-oriented tutorial that enables developers and providers to embed AI Act safeguards before approaching MyHealth@EU interoperability tests. The goal is to show how AI-specific metadata can be included in HL7 CDA and FHIR messages without disrupting standard structures, ensuring both compliance and trustworthiness in AI-assisted clinical decisions. Regulatory foundations: We systematically analysed Regulation (EU) 2024/1689 (AI Act) and the MyHealth@EU/OpenNCP technical specifications, extracting a harmonised set of overlapping obligations. AI Act provisions on transparency, provenance, and robustness are mapped directly onto MyHealth@EU workflows, identifying the points where outgoing messages must record AI involvement, log provenance, and trigger validation. Walkthrough: To operationalise this mapping, we propose a minimal extension set, covering AI contribution status, rationale, risk classification, and Annex IV documentation links, together with a phase-based compliance checklist that aligns AI Act controls with MyHealth@EU conformance steps. Illustrative example: A simulated International Patient Summary (IPS) transmission demonstrates how CDA/FHIR extensions can annotate AI involvement, how OpenNCP processes such enriched payloads, and how clinicians in another Member State view the result with backward compatibility preserved. Discussion: We expand on security considerations (e.g., OWASP GenAI risks such as prompt injection and adversarial inputs), continuous post-market risk assessment, monitoring, and alignment with MyHealth@EU's incident aggregation system. Limitations reflect the immaturity of current infrastructures and regulations, with real-world validation pending the rollout of key dependencies. Conclusions: AI-enabled clinical software succeeds only when AI Act safeguards and MyHealth@EU interoperability rules are engineered together from \"day zero.\" This tutorial provides developers with a forward-looking blueprint that reduces duplication of effort, streamlines conformance testing, and embeds compliance early. While the concept is still in its early phases of practice, it represents a necessary and worthwhile direction for ensuring that future AI-enabled clinical systems can meet both EU regulatory requirements from day one.</p>","PeriodicalId":16337,"journal":{"name":"Journal of Medical Internet Research","volume":" ","pages":""},"PeriodicalIF":6.0,"publicationDate":"2025-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145131123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Critical Limitations in Systematic Reviews of Large Language Models in Health Care.","authors":"Zvi Weizman","doi":"10.2196/81769","DOIUrl":"10.2196/81769","url":null,"abstract":"","PeriodicalId":16337,"journal":{"name":"Journal of Medical Internet Research","volume":"27 ","pages":"e81769"},"PeriodicalIF":6.0,"publicationDate":"2025-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12459740/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145137705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fernando Acosta-Perez, Justin Boutilier, Gabriel Zayas-Caban, Sabrina Adelaine, Frank Liao, Brian Patterson
{"title":"Correction: Toward Real-Time Discharge Volume Predictions in Multisite Health Care Systems: Longitudinal Observational Study.","authors":"Fernando Acosta-Perez, Justin Boutilier, Gabriel Zayas-Caban, Sabrina Adelaine, Frank Liao, Brian Patterson","doi":"10.2196/83802","DOIUrl":"10.2196/83802","url":null,"abstract":"<p><p>[This corrects the article DOI: 10.2196/63765.].</p>","PeriodicalId":16337,"journal":{"name":"Journal of Medical Internet Research","volume":"27 ","pages":"e83802"},"PeriodicalIF":6.0,"publicationDate":"2025-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12504887/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145124911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yinan Sun, Aditi Jaiswal, Christopher Slade, Kristina T Phillips, Roberto M Benzo, Peter Washington
{"title":"Associations Between Social Determinants of Health and Adherence in Mobile-Based Ecological Momentary Assessment: Scoping Review.","authors":"Yinan Sun, Aditi Jaiswal, Christopher Slade, Kristina T Phillips, Roberto M Benzo, Peter Washington","doi":"10.2196/69831","DOIUrl":"10.2196/69831","url":null,"abstract":"<p><strong>Background: </strong>Ecological momentary assessment (EMA) involves repeated prompts to capture real-time self-reported health outcomes and behaviors via mobile devices. With the rise of mobile health (mHealth) technologies, EMA has been applied across diverse populations and health domains. However, the extent to which EMA engagement and data quality vary across social determinants of health (SDoH) remains underexplored. Emerging evidence suggests that EMA adherence and data completeness may be sometimes associated with participant characteristics such as socioeconomic status, race/ethnicity, and education level. These associations may sometimes influence who engages with EMA protocols and the types of contextual data captured. Despite growing interest in these patterns, no review to date has synthesized evidence on how SDoH relate to EMA compliance and engagement.</p><p><strong>Objective: </strong>We conducted a scoping review to study two research questions: (R1) how EMA compliance rates in health studies can differ across SDoH and (R2) what types of SDoH have been identified through EMA health studies.</p><p><strong>Methods: </strong>Following PRISMA-ScR guidelines, we searched PubMed, Web of Science, and EBSCOhost using two sets of queries targeting EMA and its relationship to SDoH. Eligible studies were peer reviewed, were published in English between 2013 and 2024, and used mobile-based EMA methods. Studies were included if they (1) reported on differences in EMA compliance by SDoH or (2) reported at least one SDoH observed or uncovered during an EMA study. We used the social ecological model (SEM) as a guiding framework to categorize and interpret SDoH across individual, interpersonal, community, and societal levels. A qualitative thematic synthesis was conducted to iteratively and collaboratively extract, categorize, and review determinants.</p><p><strong>Results: </strong>We analyzed 48 eligible studies, of which 35 addressed R1 by examining compliance patterns across various SDoH. Using the SEM, we identified 13 determinants categorized across 4 levels: individual (eg, daily routine, biological sex, age, socioeconomic status, language, education, and race or ethnicity), interpersonal (eg, social support), community and organizational (eg, social context, social acceptance, stigmatization, and youth culture), and policy or societal (eg, systemic and structural barriers). These studies described differences in EMA response rates, compliance, and dropout associated with these determinants, often among vulnerable populations. The remaining 13 studies addressed R2, demonstrating examples of the types of SDoH that EMA research can uncover, including family culture, social support, social contexts, stigmatization, gender norms, heroic narratives, LGBTQ+ culture, racial discrimination, and systematic and structural barriers.</p><p><strong>Conclusions: </strong>This scoping review illustrates how EMA compliance rates can differ acros","PeriodicalId":16337,"journal":{"name":"Journal of Medical Internet Research","volume":"27 ","pages":"e69831"},"PeriodicalIF":6.0,"publicationDate":"2025-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12456876/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145130988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Thomas Savage, Stephen P Ma, Abdessalem Boukil, Ekanath Rangan, Vishwesh Patel, Ivan Lopez, Jonathan Chen
{"title":"Fine-Tuning Methods for Large Language Models in Clinical Medicine by Supervised Fine-Tuning and Direct Preference Optimization: Comparative Evaluation.","authors":"Thomas Savage, Stephen P Ma, Abdessalem Boukil, Ekanath Rangan, Vishwesh Patel, Ivan Lopez, Jonathan Chen","doi":"10.2196/76048","DOIUrl":"10.2196/76048","url":null,"abstract":"<p><strong>Background: </strong>Large language model (LLM) fine-tuning is the process of adjusting out-of-the-box model weights using a dataset of interest. Fine-tuning can be a powerful technique to improve model performance in fields like medicine, where LLMs may have poor out-of-the-box performance. The 2 common fine-tuning techniques are supervised fine-tuning (SFT) and direct preference optimization (DPO); however, little guidance is available for when to apply either method within clinical medicine or health care operations.</p><p><strong>Objective: </strong>This study aims to investigate the benefits of fine-tuning with SFT and DPO across a range of core natural language tasks in medicine to better inform clinical informaticists when either technique should be deployed.</p><p><strong>Methods: </strong>We use Llama3 8B (Meta) and Mistral 7B v2 (Mistral AI) to compare the performance of SFT alone and DPO across 4 common natural language tasks in medicine. The tasks we evaluate include text classification, clinical reasoning, text summarization, and clinical triage.</p><p><strong>Results: </strong>Our results found clinical reasoning accuracy increased from 7% to 22% with base Llama3 and Mistral2, respectively, to 28% and 33% with SFT, and then 36% and 40% with DPO (P=.003 and P=.004, respectively). Summarization quality, graded on a 5-point Likert scale, was 4.11 with base Llama3 and 3.93 with base Mistral2. Performance increased to 4.21 and 3.98 with SFT and then 4.34 and 4.08 with DPO (P<.001). F1-scores for provider triage were 0.55 for Llama3 and 0.49 for Mistral2, which increased to 0.58 and 0.52 with SFT and 0.74 and 0.66 with DPO (P<.001). F1-scores for urgency triage were 0.81 for Llama3 and 0.88 for Mistral2, which decreased with SFT to 0.79 and 0.87, and then experienced mixed results with DPO, achieving 0.91 and 0.85, respectively (P<.001 and P>.99, respectively). Finally, F1-scores for text classification were 0.63 for Llama3 and 0.73 for Mistral2, which increased to 0.98 and 0.97 with SFT, and then essentially did not change with DPO to 0.95 and 0.97, respectively (P=.55 and P>.99, respectively). DPO fine-tuning required approximately 2 to 3 times more compute resources than SFT alone.</p><p><strong>Conclusions: </strong>SFT alone is sufficient for simple tasks such as rule-based text classification, while DPO after SFT improves performance on the more complex tasks of triage, clinical reasoning, and summarization. We postulate that SFT alone is sufficient for simple tasks because SFT strengthens simple word-association reasoning, whereas DPO enables deeper comprehension because it is trained with both positive and negative examples, enabling the model to recognize more complex patterns. Ultimately, our results help inform clinical informaticists when to deploy either fine-tuning method and encourage commercial LLM providers to offer DPO fine-tuning for commonly used proprietary LLMs in medicine.</p>","PeriodicalId":16337,"journal":{"name":"Journal of Medical Internet Research","volume":"27 ","pages":"e76048"},"PeriodicalIF":6.0,"publicationDate":"2025-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12457693/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145131026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fabio Dennstädt, Max Schmerder, Elena Riggenbach, Lucas Mose, Katarina Bryjova, Nicolas Bachmann, Paul-Henry Mackeprang, Maiwand Ahmadsei, Dubravko Sinovcic, Paul Windisch, Daniel Zwahlen, Susanne Rogers, Oliver Riesterer, Martin Maffei, Eleni Gkika, Hathal Haddad, Jan Peeken, Paul Martin Putora, Markus Glatzer, Florian Putz, Daniel Hoefler, Sebastian M Christ, Irina Filchenko, Janna Hastings, Roberto Gaio, Lawrence Chiang, Daniel M Aebersold, Nikola Cihoric
{"title":"Comparative Evaluation of a Medical Large Language Model in Answering Real-World Radiation Oncology Questions: Multicenter Observational Study.","authors":"Fabio Dennstädt, Max Schmerder, Elena Riggenbach, Lucas Mose, Katarina Bryjova, Nicolas Bachmann, Paul-Henry Mackeprang, Maiwand Ahmadsei, Dubravko Sinovcic, Paul Windisch, Daniel Zwahlen, Susanne Rogers, Oliver Riesterer, Martin Maffei, Eleni Gkika, Hathal Haddad, Jan Peeken, Paul Martin Putora, Markus Glatzer, Florian Putz, Daniel Hoefler, Sebastian M Christ, Irina Filchenko, Janna Hastings, Roberto Gaio, Lawrence Chiang, Daniel M Aebersold, Nikola Cihoric","doi":"10.2196/69752","DOIUrl":"10.2196/69752","url":null,"abstract":"<p><strong>Background: </strong>Large language models (LLMs) hold promise for supporting clinical tasks, particularly in data-driven and technical disciplines such as radiation oncology. While prior evaluation studies have focused on examination-style settings for evaluating LLMs, their performance in real-life clinical scenarios remains unclear. In the future, LLMs might be used as general AI assistants to answer questions arising in clinical practice. It is unclear how well a modern LLM, locally executed within the infrastructure of a hospital, would answer such questions compared with clinical experts.</p><p><strong>Objective: </strong>This study aimed to assess the performance of a locally deployed, state-of-the-art medical LLM in answering real-world clinical questions in radiation oncology compared with clinical experts. The aim was to evaluate the overall quality of answers, as well as the potential harmfulness of the answers if used for clinical decision-making.</p><p><strong>Methods: </strong>Physicians from 10 departments of European hospitals collected questions arising in the clinical practice of radiation oncology. Fifty of these questions were answered by 3 senior radiation oncology experts with at least 10 years of work experience, as well as the LLM Llama3-OpenBioLLM-70B (Ankit Pal and Malaikannan Sankarasubbu). In a blinded review, physicians rated the overall answer quality on a 5-point Likert scale (quality), assessed whether an answer might be potentially harmful if used for clinical decision-making (harmfulness), and determined if responses were from an expert or the LLM (recognizability). Comparisons between clinical experts and LLMs were then made for quality, harmfulness, and recognizability.</p><p><strong>Results: </strong>There were no significant differences between the quality of the answers between LLM and clinical experts (mean scores of 3.38 vs 3.63; median 4.00, IQR 3.00-4.00 vs median 3.67, IQR 3.33-4.00; P=.26; Wilcoxon signed rank test). The answers were deemed potentially harmful in 13% of cases for the clinical experts compared with 16% of cases for the LLM (P=.63; Fisher exact test). Physicians correctly identified whether an answer was given by a clinical expert or an LLM in 78% and 72% of cases, respectively.</p><p><strong>Conclusions: </strong>A state-of-the-art medical LLM can answer real-life questions from the clinical practice of radiation oncology similarly well as clinical experts regarding overall quality and potential harmfulness. Such LLMs can already be deployed within the local hospital environment at an affordable cost. While LLMs may not yet be ready for clinical implementation as general AI assistants, the technology continues to improve at a rapid pace. Evaluation studies based on real-life situations are important to better understand the weaknesses and limitations of LLMs in clinical practice. Such studies are also crucial to define when the technology is ready for clinical implementation.","PeriodicalId":16337,"journal":{"name":"Journal of Medical Internet Research","volume":"27 ","pages":"e69752"},"PeriodicalIF":6.0,"publicationDate":"2025-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12504895/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145130984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}