{"title":"Training Language Models for Estimating Priority Levels in Ultrasound Examination Waitlists: Algorithm Development and Validation.","authors":"Kanato Masayoshi, Masahiro Hashimoto, Naoki Toda, Hirozumi Mori, Goh Kobayashi, Hasnine Haque, Mizuki So, Masahiro Jinzaki","doi":"10.2196/68020","DOIUrl":"https://doi.org/10.2196/68020","url":null,"abstract":"<p><strong>Background: </strong>Ultrasound examinations, while valuable, are time-consuming and often limited in availability. Consequently, many hospitals implement reservation systems; however, these systems typically lack prioritization for examination purposes. Hence, our hospital uses a waitlist system that prioritizes examination requests based on their clinical value when slots become available due to cancellations. This system, however, requires a manual review of examination purposes, which are recorded in free-form text. We hypothesized that artificial intelligence language models could preliminarily estimate the priority of requests before manual reviews.</p><p><strong>Objective: </strong>This study aimed to investigate potential challenges associated with using language models for estimating the priority of medical examination requests and to evaluate the performance of language models in processing Japanese medical texts.</p><p><strong>Methods: </strong>We retrospectively collected ultrasound examination requests from the waitlist system at Keio University Hospital, spanning January 2020 to March 2023. Each request comprised an examination purpose documented by the requesting physician and a 6-tier priority level assigned by a radiologist during the clinical workflow. We fine-tuned JMedRoBERTa, Luke, OpenCalm, and LLaMA2 under two conditions: (1) tuning only the final layer and (2) tuning all layers using either standard backpropagation or low-rank adaptation.</p><p><strong>Results: </strong>We had 2335 and 204 requests in the training and test datasets post cleaning. When only the final layers were tuned, JMedRoBERTa outperformed the other models (Kendall coefficient=0.225). With full fine-tuning, JMedRoBERTa continued to perform best (Kendall coefficient=0.254), though with reduced margins compared with the other models. The radiologist's retrospective re-evaluation yielded a Kendall coefficient of 0.221.</p><p><strong>Conclusions: </strong>Language models can estimate the priority of examination requests with accuracy comparable with that of human radiologists. The fine-tuning results indicate that general-purpose language models can be adapted to domain-specific texts (ie, Japanese medical texts) with sufficient fine-tuning. Further research is required to address priority rank ambiguity, expand the dataset across multiple institutions, and explore more recent language models with potentially higher performance or better suitability for this task.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e68020"},"PeriodicalIF":0.0,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144692629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Taisuke Sato, Emily D Grussing, Ruchi Patel, Jessica Ridgway, Joji Suzuki, Benjamin Sweigart, Robert Miller, Alysse G Wurcel
{"title":"Natural Language Processing for Identification of Hospitalized People Who Use Drugs: Cohort Study.","authors":"Taisuke Sato, Emily D Grussing, Ruchi Patel, Jessica Ridgway, Joji Suzuki, Benjamin Sweigart, Robert Miller, Alysse G Wurcel","doi":"10.2196/63147","DOIUrl":"https://doi.org/10.2196/63147","url":null,"abstract":"<p><strong>Background: </strong>People who use drugs (PWUD) are at heightened risk of severe injection-related infections. Current research relies on billing codes to identify PWUD-a methodology with suboptimal accuracy that may underestimate the economic, racial, and ethnic diversity of hospitalized PWUD.</p><p><strong>Objective: </strong>The goal of this study is to examine the impact of natural language processing (NLP) on enhancing identification of PWUD in electronic medical records, with a specific focus on determining improved systems of identifying populations who may previously been missed, including people who have low income or those from racially and ethnically minoritized populations.</p><p><strong>Methods: </strong>Health informatics specialists assisted in querying a cohort of likely PWUD hospital admissions at Tufts Medical Center between 2020-2022 using the following criteria: (1) ICD-10 codes indicative of drug use, (2) positive drug toxicology results, (3) prescriptions for medications for opioid use disorder, and (4) applying NLP-detected presence of \"token\" keywords in the electronic medical records likely indicative of the patient being a PWUD. Hospital admissions were split into two groups: highly documented (all four criteria present) and minimally documented (NLP-only). These groups were examined to assess the impact of race, ethnicity, and social vulnerability index. With chart review as the \"gold standard,\" the positive predictive value was calculated.</p><p><strong>Results: </strong>The cohort included 4548 hospitalization admissions, with broad heterogeneity in how people entered the cohort and subcohorts; a total of 288 hospital admissions entered the cohort through NLP token presence alone. NLP demonstrated a 54% positive predictive value, outperforming biomarkers, prescription for medications for opioid use disorder, and ICD codes in identifying hospitalizations of PWUD. Additionally, NLP significantly enhanced these methods when integrated into the identification algorithm. The study also found that people from racially and ethnically minoritized communities and those with lower social vulnerability index were significantly more likely to have lower rates of PWUD-related documentation.</p><p><strong>Conclusions: </strong>NLP proved effective in identifying hospitalizations of PWUD, surpassing traditional methods. While further refinement is needed, NLP shows promising potential in minimizing health care disparities.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e63147"},"PeriodicalIF":0.0,"publicationDate":"2025-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144664120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"AI-SDM: A Concept of Integrating AI Reasoning into Shared Decision-Making.","authors":"Mohammed As'ad, Nawarh Faran, Hala Joharji","doi":"10.2196/75866","DOIUrl":"https://doi.org/10.2196/75866","url":null,"abstract":"<p><strong>Unstructured: </strong>Shared decision-making is central to patient-centered care but is often hampered by AI systems that focus on technical transparency rather than delivering context-rich, clinically meaningful reasoning. Although XAI methods elucidate how decisions are made, they fall short in addressing the \"why\" that supports effective patient-clinician dialogue. To bridge this gap, we introduce AI-SDM, a conceptual framework designed to integrate AI-based reasoning into Shared decision-making to enhance care quality while preserving patient autonomy. AI-SDM is a structured, multi-model framework that synthesizes predictive modelling, evidence-based recommendations, and generative AI techniques to produce adaptive, context-sensitive explanations. The framework distinguishes conventional AI explainability from AI reasoning-prioritizing the generation of tailored, narrative justifications that inform shared decisions. A hypothetical clinical scenario in stroke management is used to illustrate how AI-SDM facilitates an iterative, triadic deliberation process between healthcare providers, patients, and AI outputs. This integration is intended to transform raw algorithmic data into actionable insights that directly support the decision-making process without supplanting human judgment.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144593085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deep Learning Multi Modal Melanoma Detection: Algorithm Development and Validation.","authors":"Nithika Vivek, Karthik Ramesh","doi":"10.2196/66561","DOIUrl":"https://doi.org/10.2196/66561","url":null,"abstract":"<p><strong>Background: </strong>The visual similarity of melanoma and seborrheic keratosis has made it difficult for elderly patients with disabilities to know when to seek medical attention, contributing to the metastasis of melanoma.</p><p><strong>Objective: </strong>In this paper, we present a novel multi-modal deep learning-based technique to distinguish between melanoma and seborrheic keratosis.</p><p><strong>Methods: </strong>Our strategy is three-fold: (1) utilize patient image data to train and test three deep learning models using transfer learning (ResNet50, InceptionV3, and VGG16) and one author designed model, (2) use patient metadata to train and test a deep learning model, and (3) combine the predictions of the image model with the best accuracy and the metadata model, using nonlinear least squares regression to specify ideal weights to each model for a combined prediction.</p><p><strong>Results: </strong>The accuracy of the combined model was 88% (195/221 classified correctly) on test data from the HAM10000 dataset. Model reliability was assessed by visualizing the output activation map of each model and comparing the diagnosis patterns to that of dermatologists. The addition of metadata to the image dataset was key to reducing the false negative and false positive rate simultaneously, thereby producing better metrics and improving overall model accuracy.</p><p><strong>Conclusions: </strong>Results from this experiment could be used to eliminate late diagnosis of melanoma via easy access to an app. Future experiments can utilize text data (subjective data pertaining to how the patient felt over a certain period of time) to allow this model to reflect the real hospital setting to a greater extent.</p><p><strong>Clinicaltrial: </strong></p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144577145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Angel Manuel Garcia-Carmona, Maria-Lorena Prieto, Enrique Puertas, Juan-Jose Beunza
{"title":"Leveraging Large Language Models for Accurate Retrieval of Patient Information From Medical Reports: Systematic Evaluation Study.","authors":"Angel Manuel Garcia-Carmona, Maria-Lorena Prieto, Enrique Puertas, Juan-Jose Beunza","doi":"10.2196/68776","DOIUrl":"10.2196/68776","url":null,"abstract":"<p><strong>Background: </strong>The digital transformation of health care has introduced both opportunities and challenges, particularly in managing and analyzing the vast amounts of unstructured medical data generated daily. There is a need to explore the feasibility of generative solutions in extracting data from medical reports, categorized by specific criteria.</p><p><strong>Objective: </strong>This study aimed to investigate the application of large language models (LLMs) for the automated extraction of structured information from unstructured medical reports, using the LangChain framework in Python.</p><p><strong>Methods: </strong>Through a systematic evaluation of leading LLMs-GPT-4o, Llama 3, Llama 3.1, Gemma 2, Qwen 2, and Qwen 2.5-using zero-shot prompting techniques and embedding results into a vector database, this study assessed the performance of LLMs in extracting patient demographics, diagnostic details, and pharmacological data.</p><p><strong>Results: </strong>Evaluation metrics, including accuracy, precision, recall, and F<sub>1</sub>-score, revealed high efficacy across most categories, with GPT-4o achieving the highest overall performance (91.4% accuracy).</p><p><strong>Conclusions: </strong>The findings highlight notable differences in precision and recall between models, particularly in extracting names and age-related information. There were challenges in processing unstructured medical text, including variability in model performance across data types. Our findings demonstrate the feasibility of integrating LLMs into health care workflows; LLMs offer substantial improvements in data accessibility and support clinical decision-making processes. In addition, the paper describes the role of retrieval-augmented generation techniques in enhancing information retrieval accuracy, addressing issues such as hallucinations and outdated data in LLM outputs. Future work should explore the need for optimization through larger and more diverse training datasets, advanced prompting strategies, and the integration of domain-specific knowledge to improve model generalizability and precision.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e68776"},"PeriodicalIF":0.0,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12271962/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144556010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
David C Sing, Kishan S Shah, Michael Pompliano, Paul H Yi, Calogero Velluto, Ali Bagheri, Robert K Eastlack, Stephen R Stephan, Gregory M Mundis
{"title":"Enhancing Magnetic Resonance Imaging (MRI) Report Comprehension in Spinal Trauma: Readability Analysis of AI-Generated Explanations for Thoracolumbar Fractures.","authors":"David C Sing, Kishan S Shah, Michael Pompliano, Paul H Yi, Calogero Velluto, Ali Bagheri, Robert K Eastlack, Stephen R Stephan, Gregory M Mundis","doi":"10.2196/69654","DOIUrl":"10.2196/69654","url":null,"abstract":"<p><strong>Background: </strong>Magnetic resonance imaging (MRI) reports are challenging for patients to interpret and may subject patients to unnecessary anxiety. The advent of advanced artificial intelligence (AI) large language models (LLMs), such as GPT-4o, hold promise for translating complex medical information into layman terms.</p><p><strong>Objective: </strong>This paper aims to evaluate the accuracy, helpfulness, and readability of GPT-4o in explaining MRI reports of patients with thoracolumbar fractures.</p><p><strong>Methods: </strong>MRI reports of 20 patients presenting with thoracic or lumbar vertebral body fractures were obtained. GPT-4o was prompted to explain the MRI report in layman's terms. The generated explanations were then presented to 7 board-certified spine surgeons for evaluation on the reports' helpfulness and accuracy. The MRI report text and GPT-4o explanations were then analyzed to grade the readability of the texts using the Flesch Readability Ease Score (FRES) and Flesch-Kincaid Grade Level (FKGL) Scale.</p><p><strong>Results: </strong>The layman explanations provided by GPT-4o were found to be helpful by all surgeons in 17 cases, with 6 of 7 surgeons finding the information helpful in the remaining 3 cases. ChatGPT-generated layman reports were rated as \"accurate\" by all 7 surgeons in 11/20 cases (55%). In an additional 5/20 cases (25%), 6 out of 7 surgeons agreed on their accuracy. In the remaining 4/20 cases (20%), accuracy ratings varied, with 4 or 5 surgeons considering them accurate. Review of surgeon feedback on inaccuracies revealed that the radiology reports were often insufficiently detailed. The mean FRES score of the MRI reports was significantly lower than the GPT-4o explanations (32.15, SD 15.89 vs 53.9, SD 7.86; P<.001). The mean FKGL score of the MRI reports trended higher compared to the GPT-4o explanations (11th-12th grade vs 10th-11th grade level; P=.11).</p><p><strong>Conclusions: </strong>Overall helpfulness and readability ratings for AI-generated summaries of MRI reports were high, with few inaccuracies recorded. This study demonstrates the potential of GPT-4o to serve as a valuable tool for enhancing patient comprehension of MRI report findings.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e69654"},"PeriodicalIF":0.0,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12231343/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144562176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Laith R Sultan, Shyam Sunder B Venkatakrishna, Sudha A Anupindi, Savvas Andronikou, Michael R Acord, Hansel J Otero, Kassa Darge, Chandra M Sehgal, John H Holmes
{"title":"ChatGPT-4-Driven Liver Ultrasound Radiomics Analysis: Diagnostic Value and Drawbacks in a Comparative Study.","authors":"Laith R Sultan, Shyam Sunder B Venkatakrishna, Sudha A Anupindi, Savvas Andronikou, Michael R Acord, Hansel J Otero, Kassa Darge, Chandra M Sehgal, John H Holmes","doi":"10.2196/68144","DOIUrl":"10.2196/68144","url":null,"abstract":"<p><strong>Background: </strong>Artificial intelligence (AI) is transforming medical imaging, with large language models such as ChatGPT-4 emerging as potential tools for automated image interpretation. While AI-driven radiomics has shown promise in diagnostic imaging, the efficacy of ChatGPT-4 in liver ultrasound analysis remains largely unexamined.</p><p><strong>Objective: </strong>This study aimed to evaluate the capability of ChatGPT-4 in liver ultrasound radiomics, specifically its ability to differentiate fibrosis, steatosis, and normal liver tissue, compared with conventional image analysis software.</p><p><strong>Methods: </strong>Seventy grayscale ultrasound images from a preclinical liver disease model, including fibrosis (n=31), fatty liver (n=18), and normal liver (n=21), were analyzed. ChatGPT-4 extracted texture features, which were compared with those obtained using interactive data language (IDL), a traditional image analysis software. One-way ANOVA was used to identify statistically significant features differentiating liver conditions, and logistic regression models were used to assess diagnostic performance.</p><p><strong>Results: </strong>ChatGPT-4 extracted 9 key textural features-echo intensity, heterogeneity, skewness, kurtosis, contrast, homogeneity, dissimilarity, angular second momentum, and entropy-all of which significantly differed across liver conditions (P<.05). Among individual features, echo intensity achieved the highest F<sub>1</sub>-score (0.85). When combined, ChatGPT-4 attained 76% accuracy and 83% sensitivity in classifying liver disease. Receiver operating characteristic analysis demonstrated strong discriminatory performance, with area under the curve values of 0.75 for fibrosis, 0.87 for normal liver, and 0.97 for steatosis. Compared with IDL image analysis software, ChatGPT-4 exhibited slightly lower sensitivity (0.83 vs 0.89) but showed moderate correlation (r=0.68, P<.001) with IDL-derived features. However, it significantly outperformed IDL in processing efficiency, reducing analysis time by 40%, and highlighting its potential for high throughput radiomic analysis.</p><p><strong>Conclusions: </strong>Despite slightly lower sensitivity than IDL, ChatGPT-4 demonstrated high feasibility for ultrasound radiomics, offering faster processing, high-throughput analysis, and automated multi-image evaluation. These findings support its potential integration into AI-driven imaging workflows, with further refinements needed to enhance feature reproducibility and diagnostic accuracy.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":" ","pages":"e68144"},"PeriodicalIF":0.0,"publicationDate":"2025-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12260471/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144103214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lizhong Liang, Tianci Liu, William Ollier, Yonghong Peng, Yao Lu, Chao Che
{"title":"Identifying New Risk Associations Between Chronic Physical Illness and Mental Health Disorders in China: Machine Learning Approach to a Retrospective Population Analysis.","authors":"Lizhong Liang, Tianci Liu, William Ollier, Yonghong Peng, Yao Lu, Chao Che","doi":"10.2196/72599","DOIUrl":"10.2196/72599","url":null,"abstract":"<p><strong>Background: </strong>The mechanisms underlying the mutual relationships between chronic physical illnesses and mental health disorders, which potentially explain their association, remain unclear. Furthermore, how patterns of this comorbidity evolve over time are significantly underinvestigated.</p><p><strong>Objective: </strong>The main aim of this study was to use machine learning models to model and analyze the complex interplay between mental health disorders and chronic physical illnesses. Another aim was to investigate the evolving longitudinal trajectories of patients' \"health journeys.\" Moreover, the study intended to clarify the variability of comorbidity patterns within the patient population by considering the effects of age and gender in different patient subgroups.</p><p><strong>Methods: </strong>Four machine learning models were used to conduct the analysis of the relationship between mental health disorders and chronic physical illnesses.</p><p><strong>Results: </strong>Through systematic research and in-depth analysis, we found that 5 categories of chronic physical illnesses exhibit a higher risk of comorbidity with mental health disorders. Further analysis of comorbidity intensity revealed correlations between specific disease combinations, with the strongest association observed between prostate diseases and organic mental disorders (relative risk=2.055, Φ=0.212). Additionally, by examining patient subgroups stratified by age and gender, we clarified the variability of comorbidity patterns within the population. These findings highlight the complexity of disease interactions and emphasize the need for targeted monitoring and comprehensive management strategies in clinical practice.</p><p><strong>Conclusions: </strong>Machine learning models can effectively be used to study the comorbidity between mental health disorders and chronic physical illnesses. The identified high-risk chronic physical illness categories for comorbidity, the correlations between disease combinations, and the variability of comorbidity patterns according to age and gender provide valuable insights into the complex relationship between these two types of disorders.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e72599"},"PeriodicalIF":0.0,"publicationDate":"2025-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12231344/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144562189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Philip Sutera, Rohini Bhatia, Timothy Lin, Leslie Chang, Andrea Brown, Reshma Jagsi
{"title":"Generative AI in Medicine: Pioneering Progress or Perpetuating Historical Inaccuracies? Cross-Sectional Study Evaluating Implicit Bias.","authors":"Philip Sutera, Rohini Bhatia, Timothy Lin, Leslie Chang, Andrea Brown, Reshma Jagsi","doi":"10.2196/56891","DOIUrl":"10.2196/56891","url":null,"abstract":"<p><strong>Background: </strong>Generative artificial intelligence (gAI) models, such as DALL-E 2, are promising tools that can generate novel images or artwork based on text input. However, caution is warranted, as these tools generate information based on historical data and are thus at risk of propagating past learned inequities. Women in medicine have routinely been underrepresented in academic and clinical medicine and the stereotype of a male physician persists.</p><p><strong>Objective: </strong>The primary objective is to evaluate implicit bias among gAI across medical specialties.</p><p><strong>Methods: </strong>To evaluate for potential implicit bias, 100 photographs for each medical specialty were generated using the gAI platform DALL-E2. For each specialty, DALL-E2 was queried with \"An American [specialty name].\" Our primary endpoint was to compare the gender distribution of gAI photos to the current distribution in the United States. Our secondary endpoint included evaluating the racial distribution. gAI photos were classified according to perceived gender and race based on a unanimous consensus among a diverse group of medical residents. The proportion of gAI women subjects was compared for each medical specialty to the most recent Association of American Medical Colleges report for physician workforce and active residents using χ2 analysis.</p><p><strong>Results: </strong>A total of 1900 photos across 19 medical specialties were generated. Compared to physician workforce data, AI significantly overrepresented women in 7/19 specialties and underrepresented women in 6/19 specialties. Women were significantly underrepresented compared to the physician workforce by 18%, 18%, and 27% in internal medicine, family medicine, and pediatrics, respectively. Compared to current residents, AI significantly underrepresented women in 12/19 specialties, ranging from 10% to 36%. Additionally, women represented <50% of the demographic for 17/19 specialties by gAI.</p><p><strong>Conclusions: </strong>gAI created a sample population of physicians that underrepresented women when compared to both the resident and active physician workforce. Steps must be taken to train datasets in order to represent the diversity of the incoming physician workforce.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e56891"},"PeriodicalIF":0.0,"publicationDate":"2025-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12223688/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144556008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bradley Karlin, Doug Henry, Ryan Anderson, Salvatore Cieri, Michael Aratow, Elizabeth Shriberg, Michelle Hoy
{"title":"Digital Phenotyping for Detecting Depression Severity in a Large Payor-Provider System: Retrospective Study of Speech and Language Model Performance.","authors":"Bradley Karlin, Doug Henry, Ryan Anderson, Salvatore Cieri, Michael Aratow, Elizabeth Shriberg, Michelle Hoy","doi":"10.2196/69149","DOIUrl":"10.2196/69149","url":null,"abstract":"<p><strong>Background: </strong>There is considerable need to improve and increase the detection and measurement of depression. The use of speech as a digital biomarker of depression represents a considerable opportunity for transforming and accelerating depression identification and treatment; however, research to date has primarily consisted of small-sample feasibility or pilot studies incorporating highly controlled applications and settings. There has been limited examination of the technology in real-world use contexts.</p><p><strong>Objective: </strong>This study evaluated the performance of a machine learning (ML) model examining both semantic and acoustic properties of speech in predicting depression across more than 2000 real-world interactions between health plan members and case managers.</p><p><strong>Methods: </strong>A total of 2086 recordings of case management calls with verbally administered Patient Health Questionnaire-9 questions (PHQ-9) surveys were analyzed using the ML model after the portions of the recordings with the PHQ-9 survey were manually redacted. The recordings were divided into a Development Set (Dev Set) (n=1336) and a Blind Set (n=671), and Patient Health Questionnaire-8 questions (PHQ-8) scores were provided for the Dev Set for ML model refinement while PHQ-8 scores from the Blind Set were withheld until after ML model depression severity output was reported.</p><p><strong>Results: </strong>The Dev Set and the Blind Set were well matched for age (Dev Set: mean 53.7, SD 16.3 years; Blind Set: mean 51.7, SD 16.9 years), gender (Dev Set: 910/1336, 68.1% of female participants; Blind Set: 462/671, 68.9% of female participants), and depression severity (Dev Set: mean 10.5, SD 6.1 of PHQ-8 scores; Blind Set: mean 10.9, SD 6.0 of PHQ-8 scores). The concordance correlation coefficient was ρc=0.57 for the test of the ML model on the Dev Set and ρc=0.54 on the Blind Set, while the mean absolute error was 3.91 for the Dev Set and 4.06 for the Blind Set, demonstrating strong model performance. This performance was maintained when dividing each set into subgroups of age brackets (≤39, 40-64, and ≥65 years), biological sex, and the 4 categories of Social Vulnerability Index (an index based on 16 social factors), with concordance correlation coefficients ranging as ρc=0.44-0.61. Performance at PHQ-8 threshold score cutoffs of 5, 10, 15, and 20, representing the depression severity categories of none, mild, moderate, moderately severe, and severe (≥20), respectively, expressed as area under the receiver operating characteristic curve values, varied between 0.79 and 0.83 in both the Dev and Blind Sets.</p><p><strong>Conclusions: </strong>Overall, the findings suggest that speech may have significant potential for detection and measurement of depression severity over a variety of ages, gender, and socioeconomic categories that may enhance treatment, improve clinical decision-making, and enable truly personalized treatment recomm","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e69149"},"PeriodicalIF":0.0,"publicationDate":"2025-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12223686/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144556097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}