Arturo Castellanos, Haoqiang Jiang, Paulo Gomes, Debra Vander Meer, Alfred Castillo
{"title":"Large Language Models for Thematic Summarization in Qualitative Health Care Research: Comparative Analysis of Model and Human Performance.","authors":"Arturo Castellanos, Haoqiang Jiang, Paulo Gomes, Debra Vander Meer, Alfred Castillo","doi":"10.2196/64447","DOIUrl":"10.2196/64447","url":null,"abstract":"<p><strong>Background: </strong>The application of large language models (LLMs) in analyzing expert textual online data is a topic of growing importance in computational linguistics and qualitative research within health care settings.</p><p><strong>Objective: </strong>The objective of this study was to understand how LLMs can help analyze expert textual data. Topic modeling enables scaling the thematic analysis of content of a large corpus of data, but it still requires interpretation. We investigate the use of LLMs to help researchers scale this interpretation.</p><p><strong>Methods: </strong>The primary methodological phases of this project were (1) collecting data representing posts to an online nurse forum, as well as cleaning and preprocessing the data; (2) using latent Dirichlet allocation (LDA) to derive topics; (3) using human categorization for topic modeling; and (4) using LLMs to complement and scale the interpretation of thematic analysis. The purpose is to compare the outcomes of human interpretation with those derived from LLMs.</p><p><strong>Results: </strong>There is substantial agreement (247/310, 80%) between LLM and human interpretation. For two-thirds of the topics, human evaluation and LLMs agree on alignment and convergence of themes. Furthermore, LLM subthemes offer depth of analysis within LDA topics, providing detailed explanations that align with and build upon established human themes. Nonetheless, LLMs identify coherence and complementarity where human evaluation does not.</p><p><strong>Conclusions: </strong>LLMs enable the automation of the interpretation task in qualitative research. There are challenges in the use of LLMs for evaluation of the resulting themes.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e64447"},"PeriodicalIF":2.0,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12231516/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144562190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Insights on the Side Effects of Female Contraceptive Products From Online Drug Reviews: Natural Language Processing-Based Content Analysis.","authors":"Nicole Groene, Audrey Nickel, Amanda E Rohn","doi":"10.2196/68809","DOIUrl":"10.2196/68809","url":null,"abstract":"<p><strong>Background: </strong>Most online and social media discussions about birth control methods for women center on side effects, highlighting a demand for shared experiences with these products. Online user reviews and ratings of birth control products offer a largely untapped supplementary resource that could assist women and their partners in making informed contraception choices.</p><p><strong>Objective: </strong>This study sought to analyze women's online ratings and reviews of various birth control methods, focusing on side effects linked to low product ratings.</p><p><strong>Methods: </strong>Using natural language processing (NLP) for topic modeling and descriptive statistics, this study analyzes 19,506 unique reviews of female contraceptive products posted on the website Drugs.com.</p><p><strong>Results: </strong>Ratings vary widely across contraception types. Hormonal contraceptives with high systemic absorption, such as progestin-only pills and extended-cycle pills, received more unfavorable reviews than other methods and women frequently described menstrual irregularities, continuous bleeding, and weight gain associated with their administration. Intrauterine devices were generally rated more positively, although about 1 in 10 users reported severe cramps and pain, which were linked to very poor ratings.</p><p><strong>Conclusions: </strong>While exploratory, this study highlights the potential of NLP in analyzing extensive online reviews to reveal insights into women's experiences with contraceptives and the impact of side effects on their overall well-being. In addition to results from clinical studies, NLP-derived insights from online reviews can provide complementary information for women and health care providers, despite possible biases in online reviews. The findings suggest a need for further research to validate links between specific side effects, contraceptive methods, and women's overall well-being.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e68809"},"PeriodicalIF":0.0,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12006776/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143782240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Océane Dorémus, Dylan Russon, Benjamin Contrand, Ariel Guerra-Adames, Marta Avalos-Fernandez, Cédric Gil-Jardiné, Emmanuel Lagarde
{"title":"Harnessing Moderate-Sized Language Models for Reliable Patient Data Deidentification in Emergency Department Records: Algorithm Development, Validation, and Implementation Study.","authors":"Océane Dorémus, Dylan Russon, Benjamin Contrand, Ariel Guerra-Adames, Marta Avalos-Fernandez, Cédric Gil-Jardiné, Emmanuel Lagarde","doi":"10.2196/57828","DOIUrl":"10.2196/57828","url":null,"abstract":"<p><strong>Background: </strong>The digitization of health care, facilitated by the adoption of electronic health records systems, has revolutionized data-driven medical research and patient care. While this digital transformation offers substantial benefits in health care efficiency and accessibility, it concurrently raises significant concerns over privacy and data security. Initially, the journey toward protecting patient data deidentification saw the transition from rule-based systems to more mixed approaches including machine learning for deidentifying patient data. Subsequently, the emergence of large language models has represented a further opportunity in this domain, offering unparalleled potential for enhancing the accuracy of context-sensitive deidentification. However, despite large language models offering significant potential, the deployment of the most advanced models in hospital environments is frequently hindered by data security issues and the extensive hardware resources required.</p><p><strong>Objective: </strong>The objective of our study is to design, implement, and evaluate deidentification algorithms using fine-tuned moderate-sized open-source language models, ensuring their suitability for production inference tasks on personal computers.</p><p><strong>Methods: </strong>We aimed to replace personal identifying information (PII) with generic placeholders or labeling non-PII texts as \"ANONYMOUS,\" ensuring privacy while preserving textual integrity. Our dataset, derived from over 425,000 clinical notes from the adult emergency department of the Bordeaux University Hospital in France, underwent independent double annotation by 2 experts to create a reference for model validation with 3000 clinical notes randomly selected. Three open-source language models of manageable size were selected for their feasibility in hospital settings: Llama 2 (Meta) 7B, Mistral 7B, and Mixtral 8×7B (Mistral AI). Fine-tuning used the quantized low-rank adaptation technique. Evaluation focused on PII-level (recall, precision, and F1-score) and clinical note-level metrics (recall and BLEU [bilingual evaluation understudy] metric), assessing deidentification effectiveness and content preservation.</p><p><strong>Results: </strong>The generative model Mistral 7B performed the highest with an overall F1-score of 0.9673 (vs 0.8750 for Llama 2 and 0.8686 for Mixtral 8×7B). At the clinical notes level, the model's overall recall was 0.9326 (vs 0.6888 for Llama 2 and 0.6417 for Mixtral 8×7B). This rate increased to 0.9915 when Mistral 7B only deleted names. Four notes of 3000 failed to be fully pseudonymized for names: in 1 case, the nondeleted name belonged to a patient, while in the others, it belonged to medical staff. Beyond the fifth epoch, the BLEU score consistently exceeded 0.9864, indicating no significant text alteration.</p><p><strong>Conclusions: </strong>Our research underscores the significant capabilities of generative natural language proce","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e57828"},"PeriodicalIF":0.0,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12223680/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144556009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mohammad Amin Roshani, Xiangyu Zhou, Yao Qiang, Srinivasan Suresh, Steven Hicks, Usha Sethuraman, Dongxiao Zhu
{"title":"Generative Large Language Model-Powered Conversational AI App for Personalized Risk Assessment: Case Study in COVID-19.","authors":"Mohammad Amin Roshani, Xiangyu Zhou, Yao Qiang, Srinivasan Suresh, Steven Hicks, Usha Sethuraman, Dongxiao Zhu","doi":"10.2196/67363","DOIUrl":"10.2196/67363","url":null,"abstract":"<p><strong>Background: </strong>Large language models (LLMs) have demonstrated powerful capabilities in natural language tasks and are increasingly being integrated into health care for tasks like disease risk assessment. Traditional machine learning methods rely on structured data and coding, limiting their flexibility in dynamic clinical environments. This study presents a novel approach to disease risk assessment using generative LLMs through conversational artificial intelligence (AI), eliminating the need for programming.</p><p><strong>Objective: </strong>This study evaluates the use of pretrained generative LLMs, including LLaMA2-7b and Flan-T5-xl, for COVID-19 severity prediction with the goal of enabling a real-time, no-code, risk assessment solution through chatbot-based, question-answering interactions. To contextualize their performance, we compare LLMs with traditional machine learning classifiers, such as logistic regression, extreme gradient boosting (XGBoost), and random forest, which rely on tabular data.</p><p><strong>Methods: </strong>We fine-tuned LLMs using few-shot natural language examples from a dataset of 393 pediatric patients, developing a mobile app that integrates these models to provide real-time, no-code, COVID-19 severity risk assessment through clinician-patient interaction. The LLMs were compared with traditional classifiers across different experimental settings, using the area under the curve (AUC) as the primary evaluation metric. Feature importance derived from LLM attention layers was also analyzed to enhance interpretability.</p><p><strong>Results: </strong>Generative LLMs demonstrated strong performance in low-data settings. In zero-shot scenarios, the T0-3b-T model achieved an AUC of 0.75, while other LLMs, such as T0pp(8bit)-T and Flan-T5-xl-T, reached 0.67 and 0.69, respectively. At 2-shot settings, logistic regression and random forest achieved an AUC of 0.57, while Flan-T5-xl-T and T0-3b-T obtained 0.69 and 0.65, respectively. By 32-shot settings, Flan-T5-xl-T reached 0.70, similar to logistic regression (0.69) and random forest (0.68), while XGBoost improved to 0.65. These results illustrate the differences in how generative LLMs and traditional models handle the increasing data availability. LLMs perform well in low-data scenarios, whereas traditional models rely more on structured tabular data and labeled training examples. Furthermore, the mobile app provides real-time, COVID-19 severity assessments and personalized insights through attention-based feature importance, adding value to the clinical interpretation of the results.</p><p><strong>Conclusions: </strong>Generative LLMs provide a robust alternative to traditional classifiers, particularly in scenarios with limited labeled data. Their ability to handle unstructured inputs and deliver personalized, real-time assessments without coding makes them highly adaptable to clinical settings. This study underscores the potential of LLM-powered convers","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e67363"},"PeriodicalIF":0.0,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11986386/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143733506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
De Rong Loh, Elliot D Hill, Nan Liu, Geraldine Dawson, Matthew M Engelhard
{"title":"Limitations of Binary Classification for Long-Horizon Diagnosis Prediction and Advantages of a Discrete-Time Time-to-Event Approach: Empirical Analysis.","authors":"De Rong Loh, Elliot D Hill, Nan Liu, Geraldine Dawson, Matthew M Engelhard","doi":"10.2196/62985","DOIUrl":"10.2196/62985","url":null,"abstract":"<p><strong>Background: </strong>A major challenge in using electronic health records (EHR) is the inconsistency of patient follow-up, resulting in right-censored outcomes. This becomes particularly problematic in long-horizon event predictions, such as autism and attention-deficit/hyperactivity disorder (ADHD) diagnoses, where a significant number of patients are lost to follow-up before the outcome can be observed. Consequently, fully supervised methods such as binary classification (BC), which are trained to predict observed diagnoses, are substantially affected by the probability of sufficient follow-up, leading to biased results.</p><p><strong>Objective: </strong>This empirical analysis aims to characterize BC's inherent limitations for long-horizon diagnosis prediction from EHR; and quantify the benefits of a specific time-to-event (TTE) approach, the discrete-time neural network (DTNN).</p><p><strong>Methods: </strong>Records within the Duke University Health System EHR were analyzed, extracting features such as ICD-10 (International Classification of Diseases, Tenth Revision) diagnosis codes, medications, laboratories, and procedures. We compared a DTNN to 3 BC approaches and a deep Cox proportional hazards model across 4 clinical conditions to examine distributional patterns across various subgroups. Time-varying area under the receiving operating characteristic curve (AUCt) and time-varying average precision (APt) were our primary evaluation metrics.</p><p><strong>Results: </strong>TTE models consistently had comparable or higher AUCt and APt than BC for all conditions. At clinically relevant operating time points, the area under the receiving operating characteristic curve (AUC) values for DTNNYOB≤2020 (year-of-birth) and DCPHYOB≤2020 (deep Cox proportional hazard) were 0.70 (95% CI 0.66-0.77) and 0.72 (95% CI 0.66-0.78) at t=5 for autism, 0.72 (95% CI 0.65-0.76) and 0.68 (95% CI 0.62-0.74) at t=7 for ADHD, 0.72 (95% CI 0.70-0.75) and 0.71 (95% CI 0.69-0.74) at t=1 for recurrent otitis media, and 0.74 (95% CI 0.68-0.82) and 0.71 (95% CI 0.63-0.77) at t=1 for food allergy, compared to 0.6 (95% CI 0.55-0.66), 0.47 (95% CI 0.40-0.54), 0.73 (95% CI 0.70-0.75), and 0.77 (95% CI 0.71-0.82) for BCYOB≤2020, respectively. The probabilities predicted by BC models were positively correlated with censoring times, particularly for autism and ADHD prediction. Filtering strategies based on YOB or length of follow-up only partially corrected these biases. In subgroup analyses, only DTNN predicted diagnosis probabilities that accurately reflect actual clinical prevalence and temporal trends.</p><p><strong>Conclusions: </strong>BC models substantially underpredicted diagnosis likelihood and inappropriately assigned lower probability scores to individuals with earlier censoring. Common filtering strategies did not adequately address this limitation. TTE approaches, particularly DTNN, effectively mitigated bias from the censoring distribution, resulting in","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e62985"},"PeriodicalIF":0.0,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12223692/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144556011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Per Niklas Waaler, Musarrat Hussain, Igor Molchanov, Lars Ailo Bongo, Brita Elvevåg
{"title":"Prompt Engineering an Informational Chatbot for Education on Mental Health Using a Multiagent Approach for Enhanced Compliance With Prompt Instructions: Algorithm Development and Validation.","authors":"Per Niklas Waaler, Musarrat Hussain, Igor Molchanov, Lars Ailo Bongo, Brita Elvevåg","doi":"10.2196/69820","DOIUrl":"10.2196/69820","url":null,"abstract":"<p><strong>Background: </strong>People with schizophrenia often present with cognitive impairments that may hinder their ability to learn about their condition. Education platforms powered by large language models (LLMs) have the potential to improve the accessibility of mental health information. However, the black-box nature of LLMs raises ethical and safety concerns regarding the controllability of chatbots. In particular, prompt-engineered chatbots may drift from their intended role as the conversation progresses and become more prone to hallucinations.</p><p><strong>Objective: </strong>This study aimed to develop and evaluate a critical analysis filter (CAF) system that ensures that an LLM-powered prompt-engineered chatbot reliably complies with its predefined instructions and scope while delivering validated mental health information.</p><p><strong>Methods: </strong>For a proof of concept, we prompt engineered an educational chatbot for schizophrenia powered by GPT-4 that could dynamically access information from a schizophrenia manual written for people with schizophrenia and their caregivers. In the CAF, a team of prompt-engineered LLM agents was used to critically analyze and refine the chatbot's responses and deliver real-time feedback to the chatbot. To assess the ability of the CAF to re-establish the chatbot's adherence to its instructions, we generated 3 conversations (by conversing with the chatbot with the CAF disabled) wherein the chatbot started to drift from its instructions toward various unintended roles. We used these checkpoint conversations to initialize automated conversations between the chatbot and adversarial chatbots designed to entice it toward unintended roles. Conversations were repeatedly sampled with the CAF enabled and disabled. In total, 3 human raters independently rated each chatbot response according to criteria developed to measure the chatbot's integrity, specifically, its transparency (such as admitting when a statement lacked explicit support from its scripted sources) and its tendency to faithfully convey the scripted information in the schizophrenia manual.</p><p><strong>Results: </strong>In total, 36 responses (3 different checkpoint conversations, 3 conversations per checkpoint, and 4 adversarial queries per conversation) were rated for compliance with the CAF enabled and disabled. Activating the CAF resulted in a compliance score that was considered acceptable (≥2) in 81% (7/36) of the responses, compared to only 8.3% (3/36) when the CAF was deactivated.</p><p><strong>Conclusions: </strong>Although more rigorous testing in realistic scenarios is needed, our results suggest that self-reflection mechanisms could enable LLMs to be used effectively and safely in educational mental health platforms. This approach harnesses the flexibility of LLMs while reliably constraining their scope to appropriate and accurate interactions.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":" ","pages":"e69820"},"PeriodicalIF":0.0,"publicationDate":"2025-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11982747/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143484853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Research Dawadi, Mai Inoue, Jie Ting Tay, Agustin Martin-Morales, Thien Vu, Michihiro Araki
{"title":"Disease Prediction Using Machine Learning on Smartphone-Based Eye, Skin, and Voice Data: Scoping Review.","authors":"Research Dawadi, Mai Inoue, Jie Ting Tay, Agustin Martin-Morales, Thien Vu, Michihiro Araki","doi":"10.2196/59094","DOIUrl":"10.2196/59094","url":null,"abstract":"<p><strong>Background: </strong>The application of machine learning methods to data generated by ubiquitous devices like smartphones presents an opportunity to enhance the quality of health care and diagnostics. Smartphones are ideal for gathering data easily, providing quick feedback on diagnoses, and proposing interventions for health improvement.</p><p><strong>Objective: </strong>We reviewed the existing literature to gather studies that have used machine learning models with smartphone-derived data for the prediction and diagnosis of health anomalies. We divided the studies into those that used machine learning models by conducting experiments to retrieve data and predict diseases, and those that used machine learning models on publicly available databases. The details of databases, experiments, and machine learning models are intended to help researchers working in the fields of machine learning and artificial intelligence in the health care domain. Researchers can use the information to design their experiments or determine the databases they could analyze.</p><p><strong>Methods: </strong>A comprehensive search of the PubMed and IEEE Xplore databases was conducted, and an in-house keyword screening method was used to filter the articles based on the content of their titles and abstracts. Subsequently, studies related to the 3 areas of voice, skin, and eye were selected and analyzed based on how data for machine learning models were extracted (ie, the use of publicly available databases or through experiments). The machine learning methods used in each study were also noted.</p><p><strong>Results: </strong>A total of 49 studies were identified as being relevant to the topic of interest, and among these studies, there were 31 different databases and 24 different machine learning methods.</p><p><strong>Conclusions: </strong>The results provide a better understanding of how smartphone data are collected for predicting different diseases and what kinds of machine learning methods are used on these data. Similarly, publicly available databases having smartphone-based data that can be used for the diagnosis of various diseases have been presented. Our screening method could be used or improved in future studies, and our findings could be used as a reference to conduct similar studies, experiments, or statistical analyses.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e59094"},"PeriodicalIF":0.0,"publicationDate":"2025-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11979540/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143712484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Scott A Helgeson, Zachary S Quicksall, Patrick W Johnson, Kaiser G Lim, Rickey E Carter, Augustine S Lee
{"title":"Estimation of Static Lung Volumes and Capacities From Spirometry Using Machine Learning: Algorithm Development and Validation.","authors":"Scott A Helgeson, Zachary S Quicksall, Patrick W Johnson, Kaiser G Lim, Rickey E Carter, Augustine S Lee","doi":"10.2196/65456","DOIUrl":"10.2196/65456","url":null,"abstract":"<p><strong>Background: </strong>Spirometry can be performed in an office setting or remotely using portable spirometers. Although basic spirometry is used for diagnosis of obstructive lung disease, clinically relevant information such as restriction, hyperinflation, and air trapping require additional testing, such as body plethysmography, which is not as readily available. We hypothesize that spirometry data contains information that can allow estimation of static lung volumes in certain circumstances by leveraging machine learning techniques.</p><p><strong>Objective: </strong>The aim of the study was to develop artificial intelligence-based algorithms for estimating lung volumes and capacities using spirometry measures.</p><p><strong>Methods: </strong>This study obtained spirometry and lung volume measurements from the Mayo Clinic pulmonary function test database for patient visits between February 19, 2001, and December 16, 2022. Preprocessing was performed, and various machine learning algorithms were applied, including a generalized linear model with regularization, random forests, extremely randomized trees, gradient-boosted trees, and XGBoost for both classification and regression cohorts.</p><p><strong>Results: </strong>A total of 121,498 pulmonary function tests were used in this study, with 85,017 allotted for exploratory data analysis and model development (ie, training dataset) and 36,481 tests reserved for model evaluation (ie, testing dataset). The median age of the cohort was 64.7 years (IQR 18-119.6), with a balanced distribution between genders, consisting 48.2% (n=58,607) female and 51.8% (n=62,889) male patients. The classification models showed a robust performance overall, with relatively low root mean square error and mean absolute error values observed across all predicted lung volumes. Across all lung volume categories, the models demonstrated strong discriminatory capacity, as indicated by the high area under the receiver operating characteristic curve values ranging from 0.85 to 0.99 in the training set and 0.81 to 0.98 in the testing set.</p><p><strong>Conclusions: </strong>Overall, the models demonstrate robust performance across lung volume measurements, underscoring their potential utility in clinical practice for accurate diagnosis and prognosis of respiratory conditions, particularly in settings where access to body plethysmography or other lung volume measurement modalities is limited.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e65456"},"PeriodicalIF":2.0,"publicationDate":"2025-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12223454/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144556006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Saman Andalib, Aidin Spina, Bryce Picton, Sean S Solomon, John A Scolaro, Ariana M Nelson
{"title":"Using AI to Translate and Simplify Spanish Orthopedic Medical Text: Instrument Validation Study.","authors":"Saman Andalib, Aidin Spina, Bryce Picton, Sean S Solomon, John A Scolaro, Ariana M Nelson","doi":"10.2196/70222","DOIUrl":"10.2196/70222","url":null,"abstract":"<p><strong>Background: </strong>Language barriers contribute significantly to health care disparities in the United States, where a sizable proportion of patients are exclusively Spanish speakers. In orthopedic surgery, such barriers impact both patients' comprehension of and patients' engagement with available resources. Studies have explored the utility of large language models (LLMs) for medical translation but have yet to robustly evaluate artificial intelligence (AI)-driven translation and simplification of orthopedic materials for Spanish speakers.</p><p><strong>Objective: </strong>This study used the bilingual evaluation understudy (BLEU) method to assess translation quality and investigated the ability of AI to simplify patient education materials (PEMs) in Spanish.</p><p><strong>Methods: </strong>PEMs (n=78) from the American Academy of Orthopaedic Surgery were translated from English to Spanish, using 2 LLMs (GPT-4 and Google Translate). The BLEU methodology was applied to compare AI translations with professionally human-translated PEMs. The Friedman test and Dunn multiple comparisons test were used to statistically quantify differences in translation quality. A readability analysis and feature analysis were subsequently performed to evaluate text simplification success and the impact of English text features on BLEU scores. The capability of an LLM to simplify medical language written in Spanish was also assessed.</p><p><strong>Results: </strong>As measured by BLEU scores, GPT-4 showed moderate success in translating PEMs into Spanish but was less successful than Google Translate. Simplified PEMs demonstrated improved readability when compared to original versions (P<.001) but were unable to reach the targeted grade level for simplification. The feature analysis revealed that the total number of syllables and average number of syllables per sentence had the highest impact on BLEU scores. GPT-4 was able to significantly reduce the complexity of medical text written in Spanish (P<.001).</p><p><strong>Conclusions: </strong>Although Google Translate outperformed GPT-4 in translation accuracy, LLMs, such as GPT-4, may provide significant utility in translating medical texts into Spanish and simplifying such texts. We recommend considering a dual approach-using Google Translate for translation and GPT-4 for simplification-to improve medical information accessibility and orthopedic surgery education among Spanish-speaking patients.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e70222"},"PeriodicalIF":0.0,"publicationDate":"2025-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12223325/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144556025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Utility-based Analysis of Statistical Approaches and Deep Learning Models for Synthetic Data Generation With Focus on Correlation Structures: Algorithm Development and Validation.","authors":"Marko Miletic, Murat Sariyar","doi":"10.2196/65729","DOIUrl":"10.2196/65729","url":null,"abstract":"<p><strong>Background: </strong>Recent advancements in Generative Adversarial Networks and large language models (LLMs) have significantly advanced the synthesis and augmentation of medical data. These and other deep learning-based methods offer promising potential for generating high-quality, realistic datasets crucial for improving machine learning applications in health care, particularly in contexts where data privacy and availability are limiting factors. However, challenges remain in accurately capturing the complex associations inherent in medical datasets.</p><p><strong>Objective: </strong>This study evaluates the effectiveness of various Synthetic Data Generation (SDG) methods in replicating the correlation structures inherent in real medical datasets. In addition, it examines their performance in downstream tasks using Random Forests (RFs) as the benchmark model. To provide a comprehensive analysis, alternative models such as eXtreme Gradient Boosting and Gated Additive Tree Ensembles are also considered. We compare the following SDG approaches: Synthetic Populations in R (synthpop), copula, copulagan, Conditional Tabular Generative Adversarial Network (ctgan), tabular variational autoencoder (tvae), and tabula for LLMs.</p><p><strong>Methods: </strong>We evaluated synthetic data generation methods using both real-world and simulated datasets. Simulated data consist of 10 Gaussian variables and one binary target variable with varying correlation structures, generated via Cholesky decomposition. Real-world datasets include the body performance dataset with 13,393 samples for fitness classification, the Wisconsin Breast Cancer dataset with 569 samples for tumor diagnosis, and the diabetes dataset with 768 samples for diabetes prediction. Data quality is evaluated by comparing correlation matrices, the propensity score mean-squared error (pMSE) for general utility, and F<sub>1</sub>-scores for downstream tasks as a specific utility metric, using training on synthetic data and testing on real data.</p><p><strong>Results: </strong>Our simulation study, supplemented with real-world data analyses, shows that the statistical methods copula and synthpop consistently outperform deep learning approaches across various sample sizes and correlation complexities, with synthpop being the most effective. Deep learning methods, including large LLMs, show mixed performance, particularly with smaller datasets or limited training epochs. LLMs often struggle to replicate numerical dependencies effectively. In contrast, methods like tvae with 10,000 epochs perform comparably well. On the body performance dataset, copulagan achieves the best performance in terms of pMSE. The results also highlight that model utility depends more on the relative correlations between features and the target variable than on the absolute magnitude of correlation matrix differences.</p><p><strong>Conclusions: </strong>Statistical methods, particularly synthpop, demonstrate superi","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e65729"},"PeriodicalIF":0.0,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11969122/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143671937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}