{"title":"Evaluating and Enhancing Japanese Large Language Models for Genetic Counseling Support: Comparative Study of Domain Adaptation and the Development of an Expert-Evaluated Dataset.","authors":"Takuya Fukushima, Masae Manabe, Shuntaro Yada, Shoko Wakamiya, Akiko Yoshida, Yusaku Urakawa, Akiko Maeda, Shigeyuki Kan, Masayo Takahashi, Eiji Aramaki","doi":"10.2196/65047","DOIUrl":"10.2196/65047","url":null,"abstract":"<p><strong>Background: </strong>Advances in genetics have underscored a strong association between genetic factors and health outcomes, leading to an increased demand for genetic counseling services. However, a shortage of qualified genetic counselors poses a significant challenge. Large language models (LLMs) have emerged as a potential solution for augmenting support in genetic counseling tasks. Despite the potential, Japanese genetic counseling LLMs (JGCLLMs) are underexplored. To advance a JGCLLM-based dialogue system for genetic counseling, effective domain adaptation methods require investigation.</p><p><strong>Objective: </strong>This study aims to evaluate the current capabilities and identify challenges in developing a JGCLLM-based dialogue system for genetic counseling. The primary focus is to assess the effectiveness of prompt engineering, retrieval-augmented generation (RAG), and instruction tuning within the context of genetic counseling. Furthermore, we will establish an experts-evaluated dataset of responses generated by LLMs adapted to Japanese genetic counseling for the future development of JGCLLMs.</p><p><strong>Methods: </strong>Two primary datasets were used in this study: (1) a question-answer (QA) dataset for LLM adaptation and (2) a genetic counseling question dataset for evaluation. The QA dataset included 899 QA pairs covering medical and genetic counseling topics, while the evaluation dataset contained 120 curated questions across 6 genetic counseling categories. Three enhancement techniques of LLMs-instruction tuning, RAG, and prompt engineering-were applied to a lightweight Japanese LLM to enhance its ability for genetic counseling. The performance of the adapted LLM was evaluated on the 120-question dataset by 2 certified genetic counselors and 1 ophthalmologist (SK, YU, and AY). Evaluation focused on four metrics: (1) inappropriateness of information, (2) sufficiency of information, (3) severity of harm, and (4) alignment with medical consensus.</p><p><strong>Results: </strong>The evaluation by certified genetic counselors and an ophthalmologist revealed varied outcomes across different methods. RAG showed potential, particularly in enhancing critical aspects of genetic counseling. In contrast, instruction tuning and prompt engineering produced less favorable outcomes. This evaluation process facilitated the creation an expert-evaluated dataset of responses generated by LLMs adapted with different combinations of these methods. Error analysis identified key ethical concerns, including inappropriate promotion of prenatal testing, criticism of relatives, and inaccurate probability statements.</p><p><strong>Conclusions: </strong>RAG demonstrated notable improvements across all evaluation metrics, suggesting potential for further enhancement through the expansion of RAG data. The expert-evaluated dataset developed in this study provides valuable insights for future optimization efforts. However, the ethical issues obser","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e65047"},"PeriodicalIF":3.1,"publicationDate":"2025-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11783024/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143016961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Effectiveness of the Facility for Elderly Surveillance System (FESSy) in Two Public Health Center Jurisdictions in Japan: Prospective Observational Study.","authors":"Junko Kurita, Motomi Hori, Sumiyo Yamaguchi, Aiko Ogiwara, Yurina Saito, Minako Sugiyama, Asami Sunadori, Tomoko Hayashi, Akane Hara, Yukari Kawana, Youichi Itoi, Tamie Sugawara, Yoshiyuki Sugishita, Fujiko Irie, Naomi Sakurai","doi":"10.2196/58509","DOIUrl":"10.2196/58509","url":null,"abstract":"<p><strong>Background: </strong>Residents of facilities for older people are vulnerable to COVID-19 outbreaks. Nevertheless, timely recognition of outbreaks at facilities for older people at public health centers has been impossible in Japan since May 8, 2023, when the Japanese government discontinued aggressive countermeasures against COVID-19 because of the waning severity of the dominant Omicron strain. The Facility for Elderly Surveillance System (FESSy) has been developed to improve information collection.</p><p><strong>Objective: </strong>This study examined FESSy experiences and effectiveness in two public health center jurisdictions in Japan.</p><p><strong>Methods: </strong>This study assessed the use by public health centers of the detection mode of an automated AI detection system (ie, FESSy AI), as well as manual detection by the public health centers' staff (ie, FESSy staff) and direct reporting by facilities to the public health centers. We considered the following aspects: (1) diagnoses or symptoms, (2) numbers of patients as of their detection date, and (3) ultimate numbers of patients involved in incidents. Subsequently, effectiveness was assessed and compared based on detection modes. The study lasted from June 1, 2023, through January 2024.</p><p><strong>Results: </strong>In both areas, this study examined 31 facilities at which 87 incidents were detected. FESSy (AI or staff) detected significantly fewer patients than non-FESSy methods, that is, direct reporting to the public health center of the detection date and ultimate number of patients.</p><p><strong>Conclusions: </strong>FESSy was superior to direct reporting from facilities for the number of patients as of the detection date and for the ultimate outbreak size.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e58509"},"PeriodicalIF":3.1,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11741194/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142973490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Qwen-2.5 Outperforms Other Large Language Models in the Chinese National Nursing Licensing Examination: Retrospective Cross-Sectional Comparative Study.","authors":"Shiben Zhu, Wanqin Hu, Zhi Yang, Jiani Yan, Fang Zhang","doi":"10.2196/63731","DOIUrl":"10.2196/63731","url":null,"abstract":"<p><strong>Background: </strong>Large language models (LLMs) have been proposed as valuable tools in medical education and practice. The Chinese National Nursing Licensing Examination (CNNLE) presents unique challenges for LLMs due to its requirement for both deep domain-specific nursing knowledge and the ability to make complex clinical decisions, which differentiates it from more general medical examinations. However, their potential application in the CNNLE remains unexplored.</p><p><strong>Objective: </strong>This study aims to evaluates the accuracy of 7 LLMs including GPT-3.5, GPT-4.0, GPT-4o, Copilot, ERNIE Bot-3.5, SPARK, and Qwen-2.5 on the CNNLE, focusing on their ability to handle domain-specific nursing knowledge and clinical decision-making. We also explore whether combining their outputs using machine learning techniques can improve their overall accuracy.</p><p><strong>Methods: </strong>This retrospective cross-sectional study analyzed all 1200 multiple-choice questions from the CNNLE conducted between 2019 and 2023. Seven LLMs were evaluated on these multiple-choice questions, and 9 machine learning models, including Logistic Regression, Support Vector Machine, Multilayer Perceptron, k-nearest neighbors, Random Forest, LightGBM, AdaBoost, XGBoost, and CatBoost, were used to optimize overall performance through ensemble techniques.</p><p><strong>Results: </strong>Qwen-2.5 achieved the highest overall accuracy of 88.9%, followed by GPT-4o (80.7%), ERNIE Bot-3.5 (78.1%), GPT-4.0 (70.3%), SPARK (65.0%), and GPT-3.5 (49.5%). Qwen-2.5 demonstrated superior accuracy in the Practical Skills section compared with the Professional Practice section across most years. It also performed well in brief clinical case summaries and questions involving shared clinical scenarios. When the outputs of the 7 LLMs were combined using 9 machine learning models, XGBoost yielded the best performance, increasing accuracy to 90.8%. XGBoost also achieved an area under the curve of 0.961, sensitivity of 0.905, specificity of 0.978, F<sub>1</sub>-score of 0.901, positive predictive value of 0.901, and negative predictive value of 0.977.</p><p><strong>Conclusions: </strong>This study is the first to evaluate the performance of 7 LLMs on the CNNLE and that the integration of models via machine learning significantly boosted accuracy, reaching 90.8%. These findings demonstrate the transformative potential of LLMs in revolutionizing health care education and call for further research to refine their capabilities and expand their impact on examination preparation and professional training.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e63731"},"PeriodicalIF":3.1,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11759905/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142962601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yong Zhang, Xiao Lu, Yan Luo, Ying Zhu, Wenwu Ling
{"title":"Performance of Artificial Intelligence Chatbots on Ultrasound Examinations: Cross-Sectional Comparative Analysis.","authors":"Yong Zhang, Xiao Lu, Yan Luo, Ying Zhu, Wenwu Ling","doi":"10.2196/63924","DOIUrl":"10.2196/63924","url":null,"abstract":"<p><strong>Background: </strong>Artificial intelligence chatbots are being increasingly used for medical inquiries, particularly in the field of ultrasound medicine. However, their performance varies and is influenced by factors such as language, question type, and topic.</p><p><strong>Objective: </strong>This study aimed to evaluate the performance of ChatGPT and ERNIE Bot in answering ultrasound-related medical examination questions, providing insights for users and developers.</p><p><strong>Methods: </strong>We curated 554 questions from ultrasound medicine examinations, covering various question types and topics. The questions were posed in both English and Chinese. Objective questions were scored based on accuracy rates, whereas subjective questions were rated by 5 experienced doctors using a Likert scale. The data were analyzed in Excel.</p><p><strong>Results: </strong>Of the 554 questions included in this study, single-choice questions comprised the largest share (354/554, 64%), followed by short answers (69/554, 12%) and noun explanations (63/554, 11%). The accuracy rates for objective questions ranged from 8.33% to 80%, with true or false questions scoring highest. Subjective questions received acceptability rates ranging from 47.62% to 75.36%. ERNIE Bot was superior to ChatGPT in many aspects (P<.05). Both models showed a performance decline in English, but ERNIE Bot's decline was less significant. The models performed better in terms of basic knowledge, ultrasound methods, and diseases than in terms of ultrasound signs and diagnosis.</p><p><strong>Conclusions: </strong>Chatbots can provide valuable ultrasound-related answers, but performance differs by model and is influenced by language, question type, and topic. In general, ERNIE Bot outperforms ChatGPT. Users and developers should understand model performance characteristics and select appropriate models for different questions and languages to optimize chatbot use.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e63924"},"PeriodicalIF":3.1,"publicationDate":"2025-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11737282/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143016966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Susan J Oudbier, Ellen Ma Smets, Pythia T Nieuwkerk, David P Neal, S Azam Nurmohamed, Hans J Meij, Linda W Dusseljee-Peute
{"title":"Patients' Experienced Usability and Satisfaction With Digital Health Solutions in a Home Setting: Instrument Validation Study.","authors":"Susan J Oudbier, Ellen Ma Smets, Pythia T Nieuwkerk, David P Neal, S Azam Nurmohamed, Hans J Meij, Linda W Dusseljee-Peute","doi":"10.2196/63703","DOIUrl":"10.2196/63703","url":null,"abstract":"<p><strong>Background: </strong>The field of digital health solutions (DHS) has grown tremendously over the past years. DHS include tools for self-management, which support individuals to take charge of their own health. The usability of DHS, as experienced by patients, is pivotal to adoption. However, well-known questionnaires that evaluate usability and satisfaction use complex terminology derived from human-computer interaction and are therefore not well suited to assess experienced usability of patients using DHS in a home setting.</p><p><strong>Objective: </strong>This study aimed to develop, validate, and assess an instrument that measures experienced usability and satisfaction of patients using DHS in a home setting.</p><p><strong>Methods: </strong>The development of the \"Experienced Usability and Satisfaction with Self-monitoring in the Home Setting\" (GEMS) questionnaire followed several steps. Step I consisted of assessing the content validity, by conducting a literature review on current usability and satisfaction questionnaires, collecting statements and discussing these in an expert meeting, and translating each statement and adjusting it to the language level of the general population. This phase resulted in a draft version of the GEMS. Step II comprised assessing its face validity by pilot testing with Amsterdam University Medical Center's patient panel. In step III, psychometric analysis was conducted and the GEMS was assessed for reliability.</p><p><strong>Results: </strong>A total of 14 items were included for psychometric analysis and resulted in 4 reliable scales: convenience of use, perceived value, efficiency of use, and satisfaction.</p><p><strong>Conclusions: </strong>Overall, the GEMS questionnaire demonstrated its reliability and validity in assessing experienced usability and satisfaction of DHS in a home setting. Further refinement of the instrument is necessary to confirm its applicability in other patient populations in order to promote the development of a steering mechanism that can be applied longitudinally throughout implementation, and can be used as a benchmarking instrument.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e63703"},"PeriodicalIF":3.1,"publicationDate":"2025-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11734564/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142973340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yan Zhuang, Junyan Zhang, Xiuxing Li, Chao Liu, Yue Yu, Wei Dong, Kunlun He
{"title":"Autonomous International Classification of Diseases Coding Using Pretrained Language Models and Advanced Prompt Learning Techniques: Evaluation of an Automated Analysis System Using Medical Text.","authors":"Yan Zhuang, Junyan Zhang, Xiuxing Li, Chao Liu, Yue Yu, Wei Dong, Kunlun He","doi":"10.2196/63020","DOIUrl":"10.2196/63020","url":null,"abstract":"<p><strong>Background: </strong>Machine learning models can reduce the burden on doctors by converting medical records into International Classification of Diseases (ICD) codes in real time, thereby enhancing the efficiency of diagnosis and treatment. However, it faces challenges such as small datasets, diverse writing styles, unstructured records, and the need for semimanual preprocessing. Existing approaches, such as naive Bayes, Word2Vec, and convolutional neural networks, have limitations in handling missing values and understanding the context of medical texts, leading to a high error rate. We developed a fully automated pipeline based on the Key-bidirectional encoder representations from transformers (BERT) approach and large-scale medical records for continued pretraining, which effectively converts long free text into standard ICD codes. By adjusting parameter settings, such as mixed templates and soft verbalizers, the model can adapt flexibly to different requirements, enabling task-specific prompt learning.</p><p><strong>Objective: </strong>This study aims to propose a prompt learning real-time framework based on pretrained language models that can automatically label long free-text data with ICD-10 codes for cardiovascular diseases without the need for semiautomatic preprocessing.</p><p><strong>Methods: </strong>We integrated 4 components into our framework: a medical pretrained BERT, a keyword filtration BERT in a functional order, a fine-tuning phase, and task-specific prompt learning utilizing mixed templates and soft verbalizers. This framework was validated on a multicenter medical dataset for the automated ICD coding of 13 common cardiovascular diseases (584,969 records). Its performance was compared against robustly optimized BERT pretraining approach, extreme language network, and various BERT-based fine-tuning pipelines. Additionally, we evaluated the framework's performance under different prompt learning and fine-tuning settings. Furthermore, few-shot learning experiments were conducted to assess the feasibility and efficacy of our framework in scenarios involving small- to mid-sized datasets.</p><p><strong>Results: </strong>Compared with traditional pretraining and fine-tuning pipelines, our approach achieved a higher micro-F1-score of 0.838 and a macro-area under the receiver operating characteristic curve (macro-AUC) of 0.958, which is 10% higher than other methods. Among different prompt learning setups, the combination of mixed templates and soft verbalizers yielded the best performance. Few-shot experiments showed that performance stabilized and the AUC peaked at 500 shots.</p><p><strong>Conclusions: </strong>These findings underscore the effectiveness and superior performance of prompt learning and fine-tuning for subtasks within pretrained language models in medical practice. Our real-time ICD coding pipeline efficiently converts detailed medical free text into standardized labels, offering promising applications in ","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e63020"},"PeriodicalIF":3.1,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11747532/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143025995","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Development and Evaluation of a Mental Health Chatbot Using ChatGPT 4.0: Mixed Methods User Experience Study With Korean Users.","authors":"Boyoung Kang, Munpyo Hong","doi":"10.2196/63538","DOIUrl":"10.2196/63538","url":null,"abstract":"<p><strong>Background: </strong>Mental health chatbots have emerged as a promising tool for providing accessible and convenient support to individuals in need. Building on our previous research on digital interventions for loneliness and depression among Korean college students, this study addresses the limitations identified and explores more advanced artificial intelligence-driven solutions.</p><p><strong>Objective: </strong>This study aimed to develop and evaluate the performance of HoMemeTown Dr. CareSam, an advanced cross-lingual chatbot using ChatGPT 4.0 (OpenAI) to provide seamless support in both English and Korean contexts. The chatbot was designed to address the need for more personalized and culturally sensitive mental health support identified in our previous work while providing an accessible and user-friendly interface for Korean young adults.</p><p><strong>Methods: </strong>We conducted a mixed methods pilot study with 20 Korean young adults aged 18 to 27 (mean 23.3, SD 1.96) years. The HoMemeTown Dr CareSam chatbot was developed using the GPT application programming interface, incorporating features such as a gratitude journal and risk detection. User satisfaction and chatbot performance were evaluated using quantitative surveys and qualitative feedback, with triangulation used to ensure the validity and robustness of findings through cross-verification of data sources. Comparative analyses were conducted with other large language models chatbots and existing digital therapy tools (Woebot [Woebot Health Inc] and Happify [Twill Inc]).</p><p><strong>Results: </strong>Users generally expressed positive views towards the chatbot, with positivity and support receiving the highest score on a 10-point scale (mean 9.0, SD 1.2), followed by empathy (mean 8.7, SD 1.6) and active listening (mean 8.0, SD 1.8). However, areas for improvement were noted in professionalism (mean 7.0, SD 2.0), complexity of content (mean 7.4, SD 2.0), and personalization (mean 7.4, SD 2.4). The chatbot demonstrated statistically significant performance differences compared with other large language models chatbots (F=3.27; P=.047), with more pronounced differences compared with Woebot and Happify (F=12.94; P<.001). Qualitative feedback highlighted the chatbot's strengths in providing empathetic responses and a user-friendly interface, while areas for improvement included response speed and the naturalness of Korean language responses.</p><p><strong>Conclusions: </strong>The HoMemeTown Dr CareSam chatbot shows potential as a cross-lingual mental health support tool, achieving high user satisfaction and demonstrating comparative advantages over existing digital interventions. However, the study's limited sample size and short-term nature necessitate further research. Future studies should include larger-scale clinical trials, enhanced risk detection features, and integration with existing health care systems to fully realize its potential in supporting mental well-","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e63538"},"PeriodicalIF":3.1,"publicationDate":"2025-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11748427/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142928333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Amadeo Jesus Wals Zurita, Hector Miras Del Rio, Nerea Ugarte Ruiz de Aguirre, Cristina Nebrera Navarro, Maria Rubio Jimenez, David Muñoz Carmona, Carlos Miguez Sanchez
{"title":"The Transformative Potential of Large Language Models in Mining Electronic Health Records Data: Content Analysis.","authors":"Amadeo Jesus Wals Zurita, Hector Miras Del Rio, Nerea Ugarte Ruiz de Aguirre, Cristina Nebrera Navarro, Maria Rubio Jimenez, David Muñoz Carmona, Carlos Miguez Sanchez","doi":"10.2196/58457","DOIUrl":"10.2196/58457","url":null,"abstract":"<p><strong>Background: </strong>In this study, we evaluate the accuracy, efficiency, and cost-effectiveness of large language models in extracting and structuring information from free-text clinical reports, particularly in identifying and classifying patient comorbidities within oncology electronic health records. We specifically compare the performance of gpt-3.5-turbo-1106 and gpt-4-1106-preview models against that of specialized human evaluators.</p><p><strong>Objective: </strong>We specifically compare the performance of gpt-3.5-turbo-1106 and gpt-4-1106-preview models against that of specialized human evaluators.</p><p><strong>Methods: </strong>We implemented a script using the OpenAI application programming interface to extract structured information in JavaScript object notation format from comorbidities reported in 250 personal history reports. These reports were manually reviewed in batches of 50 by 5 specialists in radiation oncology. We compared the results using metrics such as sensitivity, specificity, precision, accuracy, F-value, κ index, and the McNemar test, in addition to examining the common causes of errors in both humans and generative pretrained transformer (GPT) models.</p><p><strong>Results: </strong>The GPT-3.5 model exhibited slightly lower performance compared to physicians across all metrics, though the differences were not statistically significant (McNemar test, P=.79). GPT-4 demonstrated clear superiority in several key metrics (McNemar test, P<.001). Notably, it achieved a sensitivity of 96.8%, compared to 88.2% for GPT-3.5 and 88.8% for physicians. However, physicians marginally outperformed GPT-4 in precision (97.7% vs 96.8%). GPT-4 showed greater consistency, replicating the exact same results in 76% of the reports across 10 repeated analyses, compared to 59% for GPT-3.5, indicating more stable and reliable performance. Physicians were more likely to miss explicit comorbidities, while the GPT models more frequently inferred nonexplicit comorbidities, sometimes correctly, though this also resulted in more false positives.</p><p><strong>Conclusions: </strong>This study demonstrates that, with well-designed prompts, the large language models examined can match or even surpass medical specialists in extracting information from complex clinical reports. Their superior efficiency in time and costs, along with easy integration with databases, makes them a valuable tool for large-scale data mining and real-world evidence generation.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e58457"},"PeriodicalIF":3.1,"publicationDate":"2025-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11739723/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142923859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhancing Clinical Decision Making by Predicting Readmission Risk in Patients With Heart Failure Using Machine Learning: Predictive Model Development Study.","authors":"Xiangkui Jiang, Bingquan Wang","doi":"10.2196/58812","DOIUrl":"10.2196/58812","url":null,"abstract":"<p><strong>Background: </strong>Patients with heart failure frequently face the possibility of rehospitalization following an initial hospital stay, placing a significant burden on both patients and health care systems. Accurate predictive tools are crucial for guiding clinical decision-making and optimizing patient care. However, the effectiveness of existing models tailored specifically to the Chinese population is still limited.</p><p><strong>Objective: </strong>This study aimed to formulate a predictive model for assessing the likelihood of readmission among patients diagnosed with heart failure.</p><p><strong>Methods: </strong>In this study, we analyzed data from 1948 patients with heart failure in a hospital in Sichuan Province between 2016 and 2019. By applying 3 variable selection strategies, 29 relevant variables were identified. Subsequently, we constructed 6 predictive models using different algorithms: logistic regression, support vector machine, gradient boosting machine, Extreme Gradient Boosting, multilayer perception, and graph convolutional networks.</p><p><strong>Results: </strong>The graph convolutional network model showed the highest prediction accuracy with an area under the receiver operating characteristic curve of 0.831, accuracy of 75%, sensitivity of 52.12%, and specificity of 90.25%.</p><p><strong>Conclusions: </strong>The model crafted in this study proves its effectiveness in forecasting the likelihood of readmission among patients with heart failure, thus serving as a crucial reference for clinical decision-making.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"12 ","pages":"e58812"},"PeriodicalIF":3.1,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11706445/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142911165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Effectiveness of Outpatient Chronic Pain Management for Middle-Aged Patients by Internet Hospitals: Retrospective Cohort Study.","authors":"Ling Sang, Bixin Zheng, Xianzheng Zeng, Huizhen Liu, Qing Jiang, Maotong Liu, Chenyu Zhu, Maoying Wang, Zengwei Yi, Keyu Song, Li Song","doi":"10.2196/54975","DOIUrl":"https://doi.org/10.2196/54975","url":null,"abstract":"<p><strong>Background: </strong>Chronic pain is widespread and carries a heavy disease burden, and there is a lack of effective outpatient pain management. As an emerging internet medical platform in China, internet hospitals have been successfully applied for the management of chronic diseases. There are also a certain number of patients with chronic pain that use internet hospitals for pain management. However, no studies have investigated the effectiveness of pain management via internet hospitals.</p><p><strong>Objective: </strong>The aim of this retrospective cohort study was to explore the effectiveness of chronic pain management by internet hospitals and their advantages and disadvantages compared to traditional physical hospital visits.</p><p><strong>Methods: </strong>This was a retrospective cohort study. Demographic information such as the patient's sex, age, and number of visits was obtained from the IT center. During the first and last patient visits, information on outcome variables such as the Brief Pain Inventory (BPI), medical satisfaction, medical costs, and adverse drug events was obtained through a telephone follow-up. All patients with chronic pain who had 3 or more visits (internet or offline) between September 2021, and February 2023, were included. The patients were divided into an internet hospital group and a physical hospital group, according to whether they had web-based or in-person consultations, respectively. To control for confounding variables, propensity score matching was used to match the two groups. Matching variables included age, sex, diagnosis, and number of clinic visits.</p><p><strong>Results: </strong>A total of 122 people in the internet hospital group and 739 people in the physical hospital group met the inclusion criteria. After propensity score matching, 77 patients in each of the two groups were included in the analysis. There was not a significant difference in the quality of life (QOL; QOL assessment was part of the BPI scale) between the internet hospital group and the physical hospital group (P=.80), but the QOL of both groups of patients improved after pain management (internet hospital group: P<.001; physical hospital group: P=.001). There were no significant differences in the pain relief rate (P=.25) or the incidence of adverse events (P=.60) between the two groups. The total cost (P<.001) and treatment-related cost (P<.001) of the physical hospital group were higher than those of the internet hospital group. In addition, the degree of satisfaction in the internet hospital group was greater than that in the physical hospital group (P=.01).</p><p><strong>Conclusions: </strong>Internet hospitals are an effective way of managing chronic pain. They can improve patients' QOL and satisfaction, reduce treatment costs, and can be used as part of a multimodal strategy for chronic pain self-management.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"12 ","pages":"e54975"},"PeriodicalIF":3.1,"publicationDate":"2024-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142933332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}