Hesam Mahmoudi, Doris Chang, Hannah Lee, Navid Ghaffarzadegan, Mohammad S Jalali
{"title":"Critical Assessment of Large Language Models' (ChatGPT) Performance in Data Extraction for Systematic Reviews: Exploratory Study.","authors":"Hesam Mahmoudi, Doris Chang, Hannah Lee, Navid Ghaffarzadegan, Mohammad S Jalali","doi":"10.2196/68097","DOIUrl":"10.2196/68097","url":null,"abstract":"<p><strong>Background: </strong>Systematic literature reviews (SLRs) are foundational for synthesizing evidence across diverse fields and are especially important in guiding research and practice in health and biomedical sciences. However, they are labor intensive due to manual data extraction from multiple studies. As large language models (LLMs) gain attention for their potential to automate research tasks and extract basic information, understanding their ability to accurately extract explicit data from academic papers is critical for advancing SLRs.</p><p><strong>Objective: </strong>Our study aimed to explore the capability of LLMs to extract both explicitly outlined study characteristics and deeper, more contextual information requiring nuanced evaluations, using ChatGPT (GPT-4).</p><p><strong>Methods: </strong>We screened the full text of a sample of COVID-19 modeling studies and analyzed three basic measures of study settings (ie, analysis location, modeling approach, and analyzed interventions) and three complex measures of behavioral components in models (ie, mobility, risk perception, and compliance). To extract data on these measures, two researchers independently extracted 60 data elements using manual coding and compared them with the responses from ChatGPT to 420 queries spanning 7 iterations.</p><p><strong>Results: </strong>ChatGPT's accuracy improved as prompts were refined, showing improvements of 33% and 23% between the initial and final iterations for extracting study settings and behavioral components, respectively. In the initial prompts, 26 (43.3%) of 60 ChatGPT responses were correct. However, in the final iteration, ChatGPT extracted 43 (71.7%) of the 60 data elements, showing better performance in extracting explicitly stated study settings (28/30, 93.3%) than in extracting subjective behavioral components (15/30, 50%). Nonetheless, the varying accuracy across measures highlighted its limitations.</p><p><strong>Conclusions: </strong>Our findings underscore LLMs' utility in extracting basic as well as explicit data in SLRs by using effective prompts. However, the results reveal significant limitations in handling nuanced, subjective criteria, emphasizing the necessity for human oversight.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e68097"},"PeriodicalIF":2.0,"publicationDate":"2025-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12425462/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145042336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Leticia Medeiros Mancini, Luiz Eduardo Vanderlei Torres, Jorge Artur P de M Coelho, Nichollas Botelho da Fonseca, Pedro Fellipe Dantas Cordeiro, Samara Silva Noronha Cavalcante, Diego Dermeval
{"title":"Diagnostic and Screening AI Tools in Brazil's Resource-Limited Settings: Systematic Review.","authors":"Leticia Medeiros Mancini, Luiz Eduardo Vanderlei Torres, Jorge Artur P de M Coelho, Nichollas Botelho da Fonseca, Pedro Fellipe Dantas Cordeiro, Samara Silva Noronha Cavalcante, Diego Dermeval","doi":"10.2196/69547","DOIUrl":"10.2196/69547","url":null,"abstract":"<p><strong>Background: </strong>Artificial intelligence (AI) has the potential to transform global health care, with extensive application in Brazil, particularly for diagnosis and screening.</p><p><strong>Objective: </strong>This study aimed to conduct a systematic review to understand AI applications in Brazilian health care, especially focusing on the resource-constrained environments.</p><p><strong>Methods: </strong>A systematic review was performed. The search strategy included the following databases: PubMed, Cochrane Library, Embase, Web of Science, LILACS, and SciELO. The search covered papers from 1993 to November 2023, with an initial overview of 714 papers found, of which 25 papers were selected for the final sample. Meta-analysis data were evaluated based on three main metrics: area under the receiver operating characteristic curve, sensitivity, and specificity. A random effects model was applied for each metric to address study variability.</p><p><strong>Results: </strong>Key specialties for AI tools include ophthalmology and infectious disease, with a significant concentration of studies conducted in São Paulo state (13/25, 52%). All papers included testing to evaluate and validate the tools; however, only two conducted secondary testing with a different population. In terms of risk of bias, 10 of 25 (40%) papers had medium risk, 8 of 25 (32%) had low risk, and 7 of 25 (28%) had high risk. Most studies were public initiatives, totaling 17 of 25 (68%), while 5 of 25 (20%) were private. In limited-income countries like Brazil, minimum technological requirements for implementing AI in health care must be carefully considered due to financial limitations and often insufficient technological infrastructure. Of the papers reviewed, 19 of 25 (76%) used computers, and 18 of 25 (72%) required the Windows operating system. The most used AI algorithm was machine learning (11/25, 44%). The combined sensitivity was 0.8113, the combined specificity was 0.7417, and the combined area under the receiver operating characteristic curve was 0.8308, all with P<.001.</p><p><strong>Conclusions: </strong>There is a relative balance in the use of both diagnostic and screening tools, with widespread application across Brazil in varied contexts. The need for secondary testing highlights opportunities for future research.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e69547"},"PeriodicalIF":2.0,"publicationDate":"2025-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12422524/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145034859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Veysel Kocaman, Fu-Yuan Cheng, Julio Bonis, Ganesh Raut, Prem Timsina, David Talby, Arash Kia
{"title":"Exploring Named Entity Recognition Potential and the Value of Tailored Natural Language Processing Pipelines for Radiology, Pathology, and Progress Notes in Clinical Decision Support: Quantitative Study.","authors":"Veysel Kocaman, Fu-Yuan Cheng, Julio Bonis, Ganesh Raut, Prem Timsina, David Talby, Arash Kia","doi":"10.2196/59251","DOIUrl":"10.2196/59251","url":null,"abstract":"<p><strong>Background: </strong>Clinical notes house rich, yet unstructured, patient data, making analysis challenging due to medical jargon, abbreviations, and synonyms causing ambiguity. This complicates real-time extraction for decision support tools.</p><p><strong>Objective: </strong>This study aimed to examine the data curation, technology, and workflow of the named entity recognition (NER) pipeline, a component of a broader clinical decision support tool that identifies key entities using NER models and classifies these entities as present or absent in the patient through an NER assertion model.</p><p><strong>Methods: </strong>We gathered progress care, radiology, and pathology notes from 5000 patients, dividing them into 5 batches of 1000 patients each. Metrics such as notes and reports per patient, sentence count, token size, runtime, central processing unit, and memory use were measured per note type. We also evaluated the precision of the NER outputs and then the precision and recall of NER assertion models against manual annotations by a clinical expert.</p><p><strong>Results: </strong>Using Spark natural language processing clinical pretrained NER models on 138,250 clinical notes, we observed excellent NER precision, with a peak in procedures at 0.989 (95% CI 0.977-1.000) and an accuracy in the assertion model of 0.889 (95% CI 0.856-0.922). Our analysis highlighted long-tail distributions in notes per patient, note length, and entity density. Progress care notes had notably more entities per sentence than radiology and pathology notes, showing 4-fold and 16-fold differences, respectively.</p><p><strong>Conclusions: </strong>Further research should explore the analysis of clinical notes beyond the scope of our study, including discharge summaries and psychiatric evaluation notes. Recognizing the unique linguistic characteristics of different note types underscores the importance of developing specialized NER models or natural language processing pipeline setups tailored to each type. By doing so, we can enhance their performance across a more diverse range of clinical scenarios.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e59251"},"PeriodicalIF":2.0,"publicationDate":"2025-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12449662/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145006989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Laura Antonia Meliante, Giulia Coco, Alessandro Rabiolo, Stefano De Cillà, Gianluca Manni
{"title":"Evaluation of AI Tools Versus the PRISMA Method for Literature Search, Data Extraction, and Study Composition in Glaucoma Systematic Reviews: Content Analysis.","authors":"Laura Antonia Meliante, Giulia Coco, Alessandro Rabiolo, Stefano De Cillà, Gianluca Manni","doi":"10.2196/68592","DOIUrl":"10.2196/68592","url":null,"abstract":"<p><strong>Background: </strong>Artificial intelligence (AI) is becoming increasingly popular in the scientific field, as it allows for the analysis of extensive datasets, summarizes results, and assists in writing academic papers.</p><p><strong>Objective: </strong>This study investigates the role of AI in the process of conducting a systematic literature review (SLR), focusing on its contributions and limitations at three key stages of its development, study selection, data extraction, and study composition, using glaucoma-related SLRs as case studies and Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA)-based SLRs as benchmarks.</p><p><strong>Methods: </strong>Four AI platforms were tested on their ability to reproduce four PRISMA-based, glaucoma-related SLRs. We used Connected Papers and Elicit to perform research of relevant records; then we assessed Elicit and ChatPDF's ability to extract and organize information contained in the retrieved records. Finally, we tested Jenni AI's capacity to compose an SLR.</p><p><strong>Results: </strong>Neither Connected Papers nor Elicit provided the totality of the results found using the PRISMA method. On average, data extracted from Elicit were accurate in 51.40% (SD 31.45%) of cases and imprecise in 13.69% (SD 17.98%); 22.37% (SD 27.54%) of responses were missing, while 12.51% (SD 14.70%) were incorrect. Data extracted from ChatPDF were accurate in 60.33% (SD 30.72%) of cases and imprecise in 7.41% (SD 13.88%); 17.56% (SD 20.02%) of responses were missing, and 14.70% (SD 17.72%) were incorrect. Jenni AI's generated content exhibited satisfactory language fluency and technical proficiency but was insufficient in defining methods, elaborating results, and stating conclusions.</p><p><strong>Conclusions: </strong>The PRISMA method continues to exhibit clear superiority in terms of reproducibility and accuracy during the literature search, data extraction, and study composition phases of the SLR writing process. While AI can save time and assist with repetitive tasks, the active participation of the researcher throughout the entire process is still crucial to maintain control over the quality, accuracy, and objectivity of their work.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e68592"},"PeriodicalIF":2.0,"publicationDate":"2025-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12413140/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145006987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mingjia Huo, Sean Perez, Linda Awdishu, Janice S Kerr, Pengtao Xie, Adnan Khan, Kristin Mekeel, Shamim Nemati
{"title":"AI-Driven Tacrolimus Dosing in Transplant Care: Cohort Study.","authors":"Mingjia Huo, Sean Perez, Linda Awdishu, Janice S Kerr, Pengtao Xie, Adnan Khan, Kristin Mekeel, Shamim Nemati","doi":"10.2196/67302","DOIUrl":"10.2196/67302","url":null,"abstract":"<p><strong>Background: </strong>Tacrolimus forms the backbone of immunosuppressive therapy in solid organ transplantation, requiring precise dosing due to its narrow therapeutic range. Maintaining therapeutic tacrolimus levels in the postoperative period is challenging due to diverse patient characteristics, donor organ factors, drug interactions, and evolving perioperative physiology.</p><p><strong>Objective: </strong>The aim of this study is to design a machine learning model to predict the next-day tacrolimus trough concentrations (C0) and guide dosing to prevent persistent under- or overdosing.</p><p><strong>Methods: </strong>We used retrospective data from 1597 adult recipients of kidney and liver transplants at UC San Diego Health to develop a long short-term memory (LSTM) model to predict next-day tacrolimus C0 in an inpatient setting. Predictors included transplant type, demographics, comorbidities, vital signs, laboratory parameters, ordered diet, and medications. Permutation feature importance was evaluated for the model. We further implemented a classification task to evaluate the model's ability to identify underdosing, therapeutic dosing, and overdosing. Finally, we generated next-day dose recommendations that would achieve tacrolimus C0 within the target ranges.</p><p><strong>Results: </strong>The LSTM model provided a mean absolute error of 1.880 ng/mL when predicting next-day tacrolimus C0. Top predictive features included the recent tacrolimus C0, tacrolimus doses, transplant organ type, diet, and interactive drugs. When predicting underdosing, therapeutic dosing, and overdosing using a 3-class classification task, the model achieved a microaverage F1-score of 0.653. For dose recommendations, the best clinical outcomes were achieved when the actual total daily dose closely aligned with the model's recommended dose (within 3 mg).</p><p><strong>Conclusions: </strong>Ours is one of the largest studies to apply artificial intelligence to tacrolimus dosing, and our LSTM model effectively predicts tacrolimus C0 and could potentially guide accurate dose recommendations. Further prospective studies are needed to evaluate the model's performance in real-world dose adjustments.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e67302"},"PeriodicalIF":2.0,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12404564/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144981216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anne Pankow, Nico Meißner-Bendzko, Jessica Kaufeld, Laura Fouquette, Fabienne Cotte, Stephen Gilbert, Ewelina Türk, Anibh Das, Christoph Terkamp, Gerhard-Rüdiger Burmester, Annette Doris Wagner
{"title":"Medical Expert Knowledge Meets AI to Enhance Symptom Checker Performance for Rare Disease Identification in Fabry Disease: Mixed Methods Study.","authors":"Anne Pankow, Nico Meißner-Bendzko, Jessica Kaufeld, Laura Fouquette, Fabienne Cotte, Stephen Gilbert, Ewelina Türk, Anibh Das, Christoph Terkamp, Gerhard-Rüdiger Burmester, Annette Doris Wagner","doi":"10.2196/55001","DOIUrl":"10.2196/55001","url":null,"abstract":"<p><strong>Background: </strong>Rare diseases, which affect millions of people worldwide, pose a major challenge, as it often takes years before an accurate diagnosis can be made. This delay results in substantial burdens for patients and health care systems, as misdiagnoses lead to inadequate treatment and increased costs. Artificial intelligence (AI)-powered symptom checkers (SCs) present an opportunity to flag rare diseases earlier in the diagnostic work-up. However, these tools are primarily based on published literature, which often contains incomplete data on rare diseases, resulting in compromised diagnostic accuracy. Integrating expert interview insights into SC models may enhance their performance, ensuring that rare diseases are considered sooner and diagnosed more accurately.</p><p><strong>Objective: </strong>The objectives of our study were to incorporate expert interview vignettes into AI-powered SCs, in addition to a traditional literature review, and to evaluate whether this novel approach improves diagnostic accuracy and user satisfaction for rare diseases, focusing on Fabry disease.</p><p><strong>Methods: </strong>This mixed methods prospective pilot study was conducted at Hannover Medical School, Germany. In the first phase, guided interviews were conducted with medical experts specialized in Fabry disease to create clinical vignettes that enriched the AI SC's Fabry disease model. In the second phase, adult patients with a confirmed diagnosis of Fabry disease used both the original and optimized SC versions in a randomized order. The versions, containing either the original or the optimized Fabry disease model, were evaluated based on diagnostic accuracy and user satisfaction, which were assessed through questionnaires.</p><p><strong>Results: </strong>Three medical experts with extensive experience in lysosomal storage disorder Fabry disease contributed to the creation of 5 clinical vignettes, which were integrated into the AI-powered SC. The study compared the original and optimized SC versions in 6 patients with Fabry disease. The optimized version improved diagnostic accuracy, with Fabry disease identified as the top suggestion in 33% (2/6) of cases, compared to 17% (1/6) with the original model. Additionally, overall user satisfaction was higher for the optimized version, with participants rating it more favorably in terms of symptom coverage and completeness.</p><p><strong>Conclusions: </strong>This study demonstrates that integrating expert-derived clinical vignettes into AI-powered SCs can improve diagnostic accuracy and user satisfaction, particularly for rare diseases. The optimized SC version, which incorporated these vignettes, showed improved performance in identifying Fabry disease as a top diagnostic suggestion and received higher user satisfaction ratings compared to the original version. To fully realize the potential of this approach, it is crucial to include vignettes representing atypical presentations and to ","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e55001"},"PeriodicalIF":2.0,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12392689/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144981267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Raphaëlle Giguère, Victor Niaussat, Monia Noël-Hunter, William Witteman, Tanya S Paul, Alexandre Marois, Philippe Després, Simon Duchesne, Patrick M Archambault
{"title":"Predicting Episodes of Hypovigilance in Intensive Care Units Using Routine Physiological Parameters and Artificial Intelligence: Derivation Study.","authors":"Raphaëlle Giguère, Victor Niaussat, Monia Noël-Hunter, William Witteman, Tanya S Paul, Alexandre Marois, Philippe Després, Simon Duchesne, Patrick M Archambault","doi":"10.2196/60885","DOIUrl":"10.2196/60885","url":null,"abstract":"<p><strong>Background: </strong>Delirium is prevalent in intensive care units (ICUs), often leading to adverse outcomes. Hypoactive delirium is particularly difficult to detect. Despite the development of new tools, the timely identification of hypoactive delirium remains clinically challenging due to its dynamic nature, lack of human resources, lack of reliable monitoring tools, and subtle clinical signs including hypovigilance. Machine learning models could support the identification of hypoactive delirium episodes by better detecting episodes of hypovigilance.</p><p><strong>Objective: </strong>Develop an artificial intelligence prediction model capable of detecting hypovigilance events using routinely collected physiological data in the ICU.</p><p><strong>Methods: </strong>This derivation study was conducted using data from a prospective observational cohort of eligible patients admitted to the ICU in Lévis, Québec, Canada. We included patients admitted to the ICU between October 2021 and June 2022 who were aged ≥18 years and had an anticipated ICU stay of ≥48 hours. ICU nurses identified hypovigilant states every hour using the Richmond Agitation and Sedation Scale (RASS) or the Ramsay Sedation Scale (RSS). Routine vital signs (heart rate, respiratory rate, blood pressure, and oxygen saturation), as well as other physiological and clinical variables (premature ventricular contractions, intubation, use of sedative medication, and temperature), were automatically collected and stored using a CARESCAPE Gateway (General Electric) or manually collected (for sociodemographic characteristics and medication) through chart review. Time series were generated around hypovigilance episodes for analysis. Random Forest, XGBoost, and Light Gradient Boosting Machine classifiers were then used to detect hypovigilant episodes based on time series analysis. Hyperparameter optimization was performed using a random search in a 10-fold group-based cross-validation setup. To interpret the predictions of the best-performing models, we conducted a Shapley Additive Explanations (SHAP) analysis. We report the results of this study using the TRIPOD+AI (Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis for machine learning models) guidelines, and potential biases were assessed using PROBAST (Prediction model Risk Of Bias ASsessment Tool).</p><p><strong>Results: </strong>Out of 136 potentially eligible participants, data from 30 patients (mean age 69 y, 63% male) were collected for analysis. Among all participants, 30% were admitted to the ICU for surgical reasons. Following data preprocessing, the study included 1493 hypovigilance episodes and 764 nonhypovigilant episodes. Among the 3 models evaluated, Light Gradient Boosting Machine demonstrated the best performance. It achieved an average accuracy of 68% to detect hypovigilant episodes, with a precision of 76%, a recall of 74%, an area under the curve (AUC) of 60%, and an F","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e60885"},"PeriodicalIF":2.0,"publicationDate":"2025-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12384691/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144981261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ethan Bernstein, Anya Ramsamooj, Kelsey L Millar, Zachary C Lum
{"title":"Identification and Categorization of the Top 100 Articles and the Future of Large Language Models: Thematic Analysis Using Bibliometric Analysis.","authors":"Ethan Bernstein, Anya Ramsamooj, Kelsey L Millar, Zachary C Lum","doi":"10.2196/68603","DOIUrl":"10.2196/68603","url":null,"abstract":"<p><strong>Background: </strong>Since the release of ChatGPT and other large language models (LLMs), there has been a significant increase in academic publications exploring their capabilities and implications across various fields, such as medicine, education, and technology.</p><p><strong>Objective: </strong>This study aims to identify the most influential academic works on LLMs published in the past year, categorize their research types and thematic focuses, within different professional fields. The study also evaluates the ability of artificial intelligence (AI) tools, such as ChatGPT, to accurately classify academic research.</p><p><strong>Methods: </strong>We conducted a bibliometric analysis using Clarivate's Web of Science (WOS) to extract the top 100 most cited papers on LLMs. Papers were manually categorized by field, journal, author, and research type. ChatGPT-4 was used to generate categorizations for the same papers, and its performance was compared to human classifications. We summarized the distribution of research fields and assessed the concordance between AI-generated and manual classifications.</p><p><strong>Results: </strong>Medicine emerged as the predominant field among the top 100 most cited papers, accounting for 43 (43%), followed by education 26 (26%) and technology 15 (15%). Medical literature primarily focused on clinical applications of LLMs, limitations of AI in health care, and the role of AI in medical education. In education, research was centered around ethical concerns and potential applications of AI for teaching and learning. ChatGPT demonstrated variable concordance with human reviewers, achieving an agreement rating of 47% for research types and 92% for fields of study.</p><p><strong>Conclusions: </strong>While LLMs such as ChatGPT exhibit considerable potential in aiding research categorization, human oversight remains essential to address issues such as hallucinations, outdated information, and biases in AI-generated outputs. This study highlights the transformative potential of LLMs across multiple sectors and emphasizes the importance of continuous ethical evaluation and iterative improvement of AI systems to maximize their benefits while minimizing risks.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e68603"},"PeriodicalIF":2.0,"publicationDate":"2025-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12384689/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144981318","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Performance of DeepSeek and GPT Models on Pediatric Board Preparation Questions: Comparative Evaluation.","authors":"Masab Mansoor, Andrew Ibrahim, Ali Hamide","doi":"10.2196/76056","DOIUrl":"10.2196/76056","url":null,"abstract":"<p><strong>Background: </strong>Limited research exists evaluating artificial intelligence (AI) performance on standardized pediatric assessments. This study evaluated 3 leading AI models on pediatric board preparation questions.</p><p><strong>Objective: </strong>The aim of this study is to evaluate and compare the performance of 3 leading large language models (LLMs) on pediatric board examination preparation questions and contextualize their performance against human physician benchmarks.</p><p><strong>Methods: </strong>We analyzed DeepSeek-R1, ChatGPT-4, and ChatGPT-4.5 using 266 multiple-choice questions from the 2023 PREP Self-Assessment. Performance was compared to published American Board of Pediatrics first-time pass rates.</p><p><strong>Results: </strong>DeepSeek-R1 exhibited the highest accuracy at 98.1% (261/266 correct responses). ChatGPT-4.5 achieved 96.6% accuracy (257/266), performing at the upper threshold of human performance. ChatGPT-4 demonstrated 82.7% accuracy (220/266), comparable to the lower range of human pass rates. Error pattern analysis revealed that AI models most commonly struggled with questions requiring integration of complex clinical presentations with rare disease knowledge.</p><p><strong>Conclusions: </strong>DeepSeek-R1 demonstrated exceptional performance exceeding typical American Board of Pediatrics pass rates, suggesting potential applications in medical education and clinical support, though further research on complex clinical reasoning is needed.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e76056"},"PeriodicalIF":2.0,"publicationDate":"2025-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12384676/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144981265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Intensive Care Unit Patient Outcome Prediction Using ν-Support Vector Classification and Stochastic Signal Processing-Based Feature Extraction Techniques: Algorithm Development and Validation Study.","authors":"Shaodong Wang, Yiqun Jiang, Qing Li, Wenli Zhang","doi":"10.2196/72671","DOIUrl":"10.2196/72671","url":null,"abstract":"<p><strong>Background: </strong>Intensive care units (ICUs) treat patients with life-threatening illnesses. Worldwide, intensive care demand is massive. Predicting patient outcomes in ICUs holds significant importance for health care operation management. Nevertheless, it remains a challenging problem that researchers and health care practitioners have yet to overcome. While the newly emerging health digital trace data offer new possibilities, such data contain complex time series and patterns. Although researchers have devised severity score systems, traditional machine learning models with feature engineering, and deep learning models that use raw clinical data to predict ICU outcomes, existing methods have limitations.</p><p><strong>Objective: </strong>This study aimed to develop a novel feature extraction and machine learning framework to repurpose and extract features with strong predictive power from patients' health digital traces for ICU outcome prediction.</p><p><strong>Methods: </strong>Guided by signal processing techniques and medical domain knowledge, the proposed framework introduces a novel, signal processing-based feature engineering method to extract highly predictive features from ICU digital trace data. We rigorously evaluated this method on a real-world ICU dataset, demonstrating significant improvements over both traditional and deep learning baseline methods. The method was then evaluated using a real-world database to assess prediction accuracy and feature representativeness.</p><p><strong>Results: </strong>The prediction results obtained by the proposed framework significantly outperformed state-of-the-art benchmarks. This demonstrated the framework's effectiveness in capturing key patterns from complex health digital traces for improving ICU outcome prediction.</p><p><strong>Conclusions: </strong>Our study contributes to health care operation management by leveraging digital traces from health care information systems to address challenges with significant implications for health care.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e72671"},"PeriodicalIF":2.0,"publicationDate":"2025-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12421204/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144981249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}