{"title":"Global prevalence, trend and projection of myopia in children and adolescents from 1990 to 2050: a comprehensive systematic review and meta-analysis","authors":"Jinghong Liang, Yingqi Pu, Jiaqi Chen, Meiling Liu, Bowen Ouyang, Zhengge Jin, Wenxin Ge, Zhuowen Wu, Xiuzhi Yang, Chunsong Qin, Cong Wang, Shan Huang, Nan Jiang, Lixin Hu, Yushan Zhang, Zhaohuan Gui, Xueya Pu, Shaoyi Huang, Yajun Chen","doi":"10.1136/bjo-2024-325427","DOIUrl":"https://doi.org/10.1136/bjo-2024-325427","url":null,"abstract":"Background Myopia is a pervasive global public health concern, particularly among the younger population. However, the escalating prevalence of myopia remains uncertain. Hence, our research aims to ascertain the global and regional prevalence of myopia, along with its occurrence within specific demographic groups. Methods An exhaustive literature search was performed on several databases covering the period from their inception to 27 June 2023. The global prevalence of myopia was determined by employing pooled estimates with a 95% CI, and further analysis was conducted to assess variations in prevalence estimates across different subgroups. Additionally, a time series model was utilised to forecast and fit accurately the future prevalence of myopia for the next three decades. Results This study encompasses a comprehensive analysis of 276 studies, involving a total of 5 410 945 participants from 50 countries across all six continents. The findings revealed a gradual increase in pooled prevalence of myopia, ranging from 24.32% (95% CI 15.23% to 33.40%) to 35.81% (95% CI 31.70% to 39.91%), observed from 1990 to 2023, and projections indicate that this prevalence is expected to reach 36.59% in 2040 and 39.80% in 2050. Notably, individuals residing in East Asia (35.22%) or in urban areas (28.55%), female gender (33.57%), adolescents (47.00%), and high school students (45.71%) exhibit a higher proportion of myopia prevalence. Conclusion The global prevalence of childhood myopia is substantial, affecting approximately one-third of children and adolescents, with notable variations in prevalence across different demographic groups. It is anticipated that the global incidence of myopia will exceed 740 million cases by 2050. All data relevant to the study are included in the article or uploaded as supplementary information. All the data included in our study are from published studies.","PeriodicalId":9313,"journal":{"name":"British Journal of Ophthalmology","volume":null,"pages":null},"PeriodicalIF":4.1,"publicationDate":"2024-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142317679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Outcomes of gonioscopy-assisted transluminal trabeculotomy in advanced pigmentary glaucoma","authors":"Arnav Panigrahi, Anurag Kumar, Shikha Gupta, Davinder S Grover, Viney Gupta","doi":"10.1136/bjo-2024-325749","DOIUrl":"https://doi.org/10.1136/bjo-2024-325749","url":null,"abstract":"Purpose To compare outcomes of gonioscopy-assisted transluminal trabeculotomy (GATT) over a 12-month period with trabeculectomy in patients with advanced pigmentary glaucoma (PG). Methods This was a pilot randomised controlled trial of patients with advanced PG (mean deviation worse than −12 dB), undergoing either GATT or a fornix-based trabeculectomy. Absolute success (criterion A) was defined as a postoperative intraocular pressure (IOP) between 6 and 18 mm Hg, with a drop of at least 30% from the treated preoperative value without need of any IOP-lowering medication. Success (criterion B) was also defined as per the target IOP, with an upper limit of 15 mm Hg for eyes with mean deviation (MD) between −12 and −24 dB, and 12 mm Hg or lower for MD values worse than −24 dB. Qualified success was a similar IOP standard on the same or fewer antiglaucoma medications. Results For GATT (n=10), mean preoperative IOP and number of glaucoma medications were 28.2±11.2 mm Hg and 4±0.8 that reduced to 11.8±2.5 mm Hg and 0.7 at 12 months postoperatively, while in the trabeculectomy (n=12) group, they were 27.3±5.5 mm Hg and 3.6±0.7 that reduced to 11.5±2.2 mm Hg and 0.5±0.9, respectively. All eyes (100%) achieved qualified success. Absolute success was 60% and 67.7% by criterion A and 50% and 58.3% by criterion B for GATT and trabeculectomy, respectively. Two eyes in the trabeculectomy group developed hypotony while none of the GATT group had any sight-threatening complications (p=0.4). Conclusions GATT alone demonstrated a significant reduction in IOP and number of glaucoma medications in patients with advanced PG. Data are available upon reasonable request. Not Applicable.","PeriodicalId":9313,"journal":{"name":"British Journal of Ophthalmology","volume":null,"pages":null},"PeriodicalIF":4.1,"publicationDate":"2024-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142317595","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Unveiling the clinical incapabilities: a benchmarking study of GPT-4V(ision) for ophthalmic multimodal image analysis.","authors":"Pusheng Xu, Xiaolan Chen, Ziwei Zhao, Danli Shi","doi":"10.1136/bjo-2023-325054","DOIUrl":"10.1136/bjo-2023-325054","url":null,"abstract":"<p><strong>Purpose: </strong>To evaluate the capabilities and incapabilities of a GPT-4V(ision)-based chatbot in interpreting ocular multimodal images.</p><p><strong>Methods: </strong>We developed a digital ophthalmologist app using GPT-4V and evaluated its performance with a dataset (60 images, 60 ophthalmic conditions, 6 modalities) that included slit-lamp, scanning laser ophthalmoscopy, fundus photography of the posterior pole (FPP), optical coherence tomography, fundus fluorescein angiography and ocular ultrasound images. The chatbot was tested with ten open-ended questions per image, covering examination identification, lesion detection, diagnosis and decision support. The responses were manually assessed for accuracy, usability, safety and diagnosis repeatability. Auto-evaluation was performed using sentence similarity and GPT-4-based auto-evaluation.</p><p><strong>Results: </strong>Out of 600 responses, 30.6% were accurate, 21.5% were highly usable and 55.6% were deemed as no harm. GPT-4V performed best with slit-lamp images, with 42.0%, 38.5% and 68.5% of the responses being accurate, highly usable and no harm, respectively. However, its performance was weaker in FPP images, with only 13.7%, 3.7% and 38.5% in the same categories. GPT-4V correctly identified 95.6% of the imaging modalities and showed varying accuracies in lesion identification (25.6%), diagnosis (16.1%) and decision support (24.0%). The overall repeatability of GPT-4V in diagnosing ocular images was 63.3% (38/60). The overall sentence similarity between responses generated by GPT-4V and human answers is 55.5%, with Spearman correlations of 0.569 for accuracy and 0.576 for usability.</p><p><strong>Conclusion: </strong>GPT-4V currently is not yet suitable for clinical decision-making in ophthalmology. Our study serves as a benchmark for enhancing ophthalmic multimodal models.</p>","PeriodicalId":9313,"journal":{"name":"British Journal of Ophthalmology","volume":null,"pages":null},"PeriodicalIF":3.7,"publicationDate":"2024-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141092871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daniel Milad, Fares Antaki, Jason Milad, Andrew Farah, Thomas Khairy, David Mikhail, Charles-Édouard Giguère, Samir Touma, Allison Bernstein, Andrei-Alexandru Szigiato, Taylor Nayman, Guillaume A Mullie, Renaud Duval
{"title":"Assessing the medical reasoning skills of GPT-4 in complex ophthalmology cases.","authors":"Daniel Milad, Fares Antaki, Jason Milad, Andrew Farah, Thomas Khairy, David Mikhail, Charles-Édouard Giguère, Samir Touma, Allison Bernstein, Andrei-Alexandru Szigiato, Taylor Nayman, Guillaume A Mullie, Renaud Duval","doi":"10.1136/bjo-2023-325053","DOIUrl":"10.1136/bjo-2023-325053","url":null,"abstract":"<p><strong>Background/aims: </strong>This study assesses the proficiency of Generative Pre-trained Transformer (GPT)-4 in answering questions about complex clinical ophthalmology cases.</p><p><strong>Methods: </strong>We tested GPT-4 on 422 <i>Journal of the American Medical Association</i> Ophthalmology Clinical Challenges, and prompted the model to determine the diagnosis (open-ended question) and identify the next-step (multiple-choice question). We generated responses using two zero-shot prompting strategies, including zero-shot plan-and-solve+ (PS+), to improve the reasoning of the model. We compared the best-performing model to human graders in a benchmarking effort.</p><p><strong>Results: </strong>Using PS+ prompting, GPT-4 achieved mean accuracies of 48.0% (95% CI (43.1% to 52.9%)) and 63.0% (95% CI (58.2% to 67.6%)) in diagnosis and next step, respectively. Next-step accuracy did not significantly differ by subspecialty (p=0.44). However, diagnostic accuracy in pathology and tumours was significantly higher than in uveitis (p=0.027). When the diagnosis was accurate, 75.2% (95% CI (68.6% to 80.9%)) of the next steps were correct. Conversely, when the diagnosis was incorrect, 50.2% (95% CI (43.8% to 56.6%)) of the next steps were accurate. The next step was three times more likely to be accurate when the initial diagnosis was correct (p<0.001). No significant differences were observed in diagnostic accuracy and decision-making between board-certified ophthalmologists and GPT-4. Among trainees, senior residents outperformed GPT-4 in diagnostic accuracy (p≤0.001 and 0.049) and in accuracy of next step (p=0.002 and 0.020).</p><p><strong>Conclusion: </strong>Improved prompting enhances GPT-4's performance in complex clinical situations, although it does not surpass ophthalmology trainees in our context. Specialised large language models hold promise for future assistance in medical decision-making and diagnosis.</p>","PeriodicalId":9313,"journal":{"name":"British Journal of Ophthalmology","volume":null,"pages":null},"PeriodicalIF":3.7,"publicationDate":"2024-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139746091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Performance of ChatGPT and Bard on the official part 1 FRCOphth practice questions.","authors":"Thomas Fowler, Simon Pullen, Liam Birkett","doi":"10.1136/bjo-2023-324091","DOIUrl":"10.1136/bjo-2023-324091","url":null,"abstract":"<p><strong>Background: </strong>Chat Generative Pre-trained Transformer (ChatGPT), a large language model by OpenAI, and Bard, Google's artificial intelligence (AI) chatbot, have been evaluated in various contexts. This study aims to assess these models' proficiency in the part 1 Fellowship of the Royal College of Ophthalmologists (FRCOphth) Multiple Choice Question (MCQ) examination, highlighting their potential in medical education.</p><p><strong>Methods: </strong>Both models were tested on a sample question bank for the part 1 FRCOphth MCQ exam. Their performances were compared with historical human performance on the exam, focusing on the ability to comprehend, retain and apply information related to ophthalmology. We also tested it on the book 'MCQs for FRCOpth part 1', and assessed its performance across subjects.</p><p><strong>Results: </strong>ChatGPT demonstrated a strong performance, surpassing historical human pass marks and examination performance, while Bard underperformed. The comparison indicates the potential of certain AI models to match, and even exceed, human standards in such tasks.</p><p><strong>Conclusion: </strong>The results demonstrate the potential of AI models, such as ChatGPT, in processing and applying medical knowledge at a postgraduate level. However, performance varied among different models, highlighting the importance of appropriate AI selection. The study underlines the potential for AI applications in medical education and the necessity for further investigation into their strengths and limitations.</p>","PeriodicalId":9313,"journal":{"name":"British Journal of Ophthalmology","volume":null,"pages":null},"PeriodicalIF":3.7,"publicationDate":"2024-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71478299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ce Zheng, Hongfei Ye, Jinming Guo, Junrui Yang, Ping Fei, Yuanzhi Yuan, Danqing Huang, Yuqiang Huang, Jie Peng, Xiaoling Xie, Meng Xie, Peiquan Zhao, Li Chen, Mingzhi Zhang
{"title":"Development and evaluation of a large language model of ophthalmology in Chinese.","authors":"Ce Zheng, Hongfei Ye, Jinming Guo, Junrui Yang, Ping Fei, Yuanzhi Yuan, Danqing Huang, Yuqiang Huang, Jie Peng, Xiaoling Xie, Meng Xie, Peiquan Zhao, Li Chen, Mingzhi Zhang","doi":"10.1136/bjo-2023-324526","DOIUrl":"10.1136/bjo-2023-324526","url":null,"abstract":"<p><strong>Background: </strong>Large language models (LLMs), such as ChatGPT, have considerable implications for various medical applications. However, ChatGPT's training primarily draws from English-centric internet data and is not tailored explicitly to the medical domain. Thus, an ophthalmic LLM in Chinese is clinically essential for both healthcare providers and patients in mainland China.</p><p><strong>Methods: </strong>We developed an LLM of ophthalmology (MOPH) using Chinese corpora and evaluated its performance in three clinical scenarios: ophthalmic board exams in Chinese, answering evidence-based medicine-oriented ophthalmic questions and diagnostic accuracy for clinical vignettes. Additionally, we compared MOPH's performance to that of human doctors.</p><p><strong>Results: </strong>In the ophthalmic exam, MOPH's average score closely aligned with the mean score of trainees (64.7 (range 62-68) vs 66.2 (range 50-92), p=0.817), but achieving a score above 60 in all seven mock exams. In answering ophthalmic questions, MOPH demonstrated an adherence of 83.3% (25/30) of responses following Chinese guidelines (Likert scale 4-5). Only 6.7% (2/30, Likert scale 1-2) and 10% (3/30, Likert scale 3) of responses were rated as 'poor or very poor' or 'potentially misinterpretable inaccuracies' by reviewers. In diagnostic accuracy, although the rate of correct diagnosis by ophthalmologists was superior to that by MOPH (96.1% vs 81.1%, p>0.05), the difference was not statistically significant.</p><p><strong>Conclusion: </strong>This study demonstrated the promising performance of MOPH, a Chinese-specific ophthalmic LLM, in diverse clinical scenarios. MOPH has potential real-world applications in Chinese-language ophthalmology settings.</p>","PeriodicalId":9313,"journal":{"name":"British Journal of Ophthalmology","volume":null,"pages":null},"PeriodicalIF":3.7,"publicationDate":"2024-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11503135/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141632683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kai Xiong Cheong, Chenxi Zhang, Tien-En Tan, Beau J Fenner, Wendy Meihua Wong, Kelvin Yc Teo, Ya Xing Wang, Sobha Sivaprasad, Pearse A Keane, Cecilia Sungmin Lee, Aaron Y Lee, Chui Ming Gemmy Cheung, Tien Yin Wong, Yun-Gyung Cheong, Su Jeong Song, Yih Chung Tham
{"title":"Comparing generative and retrieval-based chatbots in answering patient questions regarding age-related macular degeneration and diabetic retinopathy.","authors":"Kai Xiong Cheong, Chenxi Zhang, Tien-En Tan, Beau J Fenner, Wendy Meihua Wong, Kelvin Yc Teo, Ya Xing Wang, Sobha Sivaprasad, Pearse A Keane, Cecilia Sungmin Lee, Aaron Y Lee, Chui Ming Gemmy Cheung, Tien Yin Wong, Yun-Gyung Cheong, Su Jeong Song, Yih Chung Tham","doi":"10.1136/bjo-2023-324533","DOIUrl":"10.1136/bjo-2023-324533","url":null,"abstract":"<p><strong>Background/aims: </strong>To compare the performance of generative versus retrieval-based chatbots in answering patient inquiries regarding age-related macular degeneration (AMD) and diabetic retinopathy (DR).</p><p><strong>Methods: </strong>We evaluated four chatbots: generative models (ChatGPT-4, ChatGPT-3.5 and Google Bard) and a retrieval-based model (OcularBERT) in a cross-sectional study. Their response accuracy to 45 questions (15 AMD, 15 DR and 15 others) was evaluated and compared. Three masked retinal specialists graded the responses using a three-point Likert scale: either 2 (good, error-free), 1 (borderline) or 0 (poor with significant inaccuracies). The scores were aggregated, ranging from 0 to 6. Based on majority consensus among the graders, the responses were also classified as 'Good', 'Borderline' or 'Poor' quality.</p><p><strong>Results: </strong>Overall, ChatGPT-4 and ChatGPT-3.5 outperformed the other chatbots, both achieving median scores (IQR) of 6 (1), compared with 4.5 (2) in Google Bard, and 2 (1) in OcularBERT (all p ≤8.4×10<sup>-3</sup>). Based on the consensus approach, 83.3% of ChatGPT-4's responses and 86.7% of ChatGPT-3.5's were rated as 'Good', surpassing Google Bard (50%) and OcularBERT (10%) (all p ≤1.4×10<sup>-2</sup>). ChatGPT-4 and ChatGPT-3.5 had no 'Poor' rated responses. Google Bard produced 6.7% Poor responses, and OcularBERT produced 20%. Across question types, ChatGPT-4 outperformed Google Bard only for AMD, and ChatGPT-3.5 outperformed Google Bard for DR and others.</p><p><strong>Conclusion: </strong>ChatGPT-4 and ChatGPT-3.5 demonstrated superior performance, followed by Google Bard and OcularBERT. Generative chatbots are potentially capable of answering domain-specific questions outside their original training. Further validation studies are still required prior to real-world implementation.</p>","PeriodicalId":9313,"journal":{"name":"British Journal of Ophthalmology","volume":null,"pages":null},"PeriodicalIF":3.7,"publicationDate":"2024-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140944063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qais Dihan, Muhammad Z Chauhan, Taher K Eleiwa, Andrew D Brown, Amr K Hassan, Mohamed M Khodeiry, Reem H Elsheikh, Isdin Oke, Bharti R Nihalani, Deborah K VanderVeen, Ahmed B Sallam, Abdelrahman M Elhusseiny
{"title":"Large language models: a new frontier in paediatric cataract patient education.","authors":"Qais Dihan, Muhammad Z Chauhan, Taher K Eleiwa, Andrew D Brown, Amr K Hassan, Mohamed M Khodeiry, Reem H Elsheikh, Isdin Oke, Bharti R Nihalani, Deborah K VanderVeen, Ahmed B Sallam, Abdelrahman M Elhusseiny","doi":"10.1136/bjo-2024-325252","DOIUrl":"10.1136/bjo-2024-325252","url":null,"abstract":"<p><strong>Background/aims: </strong>This was a cross-sectional comparative study. We evaluated the ability of three large language models (LLMs) (ChatGPT-3.5, ChatGPT-4, and Google Bard) to generate novel patient education materials (PEMs) and improve the readability of existing PEMs on paediatric cataract.</p><p><strong>Methods: </strong>We compared LLMs' responses to three prompts. Prompt A requested they write a handout on paediatric cataract that was 'easily understandable by an average American.' Prompt B modified prompt A and requested the handout be written at a 'sixth-grade reading level, using the Simple Measure of Gobbledygook (SMOG) readability formula.' Prompt C rewrote existing PEMs on paediatric cataract 'to a sixth-grade reading level using the SMOG readability formula'. Responses were compared on their quality (DISCERN; 1 (low quality) to 5 (high quality)), understandability and actionability (Patient Education Materials Assessment Tool (≥70%: understandable, ≥70%: actionable)), accuracy (Likert misinformation; 1 (no misinformation) to 5 (high misinformation) and readability (SMOG, Flesch-Kincaid Grade Level (FKGL); grade level <7: highly readable).</p><p><strong>Results: </strong>All LLM-generated responses were of high-quality (median DISCERN ≥4), understandability (≥70%), and accuracy (Likert=1). All LLM-generated responses were not actionable (<70%). ChatGPT-3.5 and ChatGPT-4 prompt B responses were more readable than prompt A responses (p<0.001). ChatGPT-4 generated more readable responses (lower SMOG and FKGL scores; 5.59±0.5 and 4.31±0.7, respectively) than the other two LLMs (p<0.001) and consistently rewrote them to or below the specified sixth-grade reading level (SMOG: 5.14±0.3).</p><p><strong>Conclusion: </strong>LLMs, particularly ChatGPT-4, proved valuable in generating high-quality, readable, accurate PEMs and in improving the readability of existing materials on paediatric cataract.</p>","PeriodicalId":9313,"journal":{"name":"British Journal of Ophthalmology","volume":null,"pages":null},"PeriodicalIF":3.7,"publicationDate":"2024-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142035322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Matteo Mario Carlà, Gloria Gambini, Antonio Baldascino, Federico Giannuzzi, Francesco Boselli, Emanuele Crincoli, Nicola Claudio D'Onofrio, Stanislao Rizzo
{"title":"Exploring AI-chatbots' capability to suggest surgical planning in ophthalmology: ChatGPT versus Google Gemini analysis of retinal detachment cases.","authors":"Matteo Mario Carlà, Gloria Gambini, Antonio Baldascino, Federico Giannuzzi, Francesco Boselli, Emanuele Crincoli, Nicola Claudio D'Onofrio, Stanislao Rizzo","doi":"10.1136/bjo-2023-325143","DOIUrl":"10.1136/bjo-2023-325143","url":null,"abstract":"<p><strong>Background: </strong>We aimed to define the capability of three different publicly available large language models, Chat Generative Pretrained Transformer (ChatGPT-3.5), ChatGPT-4 and Google Gemini in analysing retinal detachment cases and suggesting the best possible surgical planning.</p><p><strong>Methods: </strong>Analysis of 54 retinal detachments records entered into ChatGPT and Gemini's interfaces. After asking 'Specify what kind of surgical planning you would suggest and the eventual intraocular tamponade.' and collecting the given answers, we assessed the level of agreement with the common opinion of three expert vitreoretinal surgeons. Moreover, ChatGPT and Gemini answers were graded 1-5 (from poor to excellent quality), according to the Global Quality Score (GQS).</p><p><strong>Results: </strong>After excluding 4 controversial cases, 50 cases were included. Overall, ChatGPT-3.5, ChatGPT-4 and Google Gemini surgical choices agreed with those of vitreoretinal surgeons in 40/50 (80%), 42/50 (84%) and 35/50 (70%) of cases. Google Gemini was not able to respond in five cases. Contingency analysis showed significant differences between ChatGPT-4 and Gemini (p=0.03). ChatGPT's GQS were 3.9±0.8 and 4.2±0.7 for versions 3.5 and 4, while Gemini scored 3.5±1.1. There was no statistical difference between the two ChatGPTs (p=0.22), while both outperformed Gemini scores (p=0.03 and p=0.002, respectively). The main source of error was endotamponade choice (14% for ChatGPT-3.5 and 4, and 12% for Google Gemini). Only ChatGPT-4 was able to suggest a combined phacovitrectomy approach.</p><p><strong>Conclusion: </strong>In conclusion, Google Gemini and ChatGPT evaluated vitreoretinal patients' records in a coherent manner, showing a good level of agreement with expert surgeons. According to the GQS, ChatGPT's recommendations were much more accurate and precise.</p>","PeriodicalId":9313,"journal":{"name":"British Journal of Ophthalmology","volume":null,"pages":null},"PeriodicalIF":3.7,"publicationDate":"2024-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140048849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fares Antaki, Daniel Milad, Mark A Chia, Charles-Édouard Giguère, Samir Touma, Jonathan El-Khoury, Pearse A Keane, Renaud Duval
{"title":"Capabilities of GPT-4 in ophthalmology: an analysis of model entropy and progress towards human-level medical question answering.","authors":"Fares Antaki, Daniel Milad, Mark A Chia, Charles-Édouard Giguère, Samir Touma, Jonathan El-Khoury, Pearse A Keane, Renaud Duval","doi":"10.1136/bjo-2023-324438","DOIUrl":"10.1136/bjo-2023-324438","url":null,"abstract":"<p><strong>Background: </strong>Evidence on the performance of Generative Pre-trained Transformer 4 (GPT-4), a large language model (LLM), in the ophthalmology question-answering domain is needed.</p><p><strong>Methods: </strong>We tested GPT-4 on two 260-question multiple choice question sets from the Basic and Clinical Science Course (BCSC) Self-Assessment Program and the OphthoQuestions question banks. We compared the accuracy of GPT-4 models with varying temperatures (creativity setting) and evaluated their responses in a subset of questions. We also compared the best-performing GPT-4 model to GPT-3.5 and to historical human performance.</p><p><strong>Results: </strong>GPT-4-0.3 (GPT-4 with a temperature of 0.3) achieved the highest accuracy among GPT-4 models, with 75.8% on the BCSC set and 70.0% on the OphthoQuestions set. The combined accuracy was 72.9%, which represents an 18.3% raw improvement in accuracy compared with GPT-3.5 (p<0.001). Human graders preferred responses from models with a temperature higher than 0 (more creative). Exam section, question difficulty and cognitive level were all predictive of GPT-4-0.3 answer accuracy. GPT-4-0.3's performance was numerically superior to human performance on the BCSC (75.8% vs 73.3%) and OphthoQuestions (70.0% vs 63.0%), but the difference was not statistically significant (p=0.55 and p=0.09).</p><p><strong>Conclusion: </strong>GPT-4, an LLM trained on non-ophthalmology-specific data, performs significantly better than its predecessor on simulated ophthalmology board-style exams. Remarkably, its performance tended to be superior to historical human performance, but that difference was not statistically significant in our study.</p>","PeriodicalId":9313,"journal":{"name":"British Journal of Ophthalmology","volume":null,"pages":null},"PeriodicalIF":3.7,"publicationDate":"2024-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71478288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}