探索 ChatGPT 在妇女健康方面的功能

medRxiv - Obstetrics and Gynecology Pub Date : 2024-02-28 DOI:10.1101/2024.02.27.23300005

Magdalena Elisabeth Bachmann, Ioana Duta, Emily Mazey, William Cooke, Manu Vatish, Gabriel Davis Jones

{"title":"探索 ChatGPT 在妇女健康方面的功能","authors":"Magdalena Elisabeth Bachmann, Ioana Duta, Emily Mazey, William Cooke, Manu Vatish, Gabriel Davis Jones","doi":"10.1101/2024.02.27.23300005","DOIUrl":null,"url":null,"abstract":"Introduction: Artificial Intelligence (AI) is redefining healthcare, with Large Language Models (LLMs) like ChatGPT offering novel and powerful capabilities in processing and generating human-like information. These advancements offer potential improvements in Women's Health, particularly Obstetrics and Gynaecology (O&G), where diagnostic and treatment gaps have long existed. Despite its generalist nature, ChatGPT is increasingly being tested in healthcare, necessitating a critical analysis of its utility, limitations and safety. This study examines ChatGPT's performance in interpreting and responding to international gold standard benchmark assessments in O&G: the RCOG's MRCOG Part One and Two examinations. We evaluate ChatGPT's domain- and knowledge area-specific accuracy, the influence of linguistic complexity on performance and its self-assessment confidence and uncertainty, essential for safe clinical decision-making. Methods: A dataset of MRCOG examination questions from sources beyond the reach of LLMs was developed to mitigate the risk of ChatGPT's prior exposure. A dual-review process validated the technical and clinical accuracy of the questions, omitting those dependent on previous content, duplicates, or requiring image interpretation. Single Best Answer (SBA) and Extended Matching (EMQ) Questions were converted to JSON format to facilitate ChatGPT's interpretation, incorporating question types and background information. Interaction with ChatGPT was conducted via OpenAI's API, structured to ensure consistent, contextually informed responses from ChatGPT. The response from ChatGPT was recorded and compared against the known accurate response. Linguistic complexity was evaluated using unique token counts and Type-Token ratios (vocabulary breadth and diversity) to explore their influence on performance. ChatGPT was instructed to assign confidence scores to its answers (0-100%), reflecting its self-perceived accuracy. Responses were categorized by correctness and statistically analysed through entropy calculation, assessing ChatGPT's capacity for self-evaluating certainty and knowledge boundaries. Findings: Of 1,824 MRCOG Part One and Two questions, ChatGPT's accuracy on MRCOG Part One was 72.2% (95% CI 69.2-75.3). For Part Two, it achieved 50.4% accuracy (95% CI 47.2-53.5) with 534 correct out of 989 questions, performing better on SBAs (54.0%, 95% CI 50.0-58.0) than on EMQs (45.0%, 95% CI 40.1-49.9). In domain-specific performance, the highest accuracy was in Biochemistry (79.8%, 95% CI 71.4-88.1) and the lowest in Biophysics (51.4%, 95% CI 35.2-67.5). The best-performing subject in Part Two was Urogynaecology (63.0%, 95% CI 50.1-75.8) and the worst was Management of Labour (35.6%, 95% CI 21.6-49.5). Linguistic complexity analysis showed a marginal increase in unique token count for correct answers in Part One (median 122, IQR 114-134) compared to incorrect (median 120, IQR 112-131, p=0.05). TTR analysis revealed higher medians for correct answers with negligible effect sizes (Part One: 0.66, IQR 0.63-0.68; Part Two: 0.62, IQR 0.57-0.67) and p-values <0.001. Regarding self-assessed confidence, the median confidence for correct answers was 70.0% (IQR 60-90), the same as for incorrect choices identified as correct (p<0.001). For correct answers deemed incorrect, the median confidence was 10.0% (IQR 0-10), and for incorrect answers accurately identified, it was 5.0% (IQR 0-10, p<0.001). Entropy values were identical for correct and incorrect responses (median 1.46, IQR 0.44-1.77), indicating no discernible distinction in ChatGPT's prediction certainty. Conclusions: ChatGPT demonstrated commendable accuracy in basic medical queries on the MRCOG Part One, yet its performance was markedly reduced in the clinically demanding Part Two exam. The model's high self-confidence across correct and incorrect responses necessitates scrutiny for its application in clinical decision-making. These findings suggest that while ChatGPT has potential, its current form requires significant refinement before it can enhance diagnostic efficacy and clinical workflow in women's health.","PeriodicalId":501409,"journal":{"name":"medRxiv - Obstetrics and Gynecology","volume":"4 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Exploring the Capabilities of ChatGPT in Women's Health\",\"authors\":\"Magdalena Elisabeth Bachmann, Ioana Duta, Emily Mazey, William Cooke, Manu Vatish, Gabriel Davis Jones\",\"doi\":\"10.1101/2024.02.27.23300005\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Introduction: Artificial Intelligence (AI) is redefining healthcare, with Large Language Models (LLMs) like ChatGPT offering novel and powerful capabilities in processing and generating human-like information. These advancements offer potential improvements in Women's Health, particularly Obstetrics and Gynaecology (O&G), where diagnostic and treatment gaps have long existed. Despite its generalist nature, ChatGPT is increasingly being tested in healthcare, necessitating a critical analysis of its utility, limitations and safety. This study examines ChatGPT's performance in interpreting and responding to international gold standard benchmark assessments in O&G: the RCOG's MRCOG Part One and Two examinations. We evaluate ChatGPT's domain- and knowledge area-specific accuracy, the influence of linguistic complexity on performance and its self-assessment confidence and uncertainty, essential for safe clinical decision-making. Methods: A dataset of MRCOG examination questions from sources beyond the reach of LLMs was developed to mitigate the risk of ChatGPT's prior exposure. A dual-review process validated the technical and clinical accuracy of the questions, omitting those dependent on previous content, duplicates, or requiring image interpretation. Single Best Answer (SBA) and Extended Matching (EMQ) Questions were converted to JSON format to facilitate ChatGPT's interpretation, incorporating question types and background information. Interaction with ChatGPT was conducted via OpenAI's API, structured to ensure consistent, contextually informed responses from ChatGPT. The response from ChatGPT was recorded and compared against the known accurate response. Linguistic complexity was evaluated using unique token counts and Type-Token ratios (vocabulary breadth and diversity) to explore their influence on performance. ChatGPT was instructed to assign confidence scores to its answers (0-100%), reflecting its self-perceived accuracy. Responses were categorized by correctness and statistically analysed through entropy calculation, assessing ChatGPT's capacity for self-evaluating certainty and knowledge boundaries. Findings: Of 1,824 MRCOG Part One and Two questions, ChatGPT's accuracy on MRCOG Part One was 72.2% (95% CI 69.2-75.3). For Part Two, it achieved 50.4% accuracy (95% CI 47.2-53.5) with 534 correct out of 989 questions, performing better on SBAs (54.0%, 95% CI 50.0-58.0) than on EMQs (45.0%, 95% CI 40.1-49.9). In domain-specific performance, the highest accuracy was in Biochemistry (79.8%, 95% CI 71.4-88.1) and the lowest in Biophysics (51.4%, 95% CI 35.2-67.5). The best-performing subject in Part Two was Urogynaecology (63.0%, 95% CI 50.1-75.8) and the worst was Management of Labour (35.6%, 95% CI 21.6-49.5). Linguistic complexity analysis showed a marginal increase in unique token count for correct answers in Part One (median 122, IQR 114-134) compared to incorrect (median 120, IQR 112-131, p=0.05). TTR analysis revealed higher medians for correct answers with negligible effect sizes (Part One: 0.66, IQR 0.63-0.68; Part Two: 0.62, IQR 0.57-0.67) and p-values <0.001. Regarding self-assessed confidence, the median confidence for correct answers was 70.0% (IQR 60-90), the same as for incorrect choices identified as correct (p<0.001). For correct answers deemed incorrect, the median confidence was 10.0% (IQR 0-10), and for incorrect answers accurately identified, it was 5.0% (IQR 0-10, p<0.001). Entropy values were identical for correct and incorrect responses (median 1.46, IQR 0.44-1.77), indicating no discernible distinction in ChatGPT's prediction certainty. Conclusions: ChatGPT demonstrated commendable accuracy in basic medical queries on the MRCOG Part One, yet its performance was markedly reduced in the clinically demanding Part Two exam. The model's high self-confidence across correct and incorrect responses necessitates scrutiny for its application in clinical decision-making. These findings suggest that while ChatGPT has potential, its current form requires significant refinement before it can enhance diagnostic efficacy and clinical workflow in women's health.\",\"PeriodicalId\":501409,\"journal\":{\"name\":\"medRxiv - Obstetrics and Gynecology\",\"volume\":\"4 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-02-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"medRxiv - Obstetrics and Gynecology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1101/2024.02.27.23300005\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv - Obstetrics and Gynecology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.02.27.23300005","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

简介人工智能（AI）正在重新定义医疗保健，像 ChatGPT 这样的大型语言模型（LLM）在处理和生成类人信息方面提供了新颖而强大的功能。这些进步为改善妇女健康，特别是妇产科（O&G）提供了可能，因为妇产科在诊断和治疗方面长期存在差距。尽管 ChatGPT 具有通用性，但它正越来越多地应用于医疗保健领域，因此有必要对其实用性、局限性和安全性进行批判性分析。本研究考察了 ChatGPT 在解释和应对国际妇产科黄金标准基准评估（RCOG 的 MRCOG 第一和第二部分考试）方面的性能。我们评估了 ChatGPT 在特定领域和知识领域的准确性、语言复杂性对性能的影响及其自我评估的信心和不确定性，这对安全的临床决策至关重要。方法：为了降低 ChatGPT 事先暴露的风险，我们开发了一个 MRCOG 考试试题数据集，该数据集的来源超出了 LLM 的能力范围。双重审查过程验证了问题的技术和临床准确性，剔除了那些依赖于先前内容、重复或需要图像解读的问题。单一最佳答案（SBA）和扩展匹配（EMQ）问题被转换为 JSON 格式，以方便 ChatGPT 解释，并纳入了问题类型和背景信息。与 ChatGPT 的交互是通过 OpenAI 的应用程序接口进行的，其结构可确保 ChatGPT 能做出一致且符合上下文的回应。ChatGPT 的回答会被记录下来，并与已知的准确回答进行比较。语言复杂性使用独特标记计数和类型-标记比率（词汇广度和多样性）进行评估，以探索它们对性能的影响。ChatGPT 被指示对其答案进行置信度评分（0-100%），以反映其自我感觉的准确性。回答按正确性分类，并通过熵计算进行统计分析，评估 ChatGPT 自我评估确定性和知识边界的能力。研究结果：在 1824 个 MRCOG 第一部分和第二部分问题中，ChatGPT 在 MRCOG 第一部分的正确率为 72.2% (95% CI 69.2-75.3)。对于第二部分，它的准确率为 50.4%（95% CI 47.2-53.5），989 道问题中有 534 道正确，在 SBA（54.0%，95% CI 50.0-58.0）方面的表现优于 EMQ（45.0%，95% CI 40.1-49.9）。在特定领域的成绩方面，生物化学的准确率最高（79.8%，95% CI 71.4-88.1），生物物理的准确率最低（51.4%，95% CI 35.2-67.5）。第二部分表现最好的科目是泌尿妇科（63.0%，95% CI 50.1-75.8），最差的科目是劳动管理（35.6%，95% CI 21.6-49.5）。语言复杂性分析表明，与错误答案（中位数 120，IQR 112-131，P=0.05）相比，第一部分正确答案的唯一标记数略有增加（中位数 122，IQR 114-134）。TTR分析显示，正确答案的中位数较高，但影响大小可忽略不计（第一部分：0.66，IQR 0.63-0.68；第二部分：0.62，IQR 0.57-0.67），P值为0.001。关于自我评估的信心，正确答案的信心中位数为 70.0%（IQR 60-90），与被认定为正确的错误选项的信心中位数相同（p<0.001）。被认为是错误的正确答案的信心中位数为 10.0% (IQR0-10)，而被准确识别的错误答案的信心中位数为 5.0% (IQR 0-10，p<0.001)。正确答案和错误答案的熵值相同（中位数为 1.46，IQR 为 0.44-1.77），表明 ChatGPT 的预测确定性没有明显区别。结论：在 MRCOG 第一部分考试中，ChatGPT 在基本医学问题上表现出了值得称赞的准确性，但在临床要求较高的第二部分考试中，其表现却明显下降。该模型对正确和错误回答的自信度都很高，因此有必要对其在临床决策中的应用进行仔细研究。这些研究结果表明，虽然 ChatGPT 具有潜力，但其目前的形式还需要进一步改进，才能提高妇女健康的诊断效率和临床工作流程。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Exploring the Capabilities of ChatGPT in Women's Health

Introduction: Artificial Intelligence (AI) is redefining healthcare, with Large Language Models (LLMs) like ChatGPT offering novel and powerful capabilities in processing and generating human-like information. These advancements offer potential improvements in Women's Health, particularly Obstetrics and Gynaecology (O&G), where diagnostic and treatment gaps have long existed. Despite its generalist nature, ChatGPT is increasingly being tested in healthcare, necessitating a critical analysis of its utility, limitations and safety. This study examines ChatGPT's performance in interpreting and responding to international gold standard benchmark assessments in O&G: the RCOG's MRCOG Part One and Two examinations. We evaluate ChatGPT's domain- and knowledge area-specific accuracy, the influence of linguistic complexity on performance and its self-assessment confidence and uncertainty, essential for safe clinical decision-making. Methods: A dataset of MRCOG examination questions from sources beyond the reach of LLMs was developed to mitigate the risk of ChatGPT's prior exposure. A dual-review process validated the technical and clinical accuracy of the questions, omitting those dependent on previous content, duplicates, or requiring image interpretation. Single Best Answer (SBA) and Extended Matching (EMQ) Questions were converted to JSON format to facilitate ChatGPT's interpretation, incorporating question types and background information. Interaction with ChatGPT was conducted via OpenAI's API, structured to ensure consistent, contextually informed responses from ChatGPT. The response from ChatGPT was recorded and compared against the known accurate response. Linguistic complexity was evaluated using unique token counts and Type-Token ratios (vocabulary breadth and diversity) to explore their influence on performance. ChatGPT was instructed to assign confidence scores to its answers (0-100%), reflecting its self-perceived accuracy. Responses were categorized by correctness and statistically analysed through entropy calculation, assessing ChatGPT's capacity for self-evaluating certainty and knowledge boundaries. Findings: Of 1,824 MRCOG Part One and Two questions, ChatGPT's accuracy on MRCOG Part One was 72.2% (95% CI 69.2-75.3). For Part Two, it achieved 50.4% accuracy (95% CI 47.2-53.5) with 534 correct out of 989 questions, performing better on SBAs (54.0%, 95% CI 50.0-58.0) than on EMQs (45.0%, 95% CI 40.1-49.9). In domain-specific performance, the highest accuracy was in Biochemistry (79.8%, 95% CI 71.4-88.1) and the lowest in Biophysics (51.4%, 95% CI 35.2-67.5). The best-performing subject in Part Two was Urogynaecology (63.0%, 95% CI 50.1-75.8) and the worst was Management of Labour (35.6%, 95% CI 21.6-49.5). Linguistic complexity analysis showed a marginal increase in unique token count for correct answers in Part One (median 122, IQR 114-134) compared to incorrect (median 120, IQR 112-131, p=0.05). TTR analysis revealed higher medians for correct answers with negligible effect sizes (Part One: 0.66, IQR 0.63-0.68; Part Two: 0.62, IQR 0.57-0.67) and p-values <0.001. Regarding self-assessed confidence, the median confidence for correct answers was 70.0% (IQR 60-90), the same as for incorrect choices identified as correct (p<0.001). For correct answers deemed incorrect, the median confidence was 10.0% (IQR 0-10), and for incorrect answers accurately identified, it was 5.0% (IQR 0-10, p<0.001). Entropy values were identical for correct and incorrect responses (median 1.46, IQR 0.44-1.77), indicating no discernible distinction in ChatGPT's prediction certainty. Conclusions: ChatGPT demonstrated commendable accuracy in basic medical queries on the MRCOG Part One, yet its performance was markedly reduced in the clinically demanding Part Two exam. The model's high self-confidence across correct and incorrect responses necessitates scrutiny for its application in clinical decision-making. These findings suggest that while ChatGPT has potential, its current form requires significant refinement before it can enhance diagnostic efficacy and clinical workflow in women's health.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

medRxiv - Obstetrics and Gynecology

自引率

0.00%

发文量