Large Language Models for Pre-mediation Counseling in Medical Disputes: A Comparative Evaluation against Human Experts.

IF 2.1 Q3 MEDICAL INFORMATICS

Healthcare Informatics Research Pub Date : 2025-04-01 Epub Date: 2025-04-30 DOI:10.4258/hir.2025.31.2.200

Min Seo Kim, Jung Su Lee, Hyuna Bae

{"title":"Large Language Models for Pre-mediation Counseling in Medical Disputes: A Comparative Evaluation against Human Experts.","authors":"Min Seo Kim, Jung Su Lee, Hyuna Bae","doi":"10.4258/hir.2025.31.2.200","DOIUrl":null,"url":null,"abstract":"Objectives: Assessing medical disputes requires both medical and legal expertise, presenting challenges for patients seeking clarity regarding potential malpractice claims. This study aimed to develop and evaluate a chatbot based on a chain-of-thought pipeline using a large language model (LLM) for providing medical dispute counseling and compare its performance with responses from human experts.Methods: Retrospective counseling cases (n = 279) were collected from the Korea Medical Dispute Mediation and Arbitration Agency's website, from which 50 cases were randomly selected as a validation dataset. The Claude 3.5 Sonnet model processed each counseling request through a five-step chain-of-thought pipeline. Thirty-eight experts evaluated the chatbot's responses against the original human expert responses, rating them across four dimensions on a 5-point Likert scale. Statistical analyses were conducted using Wilcoxon signed-rank tests.Results: The chatbot significantly outperformed human experts in quality of information (p < 0.001), understanding and reasoning (p < 0.001), and overall satisfaction (p < 0.001). It also demonstrated a stronger tendency to produce opinion-driven content (p < 0.001). Despite generally high scores, evaluators noted specific instances where the chatbot encountered difficulties.Conclusions: A chain-of-thought-based LLM chatbot shows promise for enhancing the quality of medical dispute counseling, outperforming human experts across key evaluation metrics. Future research should address inaccuracies resulting from legal and contextual variability, investigate patient acceptance, and further refine the chatbot's performance in domain-specific applications.","PeriodicalId":12947,"journal":{"name":"Healthcare Informatics Research","volume":"31 2","pages":"200-208"},"PeriodicalIF":2.1000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12086436/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Healthcare Informatics Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4258/hir.2025.31.2.200","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/4/30 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}

引用次数: 0

Abstract

Objectives: Assessing medical disputes requires both medical and legal expertise, presenting challenges for patients seeking clarity regarding potential malpractice claims. This study aimed to develop and evaluate a chatbot based on a chain-of-thought pipeline using a large language model (LLM) for providing medical dispute counseling and compare its performance with responses from human experts.

Methods: Retrospective counseling cases (n = 279) were collected from the Korea Medical Dispute Mediation and Arbitration Agency's website, from which 50 cases were randomly selected as a validation dataset. The Claude 3.5 Sonnet model processed each counseling request through a five-step chain-of-thought pipeline. Thirty-eight experts evaluated the chatbot's responses against the original human expert responses, rating them across four dimensions on a 5-point Likert scale. Statistical analyses were conducted using Wilcoxon signed-rank tests.

Results: The chatbot significantly outperformed human experts in quality of information (p < 0.001), understanding and reasoning (p < 0.001), and overall satisfaction (p < 0.001). It also demonstrated a stronger tendency to produce opinion-driven content (p < 0.001). Despite generally high scores, evaluators noted specific instances where the chatbot encountered difficulties.

Conclusions: A chain-of-thought-based LLM chatbot shows promise for enhancing the quality of medical dispute counseling, outperforming human experts across key evaluation metrics. Future research should address inaccuracies resulting from legal and contextual variability, investigate patient acceptance, and further refine the chatbot's performance in domain-specific applications.

Abstract Image

查看原文本刊更多论文

医疗纠纷调解前咨询的大型语言模型：与人类专家的比较评价。

目的：评估医疗纠纷需要医疗和法律专业知识，为寻求明确潜在医疗事故索赔的患者提出挑战。本研究旨在开发和评估一个基于思维链管道的聊天机器人，使用大型语言模型（LLM）提供医疗纠纷咨询，并将其性能与人类专家的反应进行比较。方法：从韩国医疗纠纷调解仲裁院网站上收集回顾性咨询案例279例，随机抽取50例作为验证数据集。克劳德3.5十四行诗模型通过五个步骤的思维链来处理每个咨询请求。38位专家对聊天机器人的回答与人类专家的原始回答进行了评估，并在5分李克特量表的四个维度上对它们进行了评分。采用Wilcoxon符号秩检验进行统计分析。结果：聊天机器人在信息质量（p < 0.001）、理解和推理（p < 0.001）和总体满意度（p < 0.001）方面显著优于人类专家。它还显示出更强的倾向于产生意见驱动的内容（p < 0.001）。尽管总体得分很高，但评估人员指出了聊天机器人遇到困难的具体情况。结论：基于思维链的LLM聊天机器人有望提高医疗纠纷咨询的质量，在关键评估指标上优于人类专家。未来的研究应该解决由于法律和上下文变化而导致的不准确性，调查患者的接受程度，并进一步完善聊天机器人在特定领域应用中的表现。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Healthcare Informatics Research MEDICAL INFORMATICS-

CiteScore

4.90

自引率

6.90%

发文量