Advancing Korean Medical Large Language Models: Automated Pipeline for Korean Medical Preference Dataset Construction.

IF 2.3 Q3 MEDICAL INFORMATICS

Healthcare Informatics Research Pub Date : 2025-04-01 Epub Date: 2025-04-30 DOI:10.4258/hir.2025.31.2.166

Jean Seo, Sumin Park, Sungjoo Byun, Jinwook Choi, Jinho Choi, Hyopil Shin

{"title":"Advancing Korean Medical Large Language Models: Automated Pipeline for Korean Medical Preference Dataset Construction.","authors":"Jean Seo, Sumin Park, Sungjoo Byun, Jinwook Choi, Jinho Choi, Hyopil Shin","doi":"10.4258/hir.2025.31.2.166","DOIUrl":null,"url":null,"abstract":"Objectives: Developing large language models (LLMs) in biomedicine requires access to high-quality training and alignment tuning datasets. However, publicly available Korean medical preference datasets are scarce, hindering the advancement of Korean medical LLMs. This study constructs and evaluates the efficacy of the Korean Medical Preference Dataset (KoMeP), an alignment tuning dataset constructed with an automated pipeline, minimizing the high costs of human annotation.Methods: KoMeP was generated using the DAHL score, an automated hallucination evaluation metric. Five LLMs (Dolly-v2-3B, MPT-7B, GPT-4o, Qwen-2-7B, Llama-3-8B) produced responses to 8,573 biomedical examination questions, from which 5,551 preference pairs were extracted. Each pair consisted of a \"chosen\" response and a \"rejected\" response, as determined by their DAHL scores. The dataset was evaluated when trained through two different alignment tuning methods, direct preference optimization (DPO) and odds ratio preference optimization (ORPO) respectively across five different models. The KorMedMCQA benchmark was employed to assess the effectiveness of alignment tuning.Results: Models trained with DPO consistently improved KorMedMCQA performance; notably, Llama-3.1-8B showed a 43.96% increase. In contrast, ORPO training produced inconsistent results. Additionally, English-to-Korean transfer learning proved effective, particularly for English-centric models like Gemma-2, whereas Korean-to-English transfer learning achieved limited success. Instruction tuning with KoMeP yielded mixed outcomes, which suggests challenges in dataset formatting.Conclusions: KoMeP is the first publicly available Korean medical preference dataset and significantly improves alignment tuning performance in LLMs. The DPO method outperforms ORPO in alignment tuning. Future work should focus on expanding KoMeP, developing a Korean-native dataset, and refining alignment tuning methods to produce safer and more reliable Korean medical LLMs.","PeriodicalId":12947,"journal":{"name":"Healthcare Informatics Research","volume":"31 2","pages":"166-174"},"PeriodicalIF":2.3000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12086433/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Healthcare Informatics Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4258/hir.2025.31.2.166","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/4/30 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}

引用次数: 0

Abstract

Objectives: Developing large language models (LLMs) in biomedicine requires access to high-quality training and alignment tuning datasets. However, publicly available Korean medical preference datasets are scarce, hindering the advancement of Korean medical LLMs. This study constructs and evaluates the efficacy of the Korean Medical Preference Dataset (KoMeP), an alignment tuning dataset constructed with an automated pipeline, minimizing the high costs of human annotation.

Methods: KoMeP was generated using the DAHL score, an automated hallucination evaluation metric. Five LLMs (Dolly-v2-3B, MPT-7B, GPT-4o, Qwen-2-7B, Llama-3-8B) produced responses to 8,573 biomedical examination questions, from which 5,551 preference pairs were extracted. Each pair consisted of a "chosen" response and a "rejected" response, as determined by their DAHL scores. The dataset was evaluated when trained through two different alignment tuning methods, direct preference optimization (DPO) and odds ratio preference optimization (ORPO) respectively across five different models. The KorMedMCQA benchmark was employed to assess the effectiveness of alignment tuning.

Results: Models trained with DPO consistently improved KorMedMCQA performance; notably, Llama-3.1-8B showed a 43.96% increase. In contrast, ORPO training produced inconsistent results. Additionally, English-to-Korean transfer learning proved effective, particularly for English-centric models like Gemma-2, whereas Korean-to-English transfer learning achieved limited success. Instruction tuning with KoMeP yielded mixed outcomes, which suggests challenges in dataset formatting.

Conclusions: KoMeP is the first publicly available Korean medical preference dataset and significantly improves alignment tuning performance in LLMs. The DPO method outperforms ORPO in alignment tuning. Future work should focus on expanding KoMeP, developing a Korean-native dataset, and refining alignment tuning methods to produce safer and more reliable Korean medical LLMs.

查看原文本刊更多论文

推进韩国医疗大语言模型：韩国医疗偏好数据集构建的自动化管道。

目标：在生物医学中开发大型语言模型（llm）需要访问高质量的训练和校准调优数据集。然而，公开可用的韩国医疗偏好数据集很少，阻碍了韩国医学法学硕士的进步。本研究构建并评估了韩国医疗偏好数据集（KoMeP）的有效性，这是一个由自动化管道构建的校准调优数据集，最大限度地减少了人工注释的高成本。方法：采用自动幻觉评价指标DAHL评分生成KoMeP。5个LLMs （Dolly-v2-3B、MPT-7B、gpt - 40、qwen2 - 7b、Llama-3-8B）对8,573个生物医学检查问题进行了回答，从中提取了5,551个偏好对。每对由他们的DAHL分数决定的“选择”反应和“拒绝”反应组成。通过直接偏好优化（DPO）和优势比偏好优化（ORPO）两种不同的对齐调整方法，在五种不同的模型上对数据集进行了评估。采用KorMedMCQA基准来评估校准调优的有效性。结果：DPO训练的模型持续提高了KorMedMCQA的性能；值得注意的是，羊驼3.1- 8b增长了43.96%。相比之下，ORPO训练产生了不一致的结果。此外，英语到韩语的迁移学习被证明是有效的，特别是对于以英语为中心的模式，如Gemma-2，而韩语到英语的迁移学习取得了有限的成功。使用KoMeP进行指令调优产生了不同的结果，这表明在数据集格式化方面存在挑战。结论：KoMeP是第一个公开可用的韩国医疗偏好数据集，显著提高了llm的对齐调整性能。DPO方法在对齐调优方面优于ORPO方法。未来的工作应该集中在扩展KoMeP，开发韩国本土数据集，并改进对齐调整方法，以产生更安全、更可靠的韩国医学法学硕士。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Healthcare Informatics Research MEDICAL INFORMATICS-

CiteScore

4.90

自引率

6.90%

发文量