基于循证患者教育的代理法学硕士RAG框架的开发和评估。

IF 4.4 Q1 HEALTH CARE SCIENCES & SERVICES
AlHasan AlSammarraie, Ali Al-Saifi, Hassan Kamhia, Mohamed Aboagla, Mowafa Househ
{"title":"基于循证患者教育的代理法学硕士RAG框架的开发和评估。","authors":"AlHasan AlSammarraie, Ali Al-Saifi, Hassan Kamhia, Mohamed Aboagla, Mowafa Househ","doi":"10.1136/bmjhci-2025-101570","DOIUrl":null,"url":null,"abstract":"<p><strong>Objectives: </strong>To develop and evaluate an agentic retrieval augmented generation (ARAG) framework using open-source large language models (LLMs) for generating evidence-based Arabic patient education materials (PEMs) and assess the LLMs capabilities as validation agents tasked with blocking harmful content.</p><p><strong>Methods: </strong>We selected 12 LLMs and applied four experimental setups (base, base+prompt engineering, ARAG, and ARAG+prompt engineering). PEM generation quality was assessed via two-stage evaluation (automated LLM, then expert review) using 5 metrics (accuracy, readability, comprehensiveness, appropriateness and safety) against ground truth. Validation agent (VA) performance was evaluated separately using a harmful/safe PEM dataset, measuring blocking accuracy.</p><p><strong>Results: </strong>ARAG-enabled setups yielded the best generation performance for 10/12 LLMs. Arabic-focused models occupied the top 9 ranks. Expert evaluation ranking mirrored the automated ranking. AceGPT-v2-32B with ARAG and prompt engineering (setup 4) was confirmed highest-performing. VA accuracy correlated strongly with model size; only models ≥27B parameters achieved >0.80 accuracy. Fanar-7B performed well in generation but poorly as a VA.</p><p><strong>Discussion: </strong>Arabic-centred models demonstrated advantages for the Arabic PEM generation task. ARAG enhanced generation quality, although context limits impacted large-context models. The validation task highlighted model size as critical for reliable performance.</p><p><strong>Conclusion: </strong>ARAG noticeably improves Arabic PEM generation, particularly with Arabic-centred models like AceGPT-v2-32B. Larger models appear necessary for reliable harmful content validation. Automated evaluation showed potential for ranking systems, aligning with expert judgement for top performers.</p>","PeriodicalId":9050,"journal":{"name":"BMJ Health & Care Informatics","volume":"32 1","pages":""},"PeriodicalIF":4.4000,"publicationDate":"2025-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Development and evaluation of an agentic LLM based RAG framework for evidence-based patient education.\",\"authors\":\"AlHasan AlSammarraie, Ali Al-Saifi, Hassan Kamhia, Mohamed Aboagla, Mowafa Househ\",\"doi\":\"10.1136/bmjhci-2025-101570\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Objectives: </strong>To develop and evaluate an agentic retrieval augmented generation (ARAG) framework using open-source large language models (LLMs) for generating evidence-based Arabic patient education materials (PEMs) and assess the LLMs capabilities as validation agents tasked with blocking harmful content.</p><p><strong>Methods: </strong>We selected 12 LLMs and applied four experimental setups (base, base+prompt engineering, ARAG, and ARAG+prompt engineering). PEM generation quality was assessed via two-stage evaluation (automated LLM, then expert review) using 5 metrics (accuracy, readability, comprehensiveness, appropriateness and safety) against ground truth. Validation agent (VA) performance was evaluated separately using a harmful/safe PEM dataset, measuring blocking accuracy.</p><p><strong>Results: </strong>ARAG-enabled setups yielded the best generation performance for 10/12 LLMs. Arabic-focused models occupied the top 9 ranks. Expert evaluation ranking mirrored the automated ranking. AceGPT-v2-32B with ARAG and prompt engineering (setup 4) was confirmed highest-performing. VA accuracy correlated strongly with model size; only models ≥27B parameters achieved >0.80 accuracy. Fanar-7B performed well in generation but poorly as a VA.</p><p><strong>Discussion: </strong>Arabic-centred models demonstrated advantages for the Arabic PEM generation task. ARAG enhanced generation quality, although context limits impacted large-context models. The validation task highlighted model size as critical for reliable performance.</p><p><strong>Conclusion: </strong>ARAG noticeably improves Arabic PEM generation, particularly with Arabic-centred models like AceGPT-v2-32B. Larger models appear necessary for reliable harmful content validation. Automated evaluation showed potential for ranking systems, aligning with expert judgement for top performers.</p>\",\"PeriodicalId\":9050,\"journal\":{\"name\":\"BMJ Health & Care Informatics\",\"volume\":\"32 1\",\"pages\":\"\"},\"PeriodicalIF\":4.4000,\"publicationDate\":\"2025-07-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BMJ Health & Care Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1136/bmjhci-2025-101570\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMJ Health & Care Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1136/bmjhci-2025-101570","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0

摘要

目的:开发和评估使用开源大型语言模型(llm)生成基于证据的阿拉伯患者教育材料(PEMs)的代理检索增强生成(ARAG)框架,并评估llm作为阻止有害内容的验证代理的能力。方法:选取12个llm,采用基础、基础+提示工程、ARAG、ARAG+提示工程4种实验设置。PEM生成质量通过两阶段评估(自动化LLM,然后是专家评审)进行评估,使用5个指标(准确性、可读性、全面性、适当性和安全性)来评估接地事实。使用有害/安全PEM数据集分别评估验证剂(VA)的性能,测量阻塞准确性。结果:启用arag的设置为10/12 llm产生了最佳的生成性能。以阿拉伯语为主的模特占据了前9名。专家评估排名反映了自动排名。采用ARAG和快速工程(安装4)的AceGPT-v2-32B被证实性能最好。VA精度与模型尺寸密切相关;只有参数≥27B的模型精度达到>.80。Fanar-7B在生成中表现良好,但作为va表现不佳。讨论:以阿拉伯语为中心的模型展示了阿拉伯语PEM生成任务的优势。ARAG提高了生成质量,尽管上下文限制影响了大上下文模型。验证任务强调模型大小是可靠性能的关键。结论:ARAG显著改善阿拉伯语PEM生成,特别是以阿拉伯语为中心的模型,如AceGPT-v2-32B。更大的模型对于可靠的有害内容验证似乎是必要的。自动评估显示了排名系统的潜力,与专家对最佳表现的判断保持一致。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Development and evaluation of an agentic LLM based RAG framework for evidence-based patient education.

Objectives: To develop and evaluate an agentic retrieval augmented generation (ARAG) framework using open-source large language models (LLMs) for generating evidence-based Arabic patient education materials (PEMs) and assess the LLMs capabilities as validation agents tasked with blocking harmful content.

Methods: We selected 12 LLMs and applied four experimental setups (base, base+prompt engineering, ARAG, and ARAG+prompt engineering). PEM generation quality was assessed via two-stage evaluation (automated LLM, then expert review) using 5 metrics (accuracy, readability, comprehensiveness, appropriateness and safety) against ground truth. Validation agent (VA) performance was evaluated separately using a harmful/safe PEM dataset, measuring blocking accuracy.

Results: ARAG-enabled setups yielded the best generation performance for 10/12 LLMs. Arabic-focused models occupied the top 9 ranks. Expert evaluation ranking mirrored the automated ranking. AceGPT-v2-32B with ARAG and prompt engineering (setup 4) was confirmed highest-performing. VA accuracy correlated strongly with model size; only models ≥27B parameters achieved >0.80 accuracy. Fanar-7B performed well in generation but poorly as a VA.

Discussion: Arabic-centred models demonstrated advantages for the Arabic PEM generation task. ARAG enhanced generation quality, although context limits impacted large-context models. The validation task highlighted model size as critical for reliable performance.

Conclusion: ARAG noticeably improves Arabic PEM generation, particularly with Arabic-centred models like AceGPT-v2-32B. Larger models appear necessary for reliable harmful content validation. Automated evaluation showed potential for ranking systems, aligning with expert judgement for top performers.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
6.10
自引率
4.90%
发文量
40
审稿时长
18 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信