Arkangel AI：实时、循证医学问答的对话代理

Intelligence-based medicine Pub Date : 2025-01-01 DOI:10.1016/j.ibmed.2025.100274

Maria Camila Villa, Natalia Castano-Villegas, Isabella Llano, Julian Martinez, Maria Fernanda Guevara, Jose Zea, Laura Velásquez

{"title":"Arkangel AI：实时、循证医学问答的对话代理","authors":"Maria Camila Villa, Natalia Castano-Villegas, Isabella Llano, Julian Martinez, Maria Fernanda Guevara, Jose Zea, Laura Velásquez","doi":"10.1016/j.ibmed.2025.100274","DOIUrl":null,"url":null,"abstract":"<div><h3>Introduction</h3><div>Large Language Models (LLMs) have been trained and tested on several medical question-answering (QA) datasets built from medical licensing exams and natural interactions between doctors and patients to fine-tune them for specific health-related tasks.</div></div><div><h3>Objective</h3><div>We aimed to develop LLM-powered Conversational Agents (CAs) equipped to produce fast, accurate, and real-time responses to medical queries in different clinical and scientific scenarios. This paper presents Arkangel AI, our first conversational agent and research assistant.</div></div><div><h3>Methods</h3><div>The model is based on a system containing five LLMs; each is classified within a specific workflow with pre-defined instructions to produce the best search strategy and provide evidence-based answers. We assessed accuracy, intra/inter-class variability, and Cohen's Kappa using the question-answer (QA) dataset MedQA. Additionally, we used the PubMedQA dataset and assessed both databases using the RAGAS framework, including Context, Response Relevance, and Faithfulness. Traditional statistical analysis was performed with hypothesis tests and 95 % IC.</div></div><div><h3>Results</h3><div>Accuracy for MedQA (n: 1273) was 90.26 % and Cohen's kappa was 87 %, surpassing current SoTAs for other LLMs (GPT-4o, MedPaLM2). The model retrieved 80 % of the expected articles and provided relevant answers in 82 % of PubMedQA.</div></div><div><h3>Conclusion</h3><div>Arkangel AI showed proficient retrieval and reasoning abilities and unbiased responses. Evenly distributed medical QA datasets to train improved LLMs and external validation for the model with real-world physicians in clinical scenarios are needed. Clinical decision-making remains in the hands of trained healthcare professionals.</div></div>","PeriodicalId":73399,"journal":{"name":"Intelligence-based medicine","volume":"12 ","pages":"Article 100274"},"PeriodicalIF":0.0000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Arkangel AI: A conversational agent for real-time, evidence-based medical question-answering\",\"authors\":\"Maria Camila Villa, Natalia Castano-Villegas, Isabella Llano, Julian Martinez, Maria Fernanda Guevara, Jose Zea, Laura Velásquez\",\"doi\":\"10.1016/j.ibmed.2025.100274\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Introduction</h3><div>Large Language Models (LLMs) have been trained and tested on several medical question-answering (QA) datasets built from medical licensing exams and natural interactions between doctors and patients to fine-tune them for specific health-related tasks.</div></div><div><h3>Objective</h3><div>We aimed to develop LLM-powered Conversational Agents (CAs) equipped to produce fast, accurate, and real-time responses to medical queries in different clinical and scientific scenarios. This paper presents Arkangel AI, our first conversational agent and research assistant.</div></div><div><h3>Methods</h3><div>The model is based on a system containing five LLMs; each is classified within a specific workflow with pre-defined instructions to produce the best search strategy and provide evidence-based answers. We assessed accuracy, intra/inter-class variability, and Cohen's Kappa using the question-answer (QA) dataset MedQA. Additionally, we used the PubMedQA dataset and assessed both databases using the RAGAS framework, including Context, Response Relevance, and Faithfulness. Traditional statistical analysis was performed with hypothesis tests and 95 % IC.</div></div><div><h3>Results</h3><div>Accuracy for MedQA (n: 1273) was 90.26 % and Cohen's kappa was 87 %, surpassing current SoTAs for other LLMs (GPT-4o, MedPaLM2). The model retrieved 80 % of the expected articles and provided relevant answers in 82 % of PubMedQA.</div></div><div><h3>Conclusion</h3><div>Arkangel AI showed proficient retrieval and reasoning abilities and unbiased responses. Evenly distributed medical QA datasets to train improved LLMs and external validation for the model with real-world physicians in clinical scenarios are needed. Clinical decision-making remains in the hands of trained healthcare professionals.</div></div>\",\"PeriodicalId\":73399,\"journal\":{\"name\":\"Intelligence-based medicine\",\"volume\":\"12 \",\"pages\":\"Article 100274\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Intelligence-based medicine\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S266652122500078X\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Intelligence-based medicine","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S266652122500078X","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

大型语言模型（llm）已经在几个医学问答（QA）数据集上进行了培训和测试，这些数据集来自医疗许可考试和医生和患者之间的自然互动，以微调它们以适应特定的健康相关任务。我们的目标是开发基于llm的会话代理（CAs），以便在不同的临床和科学场景中对医疗查询产生快速、准确和实时的响应。本文介绍了Arkangel AI，我们的第一个会话代理和研究助理。方法基于一个包含5个llm的系统建立模型；每个都在特定的工作流中进行分类，并带有预定义的指令，以产生最佳搜索策略并提供基于证据的答案。我们使用问答（QA）数据集MedQA评估准确性、类内/类间变异性和Cohen Kappa。此外，我们使用PubMedQA数据集，并使用RAGAS框架评估两个数据库，包括上下文、响应相关性和可信度。结果MedQA （n: 1273）的准确率为90.26%,Cohen’s kappa为87%，超过了目前其他LLMs （gpt - 40、MedPaLM2）的SoTAs。该模型检索了80%的预期文章，并在82%的PubMedQA中提供了相关答案。结论arkangel人工智能具有良好的检索推理能力和无偏性反应。需要均匀分布的医疗QA数据集来训练改进的llm，并在临床场景中与现实世界的医生一起对模型进行外部验证。临床决策仍然掌握在训练有素的医疗保健专业人员手中。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Arkangel AI: A conversational agent for real-time, evidence-based medical question-answering

Introduction

Large Language Models (LLMs) have been trained and tested on several medical question-answering (QA) datasets built from medical licensing exams and natural interactions between doctors and patients to fine-tune them for specific health-related tasks.

Objective

We aimed to develop LLM-powered Conversational Agents (CAs) equipped to produce fast, accurate, and real-time responses to medical queries in different clinical and scientific scenarios. This paper presents Arkangel AI, our first conversational agent and research assistant.

Methods

The model is based on a system containing five LLMs; each is classified within a specific workflow with pre-defined instructions to produce the best search strategy and provide evidence-based answers. We assessed accuracy, intra/inter-class variability, and Cohen's Kappa using the question-answer (QA) dataset MedQA. Additionally, we used the PubMedQA dataset and assessed both databases using the RAGAS framework, including Context, Response Relevance, and Faithfulness. Traditional statistical analysis was performed with hypothesis tests and 95 % IC.

Results

Accuracy for MedQA (n: 1273) was 90.26 % and Cohen's kappa was 87 %, surpassing current SoTAs for other LLMs (GPT-4o, MedPaLM2). The model retrieved 80 % of the expected articles and provided relevant answers in 82 % of PubMedQA.

Conclusion

Arkangel AI showed proficient retrieval and reasoning abilities and unbiased responses. Evenly distributed medical QA datasets to train improved LLMs and external validation for the model with real-world physicians in clinical scenarios are needed. Clinical decision-making remains in the hands of trained healthcare professionals.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Intelligence-based medicine Health Informatics

CiteScore

5.00

自引率

0.00%

发文量

审稿时长

187 days