Maria Camila Villa, Natalia Castano-Villegas, Isabella Llano, Julian Martinez, Maria Fernanda Guevara, Jose Zea, Laura Velásquez
{"title":"Arkangel AI:实时、循证医学问答的对话代理","authors":"Maria Camila Villa, Natalia Castano-Villegas, Isabella Llano, Julian Martinez, Maria Fernanda Guevara, Jose Zea, Laura Velásquez","doi":"10.1016/j.ibmed.2025.100274","DOIUrl":null,"url":null,"abstract":"<div><h3>Introduction</h3><div>Large Language Models (LLMs) have been trained and tested on several medical question-answering (QA) datasets built from medical licensing exams and natural interactions between doctors and patients to fine-tune them for specific health-related tasks.</div></div><div><h3>Objective</h3><div>We aimed to develop LLM-powered Conversational Agents (CAs) equipped to produce fast, accurate, and real-time responses to medical queries in different clinical and scientific scenarios. This paper presents Arkangel AI, our first conversational agent and research assistant.</div></div><div><h3>Methods</h3><div>The model is based on a system containing five LLMs; each is classified within a specific workflow with pre-defined instructions to produce the best search strategy and provide evidence-based answers. We assessed accuracy, intra/inter-class variability, and Cohen's Kappa using the question-answer (QA) dataset MedQA. Additionally, we used the PubMedQA dataset and assessed both databases using the RAGAS framework, including Context, Response Relevance, and Faithfulness. Traditional statistical analysis was performed with hypothesis tests and 95 % IC.</div></div><div><h3>Results</h3><div>Accuracy for MedQA (n: 1273) was 90.26 % and Cohen's kappa was 87 %, surpassing current SoTAs for other LLMs (GPT-4o, MedPaLM2). The model retrieved 80 % of the expected articles and provided relevant answers in 82 % of PubMedQA.</div></div><div><h3>Conclusion</h3><div>Arkangel AI showed proficient retrieval and reasoning abilities and unbiased responses. Evenly distributed medical QA datasets to train improved LLMs and external validation for the model with real-world physicians in clinical scenarios are needed. Clinical decision-making remains in the hands of trained healthcare professionals.</div></div>","PeriodicalId":73399,"journal":{"name":"Intelligence-based medicine","volume":"12 ","pages":"Article 100274"},"PeriodicalIF":0.0000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Arkangel AI: A conversational agent for real-time, evidence-based medical question-answering\",\"authors\":\"Maria Camila Villa, Natalia Castano-Villegas, Isabella Llano, Julian Martinez, Maria Fernanda Guevara, Jose Zea, Laura Velásquez\",\"doi\":\"10.1016/j.ibmed.2025.100274\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Introduction</h3><div>Large Language Models (LLMs) have been trained and tested on several medical question-answering (QA) datasets built from medical licensing exams and natural interactions between doctors and patients to fine-tune them for specific health-related tasks.</div></div><div><h3>Objective</h3><div>We aimed to develop LLM-powered Conversational Agents (CAs) equipped to produce fast, accurate, and real-time responses to medical queries in different clinical and scientific scenarios. This paper presents Arkangel AI, our first conversational agent and research assistant.</div></div><div><h3>Methods</h3><div>The model is based on a system containing five LLMs; each is classified within a specific workflow with pre-defined instructions to produce the best search strategy and provide evidence-based answers. We assessed accuracy, intra/inter-class variability, and Cohen's Kappa using the question-answer (QA) dataset MedQA. Additionally, we used the PubMedQA dataset and assessed both databases using the RAGAS framework, including Context, Response Relevance, and Faithfulness. Traditional statistical analysis was performed with hypothesis tests and 95 % IC.</div></div><div><h3>Results</h3><div>Accuracy for MedQA (n: 1273) was 90.26 % and Cohen's kappa was 87 %, surpassing current SoTAs for other LLMs (GPT-4o, MedPaLM2). The model retrieved 80 % of the expected articles and provided relevant answers in 82 % of PubMedQA.</div></div><div><h3>Conclusion</h3><div>Arkangel AI showed proficient retrieval and reasoning abilities and unbiased responses. Evenly distributed medical QA datasets to train improved LLMs and external validation for the model with real-world physicians in clinical scenarios are needed. Clinical decision-making remains in the hands of trained healthcare professionals.</div></div>\",\"PeriodicalId\":73399,\"journal\":{\"name\":\"Intelligence-based medicine\",\"volume\":\"12 \",\"pages\":\"Article 100274\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Intelligence-based medicine\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S266652122500078X\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Intelligence-based medicine","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S266652122500078X","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Arkangel AI: A conversational agent for real-time, evidence-based medical question-answering
Introduction
Large Language Models (LLMs) have been trained and tested on several medical question-answering (QA) datasets built from medical licensing exams and natural interactions between doctors and patients to fine-tune them for specific health-related tasks.
Objective
We aimed to develop LLM-powered Conversational Agents (CAs) equipped to produce fast, accurate, and real-time responses to medical queries in different clinical and scientific scenarios. This paper presents Arkangel AI, our first conversational agent and research assistant.
Methods
The model is based on a system containing five LLMs; each is classified within a specific workflow with pre-defined instructions to produce the best search strategy and provide evidence-based answers. We assessed accuracy, intra/inter-class variability, and Cohen's Kappa using the question-answer (QA) dataset MedQA. Additionally, we used the PubMedQA dataset and assessed both databases using the RAGAS framework, including Context, Response Relevance, and Faithfulness. Traditional statistical analysis was performed with hypothesis tests and 95 % IC.
Results
Accuracy for MedQA (n: 1273) was 90.26 % and Cohen's kappa was 87 %, surpassing current SoTAs for other LLMs (GPT-4o, MedPaLM2). The model retrieved 80 % of the expected articles and provided relevant answers in 82 % of PubMedQA.
Conclusion
Arkangel AI showed proficient retrieval and reasoning abilities and unbiased responses. Evenly distributed medical QA datasets to train improved LLMs and external validation for the model with real-world physicians in clinical scenarios are needed. Clinical decision-making remains in the hands of trained healthcare professionals.