Towards Community-Based Evaluation of AI in Neurology: Development of a Headache Diagnosis Dataset for Large Language Models.

Studies in health technology and informatics Pub Date : 2025-10-02 DOI:10.3233/SHTI251535

Anika Zahn, Sebastian Strauss, Dorian Zwanzig

{"title":"Towards Community-Based Evaluation of AI in Neurology: Development of a Headache Diagnosis Dataset for Large Language Models.","authors":"Anika Zahn, Sebastian Strauss, Dorian Zwanzig","doi":"10.3233/SHTI251535","DOIUrl":null,"url":null,"abstract":"<p><p>Diagnosing headache disorders remains a clinical challenge due to the heterogeneity of headache phenotypes and the absence of objective biomarkers. This study presents a curated dataset of 50 clinical headache case examples, comprising both real (n = 34) and synthetic (n = 16) cases, categorized across 20 diagnoses according to ICHD-3 criteria. The dataset enables the evaluation of large language models (LLMs) for diagnostic accuracy in headache medicine. Three GPT-based models were tested using different prompting strategies, with diagnostic performance assessed at both diagnosis and group levels. Top-1 accuracy ranged from 24% to 63% at the diagnosis level and up to 92% at the group level. The results highlight the potential of LLMs in supporting differential diagnosis of headache disorders, while also emphasizing the need for further validation with larger, diverse datasets. Future efforts will focus on expanding real-world data through clinical collaborations and benchmarking LLMs against medical professionals to assess their utility in clinical decision-making.</p>","PeriodicalId":94357,"journal":{"name":"Studies in health technology and informatics","volume":"332 ","pages":"237-241"},"PeriodicalIF":0.0000,"publicationDate":"2025-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Studies in health technology and informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3233/SHTI251535","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Diagnosing headache disorders remains a clinical challenge due to the heterogeneity of headache phenotypes and the absence of objective biomarkers. This study presents a curated dataset of 50 clinical headache case examples, comprising both real (n = 34) and synthetic (n = 16) cases, categorized across 20 diagnoses according to ICHD-3 criteria. The dataset enables the evaluation of large language models (LLMs) for diagnostic accuracy in headache medicine. Three GPT-based models were tested using different prompting strategies, with diagnostic performance assessed at both diagnosis and group levels. Top-1 accuracy ranged from 24% to 63% at the diagnosis level and up to 92% at the group level. The results highlight the potential of LLMs in supporting differential diagnosis of headache disorders, while also emphasizing the need for further validation with larger, diverse datasets. Future efforts will focus on expanding real-world data through clinical collaborations and benchmarking LLMs against medical professionals to assess their utility in clinical decision-making.

查看原文本刊更多论文

神经学中基于社区的人工智能评估：大型语言模型的头痛诊断数据集的开发。

由于头痛表型的异质性和缺乏客观的生物标志物，诊断头痛疾病仍然是一个临床挑战。本研究提出了一个精心整理的50例临床头痛病例数据集，包括真实病例（n = 34）和合成病例（n = 16），根据ICHD-3标准分为20种诊断。该数据集能够评估大型语言模型（llm）在头痛医学中的诊断准确性。使用不同的提示策略测试了三种基于gpt的模型，并在诊断和组水平上评估了诊断性能。在诊断水平上，Top-1的准确率从24%到63%不等，在组水平上高达92%。结果强调了llm在支持头痛疾病鉴别诊断方面的潜力，同时也强调了需要用更大、更多样化的数据集进一步验证。未来的努力将集中在通过临床合作扩展真实世界的数据，并将法学硕士与医疗专业人员进行基准测试，以评估其在临床决策中的效用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Studies in health technology and informatics

自引率

0.00%

发文量