{"title":"Towards Community-Based Evaluation of AI in Neurology: Development of a Headache Diagnosis Dataset for Large Language Models.","authors":"Anika Zahn, Sebastian Strauss, Dorian Zwanzig","doi":"10.3233/SHTI251535","DOIUrl":null,"url":null,"abstract":"<p><p>Diagnosing headache disorders remains a clinical challenge due to the heterogeneity of headache phenotypes and the absence of objective biomarkers. This study presents a curated dataset of 50 clinical headache case examples, comprising both real (n = 34) and synthetic (n = 16) cases, categorized across 20 diagnoses according to ICHD-3 criteria. The dataset enables the evaluation of large language models (LLMs) for diagnostic accuracy in headache medicine. Three GPT-based models were tested using different prompting strategies, with diagnostic performance assessed at both diagnosis and group levels. Top-1 accuracy ranged from 24% to 63% at the diagnosis level and up to 92% at the group level. The results highlight the potential of LLMs in supporting differential diagnosis of headache disorders, while also emphasizing the need for further validation with larger, diverse datasets. Future efforts will focus on expanding real-world data through clinical collaborations and benchmarking LLMs against medical professionals to assess their utility in clinical decision-making.</p>","PeriodicalId":94357,"journal":{"name":"Studies in health technology and informatics","volume":"332 ","pages":"237-241"},"PeriodicalIF":0.0000,"publicationDate":"2025-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Studies in health technology and informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3233/SHTI251535","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Diagnosing headache disorders remains a clinical challenge due to the heterogeneity of headache phenotypes and the absence of objective biomarkers. This study presents a curated dataset of 50 clinical headache case examples, comprising both real (n = 34) and synthetic (n = 16) cases, categorized across 20 diagnoses according to ICHD-3 criteria. The dataset enables the evaluation of large language models (LLMs) for diagnostic accuracy in headache medicine. Three GPT-based models were tested using different prompting strategies, with diagnostic performance assessed at both diagnosis and group levels. Top-1 accuracy ranged from 24% to 63% at the diagnosis level and up to 92% at the group level. The results highlight the potential of LLMs in supporting differential diagnosis of headache disorders, while also emphasizing the need for further validation with larger, diverse datasets. Future efforts will focus on expanding real-world data through clinical collaborations and benchmarking LLMs against medical professionals to assess their utility in clinical decision-making.