Ayush Tripathi, Samaneh Nasiri, Wolfgang Ganglberger, Thijs Nassi, Erik-Jan Meulenbrugge, Haoqi Sun, Katie L Stone, Emmanuel Mignot, Dennis Hwang, Lynn Marie Trotti, Matthew A Reyna, Gari D Clifford, Umakanth Katwa, Robert J Thomas, M Brandon Westover
{"title":"自动分析AASM内部评分者可靠性金标准多导睡眠图数据集。","authors":"Ayush Tripathi, Samaneh Nasiri, Wolfgang Ganglberger, Thijs Nassi, Erik-Jan Meulenbrugge, Haoqi Sun, Katie L Stone, Emmanuel Mignot, Dennis Hwang, Lynn Marie Trotti, Matthew A Reyna, Gari D Clifford, Umakanth Katwa, Robert J Thomas, M Brandon Westover","doi":"10.5664/jcsm.11848","DOIUrl":null,"url":null,"abstract":"<p><strong>Study objectives: </strong>To compare the performance of a comprehensive automated polysomnogram (PSG) analysis algorithm-CAISR (Complete Artificial Intelligence Sleep Report)-to a multi-expert gold standard panel, crowdsourced scorers, and experienced technicians for sleep staging and detecting arousals, respiratory events, and limb movements.</p><p><strong>Methods: </strong>A benchmark dataset of 57 PSG records (Inter-Scorer Reliability dataset) with 200 30-second epochs scored per AASM guidelines was used. Annotations were obtained from (1) the AASM multi-expert gold standard panel, (2) AASM Inter-Scorer Reliability (ISR) platform users (\"crowd,\" averaging 6,818 raters per epoch), (3) three experienced technicians, and (4) CAISR. Agreement was assessed via Cohen's Kappa (κ) and percent agreement.</p><p><strong>Results: </strong>Across tasks, CAISR achieved performance comparable to experienced technicians but did not match consensus-level agreement between the multi-expert gold standard and the crowd. For sleep staging, CAISR's agreement with multi-expert gold standard was 82.1% (κ = 0.70), comparable to experienced technicians but below the crowd (κ = 0.88). Arousal detection showed 87.81% agreement (κ = 0.45), respiratory event detection 83.18% (κ = 0.34), and limb movement detection 94.89% (κ = 0.11), each aligning with performance equivalent to experienced technicians but trailing crowd agreement (κ = 0.83, 0.78 and 0.86 for detection of arousal, respiratory events and limb movements respectively).</p><p><strong>Conclusions: </strong>CAISR achieves experienced technician-level accuracy for PSG scoring tasks but does not surpass the consensus-level agreement of a multi-expert gold standard or the crowd. These findings highlight the potential of automated scoring to match experienced technician-level performance while emphasizing the value of multi-rater consensus.</p>","PeriodicalId":50233,"journal":{"name":"Journal of Clinical Sleep Medicine","volume":" ","pages":""},"PeriodicalIF":2.9000,"publicationDate":"2025-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Automated analysis of the AASM Inter-Scorer Reliability gold standard polysomnogram dataset.\",\"authors\":\"Ayush Tripathi, Samaneh Nasiri, Wolfgang Ganglberger, Thijs Nassi, Erik-Jan Meulenbrugge, Haoqi Sun, Katie L Stone, Emmanuel Mignot, Dennis Hwang, Lynn Marie Trotti, Matthew A Reyna, Gari D Clifford, Umakanth Katwa, Robert J Thomas, M Brandon Westover\",\"doi\":\"10.5664/jcsm.11848\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Study objectives: </strong>To compare the performance of a comprehensive automated polysomnogram (PSG) analysis algorithm-CAISR (Complete Artificial Intelligence Sleep Report)-to a multi-expert gold standard panel, crowdsourced scorers, and experienced technicians for sleep staging and detecting arousals, respiratory events, and limb movements.</p><p><strong>Methods: </strong>A benchmark dataset of 57 PSG records (Inter-Scorer Reliability dataset) with 200 30-second epochs scored per AASM guidelines was used. Annotations were obtained from (1) the AASM multi-expert gold standard panel, (2) AASM Inter-Scorer Reliability (ISR) platform users (\\\"crowd,\\\" averaging 6,818 raters per epoch), (3) three experienced technicians, and (4) CAISR. Agreement was assessed via Cohen's Kappa (κ) and percent agreement.</p><p><strong>Results: </strong>Across tasks, CAISR achieved performance comparable to experienced technicians but did not match consensus-level agreement between the multi-expert gold standard and the crowd. For sleep staging, CAISR's agreement with multi-expert gold standard was 82.1% (κ = 0.70), comparable to experienced technicians but below the crowd (κ = 0.88). Arousal detection showed 87.81% agreement (κ = 0.45), respiratory event detection 83.18% (κ = 0.34), and limb movement detection 94.89% (κ = 0.11), each aligning with performance equivalent to experienced technicians but trailing crowd agreement (κ = 0.83, 0.78 and 0.86 for detection of arousal, respiratory events and limb movements respectively).</p><p><strong>Conclusions: </strong>CAISR achieves experienced technician-level accuracy for PSG scoring tasks but does not surpass the consensus-level agreement of a multi-expert gold standard or the crowd. These findings highlight the potential of automated scoring to match experienced technician-level performance while emphasizing the value of multi-rater consensus.</p>\",\"PeriodicalId\":50233,\"journal\":{\"name\":\"Journal of Clinical Sleep Medicine\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":2.9000,\"publicationDate\":\"2025-08-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Clinical Sleep Medicine\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.5664/jcsm.11848\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"CLINICAL NEUROLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Clinical Sleep Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.5664/jcsm.11848","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}
Automated analysis of the AASM Inter-Scorer Reliability gold standard polysomnogram dataset.
Study objectives: To compare the performance of a comprehensive automated polysomnogram (PSG) analysis algorithm-CAISR (Complete Artificial Intelligence Sleep Report)-to a multi-expert gold standard panel, crowdsourced scorers, and experienced technicians for sleep staging and detecting arousals, respiratory events, and limb movements.
Methods: A benchmark dataset of 57 PSG records (Inter-Scorer Reliability dataset) with 200 30-second epochs scored per AASM guidelines was used. Annotations were obtained from (1) the AASM multi-expert gold standard panel, (2) AASM Inter-Scorer Reliability (ISR) platform users ("crowd," averaging 6,818 raters per epoch), (3) three experienced technicians, and (4) CAISR. Agreement was assessed via Cohen's Kappa (κ) and percent agreement.
Results: Across tasks, CAISR achieved performance comparable to experienced technicians but did not match consensus-level agreement between the multi-expert gold standard and the crowd. For sleep staging, CAISR's agreement with multi-expert gold standard was 82.1% (κ = 0.70), comparable to experienced technicians but below the crowd (κ = 0.88). Arousal detection showed 87.81% agreement (κ = 0.45), respiratory event detection 83.18% (κ = 0.34), and limb movement detection 94.89% (κ = 0.11), each aligning with performance equivalent to experienced technicians but trailing crowd agreement (κ = 0.83, 0.78 and 0.86 for detection of arousal, respiratory events and limb movements respectively).
Conclusions: CAISR achieves experienced technician-level accuracy for PSG scoring tasks but does not surpass the consensus-level agreement of a multi-expert gold standard or the crowd. These findings highlight the potential of automated scoring to match experienced technician-level performance while emphasizing the value of multi-rater consensus.
期刊介绍:
Journal of Clinical Sleep Medicine focuses on clinical sleep medicine. Its emphasis is publication of papers with direct applicability and/or relevance to the clinical practice of sleep medicine. This includes clinical trials, clinical reviews, clinical commentary and debate, medical economic/practice perspectives, case series and novel/interesting case reports. In addition, the journal will publish proceedings from conferences, workshops and symposia sponsored by the American Academy of Sleep Medicine or other organizations related to improving the practice of sleep medicine.