自动分析AASM内部评分者可靠性金标准多导睡眠图数据集。

IF 2.9 3区 医学 Q1 CLINICAL NEUROLOGY
Ayush Tripathi, Samaneh Nasiri, Wolfgang Ganglberger, Thijs Nassi, Erik-Jan Meulenbrugge, Haoqi Sun, Katie L Stone, Emmanuel Mignot, Dennis Hwang, Lynn Marie Trotti, Matthew A Reyna, Gari D Clifford, Umakanth Katwa, Robert J Thomas, M Brandon Westover
{"title":"自动分析AASM内部评分者可靠性金标准多导睡眠图数据集。","authors":"Ayush Tripathi, Samaneh Nasiri, Wolfgang Ganglberger, Thijs Nassi, Erik-Jan Meulenbrugge, Haoqi Sun, Katie L Stone, Emmanuel Mignot, Dennis Hwang, Lynn Marie Trotti, Matthew A Reyna, Gari D Clifford, Umakanth Katwa, Robert J Thomas, M Brandon Westover","doi":"10.5664/jcsm.11848","DOIUrl":null,"url":null,"abstract":"<p><strong>Study objectives: </strong>To compare the performance of a comprehensive automated polysomnogram (PSG) analysis algorithm-CAISR (Complete Artificial Intelligence Sleep Report)-to a multi-expert gold standard panel, crowdsourced scorers, and experienced technicians for sleep staging and detecting arousals, respiratory events, and limb movements.</p><p><strong>Methods: </strong>A benchmark dataset of 57 PSG records (Inter-Scorer Reliability dataset) with 200 30-second epochs scored per AASM guidelines was used. Annotations were obtained from (1) the AASM multi-expert gold standard panel, (2) AASM Inter-Scorer Reliability (ISR) platform users (\"crowd,\" averaging 6,818 raters per epoch), (3) three experienced technicians, and (4) CAISR. Agreement was assessed via Cohen's Kappa (κ) and percent agreement.</p><p><strong>Results: </strong>Across tasks, CAISR achieved performance comparable to experienced technicians but did not match consensus-level agreement between the multi-expert gold standard and the crowd. For sleep staging, CAISR's agreement with multi-expert gold standard was 82.1% (κ = 0.70), comparable to experienced technicians but below the crowd (κ = 0.88). Arousal detection showed 87.81% agreement (κ = 0.45), respiratory event detection 83.18% (κ = 0.34), and limb movement detection 94.89% (κ = 0.11), each aligning with performance equivalent to experienced technicians but trailing crowd agreement (κ = 0.83, 0.78 and 0.86 for detection of arousal, respiratory events and limb movements respectively).</p><p><strong>Conclusions: </strong>CAISR achieves experienced technician-level accuracy for PSG scoring tasks but does not surpass the consensus-level agreement of a multi-expert gold standard or the crowd. These findings highlight the potential of automated scoring to match experienced technician-level performance while emphasizing the value of multi-rater consensus.</p>","PeriodicalId":50233,"journal":{"name":"Journal of Clinical Sleep Medicine","volume":" ","pages":""},"PeriodicalIF":2.9000,"publicationDate":"2025-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Automated analysis of the AASM Inter-Scorer Reliability gold standard polysomnogram dataset.\",\"authors\":\"Ayush Tripathi, Samaneh Nasiri, Wolfgang Ganglberger, Thijs Nassi, Erik-Jan Meulenbrugge, Haoqi Sun, Katie L Stone, Emmanuel Mignot, Dennis Hwang, Lynn Marie Trotti, Matthew A Reyna, Gari D Clifford, Umakanth Katwa, Robert J Thomas, M Brandon Westover\",\"doi\":\"10.5664/jcsm.11848\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Study objectives: </strong>To compare the performance of a comprehensive automated polysomnogram (PSG) analysis algorithm-CAISR (Complete Artificial Intelligence Sleep Report)-to a multi-expert gold standard panel, crowdsourced scorers, and experienced technicians for sleep staging and detecting arousals, respiratory events, and limb movements.</p><p><strong>Methods: </strong>A benchmark dataset of 57 PSG records (Inter-Scorer Reliability dataset) with 200 30-second epochs scored per AASM guidelines was used. Annotations were obtained from (1) the AASM multi-expert gold standard panel, (2) AASM Inter-Scorer Reliability (ISR) platform users (\\\"crowd,\\\" averaging 6,818 raters per epoch), (3) three experienced technicians, and (4) CAISR. Agreement was assessed via Cohen's Kappa (κ) and percent agreement.</p><p><strong>Results: </strong>Across tasks, CAISR achieved performance comparable to experienced technicians but did not match consensus-level agreement between the multi-expert gold standard and the crowd. For sleep staging, CAISR's agreement with multi-expert gold standard was 82.1% (κ = 0.70), comparable to experienced technicians but below the crowd (κ = 0.88). Arousal detection showed 87.81% agreement (κ = 0.45), respiratory event detection 83.18% (κ = 0.34), and limb movement detection 94.89% (κ = 0.11), each aligning with performance equivalent to experienced technicians but trailing crowd agreement (κ = 0.83, 0.78 and 0.86 for detection of arousal, respiratory events and limb movements respectively).</p><p><strong>Conclusions: </strong>CAISR achieves experienced technician-level accuracy for PSG scoring tasks but does not surpass the consensus-level agreement of a multi-expert gold standard or the crowd. These findings highlight the potential of automated scoring to match experienced technician-level performance while emphasizing the value of multi-rater consensus.</p>\",\"PeriodicalId\":50233,\"journal\":{\"name\":\"Journal of Clinical Sleep Medicine\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":2.9000,\"publicationDate\":\"2025-08-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Clinical Sleep Medicine\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.5664/jcsm.11848\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"CLINICAL NEUROLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Clinical Sleep Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.5664/jcsm.11848","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}
引用次数: 0

摘要

研究目标:比较综合自动多导睡眠图(PSG)分析算法- caisr(完整人工智能睡眠报告)-与多专家金标准小组,众包评分者和经验丰富的睡眠分期和检测唤醒,呼吸事件和肢体运动的技术人员的性能。方法:采用57条PSG记录的基准数据集(评分者间可靠性数据集),按AASM指南评分200个30秒epoch。注释来自(1)AASM多专家金标准面板,(2)AASM Inter-Scorer Reliability (ISR)平台用户(“人群”,平均每个epoch 6,818名评分者),(3)三位经验丰富的技术人员和(4)CAISR。通过Cohen’s Kappa (κ)和同意百分比评估同意程度。结果:在任务中,CAISR达到了与经验丰富的技术人员相当的性能,但不符合多专家金标准与人群之间的共识水平。对于睡眠分期,CAISR与多专家金标准的一致性为82.1% (κ = 0.70),与经验丰富的技术人员相当,但低于人群(κ = 0.88)。唤醒检测的一致性为87.81% (κ = 0.45),呼吸事件检测的一致性为83.18% (κ = 0.34),肢体运动检测的一致性为94.89% (κ = 0.11),与经验丰富的技术人员的性能相当,但与落后人群的一致性(唤醒检测、呼吸事件检测和肢体运动检测的一致性分别为κ = 0.83、0.78和0.86)。结论:CAISR在PSG评分任务中达到了经验丰富的技术人员水平的准确性,但没有超过多专家金标准或人群的共识水平。这些发现强调了自动化评分与经验丰富的技术人员水平的表现相匹配的潜力,同时强调了多评分者共识的价值。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Automated analysis of the AASM Inter-Scorer Reliability gold standard polysomnogram dataset.

Study objectives: To compare the performance of a comprehensive automated polysomnogram (PSG) analysis algorithm-CAISR (Complete Artificial Intelligence Sleep Report)-to a multi-expert gold standard panel, crowdsourced scorers, and experienced technicians for sleep staging and detecting arousals, respiratory events, and limb movements.

Methods: A benchmark dataset of 57 PSG records (Inter-Scorer Reliability dataset) with 200 30-second epochs scored per AASM guidelines was used. Annotations were obtained from (1) the AASM multi-expert gold standard panel, (2) AASM Inter-Scorer Reliability (ISR) platform users ("crowd," averaging 6,818 raters per epoch), (3) three experienced technicians, and (4) CAISR. Agreement was assessed via Cohen's Kappa (κ) and percent agreement.

Results: Across tasks, CAISR achieved performance comparable to experienced technicians but did not match consensus-level agreement between the multi-expert gold standard and the crowd. For sleep staging, CAISR's agreement with multi-expert gold standard was 82.1% (κ = 0.70), comparable to experienced technicians but below the crowd (κ = 0.88). Arousal detection showed 87.81% agreement (κ = 0.45), respiratory event detection 83.18% (κ = 0.34), and limb movement detection 94.89% (κ = 0.11), each aligning with performance equivalent to experienced technicians but trailing crowd agreement (κ = 0.83, 0.78 and 0.86 for detection of arousal, respiratory events and limb movements respectively).

Conclusions: CAISR achieves experienced technician-level accuracy for PSG scoring tasks but does not surpass the consensus-level agreement of a multi-expert gold standard or the crowd. These findings highlight the potential of automated scoring to match experienced technician-level performance while emphasizing the value of multi-rater consensus.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
6.20
自引率
7.00%
发文量
321
审稿时长
1 months
期刊介绍: Journal of Clinical Sleep Medicine focuses on clinical sleep medicine. Its emphasis is publication of papers with direct applicability and/or relevance to the clinical practice of sleep medicine. This includes clinical trials, clinical reviews, clinical commentary and debate, medical economic/practice perspectives, case series and novel/interesting case reports. In addition, the journal will publish proceedings from conferences, workshops and symposia sponsored by the American Academy of Sleep Medicine or other organizations related to improving the practice of sleep medicine.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信