Evaluating crowdsourcing for ICU EEG annotation: A comparison with expert performance.

IF 6.6 1区医学 Q1 CLINICAL NEUROLOGY

Epilepsia Pub Date : 2025-08-06 DOI:10.1111/epi.18547

Wan-Yee Kong, Fábio A Nascimento, Aaron Struck, Erik Duhaime, Srishti Kapur, Edilberto Amorim, Gregory Kapinos, Andres Rodriguez, Brendan Thomas, Masoom Desai, Jong Woo Lee, M Brandon Westover, Jin Jing

{"title":"Evaluating crowdsourcing for ICU EEG annotation: A comparison with expert performance.","authors":"Wan-Yee Kong, Fábio A Nascimento, Aaron Struck, Erik Duhaime, Srishti Kapur, Edilberto Amorim, Gregory Kapinos, Andres Rodriguez, Brendan Thomas, Masoom Desai, Jong Woo Lee, M Brandon Westover, Jin Jing","doi":"10.1111/epi.18547","DOIUrl":null,"url":null,"abstract":"Objective: Detection of seizures and rhythmic or periodic patterns (SRPPs) on electroencephalography (EEG) is crucial for the diagnosis and management of patients with neurological critical illness. Although automated detection methods are advancing, they require large, high-quality, expert-annotated datasets for training. However, expert annotation is limited by the availability of trained neurophysiologists. Crowdsourcing, or soliciting contributions from a large group of people, may present a potential solution. This study evaluates the feasibility of crowdsourcing annotations of short epochs of EEG recordings by comparing the performance of experts and non-experts in identifying six SRPPs.Methods: We conducted an EEG scoring contest using a mobile app, involving expert and non-expert participants. Non-experts in our studies include physicians-MD, medical students-MS, nurse-practitioner-NP/physician-assistant-PA/pharmacists, NP-students/PA-students/pharmacy-students/other-healthcare-students, and others. Performance was assessed using pairwise agreement and Fleiss' kappa between experts, and accuracy comparisons between experts and the crowd using individual and weighted majority votes.Results: A total of 1542 participants (8 experts and 1534 non-experts) answered 478 834 questions across six SRPPs: seizures, generalized and lateralized periodic discharges (GPDs and LPDs), and generalized and lateralized rhythmic delta activity (GRDA LRDA), and \"Other.\" Using individual, non-weighted votes, the crowd's performance was inferior to experts for overall and across six SRPP identification. Using weighted majority votes, the crowd was non-inferior to experts for overall SRPP identification with accuracy of .70, 95% confidence interval [CI]: .69-.70 compared to expert's accuracy of .68, 95% CI: .68-.70. The crowd performed comparably or better than experts in identifying most SRPPs, except for LPDs and \"Other\". No individual expert outperformed the crowd on overall metrics.Significance: This proof-of-concept highlights the promise of crowd reviewers for obtaining expert-level annotations of SRPPs, which could potentially accelerate the development of large, diverse datasets for training automated detection algorithms. Challenges, such as varying calibration/test splits across crowd participants in the study and the absence of gold standard labels in the real-world settings, remain to be addressed.","PeriodicalId":11768,"journal":{"name":"Epilepsia","volume":" ","pages":""},"PeriodicalIF":6.6000,"publicationDate":"2025-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Epilepsia","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1111/epi.18547","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Objective: Detection of seizures and rhythmic or periodic patterns (SRPPs) on electroencephalography (EEG) is crucial for the diagnosis and management of patients with neurological critical illness. Although automated detection methods are advancing, they require large, high-quality, expert-annotated datasets for training. However, expert annotation is limited by the availability of trained neurophysiologists. Crowdsourcing, or soliciting contributions from a large group of people, may present a potential solution. This study evaluates the feasibility of crowdsourcing annotations of short epochs of EEG recordings by comparing the performance of experts and non-experts in identifying six SRPPs.

Methods: We conducted an EEG scoring contest using a mobile app, involving expert and non-expert participants. Non-experts in our studies include physicians-MD, medical students-MS, nurse-practitioner-NP/physician-assistant-PA/pharmacists, NP-students/PA-students/pharmacy-students/other-healthcare-students, and others. Performance was assessed using pairwise agreement and Fleiss' kappa between experts, and accuracy comparisons between experts and the crowd using individual and weighted majority votes.

Results: A total of 1542 participants (8 experts and 1534 non-experts) answered 478 834 questions across six SRPPs: seizures, generalized and lateralized periodic discharges (GPDs and LPDs), and generalized and lateralized rhythmic delta activity (GRDA LRDA), and "Other." Using individual, non-weighted votes, the crowd's performance was inferior to experts for overall and across six SRPP identification. Using weighted majority votes, the crowd was non-inferior to experts for overall SRPP identification with accuracy of .70, 95% confidence interval [CI]: .69-.70 compared to expert's accuracy of .68, 95% CI: .68-.70. The crowd performed comparably or better than experts in identifying most SRPPs, except for LPDs and "Other". No individual expert outperformed the crowd on overall metrics.

Significance: This proof-of-concept highlights the promise of crowd reviewers for obtaining expert-level annotations of SRPPs, which could potentially accelerate the development of large, diverse datasets for training automated detection algorithms. Challenges, such as varying calibration/test splits across crowd participants in the study and the absence of gold standard labels in the real-world settings, remain to be addressed.

查看原文本刊更多论文

评价ICU脑电图标注的众包：与专家绩效的比较。

目的：脑电图（EEG）检测癫痫发作和节律性或周期性模式（SRPPs）对神经危重症患者的诊断和治疗至关重要。尽管自动化检测方法正在进步，但它们需要大量的、高质量的、专家注释的数据集来进行训练。然而，专家注释受到训练有素的神经生理学家可用性的限制。众包，或向一大群人征求意见，可能是一个潜在的解决方案。本研究通过比较专家和非专家在识别6个srpp方面的表现，评估了众包EEG记录短时间注释的可行性。方法：采用手机app进行EEG评分比赛，专家和非专家参与。我们研究中的非专家包括医师-医学博士，医学生-医学硕士，护士-执业医师- np /医师助理- pa /药剂师，np -学生/ pa -学生/药学学生/其他医疗保健学生等。使用专家之间的两两协议和Fleiss kappa来评估绩效，并使用个人和加权多数投票来比较专家和人群之间的准确性。结果：共有1542名参与者（8名专家和1534名非专家）在6个srpp中回答了478834个问题：癫痫发作、全面性和偏侧性周期性放电（gpd和lpd）、全面性和偏侧性节律性三角洲活动（GRDA LRDA）和“其他”。使用个人，非加权投票，人群的表现不如专家的整体和跨六个SRPP识别。使用加权多数投票，人群对总体SRPP识别的准确性不低于专家，准确率为0.70,95%置信区间[CI]: 0.69 -。70，而专家的准确率为0.68,95% CI: 0.68 - 0.70。除了lpd和“其他”之外，大众在识别大多数srpp方面的表现与专家相当或更好。在整体指标上，没有哪位专家的表现优于大众。意义：这一概念验证强调了群体评论者获得专家级别srpp注释的希望，这可能会加速用于训练自动检测算法的大型、多样化数据集的开发。挑战，如不同的校准/测试分割在研究的人群参与者和缺乏黄金标准标签在现实世界的设置，仍有待解决。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Epilepsia 医学-临床神经学

CiteScore

10.90

自引率

10.70%

发文量

319

审稿时长

2-4 weeks

期刊介绍： Epilepsia is the leading, authoritative source for innovative clinical and basic science research for all aspects of epilepsy and seizures. In addition, Epilepsia publishes critical reviews, opinion pieces, and guidelines that foster understanding and aim to improve the diagnosis and treatment of people with seizures and epilepsy.