ASELMAR: Active and semi-supervised learning-based framework to reduce multi-labeling efforts for activity recognition

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding Pub Date : 2025-02-01 DOI:10.1016/j.cviu.2024.104269

Aydin Saribudak , Sifan Yuan , Chenyang Gao , Waverly V. Gestrich-Thompson , Zachary P. Milestone , Randall S. Burd , Ivan Marsic

{"title":"ASELMAR: Active and semi-supervised learning-based framework to reduce multi-labeling efforts for activity recognition","authors":"Aydin Saribudak , Sifan Yuan , Chenyang Gao , Waverly V. Gestrich-Thompson , Zachary P. Milestone , Randall S. Burd , Ivan Marsic","doi":"10.1016/j.cviu.2024.104269","DOIUrl":null,"url":null,"abstract":"<div><div>Manual annotation of unlabeled data for model training is expensive and time-consuming, especially for visual datasets requiring domain-specific experience for multi-labeling, such as video records generated in hospital settings. There is a need to build frameworks to reduce human labeling efforts while improving training performance. Semi-supervised learning is widely used to generate predictions for unlabeled samples in a partially labeled datasets. Active learning can be used with semi-supervised learning to annotate unlabeled samples to reduce the sampling bias due to the label predictions. We developed the <span>aselmar</span> framework based on active and semi-supervised learning techniques to reduce the time and effort associated with multi-labeling of unlabeled samples for activity recognition. <span>aselmar</span> (i) categorizes the predictions for unlabeled data based on the confidence level in predictions using fixed and adaptive threshold settings, (ii) applies a label verification procedure for the samples with the ambiguous prediction, and (iii) retrains the model iteratively using samples with their high-confidence predictions or manual annotations. We also designed a software tool to guide domain experts in verifying ambiguous predictions. We applied <span>aselmar</span> to recognize eight selected activities from our trauma resuscitation video dataset and evaluated their performance based on the label verification time and the mean <span>ap</span> score metric. The label verification required by <span>aselmar</span> was 12.1% of the manual annotation effort for the unlabeled video records. The improvement in the mean <span>ap</span> score was 5.7% for the first iteration and 8.3% for the second iteration with the fixed threshold-based method compared to the baseline model. The p-values were below 0.05 for the target activities. Using an adaptive-threshold method, <span>aselmar</span> achieved a decrease in <span>ap</span> score deviation, implying an improvement in model robustness. For a speech-based case study, the word error rate decreased by 6.2%, and the average transcription factor increased 2.6 times, supporting the broad applicability of ASELMAR in reducing labeling efforts from domain experts.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104269"},"PeriodicalIF":4.3000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314224003503","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Manual annotation of unlabeled data for model training is expensive and time-consuming, especially for visual datasets requiring domain-specific experience for multi-labeling, such as video records generated in hospital settings. There is a need to build frameworks to reduce human labeling efforts while improving training performance. Semi-supervised learning is widely used to generate predictions for unlabeled samples in a partially labeled datasets. Active learning can be used with semi-supervised learning to annotate unlabeled samples to reduce the sampling bias due to the label predictions. We developed the aselmar framework based on active and semi-supervised learning techniques to reduce the time and effort associated with multi-labeling of unlabeled samples for activity recognition. aselmar (i) categorizes the predictions for unlabeled data based on the confidence level in predictions using fixed and adaptive threshold settings, (ii) applies a label verification procedure for the samples with the ambiguous prediction, and (iii) retrains the model iteratively using samples with their high-confidence predictions or manual annotations. We also designed a software tool to guide domain experts in verifying ambiguous predictions. We applied aselmar to recognize eight selected activities from our trauma resuscitation video dataset and evaluated their performance based on the label verification time and the mean ap score metric. The label verification required by aselmar was 12.1% of the manual annotation effort for the unlabeled video records. The improvement in the mean ap score was 5.7% for the first iteration and 8.3% for the second iteration with the fixed threshold-based method compared to the baseline model. The p-values were below 0.05 for the target activities. Using an adaptive-threshold method, aselmar achieved a decrease in ap score deviation, implying an improvement in model robustness. For a speech-based case study, the word error rate decreased by 6.2%, and the average transcription factor increased 2.6 times, supporting the broad applicability of ASELMAR in reducing labeling efforts from domain experts.

查看原文本刊更多论文

ASELMAR：基于主动和半监督学习的框架，以减少活动识别的多标签工作

手工标注未标记数据用于模型训练既昂贵又耗时，特别是对于需要特定领域经验才能进行多标记的可视化数据集，例如在医院环境中生成的视频记录。有必要建立框架，以减少人工标签的努力，同时提高培训绩效。半监督学习被广泛用于对部分标记数据集中的未标记样本进行预测。主动学习可以与半监督学习一起用于标注未标记的样本，以减少由于标签预测而导致的抽样偏差。我们开发了基于主动和半监督学习技术的aselmar框架，以减少与活动识别中未标记样本的多次标记相关的时间和精力。Aselmar (i)使用固定和自适应阈值设置，根据预测中的置信度对未标记数据的预测进行分类，（ii）对具有模糊预测的样本应用标签验证程序，以及（iii）使用具有高置信度预测或手动注释的样本迭代地重新训练模型。我们还设计了一个软件工具来指导领域专家验证模糊预测。我们应用aselmar从我们的创伤复苏视频数据集中识别8个选定的活动，并根据标签验证时间和平均ap评分指标评估它们的表现。aselmar要求的标签验证是未标记视频记录手工注释工作量的12.1%。与基线模型相比，基于固定阈值方法的第一次迭代平均ap评分提高5.7%，第二次迭代平均ap评分提高8.3%。目标活性的p值小于0.05。使用自适应阈值方法，aselmar实现了ap分数偏差的减少，这意味着模型鲁棒性的提高。对于基于语音的案例研究，单词错误率降低了6.2%，平均转录因子增加了2.6倍，支持ASELMAR在减少领域专家标记工作方面的广泛适用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Vision and Image Understanding 工程技术-工程：电子与电气

CiteScore

7.80

自引率

4.40%

发文量

112

审稿时长

79 days

期刊介绍： The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views. Research Areas Include: • Theory • Early vision • Data structures and representations • Shape • Range • Motion • Matching and recognition • Architecture and languages • Vision systems