基于表格式数据成本效益主动学习的混合抽样方法

Sharath M. Shankaranarayana, Anubhab Samal, Chikka Veera Raghavendra, Vijay Sankar, Arindam Dey, Sangam Kumar Singh
{"title":"基于表格式数据成本效益主动学习的混合抽样方法","authors":"Sharath M. Shankaranarayana, Anubhab Samal, Chikka Veera Raghavendra, Vijay Sankar, Arindam Dey, Sangam Kumar Singh","doi":"10.1109/COMSNETS59351.2024.10427050","DOIUrl":null,"url":null,"abstract":"Active learning (AL) is an attractive paradigm of machine learning that could be very useful in real-world machine learning for domains with high labelling costs. The main objective of AL is to efficiently acquire data for annotation from a typically large pool of unlabeled data, thereby reducing human experts' labelling effort. This helps them focus on the most informative data points that would improve machine learning model performance for the task at hand. Thus, given an unlabeled dataset and a fixed labelling budget, AL aims to select a subset of examples to be labelled such that they can result in improved model performance. The central component in an active learning workflow is the sampling process in which the most valuable samples to be labelled are identified. The most common type of sampling strategy employed is uncertainty sampling. In this sampling, each AL training iteration takes those points that the model is most uncertain about. Another important sampling strategy is diversity sampling. This sampling strategy selects a collection of samples that can well represent the entire data distribution. In this paper, we propose a hybrid sampling based active learning for the classification tasks employing tabular data, wherein we jointly sample uncertain as well as diverse points at every AL iteration. This hybrid sampling is achieved by jointly training a classification model along with an outlier detection model. In addition, we propose an improved cost-effective active learning (CEAL) method in which we automatically select high confidence data points and assign pseudo-labels based on not only the model's confidence but also the outlier detection module.","PeriodicalId":518748,"journal":{"name":"2024 16th International Conference on COMmunication Systems & NETworkS (COMSNETS)","volume":"40 1","pages":"147-152"},"PeriodicalIF":0.0000,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Hybrid Sampling Methodology Based Cost Effective Active Learning for Tabular Data\",\"authors\":\"Sharath M. Shankaranarayana, Anubhab Samal, Chikka Veera Raghavendra, Vijay Sankar, Arindam Dey, Sangam Kumar Singh\",\"doi\":\"10.1109/COMSNETS59351.2024.10427050\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Active learning (AL) is an attractive paradigm of machine learning that could be very useful in real-world machine learning for domains with high labelling costs. The main objective of AL is to efficiently acquire data for annotation from a typically large pool of unlabeled data, thereby reducing human experts' labelling effort. This helps them focus on the most informative data points that would improve machine learning model performance for the task at hand. Thus, given an unlabeled dataset and a fixed labelling budget, AL aims to select a subset of examples to be labelled such that they can result in improved model performance. The central component in an active learning workflow is the sampling process in which the most valuable samples to be labelled are identified. The most common type of sampling strategy employed is uncertainty sampling. In this sampling, each AL training iteration takes those points that the model is most uncertain about. Another important sampling strategy is diversity sampling. This sampling strategy selects a collection of samples that can well represent the entire data distribution. In this paper, we propose a hybrid sampling based active learning for the classification tasks employing tabular data, wherein we jointly sample uncertain as well as diverse points at every AL iteration. This hybrid sampling is achieved by jointly training a classification model along with an outlier detection model. In addition, we propose an improved cost-effective active learning (CEAL) method in which we automatically select high confidence data points and assign pseudo-labels based on not only the model's confidence but also the outlier detection module.\",\"PeriodicalId\":518748,\"journal\":{\"name\":\"2024 16th International Conference on COMmunication Systems & NETworkS (COMSNETS)\",\"volume\":\"40 1\",\"pages\":\"147-152\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-01-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2024 16th International Conference on COMmunication Systems & NETworkS (COMSNETS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/COMSNETS59351.2024.10427050\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2024 16th International Conference on COMmunication Systems & NETworkS (COMSNETS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/COMSNETS59351.2024.10427050","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

主动学习(AL)是一种极具吸引力的机器学习范式,在现实世界的机器学习中,它对标注成本较高的领域非常有用。主动学习的主要目标是从大量典型的未标注数据池中高效获取标注数据,从而减少人类专家的标注工作量。这有助于他们将注意力集中在信息量最大的数据点上,从而提高机器学习模型在当前任务中的性能。因此,在给定未标注数据集和固定标注预算的情况下,AL 的目标是选择要标注的示例子集,从而提高模型性能。主动学习工作流程的核心部分是抽样过程,在这个过程中,需要标注的最有价值的样本被识别出来。最常见的抽样策略是不确定性抽样。在这种抽样中,每次 AL 训练迭代都会抽取模型最不确定的点。另一种重要的抽样策略是多样性抽样。这种抽样策略选择的样本集合能很好地代表整个数据分布。在本文中,我们针对表格数据的分类任务提出了一种基于混合采样的主动学习方法,即在每次 AL 迭代中对不确定点和多样性点进行联合采样。这种混合采样是通过联合训练分类模型和离群点检测模型来实现的。此外,我们还提出了一种改进的高性价比主动学习(CEAL)方法,即自动选择高置信度数据点,并根据模型置信度和离群点检测模块分配伪标签。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
A Hybrid Sampling Methodology Based Cost Effective Active Learning for Tabular Data
Active learning (AL) is an attractive paradigm of machine learning that could be very useful in real-world machine learning for domains with high labelling costs. The main objective of AL is to efficiently acquire data for annotation from a typically large pool of unlabeled data, thereby reducing human experts' labelling effort. This helps them focus on the most informative data points that would improve machine learning model performance for the task at hand. Thus, given an unlabeled dataset and a fixed labelling budget, AL aims to select a subset of examples to be labelled such that they can result in improved model performance. The central component in an active learning workflow is the sampling process in which the most valuable samples to be labelled are identified. The most common type of sampling strategy employed is uncertainty sampling. In this sampling, each AL training iteration takes those points that the model is most uncertain about. Another important sampling strategy is diversity sampling. This sampling strategy selects a collection of samples that can well represent the entire data distribution. In this paper, we propose a hybrid sampling based active learning for the classification tasks employing tabular data, wherein we jointly sample uncertain as well as diverse points at every AL iteration. This hybrid sampling is achieved by jointly training a classification model along with an outlier detection model. In addition, we propose an improved cost-effective active learning (CEAL) method in which we automatically select high confidence data points and assign pseudo-labels based on not only the model's confidence but also the outlier detection module.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信