Active learning sample selection - based on multicriteria

IF 1.6 4区 化学 Q3 CHEMISTRY, APPLIED
Zhonghai He, Kun Shen, Xiaofang Zhang
{"title":"Active learning sample selection - based on multicriteria","authors":"Zhonghai He, Kun Shen, Xiaofang Zhang","doi":"10.1177/09670335231211618","DOIUrl":null,"url":null,"abstract":"In multivariate calibration problems, model performance is affected significantly by the calibration samples used during model building. In recent years, active learning methods have become one of the best methods for sample selection. However, most active learning methods only select instances from prediction uncertainty or sample space distance, and these single-criteria methods tend to select undesired samples. In addition, sample density characterizes the spatial information carried by the sample, but few studies in quantitative analysis utilize sample density alone to select calibration samples. Considering these issues, based on the k-means clustering algorithm, this paper proposes an active learning sample selection method (DIDAL), which combines the three criteria of diversity, informativeness and sample density. The most representative sample is iteratively selected for - addition to the calibration set for modeling and estimating the chemical concentration of analytes. Soybean meal and soy sauce samples were analyzed by DIDAL and compared with existing sample selection methods. The prediction results show that the DIDAL algorithm significantly outperforms several existing algorithms and is close to the performance of full-sample modeling. A model with high prediction accuracy can be constructed by selecting only a few samples using the DIDAL method.","PeriodicalId":16551,"journal":{"name":"Journal of Near Infrared Spectroscopy","volume":null,"pages":null},"PeriodicalIF":1.6000,"publicationDate":"2023-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Near Infrared Spectroscopy","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1177/09670335231211618","RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"CHEMISTRY, APPLIED","Score":null,"Total":0}
引用次数: 0

Abstract

In multivariate calibration problems, model performance is affected significantly by the calibration samples used during model building. In recent years, active learning methods have become one of the best methods for sample selection. However, most active learning methods only select instances from prediction uncertainty or sample space distance, and these single-criteria methods tend to select undesired samples. In addition, sample density characterizes the spatial information carried by the sample, but few studies in quantitative analysis utilize sample density alone to select calibration samples. Considering these issues, based on the k-means clustering algorithm, this paper proposes an active learning sample selection method (DIDAL), which combines the three criteria of diversity, informativeness and sample density. The most representative sample is iteratively selected for - addition to the calibration set for modeling and estimating the chemical concentration of analytes. Soybean meal and soy sauce samples were analyzed by DIDAL and compared with existing sample selection methods. The prediction results show that the DIDAL algorithm significantly outperforms several existing algorithms and is close to the performance of full-sample modeling. A model with high prediction accuracy can be constructed by selecting only a few samples using the DIDAL method.
基于多标准的主动学习样本选择
在多变量校准问题中,模型性能受到模型构建过程中使用的校准样本的显著影响。近年来,主动学习方法已成为样本选择的最佳方法之一。然而,大多数主动学习方法仅从预测不确定性或样本空间距离中选择实例,这些单一标准的方法往往会选择不需要的样本。此外,样本密度表征了样本所携带的空间信息,但定量分析中很少有研究单独利用样本密度来选择校准样本。针对这些问题,本文在k-means聚类算法的基础上,提出了一种结合多样性、信息量和样本密度三个标准的主动学习样本选择方法(DIDAL)。迭代选择最具代表性的样品加入校准集,用于建模和估计分析物的化学浓度。采用DIDAL对豆粕和酱油样品进行分析,并对现有的样品选择方法进行比较。预测结果表明,DIDAL算法明显优于现有的几种算法,接近全样本建模的性能。采用DIDAL方法,只需选取少量的样本,就可以构建具有较高预测精度的模型。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
3.30
自引率
5.60%
发文量
35
审稿时长
6 months
期刊介绍: JNIRS — Journal of Near Infrared Spectroscopy is a peer reviewed journal, publishing original research papers, short communications, review articles and letters concerned with near infrared spectroscopy and technology, its application, new instrumentation and the use of chemometric and data handling techniques within NIR.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信