Active learning sample selection - based on multicriteria

IF 1.9 4区化学 Q3 CHEMISTRY, APPLIED

Journal of Near Infrared Spectroscopy Pub Date : 2023-11-08 DOI:10.1177/09670335231211618

Zhonghai He, Kun Shen, Xiaofang Zhang

{"title":"Active learning sample selection - based on multicriteria","authors":"Zhonghai He, Kun Shen, Xiaofang Zhang","doi":"10.1177/09670335231211618","DOIUrl":null,"url":null,"abstract":"In multivariate calibration problems, model performance is affected significantly by the calibration samples used during model building. In recent years, active learning methods have become one of the best methods for sample selection. However, most active learning methods only select instances from prediction uncertainty or sample space distance, and these single-criteria methods tend to select undesired samples. In addition, sample density characterizes the spatial information carried by the sample, but few studies in quantitative analysis utilize sample density alone to select calibration samples. Considering these issues, based on the k-means clustering algorithm, this paper proposes an active learning sample selection method (DIDAL), which combines the three criteria of diversity, informativeness and sample density. The most representative sample is iteratively selected for - addition to the calibration set for modeling and estimating the chemical concentration of analytes. Soybean meal and soy sauce samples were analyzed by DIDAL and compared with existing sample selection methods. The prediction results show that the DIDAL algorithm significantly outperforms several existing algorithms and is close to the performance of full-sample modeling. A model with high prediction accuracy can be constructed by selecting only a few samples using the DIDAL method.","PeriodicalId":16551,"journal":{"name":"Journal of Near Infrared Spectroscopy","volume":"144 3‐6","pages":"0"},"PeriodicalIF":1.9000,"publicationDate":"2023-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Near Infrared Spectroscopy","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1177/09670335231211618","RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"CHEMISTRY, APPLIED","Score":null,"Total":0}

引用次数: 0

Abstract

In multivariate calibration problems, model performance is affected significantly by the calibration samples used during model building. In recent years, active learning methods have become one of the best methods for sample selection. However, most active learning methods only select instances from prediction uncertainty or sample space distance, and these single-criteria methods tend to select undesired samples. In addition, sample density characterizes the spatial information carried by the sample, but few studies in quantitative analysis utilize sample density alone to select calibration samples. Considering these issues, based on the k-means clustering algorithm, this paper proposes an active learning sample selection method (DIDAL), which combines the three criteria of diversity, informativeness and sample density. The most representative sample is iteratively selected for - addition to the calibration set for modeling and estimating the chemical concentration of analytes. Soybean meal and soy sauce samples were analyzed by DIDAL and compared with existing sample selection methods. The prediction results show that the DIDAL algorithm significantly outperforms several existing algorithms and is close to the performance of full-sample modeling. A model with high prediction accuracy can be constructed by selecting only a few samples using the DIDAL method.

查看原文本刊更多论文

基于多标准的主动学习样本选择

在多变量校准问题中，模型性能受到模型构建过程中使用的校准样本的显著影响。近年来，主动学习方法已成为样本选择的最佳方法之一。然而，大多数主动学习方法仅从预测不确定性或样本空间距离中选择实例，这些单一标准的方法往往会选择不需要的样本。此外，样本密度表征了样本所携带的空间信息，但定量分析中很少有研究单独利用样本密度来选择校准样本。针对这些问题，本文在k-means聚类算法的基础上，提出了一种结合多样性、信息量和样本密度三个标准的主动学习样本选择方法(DIDAL)。迭代选择最具代表性的样品加入校准集，用于建模和估计分析物的化学浓度。采用DIDAL对豆粕和酱油样品进行分析，并对现有的样品选择方法进行比较。预测结果表明，DIDAL算法明显优于现有的几种算法，接近全样本建模的性能。采用DIDAL方法，只需选取少量的样本，就可以构建具有较高预测精度的模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Near Infrared Spectroscopy 物理-光谱学

CiteScore

3.30

自引率

5.60%

发文量

审稿时长

6 months

期刊介绍： JNIRS — Journal of Near Infrared Spectroscopy is a peer reviewed journal, publishing original research papers, short communications, review articles and letters concerned with near infrared spectroscopy and technology, its application, new instrumentation and the use of chemometric and data handling techniques within NIR.