On the efficacy of conditioned and progressive Latin hypercube sampling in supervised machine learning

IF 2.2 2区数学 Q1 MATHEMATICS, APPLIED

Applied Numerical Mathematics Pub Date : 2025-02-01 DOI:10.1016/j.apnum.2023.12.016

Ioannis Iordanis, Christos Koukouvinos, Iliana Silou

{"title":"On the efficacy of conditioned and progressive Latin hypercube sampling in supervised machine learning","authors":"Ioannis Iordanis, Christos Koukouvinos, Iliana Silou","doi":"10.1016/j.apnum.2023.12.016","DOIUrl":null,"url":null,"abstract":"<div><div><span>In this paper, Latin Hypercube Sampling<span> (LHS) method is compared as per its effectiveness in supervised machine learning procedures. Employing LHS saves computer's processing time and in conjunction with Latin hypercube design properties and space filling ability, is considered as one of the most advanced mechanisms in terms of sampling. Although more data usually deliver better results, when using LHS techniques, same quality outputs can be produced with less data and, as a result, </span></span>storage cost<span> and training time are reduced. Conditioned Latin Hypercube Sampling (cLHS) is one of those techniques, successfully performing in supervised machine learning tasks. Unfortunately, the minimum sufficient training dataset size cannot be known in advance. In this case, progressive sampling is recommended since it begins with a small sample and progressively increases its size until model accuracy no longer improves. Combining Latin hypercube sampling and the idea of sequentially incrementing sampling, we test Progressive Latin Hypercube Sampling (PLHS) while monitoring the performance of the sampling-based training as the sample size grows. PLHS and cLHS algorithms are applied in datasets with discrete variables securing that each sample is provided with the Latin hypercube design properties and preserves the principal ability of LHS for space filling, as illustrated in respective sample projecting diagrams. The performance of the above LHS methods in supervised machine learning is evaluated by the degree of training of the model, which is certified through the accuracy of the produced confusion matrices in test files. The results from the use of the above Latin Hypercube Sampling techniques compared against benchmark sampling method empirically prove that machine learning training process becomes less costfull, while remaining reliable.</span></div></div>","PeriodicalId":8199,"journal":{"name":"Applied Numerical Mathematics","volume":"208 ","pages":"Pages 256-270"},"PeriodicalIF":2.2000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Numerical Mathematics","FirstCategoryId":"100","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0168927423003240","RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICS, APPLIED","Score":null,"Total":0}

引用次数: 0

Abstract

In this paper, Latin Hypercube Sampling (LHS) method is compared as per its effectiveness in supervised machine learning procedures. Employing LHS saves computer's processing time and in conjunction with Latin hypercube design properties and space filling ability, is considered as one of the most advanced mechanisms in terms of sampling. Although more data usually deliver better results, when using LHS techniques, same quality outputs can be produced with less data and, as a result, storage cost and training time are reduced. Conditioned Latin Hypercube Sampling (cLHS) is one of those techniques, successfully performing in supervised machine learning tasks. Unfortunately, the minimum sufficient training dataset size cannot be known in advance. In this case, progressive sampling is recommended since it begins with a small sample and progressively increases its size until model accuracy no longer improves. Combining Latin hypercube sampling and the idea of sequentially incrementing sampling, we test Progressive Latin Hypercube Sampling (PLHS) while monitoring the performance of the sampling-based training as the sample size grows. PLHS and cLHS algorithms are applied in datasets with discrete variables securing that each sample is provided with the Latin hypercube design properties and preserves the principal ability of LHS for space filling, as illustrated in respective sample projecting diagrams. The performance of the above LHS methods in supervised machine learning is evaluated by the degree of training of the model, which is certified through the accuracy of the produced confusion matrices in test files. The results from the use of the above Latin Hypercube Sampling techniques compared against benchmark sampling method empirically prove that machine learning training process becomes less costfull, while remaining reliable.

查看原文本刊更多论文

论监督式机器学习中条件式和渐进式拉丁超立方采样的功效

本文比较了拉丁超立方采样（LHS）方法在监督机器学习程序中的有效性。使用 LHS 可以节省计算机的处理时间，而且结合拉丁超立方设计特性和空间填充能力，LHS 被认为是最先进的采样机制之一。虽然更多的数据通常能带来更好的结果，但使用 LHS 技术时，用较少的数据也能产生相同质量的结果，从而降低了存储成本和培训时间。有条件拉丁超立方采样（cLHS）就是这样一种技术，它在监督机器学习任务中取得了成功。遗憾的是，训练数据集的最小足够大小无法预先知道。在这种情况下，建议采用渐进式采样，因为它从少量样本开始，逐步增加样本量，直到模型准确性不再提高为止。结合拉丁超立方采样和依次递增采样的思想，我们测试了渐进式拉丁超立方采样（PLHS），同时随着样本量的增加，监测基于采样训练的性能。PLHS 和 cLHS 算法适用于具有离散变量的数据集，确保每个样本都具有拉丁超立方设计特性，并保留了 LHS 空间填充的主要能力，如各自的样本投影图所示。上述 LHS 方法在有监督机器学习中的性能是通过模型的训练程度来评估的，而模型的训练程度则通过测试文件中产生的混淆矩阵的准确性来证明。上述拉丁超立方采样技术与基准采样方法的比较结果证明，机器学习训练过程的成本更低，同时保持了可靠性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Applied Numerical Mathematics 数学-应用数学

CiteScore

5.60

自引率

7.10%

发文量

225

审稿时长

7.2 months

期刊介绍： The purpose of the journal is to provide a forum for the publication of high quality research and tutorial papers in computational mathematics. In addition to the traditional issues and problems in numerical analysis, the journal also publishes papers describing relevant applications in such fields as physics, fluid dynamics, engineering and other branches of applied science with a computational mathematics component. The journal strives to be flexible in the type of papers it publishes and their format. Equally desirable are: (i) Full papers, which should be complete and relatively self-contained original contributions with an introduction that can be understood by the broad computational mathematics community. Both rigorous and heuristic styles are acceptable. Of particular interest are papers about new areas of research, in which other than strictly mathematical arguments may be important in establishing a basis for further developments. (ii) Tutorial review papers, covering some of the important issues in Numerical Mathematics, Scientific Computing and their Applications. The journal will occasionally publish contributions which are larger than the usual format for regular papers. (iii) Short notes, which present specific new results and techniques in a brief communication.