Interpretation of high dimensional definitive screening designs assisted by bootstrapped partial least squares regression

IF 3.7 2区 化学 Q2 AUTOMATION & CONTROL SYSTEMS
Knut Dyrstad , Frank Westad
{"title":"Interpretation of high dimensional definitive screening designs assisted by bootstrapped partial least squares regression","authors":"Knut Dyrstad ,&nbsp;Frank Westad","doi":"10.1016/j.chemolab.2024.105218","DOIUrl":null,"url":null,"abstract":"<div><p>Definitive screening design (DSD) has become a widely used type of Design of Experiments for chemical, pharmaceutical and biopharmaceutical processes and product development due to its optimization properties with an estimation of main, interaction, and squared variable effects with a minimum number of experiments. These high dimensional DOEs with more variables than samples, and with partly correlated variables, make the statistical interpretation frequently challenging. The purpose of the study was to test bootstrap PLSR using a heredity procedure to select the variable subset to be finally evaluated by MLR. The heredity selection was used on bootstrap T values given by original PLSR coefficients (B) divided on the bootstrap estimated standard deviation. The investigated fractional weighted and non-parametric bootstrap PLSR resulted in same variable selection outcome and final models in this study.</p><p>A simulation study with 7 main variables and 12 tested literature real data DSDs with 4, 5, 7 and 8 main variables showed improved model performance for small and particularly for large DSDs for the bootstrap PLSR MLR methods compared to two common DSD reference methods; DSD fit definitive screening and AICc forward stepwise regression (AICc FSR). Variable selection accuracy and predictive ability were significantly improved by the investigated method in 6 out of 13 DSDs compared to the best model from either of the two reference methods. The remaining 7 DSDs gave the same model as best reference model. Strong heredity was found to provide the best models for all real data in this study. The use of the heredity procedure on the percent non-zero SVEM FSR variable effects followed by MLR showed promising results. AICc Lasso regression was among other methods partially tested and was found to set almost all variables to zero effect when tested on three large minimum DSDs. While the DSD fit definitive screening method may often be the first choice for DSD, the heredity bootstrap PLSR MLR and heredity SVEM FSR MLR may be alternative methods to improve the variable selection and model precision.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"253 ","pages":"Article 105218"},"PeriodicalIF":3.7000,"publicationDate":"2024-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Chemometrics and Intelligent Laboratory Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0169743924001588","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Definitive screening design (DSD) has become a widely used type of Design of Experiments for chemical, pharmaceutical and biopharmaceutical processes and product development due to its optimization properties with an estimation of main, interaction, and squared variable effects with a minimum number of experiments. These high dimensional DOEs with more variables than samples, and with partly correlated variables, make the statistical interpretation frequently challenging. The purpose of the study was to test bootstrap PLSR using a heredity procedure to select the variable subset to be finally evaluated by MLR. The heredity selection was used on bootstrap T values given by original PLSR coefficients (B) divided on the bootstrap estimated standard deviation. The investigated fractional weighted and non-parametric bootstrap PLSR resulted in same variable selection outcome and final models in this study.

A simulation study with 7 main variables and 12 tested literature real data DSDs with 4, 5, 7 and 8 main variables showed improved model performance for small and particularly for large DSDs for the bootstrap PLSR MLR methods compared to two common DSD reference methods; DSD fit definitive screening and AICc forward stepwise regression (AICc FSR). Variable selection accuracy and predictive ability were significantly improved by the investigated method in 6 out of 13 DSDs compared to the best model from either of the two reference methods. The remaining 7 DSDs gave the same model as best reference model. Strong heredity was found to provide the best models for all real data in this study. The use of the heredity procedure on the percent non-zero SVEM FSR variable effects followed by MLR showed promising results. AICc Lasso regression was among other methods partially tested and was found to set almost all variables to zero effect when tested on three large minimum DSDs. While the DSD fit definitive screening method may often be the first choice for DSD, the heredity bootstrap PLSR MLR and heredity SVEM FSR MLR may be alternative methods to improve the variable selection and model precision.

利用引导偏最小二乘法回归解释高维确定性筛选设计
确定性筛选设计(DSD)具有优化特性,能以最少的实验次数估算主效应、交互效应和变量平方效应,因此已成为化学、制药和生物制药工艺及产品开发中广泛使用的一种实验设计类型。这些高维 DOEs 变量多于样本,而且变量之间存在部分相关性,因此统计解释经常具有挑战性。本研究的目的是使用遗传程序对自举 PLSR 进行测试,以选择最终由 MLR 评估的变量子集。遗传选择基于原始 PLSR 系数(B)除以引导估计标准偏差得出的引导 T 值。通过对 7 个主要变量和 12 个测试文献真实数据(4、5、7 和 8 个主要变量)的模拟研究发现,与两种常见的 DSD 参考方法(DSD 拟合确定性筛选和 AICc 向前逐步回归(AICc FSR))相比,自举 PLSR MLR 方法在小 DSD 特别是大 DSD 中的模型性能有所改善。与两种参考方法中的任何一种方法得出的最佳模型相比,在 13 个 DSD 中,有 6 个的变量选择准确性和预测能力得到了显著提高。其余 7 个 DSD 的模型与最佳参考模型相同。本研究发现,强遗传为所有真实数据提供了最佳模型。在 SVEM FSR 变量效应非零百分比上使用遗传程序,然后使用 MLR,显示出了很好的结果。AICc Lasso 回归是部分测试的其他方法之一,在对三个大型最小 DSD 进行测试时,发现几乎所有变量的效应都为零。虽然 DSD 拟合确定性筛选方法通常可能是 DSD 的首选,但遗传自举 PLSR MLR 和遗传 SVEM FSR MLR 可能是改进变量选择和模型精度的替代方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
7.50
自引率
7.70%
发文量
169
审稿时长
3.4 months
期刊介绍: Chemometrics and Intelligent Laboratory Systems publishes original research papers, short communications, reviews, tutorials and Original Software Publications reporting on development of novel statistical, mathematical, or computer techniques in Chemistry and related disciplines. Chemometrics is the chemical discipline that uses mathematical and statistical methods to design or select optimal procedures and experiments, and to provide maximum chemical information by analysing chemical data. The journal deals with the following topics: 1) Development of new statistical, mathematical and chemometrical methods for Chemistry and related fields (Environmental Chemistry, Biochemistry, Toxicology, System Biology, -Omics, etc.) 2) Novel applications of chemometrics to all branches of Chemistry and related fields (typical domains of interest are: process data analysis, experimental design, data mining, signal processing, supervised modelling, decision making, robust statistics, mixture analysis, multivariate calibration etc.) Routine applications of established chemometrical techniques will not be considered. 3) Development of new software that provides novel tools or truly advances the use of chemometrical methods. 4) Well characterized data sets to test performance for the new methods and software. The journal complies with International Committee of Medical Journal Editors'' Uniform requirements for manuscripts.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信