高维数据分类因子选择预筛选的多元假设检验

IF 3.9 2区计算机科学 Q2 AUTOMATION & CONTROL SYSTEMS

Journal of Process Control Pub Date : 2025-06-25 DOI:10.1016/j.jprocont.2025.103469

Halil Arici , Fatir A. Qureshi , Jay Mulmule , Juergen Hahn

{"title":"高维数据分类因子选择预筛选的多元假设检验","authors":"Halil Arici , Fatir A. Qureshi , Jay Mulmule , Juergen Hahn","doi":"10.1016/j.jprocont.2025.103469","DOIUrl":null,"url":null,"abstract":"<div><div>Modern instrumentation, such as mass spectrometry, enables the measurement of concentrations of hundreds or even thousands of compounds in individual samples. These measurements are often used in process data analytics to build classification models for determining whether a process is operating satisfactorily, if a product meets specifications, or to diagnose specific health conditions in patients. A common challenge associated with these applications is that the number of measured compounds far exceeds the number of available samples, increasing the risk of overfitting. Typically, it is advisable to have 10–20 samples per input factor of the classification model, thereby requiring the selection of only a handful of concentrations from potentially thousands. However, identifying the best combination of compounds from such a large pool by an exhaustive search is computationally infeasible.</div><div>A common approach to address this issue is pre-screening the compounds for statistically significant differences between groups, then limiting model inputs to only those identified as significant. The simplest form of pre-screening involves a student’s t-test, however, with a commonly-used <span><math><mi>p</mi></math></span>-value threshold of 0.05, one expects 5% of the compounds to be false positives, even when no true differences exist. Multiple hypothesis testing techniques, such as the Bonferroni correction and the Benjamini–Hochberg procedure, can reduce the number of compounds considered by accounting for these false positives. However, these methods often make assumptions about the data that are not valid in practice, leading to overly conservative results and potentially missing important compounds.</div><div>In this work, we present a screening procedure that computes the false discovery rate of p-values using a Leave-n-Out approach. By omitting <span><math><mi>n</mi></math></span> samples at a time and repeatedly calculating the p-values, we assess the robustness of statistical significance against small changes in the dataset. We compare this technique to the Bonferroni correction and Benjamini–Hochberg procedure using both synthetic examples and two experimental datasets from the life sciences. Our results demonstrate that while the proposed approach is more conservative than a simple t-test, it identifies compounds that lead to better-performing models compared to those selected using existing multiple hypothesis testing methods.</div></div>","PeriodicalId":50079,"journal":{"name":"Journal of Process Control","volume":"152 ","pages":"Article 103469"},"PeriodicalIF":3.9000,"publicationDate":"2025-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multiple hypothesis testing for pre-screening of factor selection for classification of high-dimensional data\",\"authors\":\"Halil Arici , Fatir A. Qureshi , Jay Mulmule , Juergen Hahn\",\"doi\":\"10.1016/j.jprocont.2025.103469\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Modern instrumentation, such as mass spectrometry, enables the measurement of concentrations of hundreds or even thousands of compounds in individual samples. These measurements are often used in process data analytics to build classification models for determining whether a process is operating satisfactorily, if a product meets specifications, or to diagnose specific health conditions in patients. A common challenge associated with these applications is that the number of measured compounds far exceeds the number of available samples, increasing the risk of overfitting. Typically, it is advisable to have 10–20 samples per input factor of the classification model, thereby requiring the selection of only a handful of concentrations from potentially thousands. However, identifying the best combination of compounds from such a large pool by an exhaustive search is computationally infeasible.</div><div>A common approach to address this issue is pre-screening the compounds for statistically significant differences between groups, then limiting model inputs to only those identified as significant. The simplest form of pre-screening involves a student’s t-test, however, with a commonly-used <span><math><mi>p</mi></math></span>-value threshold of 0.05, one expects 5% of the compounds to be false positives, even when no true differences exist. Multiple hypothesis testing techniques, such as the Bonferroni correction and the Benjamini–Hochberg procedure, can reduce the number of compounds considered by accounting for these false positives. However, these methods often make assumptions about the data that are not valid in practice, leading to overly conservative results and potentially missing important compounds.</div><div>In this work, we present a screening procedure that computes the false discovery rate of p-values using a Leave-n-Out approach. By omitting <span><math><mi>n</mi></math></span> samples at a time and repeatedly calculating the p-values, we assess the robustness of statistical significance against small changes in the dataset. We compare this technique to the Bonferroni correction and Benjamini–Hochberg procedure using both synthetic examples and two experimental datasets from the life sciences. Our results demonstrate that while the proposed approach is more conservative than a simple t-test, it identifies compounds that lead to better-performing models compared to those selected using existing multiple hypothesis testing methods.</div></div>\",\"PeriodicalId\":50079,\"journal\":{\"name\":\"Journal of Process Control\",\"volume\":\"152 \",\"pages\":\"Article 103469\"},\"PeriodicalIF\":3.9000,\"publicationDate\":\"2025-06-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Process Control\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0959152425000976\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"AUTOMATION & CONTROL SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Process Control","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0959152425000976","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

现代仪器，如质谱法，可以测量单个样品中数百甚至数千种化合物的浓度。这些测量通常用于过程数据分析，以构建分类模型，以确定过程是否令人满意地运行，产品是否符合规格，或诊断患者的特定健康状况。与这些应用相关的一个共同挑战是，测量化合物的数量远远超过可用样品的数量，增加了过拟合的风险。通常，建议在分类模型的每个输入因子中有10-20个样本，因此只需要从可能数以千计的浓度中选择少数浓度。然而，通过穷举搜索从如此大的池中确定最佳化合物组合在计算上是不可行的。解决这一问题的一种常用方法是预先筛选各组之间的统计显著差异的化合物，然后将模型输入限制为仅识别为显著的那些。预筛选最简单的形式是学生的t检验，然而，由于通常使用的p值阈值为0.05，人们预计5%的化合物是假阳性，即使没有真正的差异存在。多重假设检验技术，如Bonferroni校正和Benjamini-Hochberg程序，可以通过考虑这些假阳性来减少化合物的数量。然而，这些方法经常对数据进行假设，而这些假设在实践中是无效的，导致结果过于保守，并可能遗漏重要的化合物。在这项工作中，我们提出了一个筛选程序，该程序使用leave - out方法计算p值的错误发现率。通过一次省略n个样本并重复计算p值，我们评估了统计显著性对数据集中微小变化的稳健性。我们将此技术与Bonferroni校正和Benjamini-Hochberg程序进行比较，使用合成示例和来自生命科学的两个实验数据集。我们的研究结果表明，虽然所提出的方法比简单的t检验更保守，但与使用现有的多重假设检验方法选择的模型相比，它识别出了导致更好表现的模型的化合物。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Multiple hypothesis testing for pre-screening of factor selection for classification of high-dimensional data

Modern instrumentation, such as mass spectrometry, enables the measurement of concentrations of hundreds or even thousands of compounds in individual samples. These measurements are often used in process data analytics to build classification models for determining whether a process is operating satisfactorily, if a product meets specifications, or to diagnose specific health conditions in patients. A common challenge associated with these applications is that the number of measured compounds far exceeds the number of available samples, increasing the risk of overfitting. Typically, it is advisable to have 10–20 samples per input factor of the classification model, thereby requiring the selection of only a handful of concentrations from potentially thousands. However, identifying the best combination of compounds from such a large pool by an exhaustive search is computationally infeasible.

A common approach to address this issue is pre-screening the compounds for statistically significant differences between groups, then limiting model inputs to only those identified as significant. The simplest form of pre-screening involves a student’s t-test, however, with a commonly-used

p

-value threshold of 0.05, one expects 5% of the compounds to be false positives, even when no true differences exist. Multiple hypothesis testing techniques, such as the Bonferroni correction and the Benjamini–Hochberg procedure, can reduce the number of compounds considered by accounting for these false positives. However, these methods often make assumptions about the data that are not valid in practice, leading to overly conservative results and potentially missing important compounds.

In this work, we present a screening procedure that computes the false discovery rate of p-values using a Leave-n-Out approach. By omitting

n

samples at a time and repeatedly calculating the p-values, we assess the robustness of statistical significance against small changes in the dataset. We compare this technique to the Bonferroni correction and Benjamini–Hochberg procedure using both synthetic examples and two experimental datasets from the life sciences. Our results demonstrate that while the proposed approach is more conservative than a simple t-test, it identifies compounds that lead to better-performing models compared to those selected using existing multiple hypothesis testing methods.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Process Control 工程技术-工程：化工

CiteScore

7.00

自引率

11.90%

发文量

159

审稿时长

74 days

期刊介绍： This international journal covers the application of control theory, operations research, computer science and engineering principles to the solution of process control problems. In addition to the traditional chemical processing and manufacturing applications, the scope of process control problems involves a wide range of applications that includes energy processes, nano-technology, systems biology, bio-medical engineering, pharmaceutical processing technology, energy storage and conversion, smart grid, and data analytics among others. Papers on the theory in these areas will also be accepted provided the theoretical contribution is aimed at the application and the development of process control techniques. Topics covered include: • Control applications• Process monitoring• Plant-wide control• Process control systems• Control techniques and algorithms• Process modelling and simulation• Design methods Advanced design methods exclude well established and widely studied traditional design techniques such as PID tuning and its many variants. Applications in fields such as control of automotive engines, machinery and robotics are not deemed suitable unless a clear motivation for the relevance to process control is provided.