Halil Arici , Fatir A. Qureshi , Jay Mulmule , Juergen Hahn
{"title":"高维数据分类因子选择预筛选的多元假设检验","authors":"Halil Arici , Fatir A. Qureshi , Jay Mulmule , Juergen Hahn","doi":"10.1016/j.jprocont.2025.103469","DOIUrl":null,"url":null,"abstract":"<div><div>Modern instrumentation, such as mass spectrometry, enables the measurement of concentrations of hundreds or even thousands of compounds in individual samples. These measurements are often used in process data analytics to build classification models for determining whether a process is operating satisfactorily, if a product meets specifications, or to diagnose specific health conditions in patients. A common challenge associated with these applications is that the number of measured compounds far exceeds the number of available samples, increasing the risk of overfitting. Typically, it is advisable to have 10–20 samples per input factor of the classification model, thereby requiring the selection of only a handful of concentrations from potentially thousands. However, identifying the best combination of compounds from such a large pool by an exhaustive search is computationally infeasible.</div><div>A common approach to address this issue is pre-screening the compounds for statistically significant differences between groups, then limiting model inputs to only those identified as significant. The simplest form of pre-screening involves a student’s t-test, however, with a commonly-used <span><math><mi>p</mi></math></span>-value threshold of 0.05, one expects 5% of the compounds to be false positives, even when no true differences exist. Multiple hypothesis testing techniques, such as the Bonferroni correction and the Benjamini–Hochberg procedure, can reduce the number of compounds considered by accounting for these false positives. However, these methods often make assumptions about the data that are not valid in practice, leading to overly conservative results and potentially missing important compounds.</div><div>In this work, we present a screening procedure that computes the false discovery rate of p-values using a Leave-n-Out approach. By omitting <span><math><mi>n</mi></math></span> samples at a time and repeatedly calculating the p-values, we assess the robustness of statistical significance against small changes in the dataset. We compare this technique to the Bonferroni correction and Benjamini–Hochberg procedure using both synthetic examples and two experimental datasets from the life sciences. Our results demonstrate that while the proposed approach is more conservative than a simple t-test, it identifies compounds that lead to better-performing models compared to those selected using existing multiple hypothesis testing methods.</div></div>","PeriodicalId":50079,"journal":{"name":"Journal of Process Control","volume":"152 ","pages":"Article 103469"},"PeriodicalIF":3.3000,"publicationDate":"2025-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multiple hypothesis testing for pre-screening of factor selection for classification of high-dimensional data\",\"authors\":\"Halil Arici , Fatir A. Qureshi , Jay Mulmule , Juergen Hahn\",\"doi\":\"10.1016/j.jprocont.2025.103469\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Modern instrumentation, such as mass spectrometry, enables the measurement of concentrations of hundreds or even thousands of compounds in individual samples. These measurements are often used in process data analytics to build classification models for determining whether a process is operating satisfactorily, if a product meets specifications, or to diagnose specific health conditions in patients. A common challenge associated with these applications is that the number of measured compounds far exceeds the number of available samples, increasing the risk of overfitting. Typically, it is advisable to have 10–20 samples per input factor of the classification model, thereby requiring the selection of only a handful of concentrations from potentially thousands. However, identifying the best combination of compounds from such a large pool by an exhaustive search is computationally infeasible.</div><div>A common approach to address this issue is pre-screening the compounds for statistically significant differences between groups, then limiting model inputs to only those identified as significant. The simplest form of pre-screening involves a student’s t-test, however, with a commonly-used <span><math><mi>p</mi></math></span>-value threshold of 0.05, one expects 5% of the compounds to be false positives, even when no true differences exist. Multiple hypothesis testing techniques, such as the Bonferroni correction and the Benjamini–Hochberg procedure, can reduce the number of compounds considered by accounting for these false positives. However, these methods often make assumptions about the data that are not valid in practice, leading to overly conservative results and potentially missing important compounds.</div><div>In this work, we present a screening procedure that computes the false discovery rate of p-values using a Leave-n-Out approach. By omitting <span><math><mi>n</mi></math></span> samples at a time and repeatedly calculating the p-values, we assess the robustness of statistical significance against small changes in the dataset. We compare this technique to the Bonferroni correction and Benjamini–Hochberg procedure using both synthetic examples and two experimental datasets from the life sciences. Our results demonstrate that while the proposed approach is more conservative than a simple t-test, it identifies compounds that lead to better-performing models compared to those selected using existing multiple hypothesis testing methods.</div></div>\",\"PeriodicalId\":50079,\"journal\":{\"name\":\"Journal of Process Control\",\"volume\":\"152 \",\"pages\":\"Article 103469\"},\"PeriodicalIF\":3.3000,\"publicationDate\":\"2025-06-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Process Control\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0959152425000976\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"AUTOMATION & CONTROL SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Process Control","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0959152425000976","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
Multiple hypothesis testing for pre-screening of factor selection for classification of high-dimensional data
Modern instrumentation, such as mass spectrometry, enables the measurement of concentrations of hundreds or even thousands of compounds in individual samples. These measurements are often used in process data analytics to build classification models for determining whether a process is operating satisfactorily, if a product meets specifications, or to diagnose specific health conditions in patients. A common challenge associated with these applications is that the number of measured compounds far exceeds the number of available samples, increasing the risk of overfitting. Typically, it is advisable to have 10–20 samples per input factor of the classification model, thereby requiring the selection of only a handful of concentrations from potentially thousands. However, identifying the best combination of compounds from such a large pool by an exhaustive search is computationally infeasible.
A common approach to address this issue is pre-screening the compounds for statistically significant differences between groups, then limiting model inputs to only those identified as significant. The simplest form of pre-screening involves a student’s t-test, however, with a commonly-used -value threshold of 0.05, one expects 5% of the compounds to be false positives, even when no true differences exist. Multiple hypothesis testing techniques, such as the Bonferroni correction and the Benjamini–Hochberg procedure, can reduce the number of compounds considered by accounting for these false positives. However, these methods often make assumptions about the data that are not valid in practice, leading to overly conservative results and potentially missing important compounds.
In this work, we present a screening procedure that computes the false discovery rate of p-values using a Leave-n-Out approach. By omitting samples at a time and repeatedly calculating the p-values, we assess the robustness of statistical significance against small changes in the dataset. We compare this technique to the Bonferroni correction and Benjamini–Hochberg procedure using both synthetic examples and two experimental datasets from the life sciences. Our results demonstrate that while the proposed approach is more conservative than a simple t-test, it identifies compounds that lead to better-performing models compared to those selected using existing multiple hypothesis testing methods.
期刊介绍:
This international journal covers the application of control theory, operations research, computer science and engineering principles to the solution of process control problems. In addition to the traditional chemical processing and manufacturing applications, the scope of process control problems involves a wide range of applications that includes energy processes, nano-technology, systems biology, bio-medical engineering, pharmaceutical processing technology, energy storage and conversion, smart grid, and data analytics among others.
Papers on the theory in these areas will also be accepted provided the theoretical contribution is aimed at the application and the development of process control techniques.
Topics covered include:
• Control applications• Process monitoring• Plant-wide control• Process control systems• Control techniques and algorithms• Process modelling and simulation• Design methods
Advanced design methods exclude well established and widely studied traditional design techniques such as PID tuning and its many variants. Applications in fields such as control of automotive engines, machinery and robotics are not deemed suitable unless a clear motivation for the relevance to process control is provided.