Improving the accuracy and internal consistency of regression-based clustering of high-dimensional datasets.

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology Pub Date : 2023-07-25 eCollection Date: 2023-01-01 DOI:10.1515/sagmb-2022-0031

Bo Zhang, Jianghua He, Jinxiang Hu, Prabhakar Chalise, Devin C Koestler

{"title":"Improving the accuracy and internal consistency of regression-based clustering of high-dimensional datasets.","authors":"Bo Zhang, Jianghua He, Jinxiang Hu, Prabhakar Chalise, Devin C Koestler","doi":"10.1515/sagmb-2022-0031","DOIUrl":null,"url":null,"abstract":"<p><p>Component-wise Sparse Mixture Regression (CSMR) is a recently proposed regression-based clustering method that shows promise in detecting heterogeneous relationships between molecular markers and a continuous phenotype of interest. However, CSMR can yield inconsistent results when applied to high-dimensional molecular data, which we hypothesize is in part due to inherent limitations associated with the feature selection method used in the CSMR algorithm. To assess this hypothesis, we explored whether substituting different regularized regression methods (i.e. Lasso, Elastic Net, Smoothly Clipped Absolute Deviation (SCAD), Minmax Convex Penalty (MCP), and Adaptive-Lasso) within the CSMR framework can improve the clustering accuracy and internal consistency (IC) of CSMR in high-dimensional settings. We calculated the true positive rate (TPR), true negative rate (TNR), IC and clustering accuracy of our proposed modifications, benchmarked against the existing CSMR algorithm, using an extensive set of simulation studies and real biological datasets. Our results demonstrated that substituting Adaptive-Lasso within the existing feature selection method used in CSMR led to significantly improved IC and clustering accuracy, with strong performance even in high-dimensional scenarios. In conclusion, our modifications of the CSMR method resulted in improved clustering performance and may thus serve as viable alternatives for the regression-based clustering of high-dimensional datasets.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"22 1","pages":""},"PeriodicalIF":0.9000,"publicationDate":"2023-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10891458/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistical Applications in Genetics and Molecular Biology","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1515/sagmb-2022-0031","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2023/1/1 0:00:00","PubModel":"eCollection","JCR":"Q3","JCRName":"Mathematics","Score":null,"Total":0}

引用次数: 0

Abstract

Component-wise Sparse Mixture Regression (CSMR) is a recently proposed regression-based clustering method that shows promise in detecting heterogeneous relationships between molecular markers and a continuous phenotype of interest. However, CSMR can yield inconsistent results when applied to high-dimensional molecular data, which we hypothesize is in part due to inherent limitations associated with the feature selection method used in the CSMR algorithm. To assess this hypothesis, we explored whether substituting different regularized regression methods (i.e. Lasso, Elastic Net, Smoothly Clipped Absolute Deviation (SCAD), Minmax Convex Penalty (MCP), and Adaptive-Lasso) within the CSMR framework can improve the clustering accuracy and internal consistency (IC) of CSMR in high-dimensional settings. We calculated the true positive rate (TPR), true negative rate (TNR), IC and clustering accuracy of our proposed modifications, benchmarked against the existing CSMR algorithm, using an extensive set of simulation studies and real biological datasets. Our results demonstrated that substituting Adaptive-Lasso within the existing feature selection method used in CSMR led to significantly improved IC and clustering accuracy, with strong performance even in high-dimensional scenarios. In conclusion, our modifications of the CSMR method resulted in improved clustering performance and may thus serve as viable alternatives for the regression-based clustering of high-dimensional datasets.

查看原文本刊更多论文

提高基于回归的高维数据集聚类法的准确性和内部一致性。

成分稀疏混杂回归（Component-wise Sparse Mixture Regression，CSMR）是最近提出的一种基于回归的聚类方法，它在检测分子标记物与感兴趣的连续表型之间的异质性关系方面显示出良好的前景。然而，当 CSMR 应用于高维分子数据时，可能会产生不一致的结果，我们假设部分原因是 CSMR 算法中使用的特征选择方法存在固有的局限性。为了评估这一假设，我们探讨了在 CSMR 框架内替换不同的正则化回归方法（即 Lasso、Elastic Net、Smoothly Clipped Absolute Deviation (SCAD)、Minmax Convex Penalty (MCP) 和 Adaptive-Lasso）是否能提高 CSMR 在高维环境下的聚类准确性和内部一致性（IC）。我们利用大量模拟研究和真实生物数据集，以现有的 CSMR 算法为基准，计算了我们提出的修改方案的真阳性率 (TPR)、真阴性率 (TNR)、内部一致性 (IC) 和聚类精度。我们的研究结果表明，在 CSMR 中使用的现有特征选择方法中替换自适应拉索，可以显著提高 IC 和聚类准确率，即使在高维场景中也能表现出色。总之，我们对 CSMR 方法的修改提高了聚类性能，因此可以作为基于回归的高维数据集聚类的可行替代方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Statistical Applications in Genetics and Molecular Biology 生物-生化与分子生物学

CiteScore

1.20

自引率

11.10%

发文量

审稿时长

6-12 weeks

期刊介绍： Statistical Applications in Genetics and Molecular Biology seeks to publish significant research on the application of statistical ideas to problems arising from computational biology. The focus of the papers should be on the relevant statistical issues but should contain a succinct description of the relevant biological problem being considered. The range of topics is wide and will include topics such as linkage mapping, association studies, gene finding and sequence alignment, protein structure prediction, design and analysis of microarray data, molecular evolution and phylogenetic trees, DNA topology, and data base search strategies. Both original research and review articles will be warmly received.