Data Driven Dimensionality Reduction to Improve Modeling Performance✱

Joshua Chung, Marcos M. López de Prado, Horst Simon, Kesheng Wu
{"title":"Data Driven Dimensionality Reduction to Improve Modeling Performance✱","authors":"Joshua Chung, Marcos M. López de Prado, Horst Simon, Kesheng Wu","doi":"10.1145/3603719.3603744","DOIUrl":null,"url":null,"abstract":"In a number of applications, data may be anonymized, obfuscated, or highly noisy. In such cases, it is difficult to use domain knowledge or low-dimensional visualizations to engineer the features for tasks such as machine learning, instead, we explore dimensionality reduction (DR) as a data-driven approach for engineering these low-dimensional representations. Through a careful examination of available feature selection and feature extraction techniques, we propose a new class named feature clustering. These new methods could utilize different forms of clustering to help evaluate the relative importance of features and take on properties different from the well-known DR algorithms. To evaluate these algorithms, we develop a parallel computing framework that optimizes their hyperparameters on a sample of application datasets. This framework harnesses the parallel computing power to examine a large number of parameter combinations and enables hyperparameter tuning and model tuning purely based on observed performance. This optimization framework provides mechanism for users to control computational cost and is able to examine many parameter choices in seconds. On a set of building energy data where the key features are known based on domain knowledge, the optimized DR algorithms indeed identify the expected main drivers of building electricity usage: outdoor temperature and solar radiance. This shows the automated optimization procedure is able to find known features. In terms of modeling accuracy, a distance correlation-based feature clustering method outperforms other DR algorithms including the well-known KPCA, LLE, and UMAP on two different tests.","PeriodicalId":314512,"journal":{"name":"Proceedings of the 35th International Conference on Scientific and Statistical Database Management","volume":"71 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 35th International Conference on Scientific and Statistical Database Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3603719.3603744","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

In a number of applications, data may be anonymized, obfuscated, or highly noisy. In such cases, it is difficult to use domain knowledge or low-dimensional visualizations to engineer the features for tasks such as machine learning, instead, we explore dimensionality reduction (DR) as a data-driven approach for engineering these low-dimensional representations. Through a careful examination of available feature selection and feature extraction techniques, we propose a new class named feature clustering. These new methods could utilize different forms of clustering to help evaluate the relative importance of features and take on properties different from the well-known DR algorithms. To evaluate these algorithms, we develop a parallel computing framework that optimizes their hyperparameters on a sample of application datasets. This framework harnesses the parallel computing power to examine a large number of parameter combinations and enables hyperparameter tuning and model tuning purely based on observed performance. This optimization framework provides mechanism for users to control computational cost and is able to examine many parameter choices in seconds. On a set of building energy data where the key features are known based on domain knowledge, the optimized DR algorithms indeed identify the expected main drivers of building electricity usage: outdoor temperature and solar radiance. This shows the automated optimization procedure is able to find known features. In terms of modeling accuracy, a distance correlation-based feature clustering method outperforms other DR algorithms including the well-known KPCA, LLE, and UMAP on two different tests.
数据驱动的降维,以提高建模性能
在许多应用程序中,数据可能是匿名的、模糊的或高度嘈杂的。在这种情况下,很难使用领域知识或低维可视化来设计诸如机器学习等任务的特征,相反,我们探索降维(DR)作为设计这些低维表示的数据驱动方法。通过对现有特征选择和特征提取技术的仔细研究,我们提出了一个新的类,命名为特征聚类。这些新方法可以利用不同形式的聚类来帮助评估特征的相对重要性,并采取不同于众所周知的DR算法的属性。为了评估这些算法,我们开发了一个并行计算框架,在应用程序数据集的样本上优化它们的超参数。该框架利用并行计算能力来检查大量的参数组合,并支持纯粹基于观察到的性能进行超参数调优和模型调优。该优化框架为用户提供了控制计算成本的机制,并能够在几秒钟内检查多个参数选择。在一组基于领域知识已知关键特征的建筑能源数据上,优化的DR算法确实确定了建筑用电量的预期主要驱动因素:室外温度和太阳辐射。这表明自动优化过程能够找到已知的特征。在建模精度方面,基于距离相关的特征聚类方法在两个不同的测试中优于其他DR算法,包括众所周知的KPCA、LLE和UMAP。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信