数据驱动的降维，以提高建模性能

Proceedings of the 35th International Conference on Scientific and Statistical Database Management Pub Date : 2023-07-10 DOI:10.1145/3603719.3603744

Joshua Chung, Marcos M. López de Prado, Horst Simon, Kesheng Wu

{"title":"数据驱动的降维，以提高建模性能","authors":"Joshua Chung, Marcos M. López de Prado, Horst Simon, Kesheng Wu","doi":"10.1145/3603719.3603744","DOIUrl":null,"url":null,"abstract":"In a number of applications, data may be anonymized, obfuscated, or highly noisy. In such cases, it is difficult to use domain knowledge or low-dimensional visualizations to engineer the features for tasks such as machine learning, instead, we explore dimensionality reduction (DR) as a data-driven approach for engineering these low-dimensional representations. Through a careful examination of available feature selection and feature extraction techniques, we propose a new class named feature clustering. These new methods could utilize different forms of clustering to help evaluate the relative importance of features and take on properties different from the well-known DR algorithms. To evaluate these algorithms, we develop a parallel computing framework that optimizes their hyperparameters on a sample of application datasets. This framework harnesses the parallel computing power to examine a large number of parameter combinations and enables hyperparameter tuning and model tuning purely based on observed performance. This optimization framework provides mechanism for users to control computational cost and is able to examine many parameter choices in seconds. On a set of building energy data where the key features are known based on domain knowledge, the optimized DR algorithms indeed identify the expected main drivers of building electricity usage: outdoor temperature and solar radiance. This shows the automated optimization procedure is able to find known features. In terms of modeling accuracy, a distance correlation-based feature clustering method outperforms other DR algorithms including the well-known KPCA, LLE, and UMAP on two different tests.","PeriodicalId":314512,"journal":{"name":"Proceedings of the 35th International Conference on Scientific and Statistical Database Management","volume":"71 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Data Driven Dimensionality Reduction to Improve Modeling Performance✱\",\"authors\":\"Joshua Chung, Marcos M. López de Prado, Horst Simon, Kesheng Wu\",\"doi\":\"10.1145/3603719.3603744\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In a number of applications, data may be anonymized, obfuscated, or highly noisy. In such cases, it is difficult to use domain knowledge or low-dimensional visualizations to engineer the features for tasks such as machine learning, instead, we explore dimensionality reduction (DR) as a data-driven approach for engineering these low-dimensional representations. Through a careful examination of available feature selection and feature extraction techniques, we propose a new class named feature clustering. These new methods could utilize different forms of clustering to help evaluate the relative importance of features and take on properties different from the well-known DR algorithms. To evaluate these algorithms, we develop a parallel computing framework that optimizes their hyperparameters on a sample of application datasets. This framework harnesses the parallel computing power to examine a large number of parameter combinations and enables hyperparameter tuning and model tuning purely based on observed performance. This optimization framework provides mechanism for users to control computational cost and is able to examine many parameter choices in seconds. On a set of building energy data where the key features are known based on domain knowledge, the optimized DR algorithms indeed identify the expected main drivers of building electricity usage: outdoor temperature and solar radiance. This shows the automated optimization procedure is able to find known features. In terms of modeling accuracy, a distance correlation-based feature clustering method outperforms other DR algorithms including the well-known KPCA, LLE, and UMAP on two different tests.\",\"PeriodicalId\":314512,\"journal\":{\"name\":\"Proceedings of the 35th International Conference on Scientific and Statistical Database Management\",\"volume\":\"71 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-07-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 35th International Conference on Scientific and Statistical Database Management\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3603719.3603744\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 35th International Conference on Scientific and Statistical Database Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3603719.3603744","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

在许多应用程序中，数据可能是匿名的、模糊的或高度嘈杂的。在这种情况下，很难使用领域知识或低维可视化来设计诸如机器学习等任务的特征，相反，我们探索降维(DR)作为设计这些低维表示的数据驱动方法。通过对现有特征选择和特征提取技术的仔细研究，我们提出了一个新的类，命名为特征聚类。这些新方法可以利用不同形式的聚类来帮助评估特征的相对重要性，并采取不同于众所周知的DR算法的属性。为了评估这些算法，我们开发了一个并行计算框架，在应用程序数据集的样本上优化它们的超参数。该框架利用并行计算能力来检查大量的参数组合，并支持纯粹基于观察到的性能进行超参数调优和模型调优。该优化框架为用户提供了控制计算成本的机制，并能够在几秒钟内检查多个参数选择。在一组基于领域知识已知关键特征的建筑能源数据上，优化的DR算法确实确定了建筑用电量的预期主要驱动因素:室外温度和太阳辐射。这表明自动优化过程能够找到已知的特征。在建模精度方面，基于距离相关的特征聚类方法在两个不同的测试中优于其他DR算法，包括众所周知的KPCA、LLE和UMAP。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Data Driven Dimensionality Reduction to Improve Modeling Performance✱

In a number of applications, data may be anonymized, obfuscated, or highly noisy. In such cases, it is difficult to use domain knowledge or low-dimensional visualizations to engineer the features for tasks such as machine learning, instead, we explore dimensionality reduction (DR) as a data-driven approach for engineering these low-dimensional representations. Through a careful examination of available feature selection and feature extraction techniques, we propose a new class named feature clustering. These new methods could utilize different forms of clustering to help evaluate the relative importance of features and take on properties different from the well-known DR algorithms. To evaluate these algorithms, we develop a parallel computing framework that optimizes their hyperparameters on a sample of application datasets. This framework harnesses the parallel computing power to examine a large number of parameter combinations and enables hyperparameter tuning and model tuning purely based on observed performance. This optimization framework provides mechanism for users to control computational cost and is able to examine many parameter choices in seconds. On a set of building energy data where the key features are known based on domain knowledge, the optimized DR algorithms indeed identify the expected main drivers of building electricity usage: outdoor temperature and solar radiance. This shows the automated optimization procedure is able to find known features. In terms of modeling accuracy, a distance correlation-based feature clustering method outperforms other DR algorithms including the well-known KPCA, LLE, and UMAP on two different tests.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 35th International Conference on Scientific and Statistical Database Management

自引率

0.00%

发文量