高维数据鲁棒变量选择的并行框架

2016 6th Workshop on Python for High-Performance and Scientific Computing (PyHPC) Pub Date : 2016-11-13 DOI:10.1109/PYHPC.2016.13

Matteo Barbieri, Samuele Fiorini, Federico Tomasi, A. Barla

{"title":"高维数据鲁棒变量选择的并行框架","authors":"Matteo Barbieri, Samuele Fiorini, Federico Tomasi, A. Barla","doi":"10.1109/PYHPC.2016.13","DOIUrl":null,"url":null,"abstract":"The main goal of supervised data analytics is to model a target phenomenon given a limited amount of samples, each represented by an arbitrarily large number of variables. Especially when the number of variables is much larger than the number of available samples, variable selection is a key step as it allows to identify a possibly reduced subset of relevant variables describing the observed phenomenon. Obtaining interpretable and reliable results, in this highly indeterminate scenario, is often a non-trivial task. In this work we present PALLADIO, a framework designed for HPC cluster architectures, that is able to provide robust variable selection in high-dimensional problems. PALLADIO is developed in Python and it integrates CUDA kernels to decrease the computational time needed for several independent element-wise operations. The scalability of the proposed framework is assessed on synthetic data of different sizes, which represent realistic scenarios.","PeriodicalId":178771,"journal":{"name":"2016 6th Workshop on Python for High-Performance and Scientific Computing (PyHPC)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"PALLADIO: A Parallel Framework for Robust Variable Selection in High-Dimensional Data\",\"authors\":\"Matteo Barbieri, Samuele Fiorini, Federico Tomasi, A. Barla\",\"doi\":\"10.1109/PYHPC.2016.13\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The main goal of supervised data analytics is to model a target phenomenon given a limited amount of samples, each represented by an arbitrarily large number of variables. Especially when the number of variables is much larger than the number of available samples, variable selection is a key step as it allows to identify a possibly reduced subset of relevant variables describing the observed phenomenon. Obtaining interpretable and reliable results, in this highly indeterminate scenario, is often a non-trivial task. In this work we present PALLADIO, a framework designed for HPC cluster architectures, that is able to provide robust variable selection in high-dimensional problems. PALLADIO is developed in Python and it integrates CUDA kernels to decrease the computational time needed for several independent element-wise operations. The scalability of the proposed framework is assessed on synthetic data of different sizes, which represent realistic scenarios.\",\"PeriodicalId\":178771,\"journal\":{\"name\":\"2016 6th Workshop on Python for High-Performance and Scientific Computing (PyHPC)\",\"volume\":\"67 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-11-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 6th Workshop on Python for High-Performance and Scientific Computing (PyHPC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PYHPC.2016.13\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 6th Workshop on Python for High-Performance and Scientific Computing (PyHPC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PYHPC.2016.13","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

摘要

监督数据分析的主要目标是对给定有限数量样本的目标现象进行建模，每个样本由任意大量的变量表示。特别是当变量的数量远远大于可用样本的数量时，变量选择是一个关键步骤，因为它允许识别描述观察到的现象的相关变量的可能减少的子集。在这种高度不确定的情况下，获得可解释和可靠的结果通常是一项非常重要的任务。在这项工作中，我们提出了PALLADIO，一个为高性能计算集群架构设计的框架，它能够在高维问题中提供鲁棒的变量选择。PALLADIO是用Python开发的，它集成了CUDA内核，以减少几个独立元素操作所需的计算时间。在不同规模的合成数据上评估了该框架的可扩展性，这些数据代表了现实场景。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

PALLADIO: A Parallel Framework for Robust Variable Selection in High-Dimensional Data

The main goal of supervised data analytics is to model a target phenomenon given a limited amount of samples, each represented by an arbitrarily large number of variables. Especially when the number of variables is much larger than the number of available samples, variable selection is a key step as it allows to identify a possibly reduced subset of relevant variables describing the observed phenomenon. Obtaining interpretable and reliable results, in this highly indeterminate scenario, is often a non-trivial task. In this work we present PALLADIO, a framework designed for HPC cluster architectures, that is able to provide robust variable selection in high-dimensional problems. PALLADIO is developed in Python and it integrates CUDA kernels to decrease the computational time needed for several independent element-wise operations. The scalability of the proposed framework is assessed on synthetic data of different sizes, which represent realistic scenarios.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2016 6th Workshop on Python for High-Performance and Scientific Computing (PyHPC)

自引率

0.00%

发文量