面向数据挖掘的并行优化数据准备

2015 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K) Pub Date : 2015-11-12 DOI:10.5220/0005594700360045

Christian Ernst, Youssef Hmamouche, Alain Casali

{"title":"面向数据挖掘的并行优化数据准备","authors":"Christian Ernst, Youssef Hmamouche, Alain Casali","doi":"10.5220/0005594700360045","DOIUrl":null,"url":null,"abstract":"In light of the fact that data preparation has a substantial impact on data mining results, we provide an original framework for automatically preparing the data of any given database. Our research focuses, for each attribute of the database, on two points: (i) Specifying an optimized outlier detection method, and (ii), Identifying the most appropriate discretization method. Concerning the former, we illustrate that the detection of an outlier depends on if data distribution is normal or not. When attempting to discern the best discretization method, what is important is the shape followed by the density function of its distribution law. For this reason, we propose an automatic choice for finding the optimized discretization method based on a multi-criteria (Entropy, Variance, Stability) evaluation. Processings are performed in parallel using multicore capabilities. Conducted experiments validate our approach, showing that it is not always the very same discretization method that is the best.","PeriodicalId":102743,"journal":{"name":"2015 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K)","volume":"71 4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"POP: A Parallel Optimized Preparation of data for data mining\",\"authors\":\"Christian Ernst, Youssef Hmamouche, Alain Casali\",\"doi\":\"10.5220/0005594700360045\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In light of the fact that data preparation has a substantial impact on data mining results, we provide an original framework for automatically preparing the data of any given database. Our research focuses, for each attribute of the database, on two points: (i) Specifying an optimized outlier detection method, and (ii), Identifying the most appropriate discretization method. Concerning the former, we illustrate that the detection of an outlier depends on if data distribution is normal or not. When attempting to discern the best discretization method, what is important is the shape followed by the density function of its distribution law. For this reason, we propose an automatic choice for finding the optimized discretization method based on a multi-criteria (Entropy, Variance, Stability) evaluation. Processings are performed in parallel using multicore capabilities. Conducted experiments validate our approach, showing that it is not always the very same discretization method that is the best.\",\"PeriodicalId\":102743,\"journal\":{\"name\":\"2015 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K)\",\"volume\":\"71 4 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-11-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.5220/0005594700360045\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5220/0005594700360045","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

鉴于数据准备对数据挖掘结果有重大影响，我们提供了一个原始框架，用于自动准备任何给定数据库的数据。对于数据库的每个属性，我们的研究重点集中在两点上:(i)指定优化的离群值检测方法，(ii)确定最合适的离散化方法。对于前者，我们说明了异常值的检测取决于数据分布是否正态。当试图识别最佳离散化方法时，重要的是其分布规律的形状，其次是密度函数。因此，我们提出了一种基于多准则(熵、方差、稳定性)评价的自动选择方法来寻找优化的离散化方法。处理使用多核能力并行执行。进行的实验验证了我们的方法，表明它并不总是非常相同的离散方法是最好的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

POP: A Parallel Optimized Preparation of data for data mining

In light of the fact that data preparation has a substantial impact on data mining results, we provide an original framework for automatically preparing the data of any given database. Our research focuses, for each attribute of the database, on two points: (i) Specifying an optimized outlier detection method, and (ii), Identifying the most appropriate discretization method. Concerning the former, we illustrate that the detection of an outlier depends on if data distribution is normal or not. When attempting to discern the best discretization method, what is important is the shape followed by the density function of its distribution law. For this reason, we propose an automatic choice for finding the optimized discretization method based on a multi-criteria (Entropy, Variance, Stability) evaluation. Processings are performed in parallel using multicore capabilities. Conducted experiments validate our approach, showing that it is not always the very same discretization method that is the best.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2015 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K)

自引率

0.00%

发文量