Data Understanding using Semi-Supervised Clustering

2012 Conference on Intelligent Data Understanding Pub Date : 2012-10-01 DOI:10.1109/CIDU.2012.6382192

Vasudha Bhatnagar, Rashmi Dobariyal, P. Jain, A. Mahabal

{"title":"Data Understanding using Semi-Supervised Clustering","authors":"Vasudha Bhatnagar, Rashmi Dobariyal, P. Jain, A. Mahabal","doi":"10.1109/CIDU.2012.6382192","DOIUrl":null,"url":null,"abstract":"In the era of E-science, most scientific endeavors depend on intense data analysis to understand the underlying physical phenomenon. Predictive modeling is one of the popular machine learning tasks undertaken in such endeavors. Labeled data used for training the predictive model reflects understanding of the domain. In this paper we introduce data understanding as a computational problem and propose a solution for enhancing domain understanding based on semisupervised clustering The proposed DU-SSC (Data Understanding using SemiSupervised Clustering) algorithm is incremental, parameterless and performs single scan of data. Given labeled (training) data is discretized at user specified resolution and finer (micro) data distributions are identified within classes, along with outliers. The discovery process is based on grouping similar instances in data space, while taking into account the degree of influence each attribute exercises on the class label. Maximal Information Coefficient measure is used during similarity computations for this purpose. The study is supported by experiments and a detailed account of understanding gained is presented for two selected UCI data sets. General observations on nine other UCI datasets are presented, along with experiments that demonstrate use of discovered knowledge for improved classification.","PeriodicalId":270712,"journal":{"name":"2012 Conference on Intelligent Data Understanding","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 Conference on Intelligent Data Understanding","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CIDU.2012.6382192","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

In the era of E-science, most scientific endeavors depend on intense data analysis to understand the underlying physical phenomenon. Predictive modeling is one of the popular machine learning tasks undertaken in such endeavors. Labeled data used for training the predictive model reflects understanding of the domain. In this paper we introduce data understanding as a computational problem and propose a solution for enhancing domain understanding based on semisupervised clustering The proposed DU-SSC (Data Understanding using SemiSupervised Clustering) algorithm is incremental, parameterless and performs single scan of data. Given labeled (training) data is discretized at user specified resolution and finer (micro) data distributions are identified within classes, along with outliers. The discovery process is based on grouping similar instances in data space, while taking into account the degree of influence each attribute exercises on the class label. Maximal Information Coefficient measure is used during similarity computations for this purpose. The study is supported by experiments and a detailed account of understanding gained is presented for two selected UCI data sets. General observations on nine other UCI datasets are presented, along with experiments that demonstrate use of discovered knowledge for improved classification.

查看原文本刊更多论文

使用半监督聚类的数据理解

在电子科学时代，大多数科学研究都依赖于密集的数据分析来理解潜在的物理现象。预测建模是在这种努力中进行的流行的机器学习任务之一。用于训练预测模型的标记数据反映了对该领域的理解。本文将数据理解作为一个计算问题，提出了一种基于半监督聚类的增强领域理解的解决方案。本文提出的DU-SSC (data understanding using semisupervised clustering)算法是增量的、无参数的、对数据进行单次扫描的算法。给定标记的(训练)数据以用户指定的分辨率离散化，并在类内识别更精细的(微观)数据分布，以及异常值。发现过程基于对数据空间中的相似实例进行分组，同时考虑到每个属性对类标签的影响程度。在相似性计算中使用最大信息系数度量。该研究得到了实验的支持，并详细说明了对两个选定的UCI数据集的理解。本文介绍了对其他9个UCI数据集的一般观察结果，以及使用发现的知识改进分类的实验。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2012 Conference on Intelligent Data Understanding

自引率

0.00%

发文量