Data Understanding using Semi-Supervised Clustering

Vasudha Bhatnagar, Rashmi Dobariyal, P. Jain, A. Mahabal
{"title":"Data Understanding using Semi-Supervised Clustering","authors":"Vasudha Bhatnagar, Rashmi Dobariyal, P. Jain, A. Mahabal","doi":"10.1109/CIDU.2012.6382192","DOIUrl":null,"url":null,"abstract":"In the era of E-science, most scientific endeavors depend on intense data analysis to understand the underlying physical phenomenon. Predictive modeling is one of the popular machine learning tasks undertaken in such endeavors. Labeled data used for training the predictive model reflects understanding of the domain. In this paper we introduce data understanding as a computational problem and propose a solution for enhancing domain understanding based on semisupervised clustering The proposed DU-SSC (Data Understanding using SemiSupervised Clustering) algorithm is incremental, parameterless and performs single scan of data. Given labeled (training) data is discretized at user specified resolution and finer (micro) data distributions are identified within classes, along with outliers. The discovery process is based on grouping similar instances in data space, while taking into account the degree of influence each attribute exercises on the class label. Maximal Information Coefficient measure is used during similarity computations for this purpose. The study is supported by experiments and a detailed account of understanding gained is presented for two selected UCI data sets. General observations on nine other UCI datasets are presented, along with experiments that demonstrate use of discovered knowledge for improved classification.","PeriodicalId":270712,"journal":{"name":"2012 Conference on Intelligent Data Understanding","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 Conference on Intelligent Data Understanding","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CIDU.2012.6382192","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

In the era of E-science, most scientific endeavors depend on intense data analysis to understand the underlying physical phenomenon. Predictive modeling is one of the popular machine learning tasks undertaken in such endeavors. Labeled data used for training the predictive model reflects understanding of the domain. In this paper we introduce data understanding as a computational problem and propose a solution for enhancing domain understanding based on semisupervised clustering The proposed DU-SSC (Data Understanding using SemiSupervised Clustering) algorithm is incremental, parameterless and performs single scan of data. Given labeled (training) data is discretized at user specified resolution and finer (micro) data distributions are identified within classes, along with outliers. The discovery process is based on grouping similar instances in data space, while taking into account the degree of influence each attribute exercises on the class label. Maximal Information Coefficient measure is used during similarity computations for this purpose. The study is supported by experiments and a detailed account of understanding gained is presented for two selected UCI data sets. General observations on nine other UCI datasets are presented, along with experiments that demonstrate use of discovered knowledge for improved classification.
使用半监督聚类的数据理解
在电子科学时代,大多数科学研究都依赖于密集的数据分析来理解潜在的物理现象。预测建模是在这种努力中进行的流行的机器学习任务之一。用于训练预测模型的标记数据反映了对该领域的理解。本文将数据理解作为一个计算问题,提出了一种基于半监督聚类的增强领域理解的解决方案。本文提出的DU-SSC (data understanding using semisupervised clustering)算法是增量的、无参数的、对数据进行单次扫描的算法。给定标记的(训练)数据以用户指定的分辨率离散化,并在类内识别更精细的(微观)数据分布,以及异常值。发现过程基于对数据空间中的相似实例进行分组,同时考虑到每个属性对类标签的影响程度。在相似性计算中使用最大信息系数度量。该研究得到了实验的支持,并详细说明了对两个选定的UCI数据集的理解。本文介绍了对其他9个UCI数据集的一般观察结果,以及使用发现的知识改进分类的实验。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信