基于文本相似度的高维分类数据聚类

Proceedings of the 2nd International Conference on Communication and Information Processing Pub Date : 2016-11-26 DOI:10.1145/3018009.3018022

G. S. Narayana, D. Vasumathi

{"title":"基于文本相似度的高维分类数据聚类","authors":"G. S. Narayana, D. Vasumathi","doi":"10.1145/3018009.3018022","DOIUrl":null,"url":null,"abstract":"It is a well-known fact that a variety of cluster analysis techniques exist to group objects which have characteristics related to one another. But the fact of the matter is the implementation of many of these techniques poses a great challenge because of the fact that much of the data contained in today's database is categorical in nature. Despite the fact that there have been recent advances in algorithms for clustering categorical data, some are unable to handle uncertainty in the clustering process while others have stability issues. In this paper, it is intended to propose an effective method for text similarity based clustering technique. At first the relevant features are selected from the input dataset. Thus the relevant features are clustered based on the A Possibilistic Fuzzy C-Means Clustering Algorithm (PFCM). Here the features used for clustering will be the similarity between the categorical data. The similarity measure is presented namely SMTP (similarity measure for text processing) for the two categorical data. Clustering based proposed method has high probability of producing a useful subset and independent features. To improve the efficiency of the proposed method, construct the minimum spanning tree by an optimization algorithm. Here adaptive artificial bee colony algorithm (AABC) is used for the purpose of selecting the optimal features. The performance of the proposed technique is evaluated by clustering accuracy, Jaccard coefficient and Dice's coefficient. The proposed method will be implemented in MATLAB platform using machine learning repository.","PeriodicalId":189252,"journal":{"name":"Proceedings of the 2nd International Conference on Communication and Information Processing","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Clustering for high dimensional categorical data based on text similarity\",\"authors\":\"G. S. Narayana, D. Vasumathi\",\"doi\":\"10.1145/3018009.3018022\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"It is a well-known fact that a variety of cluster analysis techniques exist to group objects which have characteristics related to one another. But the fact of the matter is the implementation of many of these techniques poses a great challenge because of the fact that much of the data contained in today's database is categorical in nature. Despite the fact that there have been recent advances in algorithms for clustering categorical data, some are unable to handle uncertainty in the clustering process while others have stability issues. In this paper, it is intended to propose an effective method for text similarity based clustering technique. At first the relevant features are selected from the input dataset. Thus the relevant features are clustered based on the A Possibilistic Fuzzy C-Means Clustering Algorithm (PFCM). Here the features used for clustering will be the similarity between the categorical data. The similarity measure is presented namely SMTP (similarity measure for text processing) for the two categorical data. Clustering based proposed method has high probability of producing a useful subset and independent features. To improve the efficiency of the proposed method, construct the minimum spanning tree by an optimization algorithm. Here adaptive artificial bee colony algorithm (AABC) is used for the purpose of selecting the optimal features. The performance of the proposed technique is evaluated by clustering accuracy, Jaccard coefficient and Dice's coefficient. The proposed method will be implemented in MATLAB platform using machine learning repository.\",\"PeriodicalId\":189252,\"journal\":{\"name\":\"Proceedings of the 2nd International Conference on Communication and Information Processing\",\"volume\":\"18 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-11-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2nd International Conference on Communication and Information Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3018009.3018022\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2nd International Conference on Communication and Information Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3018009.3018022","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

众所周知，存在各种聚类分析技术来对具有彼此相关特征的对象进行分组。但事实是，许多这些技术的实现带来了巨大的挑战，因为今天数据库中包含的许多数据本质上是分类的。尽管分类数据聚类的算法近年来取得了一些进展，但有些算法无法处理聚类过程中的不确定性，而另一些算法则存在稳定性问题。本文旨在提出一种有效的基于文本相似度的聚类方法。首先从输入数据集中选择相关特征。基于可能性模糊c均值聚类算法(PFCM)对相关特征进行聚类。这里用于聚类的特征将是分类数据之间的相似性。提出了两个分类数据的相似度度量，即SMTP(文本处理相似度度量)。基于聚类的方法产生有用子集和独立特征的概率高。为了提高该方法的效率，采用优化算法构造最小生成树。本文采用自适应人工蜂群算法(AABC)来选择最优特征。通过聚类精度、Jaccard系数和Dice系数来评价该方法的性能。该方法将在MATLAB平台上使用机器学习存储库实现。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Clustering for high dimensional categorical data based on text similarity

It is a well-known fact that a variety of cluster analysis techniques exist to group objects which have characteristics related to one another. But the fact of the matter is the implementation of many of these techniques poses a great challenge because of the fact that much of the data contained in today's database is categorical in nature. Despite the fact that there have been recent advances in algorithms for clustering categorical data, some are unable to handle uncertainty in the clustering process while others have stability issues. In this paper, it is intended to propose an effective method for text similarity based clustering technique. At first the relevant features are selected from the input dataset. Thus the relevant features are clustered based on the A Possibilistic Fuzzy C-Means Clustering Algorithm (PFCM). Here the features used for clustering will be the similarity between the categorical data. The similarity measure is presented namely SMTP (similarity measure for text processing) for the two categorical data. Clustering based proposed method has high probability of producing a useful subset and independent features. To improve the efficiency of the proposed method, construct the minimum spanning tree by an optimization algorithm. Here adaptive artificial bee colony algorithm (AABC) is used for the purpose of selecting the optimal features. The performance of the proposed technique is evaluated by clustering accuracy, Jaccard coefficient and Dice's coefficient. The proposed method will be implemented in MATLAB platform using machine learning repository.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2nd International Conference on Communication and Information Processing

自引率

0.00%

发文量