{"title":"Document grouping with concept based discriminative analysis and feature partition","authors":"S. Kajapriya, K. N. Vimal Shankar","doi":"10.1109/ICICES.2014.7033763","DOIUrl":null,"url":null,"abstract":"Clustering is one of the most important techniques in machine learning and data mining responsibilities. Similar documents are grouped by performing clustering techniques. Similarity measure is used to determine transaction associations. Hierarchical clustering method produces tree structured results. Partition based clustering model produces the results in grid format. Text documents are formless data values with high dimensional attributes. Document clustering group the unlabeled text documents into meaningful clusters. Traditionally clustering methods need cluster count (K) before the document grouping process. Clustering accuracy decreases drastically with reference to the unsuitable cluster count. Document word features are automatically partitioned into two groups discriminative words and non-discriminative words. But only discriminative words are useful for grouping documents. The contribution of nondiscriminative words confuses the clustering process and leads to poor cluster solutions. The variational inference algorithm is used to infer the document collection structure and partition of document words at the same time. Dirichlet Process Mixture (DPM) model is used to partition documents. DPM clustering model utilizes both the data likelihood and the clustering property of the Dirichlet Process (DP). Dirichlet Process Mixture Model for Feature Partition (DPMFP) is used to discover the latent cluster structure based on the DPM model. DPMFP clustering model is performed without requiring the no. of clusters as input. The Discriminative word identification process is enhanced with the labeled document analysis mechanism. The concept relationships are analyzed with Ontology support. Semantic weight analysis is used for the document similarity measure. This method increases the scalability with the support of labels and concept relations for dimensionality cutback process.","PeriodicalId":13713,"journal":{"name":"International Conference on Information Communication and Embedded Systems (ICICES2014)","volume":"402 1","pages":"1-4"},"PeriodicalIF":0.0000,"publicationDate":"2014-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Information Communication and Embedded Systems (ICICES2014)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICICES.2014.7033763","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Clustering is one of the most important techniques in machine learning and data mining responsibilities. Similar documents are grouped by performing clustering techniques. Similarity measure is used to determine transaction associations. Hierarchical clustering method produces tree structured results. Partition based clustering model produces the results in grid format. Text documents are formless data values with high dimensional attributes. Document clustering group the unlabeled text documents into meaningful clusters. Traditionally clustering methods need cluster count (K) before the document grouping process. Clustering accuracy decreases drastically with reference to the unsuitable cluster count. Document word features are automatically partitioned into two groups discriminative words and non-discriminative words. But only discriminative words are useful for grouping documents. The contribution of nondiscriminative words confuses the clustering process and leads to poor cluster solutions. The variational inference algorithm is used to infer the document collection structure and partition of document words at the same time. Dirichlet Process Mixture (DPM) model is used to partition documents. DPM clustering model utilizes both the data likelihood and the clustering property of the Dirichlet Process (DP). Dirichlet Process Mixture Model for Feature Partition (DPMFP) is used to discover the latent cluster structure based on the DPM model. DPMFP clustering model is performed without requiring the no. of clusters as input. The Discriminative word identification process is enhanced with the labeled document analysis mechanism. The concept relationships are analyzed with Ontology support. Semantic weight analysis is used for the document similarity measure. This method increases the scalability with the support of labels and concept relations for dimensionality cutback process.
聚类是机器学习和数据挖掘中最重要的技术之一。通过执行聚类技术对类似文档进行分组。相似性度量用于确定事务关联。分层聚类方法产生树状结构的结果。基于分区的聚类模型以网格格式生成结果。文本文档是具有高维属性的无格式数据值。文档聚类将未标记的文本文档分组到有意义的簇中。传统的聚类方法需要在文档分组之前进行聚类计数(K)。不合适的聚类数量会使聚类精度急剧下降。将文档词特征自动划分为判别词和非判别词两组。但是,只有区别词对文档分组有用。非判别词的贡献混淆了聚类过程,导致了较差的聚类解决方案。采用变分推理算法对文档集合结构进行推理,同时对文档词进行划分。采用Dirichlet过程混合(DPM)模型对文档进行划分。DPM聚类模型同时利用了Dirichlet过程(DP)的数据似然和聚类特性。基于Dirichlet过程混合特征划分模型(Dirichlet Process Mixture Model for Feature Partition, DPMFP)发现潜在聚类结构。DPMFP聚类模型的执行不需要no。簇作为输入。通过标记文档分析机制,增强了判别词识别过程。在本体支持下,对概念关系进行了分析。语义权重分析用于文档相似度度量。该方法在维数裁剪过程的标签和概念关系支持下,提高了可扩展性。