web文档聚类的测地线距离

2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM) Pub Date : 2011-04-11 DOI:10.1109/CIDM.2011.5949449

Selma Tekir, Florian Mansmann, D. Keim

{"title":"web文档聚类的测地线距离","authors":"Selma Tekir, Florian Mansmann, D. Keim","doi":"10.1109/CIDM.2011.5949449","DOIUrl":null,"url":null,"abstract":"While traditional distance measures are often capable of properly describing similarity between objects, in some application areas there is still potential to fine-tune these measures with additional information provided in the data sets. In this work we combine such traditional distance measures for document analysis with link information between documents to improve clustering results. In particular, we test the effectiveness of geodesic distances as similarity measures under the space assumption of spherical geometry in a 0-sphere. Our proposed distance measure is thus a combination of the cosine distance of the term-document matrix and some curvature values in the geodesic distance formula. To estimate these curvature values, we calculate clustering coefficient values for every document from the link graph of the data set and increase their distinctiveness by means of a heuristic as these clustering coefficient values are rough estimates of the curvatures. To evaluate our work, we perform clustering tests with the k-means algorithm on the English Wikipedia hyperlinked data set with both traditional cosine distance and our proposed geodesic distance. The effectiveness of our approach is measured by computing micro-precision values of the clusters based on the provided categorical information of each article.","PeriodicalId":211565,"journal":{"name":"2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Geodesic distances for web document clustering\",\"authors\":\"Selma Tekir, Florian Mansmann, D. Keim\",\"doi\":\"10.1109/CIDM.2011.5949449\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"While traditional distance measures are often capable of properly describing similarity between objects, in some application areas there is still potential to fine-tune these measures with additional information provided in the data sets. In this work we combine such traditional distance measures for document analysis with link information between documents to improve clustering results. In particular, we test the effectiveness of geodesic distances as similarity measures under the space assumption of spherical geometry in a 0-sphere. Our proposed distance measure is thus a combination of the cosine distance of the term-document matrix and some curvature values in the geodesic distance formula. To estimate these curvature values, we calculate clustering coefficient values for every document from the link graph of the data set and increase their distinctiveness by means of a heuristic as these clustering coefficient values are rough estimates of the curvatures. To evaluate our work, we perform clustering tests with the k-means algorithm on the English Wikipedia hyperlinked data set with both traditional cosine distance and our proposed geodesic distance. The effectiveness of our approach is measured by computing micro-precision values of the clusters based on the provided categorical information of each article.\",\"PeriodicalId\":211565,\"journal\":{\"name\":\"2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)\",\"volume\":\"40 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-04-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CIDM.2011.5949449\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CIDM.2011.5949449","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

摘要

虽然传统的距离测量通常能够正确地描述物体之间的相似性，但在某些应用领域，仍然有可能根据数据集中提供的额外信息对这些测量进行微调。在这项工作中，我们将这种传统的文档分析距离度量与文档之间的链接信息相结合，以提高聚类结果。特别地，我们在0球的球面几何空间假设下测试了测地线距离作为相似度量的有效性。因此，我们提出的距离度量是术语-文档矩阵的余弦距离和测地线距离公式中的一些曲率值的组合。为了估计这些曲率值，我们从数据集的链接图中计算每个文档的聚类系数值，并通过启发式方法增加它们的独特性，因为这些聚类系数值是曲率的粗略估计。为了评估我们的工作，我们使用k-means算法对英语维基百科超链接数据集进行聚类测试，该数据集具有传统的余弦距离和我们提出的测地线距离。我们的方法的有效性是通过基于每篇文章提供的分类信息计算聚类的微精度值来衡量的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Geodesic distances for web document clustering

While traditional distance measures are often capable of properly describing similarity between objects, in some application areas there is still potential to fine-tune these measures with additional information provided in the data sets. In this work we combine such traditional distance measures for document analysis with link information between documents to improve clustering results. In particular, we test the effectiveness of geodesic distances as similarity measures under the space assumption of spherical geometry in a 0-sphere. Our proposed distance measure is thus a combination of the cosine distance of the term-document matrix and some curvature values in the geodesic distance formula. To estimate these curvature values, we calculate clustering coefficient values for every document from the link graph of the data set and increase their distinctiveness by means of a heuristic as these clustering coefficient values are rough estimates of the curvatures. To evaluate our work, we perform clustering tests with the k-means algorithm on the English Wikipedia hyperlinked data set with both traditional cosine distance and our proposed geodesic distance. The effectiveness of our approach is measured by computing micro-precision values of the clusters based on the provided categorical information of each article.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)

自引率

0.00%

发文量