Web内容挖掘的图论技术

Series in Machine Perception and Artificial Intelligence Pub Date : 2005-05-31 DOI:10.1142/5832

A. Schenker, A. Kandel, H. Bunke, Mark Last

{"title":"Web内容挖掘的图论技术","authors":"A. Schenker, A. Kandel, H. Bunke, Mark Last","doi":"10.1142/5832","DOIUrl":null,"url":null,"abstract":"In this dissertation we introduce several novel techniques for performing data mining on web documents which utilize graph representations of document content. Graphs are more robust than typical vector representations as they can model structural information that is usually lost when converting the original web document content to a vector representation. For example, we can capture information such as the location, order and proximity of term occurrence, which is discarded under the standard document vector representation models. Many machine learning methods rely on distance computations, centroid calculations, and other numerical techniques. Thus many of these methods have not been applied to data represented by graphs since no suitable graph-theoretical concepts were previously available. \nWe introduce the novel Graph Hierarchy Construction Algorithm (GHCA), which performs topic-oriented hierarchical clustering of web search results modeled using graphs. The system we created around this new algorithm and its prior version is compared with similar web search clustering systems to gauge its usefulness. An important advantage of this approach over conventional web search systems is that the results are better organized and more easily browsed by users. \nNext we present extensions to classical machine learning algorithms, such as the k-means clustering algorithm and the k-Nearest Neighbors classification algorithm, which allows the use of graphs as fundamental data items instead of vectors. We perform experiments comparing the performance of the new graph-based methods to the traditional vector-based methods for three web document collections. Our experimental results show an improvement for the graph approaches over the vector approaches for both clustering and classification of web documents. An important advantage of the graph representations we propose is that they allow the computation of graph similarity in polynomial time; usually the determination of graph similarity with the techniques we use is an NP-Complete problem. In fact, there are some cases where the execution time of the graph-oriented approach was faster than the vector approaches.","PeriodicalId":440867,"journal":{"name":"Series in Machine Perception and Artificial Intelligence","volume":"244 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2005-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"203","resultStr":"{\"title\":\"Graph-Theoretic Techniques for Web Content Mining\",\"authors\":\"A. Schenker, A. Kandel, H. Bunke, Mark Last\",\"doi\":\"10.1142/5832\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this dissertation we introduce several novel techniques for performing data mining on web documents which utilize graph representations of document content. Graphs are more robust than typical vector representations as they can model structural information that is usually lost when converting the original web document content to a vector representation. For example, we can capture information such as the location, order and proximity of term occurrence, which is discarded under the standard document vector representation models. Many machine learning methods rely on distance computations, centroid calculations, and other numerical techniques. Thus many of these methods have not been applied to data represented by graphs since no suitable graph-theoretical concepts were previously available. \\nWe introduce the novel Graph Hierarchy Construction Algorithm (GHCA), which performs topic-oriented hierarchical clustering of web search results modeled using graphs. The system we created around this new algorithm and its prior version is compared with similar web search clustering systems to gauge its usefulness. An important advantage of this approach over conventional web search systems is that the results are better organized and more easily browsed by users. \\nNext we present extensions to classical machine learning algorithms, such as the k-means clustering algorithm and the k-Nearest Neighbors classification algorithm, which allows the use of graphs as fundamental data items instead of vectors. We perform experiments comparing the performance of the new graph-based methods to the traditional vector-based methods for three web document collections. Our experimental results show an improvement for the graph approaches over the vector approaches for both clustering and classification of web documents. An important advantage of the graph representations we propose is that they allow the computation of graph similarity in polynomial time; usually the determination of graph similarity with the techniques we use is an NP-Complete problem. In fact, there are some cases where the execution time of the graph-oriented approach was faster than the vector approaches.\",\"PeriodicalId\":440867,\"journal\":{\"name\":\"Series in Machine Perception and Artificial Intelligence\",\"volume\":\"244 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2005-05-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"203\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Series in Machine Perception and Artificial Intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1142/5832\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Series in Machine Perception and Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1142/5832","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 203

摘要

在本文中，我们介绍了几种利用文档内容的图形表示对web文档进行数据挖掘的新技术。图比典型的矢量表示更健壮，因为它们可以模拟在将原始web文档内容转换为矢量表示时通常丢失的结构信息。例如，我们可以捕获术语出现的位置、顺序和接近度等信息，这些信息在标准文档向量表示模型下被丢弃。许多机器学习方法依赖于距离计算、质心计算和其他数值技术。因此，由于以前没有合适的图理论概念，这些方法中的许多都没有应用于用图表示的数据。本文介绍了一种新的图层次构建算法(GHCA)，该算法对使用图建模的网络搜索结果进行面向主题的分层聚类。我们围绕这个新算法创建的系统及其先前版本与类似的网络搜索聚类系统进行比较，以衡量其实用性。与传统的网络搜索系统相比，这种方法的一个重要优势是，搜索结果更有条理，用户更容易浏览。接下来，我们将介绍经典机器学习算法的扩展，例如k-means聚类算法和k-Nearest Neighbors分类算法，它们允许使用图而不是向量作为基本数据项。我们对三个web文档集合进行了实验，比较了新的基于图的方法和传统的基于向量的方法的性能。我们的实验结果表明，对于web文档的聚类和分类，图方法比向量方法有了改进。我们提出的图表示的一个重要优点是，它们允许在多项式时间内计算图的相似性;通常用我们使用的技术确定图的相似度是一个np完全问题。事实上，在某些情况下，面向图的方法的执行时间比矢量方法快。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Graph-Theoretic Techniques for Web Content Mining

In this dissertation we introduce several novel techniques for performing data mining on web documents which utilize graph representations of document content. Graphs are more robust than typical vector representations as they can model structural information that is usually lost when converting the original web document content to a vector representation. For example, we can capture information such as the location, order and proximity of term occurrence, which is discarded under the standard document vector representation models. Many machine learning methods rely on distance computations, centroid calculations, and other numerical techniques. Thus many of these methods have not been applied to data represented by graphs since no suitable graph-theoretical concepts were previously available. We introduce the novel Graph Hierarchy Construction Algorithm (GHCA), which performs topic-oriented hierarchical clustering of web search results modeled using graphs. The system we created around this new algorithm and its prior version is compared with similar web search clustering systems to gauge its usefulness. An important advantage of this approach over conventional web search systems is that the results are better organized and more easily browsed by users. Next we present extensions to classical machine learning algorithms, such as the k-means clustering algorithm and the k-Nearest Neighbors classification algorithm, which allows the use of graphs as fundamental data items instead of vectors. We perform experiments comparing the performance of the new graph-based methods to the traditional vector-based methods for three web document collections. Our experimental results show an improvement for the graph approaches over the vector approaches for both clustering and classification of web documents. An important advantage of the graph representations we propose is that they allow the computation of graph similarity in polynomial time; usually the determination of graph similarity with the techniques we use is an NP-Complete problem. In fact, there are some cases where the execution time of the graph-oriented approach was faster than the vector approaches.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Series in Machine Perception and Artificial Intelligence

自引率

0.00%

发文量