Semi-Supervised Clustering of XML Documents: Getting the Most from Structural Information

22nd International Conference on Data Engineering Workshops (ICDEW'06) Pub Date : 2006-04-03 DOI:10.1109/ICDEW.2006.136

Eduardo Goncalves da Silva, M. Mattoso, G. Xexéo

{"title":"Semi-Supervised Clustering of XML Documents: Getting the Most from Structural Information","authors":"Eduardo Goncalves da Silva, M. Mattoso, G. Xexéo","doi":"10.1109/ICDEW.2006.136","DOIUrl":null,"url":null,"abstract":"As document providers can express more contextualized and complex information, semi-structured documents are becoming a major source of information in many areas, e.g., in digital libraries, e-commerce or Web applications. A particular characteristic of such document collections is the existence of some structure or metadata along with the data. In this scenario, clustering methods that can take advantage of such structural information to better organize such collections are highly relevant. Semi-structured documents pose new challenges to document clustering methods, however, since it is not clear how this structural information can be used to improve the quality of the generated clustering models. On the other hand, recently there has a growing interest in the semi-supervised clustering task, in which a little amount of prior knowledge is provided to guide the algorithm to a better clustering model. A particular type of semi-supervision is in the form of user-provided constraints defined over pairs of objects, where each pair informs if its objects must be in the same or in different clusters. In this paper, we consider the problem of constrained clustering in documents that present some form of structural information. We consider the existence of a particular form of information to be clustered: textual documents that present a logical structure represented in XML format. We define and extend methods to improve the quality of clustering results by using such structural information to guide the execution of the constrained clustering algorithm. Experimental results on the OHSUMED document collection show the effectiveness of our approach.","PeriodicalId":331953,"journal":{"name":"22nd International Conference on Data Engineering Workshops (ICDEW'06)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2006-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"22nd International Conference on Data Engineering Workshops (ICDEW'06)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDEW.2006.136","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

As document providers can express more contextualized and complex information, semi-structured documents are becoming a major source of information in many areas, e.g., in digital libraries, e-commerce or Web applications. A particular characteristic of such document collections is the existence of some structure or metadata along with the data. In this scenario, clustering methods that can take advantage of such structural information to better organize such collections are highly relevant. Semi-structured documents pose new challenges to document clustering methods, however, since it is not clear how this structural information can be used to improve the quality of the generated clustering models. On the other hand, recently there has a growing interest in the semi-supervised clustering task, in which a little amount of prior knowledge is provided to guide the algorithm to a better clustering model. A particular type of semi-supervision is in the form of user-provided constraints defined over pairs of objects, where each pair informs if its objects must be in the same or in different clusters. In this paper, we consider the problem of constrained clustering in documents that present some form of structural information. We consider the existence of a particular form of information to be clustered: textual documents that present a logical structure represented in XML format. We define and extend methods to improve the quality of clustering results by using such structural information to guide the execution of the constrained clustering algorithm. Experimental results on the OHSUMED document collection show the effectiveness of our approach.

查看原文本刊更多论文

XML文档的半监督聚类:从结构信息中获取最多

由于文档提供者可以表达更多的上下文化和复杂的信息，半结构化文档正在成为许多领域的主要信息来源，例如，在数字图书馆、电子商务或Web应用程序中。这种文档集合的一个特殊特征是在数据中存在某种结构或元数据。在这种情况下，能够利用这种结构信息来更好地组织这种集合的聚类方法是高度相关的。然而，半结构化文档对文档聚类方法提出了新的挑战，因为尚不清楚如何使用这些结构化信息来提高生成的聚类模型的质量。另一方面，近年来人们对半监督聚类任务越来越感兴趣，在半监督聚类任务中，提供少量的先验知识来指导算法获得更好的聚类模型。一种特殊类型的半监督是以用户提供的约束的形式定义的对象对，其中每对对象通知它的对象是否必须在相同或不同的集群中。在本文中，我们考虑了存在某种形式的结构信息的文档中的约束聚类问题。我们认为存在一种需要聚集的特殊形式的信息:以XML格式表示逻辑结构的文本文档。我们定义并扩展了一些方法，通过使用这些结构信息来指导约束聚类算法的执行来提高聚类结果的质量。OHSUMED文档集的实验结果表明了该方法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

22nd International Conference on Data Engineering Workshops (ICDEW'06)

自引率

0.00%

发文量