Clustering Wikipedia infoboxes to discover their types

Proceedings of the 21st ACM international conference on Information and knowledge management Pub Date : 2012-10-29 DOI:10.1145/2396761.2398588

T. Nguyen, Huong Nguyen, V. Moreira, J. Freire

引用次数: 1

Abstract

Wikipedia has emerged as an important source of structured information on the Web. But while the success of Wikipedia can be attributed in part to the simplicity of adding and modifying content, this has also created challenges when it comes to using, querying, and integrating the information. Even though authors are encouraged to select appropriate categories and provide infoboxes that follow pre-defined templates, many do not follow the guidelines or follow them loosely. This leads to undesirable effects, such as template duplication, heterogeneity, and schema drift. As a step towards addressing this problem, we propose a new unsupervised approach for clustering Wikipedia infoboxes. Instead of relying on manually assigned categories and template labels, we use the structured information available in infoboxes to group them and infer their entity types. Experiments using over 48,000 infoboxes indicate that our clustering approach is effective and produces high quality clusters.

查看原文本刊更多论文

聚集维基百科信息框以发现它们的类型

维基百科已经成为网络结构化信息的重要来源。虽然维基百科的成功可以部分归因于添加和修改内容的简单性，但这也给使用、查询和整合信息带来了挑战。尽管鼓励作者选择适当的类别并提供遵循预定义模板的信息框，但许多作者并不遵循指导方针或松散地遵循指导方针。这将导致不良影响，例如模板复制、异构性和模式漂移。为了解决这个问题，我们提出了一种新的无监督方法来聚类维基百科信息框。我们没有依赖于手动分配的类别和模板标签，而是使用信息框中可用的结构化信息对它们进行分组并推断它们的实体类型。使用超过48,000个信息框的实验表明，我们的聚类方法是有效的，并产生了高质量的聚类。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 21st ACM international conference on Information and knowledge management

自引率

0.00%

发文量