Incorporating World Knowledge to Document Clustering via Heterogeneous Information Networks

Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Pub Date : 2015-08-10 DOI:10.1145/2783258.2783374

Chenguang Wang, Yangqiu Song, Ahmed El-Kishky, D. Roth, Ming Zhang, Jiawei Han

{"title":"Incorporating World Knowledge to Document Clustering via Heterogeneous Information Networks","authors":"Chenguang Wang, Yangqiu Song, Ahmed El-Kishky, D. Roth, Ming Zhang, Jiawei Han","doi":"10.1145/2783258.2783374","DOIUrl":null,"url":null,"abstract":"One of the key obstacles in making learning protocols realistic in applications is the need to supervise them, a costly process that often requires hiring domain experts. We consider the framework to use the world knowledge as indirect supervision. World knowledge is general-purpose knowledge, which is not designed for any specific domain. Then the key challenges are how to adapt the world knowledge to domains and how to represent it for learning. In this paper, we provide an example of using world knowledge for domain dependent document clustering. We provide three ways to specify the world knowledge to domains by resolving the ambiguity of the entities and their types, and represent the data with world knowledge as a heterogeneous information network. Then we propose a clustering algorithm that can cluster multiple types and incorporate the sub-type information as constraints. In the experiments, we use two existing knowledge bases as our sources of world knowledge. One is Freebase, which is collaboratively collected knowledge about entities and their organizations. The other is YAGO2, a knowledge base automatically extracted from Wikipedia and maps knowledge to the linguistic knowledge base, WordNet. Experimental results on two text benchmark datasets (20newsgroups and RCV1) show that incorporating world knowledge as indirect supervision can significantly outperform the state-of-the-art clustering algorithms as well as clustering algorithms enhanced with world knowledge features.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"48","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2783258.2783374","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 48

Abstract

One of the key obstacles in making learning protocols realistic in applications is the need to supervise them, a costly process that often requires hiring domain experts. We consider the framework to use the world knowledge as indirect supervision. World knowledge is general-purpose knowledge, which is not designed for any specific domain. Then the key challenges are how to adapt the world knowledge to domains and how to represent it for learning. In this paper, we provide an example of using world knowledge for domain dependent document clustering. We provide three ways to specify the world knowledge to domains by resolving the ambiguity of the entities and their types, and represent the data with world knowledge as a heterogeneous information network. Then we propose a clustering algorithm that can cluster multiple types and incorporate the sub-type information as constraints. In the experiments, we use two existing knowledge bases as our sources of world knowledge. One is Freebase, which is collaboratively collected knowledge about entities and their organizations. The other is YAGO2, a knowledge base automatically extracted from Wikipedia and maps knowledge to the linguistic knowledge base, WordNet. Experimental results on two text benchmark datasets (20newsgroups and RCV1) show that incorporating world knowledge as indirect supervision can significantly outperform the state-of-the-art clustering algorithms as well as clustering algorithms enhanced with world knowledge features.

查看原文本刊更多论文

利用异构信息网络将世界知识纳入文档聚类

在应用程序中实现学习协议的关键障碍之一是需要监督它们，这是一个昂贵的过程，通常需要聘请领域专家。我们考虑使用世界知识作为间接监督的框架。世界知识是通用知识，它不是为任何特定领域设计的。然后，关键的挑战是如何使世界知识适应于领域，以及如何表示它以供学习。在本文中，我们提供了一个使用世界知识进行领域相关文档聚类的例子。通过解决实体及其类型的模糊性，提出了三种将世界知识指定为域的方法，并将具有世界知识的数据表示为异构信息网络。在此基础上，提出了一种多类型聚类算法，并结合子类型信息作为约束。在实验中，我们使用两个现有的知识库作为我们的世界知识来源。一个是Freebase，它通过协作收集实体及其组织的知识。另一个是YAGO2，一个从维基百科中自动提取的知识库，并将知识映射到语言知识库WordNet。在两个文本基准数据集(20newsgroups和RCV1)上的实验结果表明，将世界知识作为间接监督的聚类算法显著优于当前最先进的聚类算法，以及使用世界知识特征增强的聚类算法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

自引率

0.00%

发文量