Entity Matching across Heterogeneous Sources

Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Pub Date : 2015-08-10 DOI:10.1145/2783258.2783353

Yang Yang, Yizhou Sun, Jie Tang, B. Ma, Juan-Zi Li

{"title":"Entity Matching across Heterogeneous Sources","authors":"Yang Yang, Yizhou Sun, Jie Tang, B. Ma, Juan-Zi Li","doi":"10.1145/2783258.2783353","DOIUrl":null,"url":null,"abstract":"Given an entity in a source domain, finding its matched entities from another (target) domain is an important task in many applications. Traditionally, the problem was usually addressed by first extracting major keywords corresponding to the source entity and then query relevant entities from the target domain using those keywords. However, the method would inevitably fails if the two domains have less or no overlapping in the content. An extreme case is that the source domain is in English and the target domain is in Chinese. In this paper, we formalize the problem as entity matching across heterogeneous sources and propose a probabilistic topic model to solve the problem. The model integrates the topic extraction and entity matching, two core subtasks for dealing with the problem, into a unified model. Specifically, for handling the text disjointing problem, we use a cross-sampling process in our model to extract topics with terms coming from all the sources, and leverage existing matching relations through latent topic layers instead of at text layers. Benefit from the proposed model, we can not only find the matched documents for a query entity, but also explain why these documents are related by showing the common topics they share. Our experiments in two real-world applications show that the proposed model can extensively improve the matching performance (+19.8% and +7.1% in two applications respectively) compared with several alternative methods.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"58 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"31","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2783258.2783353","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 31

Abstract

Given an entity in a source domain, finding its matched entities from another (target) domain is an important task in many applications. Traditionally, the problem was usually addressed by first extracting major keywords corresponding to the source entity and then query relevant entities from the target domain using those keywords. However, the method would inevitably fails if the two domains have less or no overlapping in the content. An extreme case is that the source domain is in English and the target domain is in Chinese. In this paper, we formalize the problem as entity matching across heterogeneous sources and propose a probabilistic topic model to solve the problem. The model integrates the topic extraction and entity matching, two core subtasks for dealing with the problem, into a unified model. Specifically, for handling the text disjointing problem, we use a cross-sampling process in our model to extract topics with terms coming from all the sources, and leverage existing matching relations through latent topic layers instead of at text layers. Benefit from the proposed model, we can not only find the matched documents for a query entity, but also explain why these documents are related by showing the common topics they share. Our experiments in two real-world applications show that the proposed model can extensively improve the matching performance (+19.8% and +7.1% in two applications respectively) compared with several alternative methods.

查看原文本刊更多论文

跨异构源的实体匹配

给定源域中的实体，从另一个(目标)域中找到与之匹配的实体是许多应用程序中的重要任务。传统的解决方法通常是先提取源实体对应的主要关键字，然后使用这些关键字在目标域中查询相关实体。但是，如果两个域在内容中重叠较少或没有重叠，则该方法不可避免地会失败。一种极端的情况是源域为英文，目标域为中文。本文将该问题形式化为跨异构源的实体匹配问题，并提出了一个概率主题模型来解决该问题。该模型将主题提取和实体匹配这两个处理问题的核心子任务集成到一个统一的模型中。具体来说，为了处理文本脱节问题，我们在模型中使用交叉采样过程来提取来自所有来源的术语的主题，并通过潜在主题层而不是文本层利用现有的匹配关系。得益于所提出的模型，我们不仅可以为查询实体找到匹配的文档，还可以通过显示这些文档共享的共同主题来解释为什么这些文档是相关的。我们在两个实际应用中的实验表明，与几种替代方法相比，所提出的模型可以广泛提高匹配性能(在两个应用中分别为+19.8%和+7.1%)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

自引率

0.00%

发文量