Entity Matching across Heterogeneous Sources

Yang Yang, Yizhou Sun, Jie Tang, B. Ma, Juan-Zi Li
{"title":"Entity Matching across Heterogeneous Sources","authors":"Yang Yang, Yizhou Sun, Jie Tang, B. Ma, Juan-Zi Li","doi":"10.1145/2783258.2783353","DOIUrl":null,"url":null,"abstract":"Given an entity in a source domain, finding its matched entities from another (target) domain is an important task in many applications. Traditionally, the problem was usually addressed by first extracting major keywords corresponding to the source entity and then query relevant entities from the target domain using those keywords. However, the method would inevitably fails if the two domains have less or no overlapping in the content. An extreme case is that the source domain is in English and the target domain is in Chinese. In this paper, we formalize the problem as entity matching across heterogeneous sources and propose a probabilistic topic model to solve the problem. The model integrates the topic extraction and entity matching, two core subtasks for dealing with the problem, into a unified model. Specifically, for handling the text disjointing problem, we use a cross-sampling process in our model to extract topics with terms coming from all the sources, and leverage existing matching relations through latent topic layers instead of at text layers. Benefit from the proposed model, we can not only find the matched documents for a query entity, but also explain why these documents are related by showing the common topics they share. Our experiments in two real-world applications show that the proposed model can extensively improve the matching performance (+19.8% and +7.1% in two applications respectively) compared with several alternative methods.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"58 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"31","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2783258.2783353","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 31

Abstract

Given an entity in a source domain, finding its matched entities from another (target) domain is an important task in many applications. Traditionally, the problem was usually addressed by first extracting major keywords corresponding to the source entity and then query relevant entities from the target domain using those keywords. However, the method would inevitably fails if the two domains have less or no overlapping in the content. An extreme case is that the source domain is in English and the target domain is in Chinese. In this paper, we formalize the problem as entity matching across heterogeneous sources and propose a probabilistic topic model to solve the problem. The model integrates the topic extraction and entity matching, two core subtasks for dealing with the problem, into a unified model. Specifically, for handling the text disjointing problem, we use a cross-sampling process in our model to extract topics with terms coming from all the sources, and leverage existing matching relations through latent topic layers instead of at text layers. Benefit from the proposed model, we can not only find the matched documents for a query entity, but also explain why these documents are related by showing the common topics they share. Our experiments in two real-world applications show that the proposed model can extensively improve the matching performance (+19.8% and +7.1% in two applications respectively) compared with several alternative methods.
跨异构源的实体匹配
给定源域中的实体,从另一个(目标)域中找到与之匹配的实体是许多应用程序中的重要任务。传统的解决方法通常是先提取源实体对应的主要关键字,然后使用这些关键字在目标域中查询相关实体。但是,如果两个域在内容中重叠较少或没有重叠,则该方法不可避免地会失败。一种极端的情况是源域为英文,目标域为中文。本文将该问题形式化为跨异构源的实体匹配问题,并提出了一个概率主题模型来解决该问题。该模型将主题提取和实体匹配这两个处理问题的核心子任务集成到一个统一的模型中。具体来说,为了处理文本脱节问题,我们在模型中使用交叉采样过程来提取来自所有来源的术语的主题,并通过潜在主题层而不是文本层利用现有的匹配关系。得益于所提出的模型,我们不仅可以为查询实体找到匹配的文档,还可以通过显示这些文档共享的共同主题来解释为什么这些文档是相关的。我们在两个实际应用中的实验表明,与几种替代方法相比,所提出的模型可以广泛提高匹配性能(在两个应用中分别为+19.8%和+7.1%)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信