Accuracy vs. Speed: Scalable Entity Coreference on the Semantic Web with On-the-Fly Pruning

2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology Pub Date : 2012-12-04 DOI:10.1109/WI-IAT.2012.24

Dezhao Song, J. Heflin

{"title":"Accuracy vs. Speed: Scalable Entity Coreference on the Semantic Web with On-the-Fly Pruning","authors":"Dezhao Song, J. Heflin","doi":"10.1109/WI-IAT.2012.24","DOIUrl":null,"url":null,"abstract":"One challenge for the Semantic Web is to scalably establish high quality owl: same As links between co referent ontology instances in different data sources, traditional approaches that exhaustively compare every pair of instances do not scale well to large datasets. In this paper, we propose a pruning-based algorithm for reducing the complexity of entity co reference. First, we discard candidate pairs of instances that are not sufficiently similar to the same pool of other instances. A sigmoid function based thresholding method is proposed to automatically adjust the threshold for such commonality on-the-fly. In our prior work, each instance is associated with a context graph consisting of neighboring RDF nodes. In this paper, we speed up the comparison for a single pair of instances by pruning insignificant context in the graph, this is accomplished by evaluating its potential contribution to the final similarity measure. We evaluate our system on three Semantic Web instance categories. We verify the effectiveness of our thresholding and context pruning methods by comparing to nine state-of-the-art systems. We show that our algorithm frequently outperforms those systems with a runtime speedup factor of 18 to 24 while maintaining competitive F1-scores. For datasets of up to 1 million instances, this translates to as much as 370 hours improvement in runtime.","PeriodicalId":220218,"journal":{"name":"2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WI-IAT.2012.24","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

One challenge for the Semantic Web is to scalably establish high quality owl: same As links between co referent ontology instances in different data sources, traditional approaches that exhaustively compare every pair of instances do not scale well to large datasets. In this paper, we propose a pruning-based algorithm for reducing the complexity of entity co reference. First, we discard candidate pairs of instances that are not sufficiently similar to the same pool of other instances. A sigmoid function based thresholding method is proposed to automatically adjust the threshold for such commonality on-the-fly. In our prior work, each instance is associated with a context graph consisting of neighboring RDF nodes. In this paper, we speed up the comparison for a single pair of instances by pruning insignificant context in the graph, this is accomplished by evaluating its potential contribution to the final similarity measure. We evaluate our system on three Semantic Web instance categories. We verify the effectiveness of our thresholding and context pruning methods by comparing to nine state-of-the-art systems. We show that our algorithm frequently outperforms those systems with a runtime speedup factor of 18 to 24 while maintaining competitive F1-scores. For datasets of up to 1 million instances, this translates to as much as 370 hours improvement in runtime.

查看原文本刊更多论文

准确性与速度:基于动态剪枝的语义网上可扩展实体相互引用

语义Web面临的一个挑战是可扩展地建立高质量的owl:就像不同数据源中相互关联的本体实例之间的链接一样，对每一对实例进行详尽比较的传统方法不能很好地扩展到大型数据集。本文提出了一种基于剪枝的实体引用复杂度降低算法。首先，我们丢弃与其他实例的相同池不够相似的候选实例对。提出了一种基于s型函数的阈值分割方法，用于动态自动调整阈值。在我们之前的工作中，每个实例都与一个由相邻RDF节点组成的上下文图相关联。在本文中，我们通过修剪图中不重要的上下文来加快对单个实例的比较，这是通过评估其对最终相似性度量的潜在贡献来完成的。我们在三个语义Web实例类别上评估我们的系统。我们通过比较九个最先进的系统来验证阈值和上下文修剪方法的有效性。我们表明，我们的算法经常以18到24的运行时加速系数胜过那些系统，同时保持有竞争力的f1分数。对于多达100万个实例的数据集，这意味着运行时可以提高370小时。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology

自引率

0.00%

发文量