Towards task-based parallelization for entity resolution

IF 2.4 Q1 Computer Science
Leonardo Gazzarri, Melanie Herschel
{"title":"Towards task-based parallelization for entity resolution","authors":"Leonardo Gazzarri, Melanie Herschel","doi":"10.1007/s00450-019-00409-6","DOIUrl":null,"url":null,"abstract":"Entity resolution (ER) refers to the problem of finding which virtual representations in one or more data sources refer to the same real-world entity. A central question in ER is how to find matching entity representations (so called duplicates) efficiently and in a scalable way. One general technique to address these issues is to leverage parallelization. In particular, almost all work on parallel ER focus on data parallelism. This paper focuses on task parallelism for ER. This type of parallelism allows to support incremental ER that offers incremental computation of the solution by streaming results of intermediate stages of ER as soon as they are computed. This possibly allows to obtain results in a more timely fashion and can also serve in a service-oriented setting with limited time or monetary budget. In summary, this paper presents a framework for task-parallelization of ER, supporting in particular ER of large amounts of semi-structured and heterogeneous data. We also discuss a possible implementation of our framework.","PeriodicalId":41265,"journal":{"name":"SICS Software-Intensive Cyber-Physical Systems","volume":"310 8","pages":"1 - 8"},"PeriodicalIF":2.4000,"publicationDate":"2019-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"SICS Software-Intensive Cyber-Physical Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s00450-019-00409-6","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Computer Science","Score":null,"Total":0}
引用次数: 3

Abstract

Entity resolution (ER) refers to the problem of finding which virtual representations in one or more data sources refer to the same real-world entity. A central question in ER is how to find matching entity representations (so called duplicates) efficiently and in a scalable way. One general technique to address these issues is to leverage parallelization. In particular, almost all work on parallel ER focus on data parallelism. This paper focuses on task parallelism for ER. This type of parallelism allows to support incremental ER that offers incremental computation of the solution by streaming results of intermediate stages of ER as soon as they are computed. This possibly allows to obtain results in a more timely fashion and can also serve in a service-oriented setting with limited time or monetary budget. In summary, this paper presents a framework for task-parallelization of ER, supporting in particular ER of large amounts of semi-structured and heterogeneous data. We also discuss a possible implementation of our framework.
面向基于任务的实体解析并行化
实体解析(ER)指的是在一个或多个数据源中查找哪些虚拟表示引用同一个真实世界实体的问题。ER中的一个中心问题是如何以可伸缩的方式高效地找到匹配的实体表示(所谓的副本)。解决这些问题的一种通用技术是利用并行化。特别是,几乎所有关于并行ER的工作都集中在数据并行性上。本文主要研究ER的任务并行性。这种类型的并行性允许支持增量ER,通过在计算ER中间阶段的结果后立即流式传输来提供解决方案的增量计算。这可能允许以更及时的方式获得结果,并且还可以在时间或金钱预算有限的面向服务的环境中使用。总之,本文提出了一个ER任务并行化框架,特别支持大量半结构化和异构数据的ER。我们还讨论了框架的可能实现。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
SICS Software-Intensive Cyber-Physical Systems
SICS Software-Intensive Cyber-Physical Systems COMPUTER SCIENCE, HARDWARE & ARCHITECTURE-
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信