技术视角:构建实体匹配管理系统

Pradap Konda, Sanjib Das, C. PaulSuganthanG., A. Doan, A. Ardalan, Jeffrey R. Ballard, Han Li, Fatemah Panahi, Haojun Zhang, J. Naughton, Shishir Prasad, Ganesh Krishnan, Rohit Deep, V. Raghavendra
{"title":"技术视角:构建实体匹配管理系统","authors":"Pradap Konda, Sanjib Das, C. PaulSuganthanG., A. Doan, A. Ardalan, Jeffrey R. Ballard, Han Li, Fatemah Panahi, Haojun Zhang, J. Naughton, Shishir Prasad, Ganesh Krishnan, Rohit Deep, V. Raghavendra","doi":"10.14778/2994509.2994535","DOIUrl":null,"url":null,"abstract":"Entity matching (EM) has been a long-standing challenge in data management. Most current EM works focus only on developing matching algorithms. We argue that far more efforts should be devoted to building EM systems. We discuss the limitations of current EM systems, then describe Magellan, a new kind of EM system. Magellan is novel in four important aspects. (1) It provides how-to guides that tell users what to do in each EM scenario, step by step. (2) It provides tools to help users execute these steps; the tools seek to cover the entire EM pipeline, not just blocking and matching as current EM systems do. (3) Tools are built into the Python open-source data science ecosystem, allowing Magellan to borrow a rich set of capabilities in data cleaning, IE, visualization, learning, etc. (4) Magellan provides a powerful scripting environment to facilitate interactive experimentation and quick \"patching\" of the system. We describe research challenges and present extensive experiments that show the promise of the Magellan approach.","PeriodicalId":21740,"journal":{"name":"SIGMOD Rec.","volume":"118 1","pages":"33-40"},"PeriodicalIF":0.0000,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"187","resultStr":"{\"title\":\"Technical Perspective:: Toward Building Entity Matching Management Systems\",\"authors\":\"Pradap Konda, Sanjib Das, C. PaulSuganthanG., A. Doan, A. Ardalan, Jeffrey R. Ballard, Han Li, Fatemah Panahi, Haojun Zhang, J. Naughton, Shishir Prasad, Ganesh Krishnan, Rohit Deep, V. Raghavendra\",\"doi\":\"10.14778/2994509.2994535\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Entity matching (EM) has been a long-standing challenge in data management. Most current EM works focus only on developing matching algorithms. We argue that far more efforts should be devoted to building EM systems. We discuss the limitations of current EM systems, then describe Magellan, a new kind of EM system. Magellan is novel in four important aspects. (1) It provides how-to guides that tell users what to do in each EM scenario, step by step. (2) It provides tools to help users execute these steps; the tools seek to cover the entire EM pipeline, not just blocking and matching as current EM systems do. (3) Tools are built into the Python open-source data science ecosystem, allowing Magellan to borrow a rich set of capabilities in data cleaning, IE, visualization, learning, etc. (4) Magellan provides a powerful scripting environment to facilitate interactive experimentation and quick \\\"patching\\\" of the system. We describe research challenges and present extensive experiments that show the promise of the Magellan approach.\",\"PeriodicalId\":21740,\"journal\":{\"name\":\"SIGMOD Rec.\",\"volume\":\"118 1\",\"pages\":\"33-40\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"187\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"SIGMOD Rec.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.14778/2994509.2994535\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"SIGMOD Rec.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.14778/2994509.2994535","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 187

摘要

实体匹配(EM)是数据管理中一个长期存在的难题。目前大多数EM工作只关注于开发匹配算法。我们认为,应该投入更多的精力来构建新兴市场系统。讨论了现有电磁系统的局限性,并介绍了一种新型的电磁系统麦哲伦。麦哲伦在四个重要方面是新颖的。(1)提供操作指南,一步一步地告诉用户在每个EM场景中应该做什么。(2)提供工具帮助用户执行这些步骤;这些工具旨在覆盖整个电磁管道,而不仅仅是像当前的电磁系统那样进行阻塞和匹配。(3) Python开源数据科学生态系统中内置了工具,允许Magellan在数据清洗,IE,可视化,学习等方面借用丰富的功能。(4)Magellan提供了强大的脚本环境,便于交互式实验和快速“修补”系统。我们描述了研究的挑战,并提出了大量的实验,显示麦哲伦方法的前景。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Technical Perspective:: Toward Building Entity Matching Management Systems
Entity matching (EM) has been a long-standing challenge in data management. Most current EM works focus only on developing matching algorithms. We argue that far more efforts should be devoted to building EM systems. We discuss the limitations of current EM systems, then describe Magellan, a new kind of EM system. Magellan is novel in four important aspects. (1) It provides how-to guides that tell users what to do in each EM scenario, step by step. (2) It provides tools to help users execute these steps; the tools seek to cover the entire EM pipeline, not just blocking and matching as current EM systems do. (3) Tools are built into the Python open-source data science ecosystem, allowing Magellan to borrow a rich set of capabilities in data cleaning, IE, visualization, learning, etc. (4) Magellan provides a powerful scripting environment to facilitate interactive experimentation and quick "patching" of the system. We describe research challenges and present extensive experiments that show the promise of the Magellan approach.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信