A hybrid machine-crowdsourcing system for matching web tables

2014 IEEE 30th International Conference on Data Engineering Pub Date : 2014-05-19 DOI:10.1109/ICDE.2014.6816716

Ju Fan, Meiyu Lu, B. Ooi, W. Tan, Meihui Zhang

{"title":"A hybrid machine-crowdsourcing system for matching web tables","authors":"Ju Fan, Meiyu Lu, B. Ooi, W. Tan, Meihui Zhang","doi":"10.1109/ICDE.2014.6816716","DOIUrl":null,"url":null,"abstract":"The Web is teeming with rich structured information in the form of HTML tables, which provides us with the opportunity to build a knowledge repository by integrating these tables. An essential problem of web data integration is to discover semantic correspondences between web table columns, and schema matching is a popular means to determine the semantic correspondences. However, conventional schema matching techniques are not always effective for web table matching due to the incompleteness in web tables. In this paper, we propose a two-pronged approach for web table matching that effectively addresses the above difficulties. First, we propose a concept-based approach that maps each column of a web table to the best concept, in a well-developed knowledge base, that represents it. This approach overcomes the problem that sometimes values of two web table columns may be disjoint, even though the columns are related, due to incompleteness in the column values. Second, we develop a hybrid machine-crowdsourcing framework that leverages human intelligence to discern the concepts for “difficult” columns. Our overall framework assigns the most “beneficial” column-to-concept matching tasks to the crowd under a given budget and utilizes the crowdsourcing result to help our algorithm infer the best matches for the rest of the columns. We validate the effectiveness of our framework through an extensive experimental study over two real-world web table data sets. The results show that our two-pronged approach outperforms existing schema matching techniques at only a low cost for crowdsourcing.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"40 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"110","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE 30th International Conference on Data Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2014.6816716","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 110

Abstract

The Web is teeming with rich structured information in the form of HTML tables, which provides us with the opportunity to build a knowledge repository by integrating these tables. An essential problem of web data integration is to discover semantic correspondences between web table columns, and schema matching is a popular means to determine the semantic correspondences. However, conventional schema matching techniques are not always effective for web table matching due to the incompleteness in web tables. In this paper, we propose a two-pronged approach for web table matching that effectively addresses the above difficulties. First, we propose a concept-based approach that maps each column of a web table to the best concept, in a well-developed knowledge base, that represents it. This approach overcomes the problem that sometimes values of two web table columns may be disjoint, even though the columns are related, due to incompleteness in the column values. Second, we develop a hybrid machine-crowdsourcing framework that leverages human intelligence to discern the concepts for “difficult” columns. Our overall framework assigns the most “beneficial” column-to-concept matching tasks to the crowd under a given budget and utilizes the crowdsourcing result to help our algorithm infer the best matches for the rest of the columns. We validate the effectiveness of our framework through an extensive experimental study over two real-world web table data sets. The results show that our two-pronged approach outperforms existing schema matching techniques at only a low cost for crowdsourcing.

查看原文本刊更多论文

用于匹配web表的混合机器-众包系统

Web上充满了HTML表形式的丰富结构化信息，这为我们提供了通过集成这些表来构建知识存储库的机会。web数据集成的一个基本问题是发现web表列之间的语义对应关系，而模式匹配是确定语义对应关系的常用方法。然而，由于web表的不完整性，传统的模式匹配技术在web表匹配中并不总是有效的。在本文中，我们提出了一种双管齐下的web表匹配方法，有效地解决了上述困难。首先，我们提出了一种基于概念的方法，将web表的每一列映射到一个完善的知识库中代表它的最佳概念。这种方法克服了有时两个web表列的值可能不相交的问题，即使这些列是相关的，由于列值不完整。其次，我们开发了一个混合机器-众包框架，利用人类智能来识别“困难”栏的概念。在给定的预算下，我们的整体框架将最“有益”的列到概念匹配任务分配给人群，并利用众包结果帮助我们的算法推断其余列的最佳匹配。我们通过对两个真实世界的web表数据集进行广泛的实验研究来验证我们框架的有效性。结果表明，我们的双管齐下方法在众包成本较低的情况下优于现有的模式匹配技术。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 IEEE 30th International Conference on Data Engineering

自引率

0.00%

发文量