RED: Redundancy-Driven Data Extraction from Result Pages?

The World Wide Web Conference Pub Date : 2019-05-13 DOI:10.1145/3308558.3313529

Jinsong Guo, Valter Crescenzi, Tim Furche, G. Grasso, G. Gottlob

{"title":"RED: Redundancy-Driven Data Extraction from Result Pages?","authors":"Jinsong Guo, Valter Crescenzi, Tim Furche, G. Grasso, G. Gottlob","doi":"10.1145/3308558.3313529","DOIUrl":null,"url":null,"abstract":"Data-driven websites are mostly accessed through search interfaces. Such sites follow a common publishing pattern that, surprisingly, has not been fully exploited for unsupervised data extraction yet: the result of a search is presented as a paginated list of result records. Each result record contains the main attributes about one single object, and links to a page dedicated to the details of that object. We present red, an automatic approach and a prototype system for extracting data records from sites following this publishing pattern. red leverages the inherent redundancy between result records and corresponding detail pages to design an effective, yet fully-unsupervised and domain-independent method. It is able to extract from result pages all the attributes of the objects that appear both in the result records and in the corresponding detail pages. With respect to previous unsupervised methods, our method does not require any a priori domain-dependent knowledge (e.g, an ontology), can achieve a significantly higher accuracy while automatically selecting only object attributes, a task which is out of the scope of traditional fully unsupervised approaches. With respect to previous supervised or semi-supervised methods, red can reach similar accuracy in many domains (e.g., job postings) without requiring supervision for each domain, let alone each website.","PeriodicalId":23013,"journal":{"name":"The World Wide Web Conference","volume":"88 4 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2019-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The World Wide Web Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3308558.3313529","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Data-driven websites are mostly accessed through search interfaces. Such sites follow a common publishing pattern that, surprisingly, has not been fully exploited for unsupervised data extraction yet: the result of a search is presented as a paginated list of result records. Each result record contains the main attributes about one single object, and links to a page dedicated to the details of that object. We present red, an automatic approach and a prototype system for extracting data records from sites following this publishing pattern. red leverages the inherent redundancy between result records and corresponding detail pages to design an effective, yet fully-unsupervised and domain-independent method. It is able to extract from result pages all the attributes of the objects that appear both in the result records and in the corresponding detail pages. With respect to previous unsupervised methods, our method does not require any a priori domain-dependent knowledge (e.g, an ontology), can achieve a significantly higher accuracy while automatically selecting only object attributes, a task which is out of the scope of traditional fully unsupervised approaches. With respect to previous supervised or semi-supervised methods, red can reach similar accuracy in many domains (e.g., job postings) without requiring supervision for each domain, let alone each website.

查看原文本刊更多论文

RED:从结果页中提取冗余驱动的数据?

数据驱动型网站大多通过搜索界面访问。这些站点遵循一种常见的发布模式，令人惊讶的是，这种模式尚未被完全用于无监督的数据提取:搜索结果以结果记录的分页列表的形式呈现。每个结果记录包含一个对象的主要属性，并链接到一个专门用于该对象详细信息的页面。我们提出了red，一种自动方法和原型系统，用于按照这种发布模式从站点提取数据记录。Red利用结果记录和相应详细页面之间固有的冗余来设计一种有效的、完全不受监督的、独立于领域的方法。它能够从结果页中提取出现在结果记录和相应详细信息页中的对象的所有属性。相对于以前的无监督方法，我们的方法不需要任何先验的领域相关知识(例如本体)，在自动选择对象属性的同时可以获得更高的精度，这是传统的完全无监督方法无法完成的任务。相对于之前的监督或半监督方法，red可以在许多领域(例如，职位发布)中达到相似的准确性，而不需要对每个领域进行监督，更不用说每个网站了。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

The World Wide Web Conference

自引率

0.00%

发文量