Learning Structural Co-occurrences for Structured Web Data Extraction in Low-Resource Settings

Proceedings of the ACM Web Conference 2023 Pub Date : 2023-04-30 DOI:10.1145/3543507.3583387

Zhenyu Zhang, Yu Bowen, Tingwen Liu, Tianyun Liu, Yubin Wang, Li Guo

{"title":"Learning Structural Co-occurrences for Structured Web Data Extraction in Low-Resource Settings","authors":"Zhenyu Zhang, Yu Bowen, Tingwen Liu, Tianyun Liu, Yubin Wang, Li Guo","doi":"10.1145/3543507.3583387","DOIUrl":null,"url":null,"abstract":"Extracting structured information from all manner of webpages is an important problem with the potential to automate many real-world applications. Recent work has shown the effectiveness of leveraging DOM trees and pre-trained language models to describe and encode webpages. However, they typically optimize the model to learn the semantic co-occurrence of elements and labels in the same webpage, thus their effectiveness depends on sufficient labeled data, which is labor-intensive. In this paper, we further observe structural co-occurrences in different webpages of the same website: the same position in the DOM tree usually plays the same semantic role, and the DOM nodes in this position also share similar surface forms. Motivated by this, we propose a novel method, Structor, to effectively incorporate the structural co-occurrences over DOM tree and surface form into pre-trained language models. Such structural co-occurrences help the model learn the task better under low-resource settings, and we study two challenging experimental scenarios: website-level low-resource setting and webpage-level low-resource setting, to evaluate our approach. Extensive experiments on the public SWDE dataset show that Structor significantly outperforms the state-of-the-art models in both settings, and even achieves three times the performance of the strong baseline model in the case of extreme lack of training data.","PeriodicalId":296351,"journal":{"name":"Proceedings of the ACM Web Conference 2023","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM Web Conference 2023","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3543507.3583387","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Extracting structured information from all manner of webpages is an important problem with the potential to automate many real-world applications. Recent work has shown the effectiveness of leveraging DOM trees and pre-trained language models to describe and encode webpages. However, they typically optimize the model to learn the semantic co-occurrence of elements and labels in the same webpage, thus their effectiveness depends on sufficient labeled data, which is labor-intensive. In this paper, we further observe structural co-occurrences in different webpages of the same website: the same position in the DOM tree usually plays the same semantic role, and the DOM nodes in this position also share similar surface forms. Motivated by this, we propose a novel method, Structor, to effectively incorporate the structural co-occurrences over DOM tree and surface form into pre-trained language models. Such structural co-occurrences help the model learn the task better under low-resource settings, and we study two challenging experimental scenarios: website-level low-resource setting and webpage-level low-resource setting, to evaluate our approach. Extensive experiments on the public SWDE dataset show that Structor significantly outperforms the state-of-the-art models in both settings, and even achieves three times the performance of the strong baseline model in the case of extreme lack of training data.

查看原文本刊更多论文

低资源环境下结构化Web数据抽取的结构共现学习

从各种形式的网页中提取结构化信息是一个重要的问题，它有可能使许多现实世界的应用程序自动化。最近的工作已经证明了利用DOM树和预训练的语言模型来描述和编码网页的有效性。然而，他们通常会优化模型来学习同一网页中元素和标签的语义共现，因此他们的有效性取决于足够的标记数据，这是劳动密集型的。在本文中，我们进一步观察到在同一网站的不同网页中结构共现:DOM树中相同的位置通常扮演相同的语义角色，并且该位置的DOM节点也具有相似的表面形式。基于此，我们提出了一种新的方法Structor，将DOM树和表面形式的结构共现有效地整合到预训练的语言模型中。这种结构共现有助于模型在低资源设置下更好地学习任务，我们研究了两个具有挑战性的实验场景:网站级低资源设置和网页级低资源设置，以评估我们的方法。在公共SWDE数据集上进行的大量实验表明，Structor在这两种设置下的性能都明显优于最先进的模型，在训练数据极度缺乏的情况下，其性能甚至达到强基线模型的三倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the ACM Web Conference 2023

自引率

0.00%

发文量