低资源环境下结构化Web数据抽取的结构共现学习

Proceedings of the ACM Web Conference 2023 Pub Date : 2023-04-30 DOI:10.1145/3543507.3583387

Zhenyu Zhang, Yu Bowen, Tingwen Liu, Tianyun Liu, Yubin Wang, Li Guo

{"title":"低资源环境下结构化Web数据抽取的结构共现学习","authors":"Zhenyu Zhang, Yu Bowen, Tingwen Liu, Tianyun Liu, Yubin Wang, Li Guo","doi":"10.1145/3543507.3583387","DOIUrl":null,"url":null,"abstract":"Extracting structured information from all manner of webpages is an important problem with the potential to automate many real-world applications. Recent work has shown the effectiveness of leveraging DOM trees and pre-trained language models to describe and encode webpages. However, they typically optimize the model to learn the semantic co-occurrence of elements and labels in the same webpage, thus their effectiveness depends on sufficient labeled data, which is labor-intensive. In this paper, we further observe structural co-occurrences in different webpages of the same website: the same position in the DOM tree usually plays the same semantic role, and the DOM nodes in this position also share similar surface forms. Motivated by this, we propose a novel method, Structor, to effectively incorporate the structural co-occurrences over DOM tree and surface form into pre-trained language models. Such structural co-occurrences help the model learn the task better under low-resource settings, and we study two challenging experimental scenarios: website-level low-resource setting and webpage-level low-resource setting, to evaluate our approach. Extensive experiments on the public SWDE dataset show that Structor significantly outperforms the state-of-the-art models in both settings, and even achieves three times the performance of the strong baseline model in the case of extreme lack of training data.","PeriodicalId":296351,"journal":{"name":"Proceedings of the ACM Web Conference 2023","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Learning Structural Co-occurrences for Structured Web Data Extraction in Low-Resource Settings\",\"authors\":\"Zhenyu Zhang, Yu Bowen, Tingwen Liu, Tianyun Liu, Yubin Wang, Li Guo\",\"doi\":\"10.1145/3543507.3583387\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Extracting structured information from all manner of webpages is an important problem with the potential to automate many real-world applications. Recent work has shown the effectiveness of leveraging DOM trees and pre-trained language models to describe and encode webpages. However, they typically optimize the model to learn the semantic co-occurrence of elements and labels in the same webpage, thus their effectiveness depends on sufficient labeled data, which is labor-intensive. In this paper, we further observe structural co-occurrences in different webpages of the same website: the same position in the DOM tree usually plays the same semantic role, and the DOM nodes in this position also share similar surface forms. Motivated by this, we propose a novel method, Structor, to effectively incorporate the structural co-occurrences over DOM tree and surface form into pre-trained language models. Such structural co-occurrences help the model learn the task better under low-resource settings, and we study two challenging experimental scenarios: website-level low-resource setting and webpage-level low-resource setting, to evaluate our approach. Extensive experiments on the public SWDE dataset show that Structor significantly outperforms the state-of-the-art models in both settings, and even achieves three times the performance of the strong baseline model in the case of extreme lack of training data.\",\"PeriodicalId\":296351,\"journal\":{\"name\":\"Proceedings of the ACM Web Conference 2023\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-04-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the ACM Web Conference 2023\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3543507.3583387\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM Web Conference 2023","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3543507.3583387","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

从各种形式的网页中提取结构化信息是一个重要的问题，它有可能使许多现实世界的应用程序自动化。最近的工作已经证明了利用DOM树和预训练的语言模型来描述和编码网页的有效性。然而，他们通常会优化模型来学习同一网页中元素和标签的语义共现，因此他们的有效性取决于足够的标记数据，这是劳动密集型的。在本文中，我们进一步观察到在同一网站的不同网页中结构共现:DOM树中相同的位置通常扮演相同的语义角色，并且该位置的DOM节点也具有相似的表面形式。基于此，我们提出了一种新的方法Structor，将DOM树和表面形式的结构共现有效地整合到预训练的语言模型中。这种结构共现有助于模型在低资源设置下更好地学习任务，我们研究了两个具有挑战性的实验场景:网站级低资源设置和网页级低资源设置，以评估我们的方法。在公共SWDE数据集上进行的大量实验表明，Structor在这两种设置下的性能都明显优于最先进的模型，在训练数据极度缺乏的情况下，其性能甚至达到强基线模型的三倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Learning Structural Co-occurrences for Structured Web Data Extraction in Low-Resource Settings

Extracting structured information from all manner of webpages is an important problem with the potential to automate many real-world applications. Recent work has shown the effectiveness of leveraging DOM trees and pre-trained language models to describe and encode webpages. However, they typically optimize the model to learn the semantic co-occurrence of elements and labels in the same webpage, thus their effectiveness depends on sufficient labeled data, which is labor-intensive. In this paper, we further observe structural co-occurrences in different webpages of the same website: the same position in the DOM tree usually plays the same semantic role, and the DOM nodes in this position also share similar surface forms. Motivated by this, we propose a novel method, Structor, to effectively incorporate the structural co-occurrences over DOM tree and surface form into pre-trained language models. Such structural co-occurrences help the model learn the task better under low-resource settings, and we study two challenging experimental scenarios: website-level low-resource setting and webpage-level low-resource setting, to evaluate our approach. Extensive experiments on the public SWDE dataset show that Structor significantly outperforms the state-of-the-art models in both settings, and even achieves three times the performance of the strong baseline model in the case of extreme lack of training data.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the ACM Web Conference 2023

自引率

0.00%

发文量