Multi-Task Neural Sequence Labeling for Zero-Shot Cross-Language Boilerplate Removal

Yu-hao Wu, Chia-Hui Chang
{"title":"Multi-Task Neural Sequence Labeling for Zero-Shot Cross-Language Boilerplate Removal","authors":"Yu-hao Wu, Chia-Hui Chang","doi":"10.1145/3486622.3493938","DOIUrl":null,"url":null,"abstract":"Although web pages are rich in resources, they are usually intertwined with advertisements, banners, navigation bars, footer copyrights and other templates, which are often not of interest to users. In this paper, we study the problem of extracting the main content and removing irrelevant information from web pages. The common solution is to classify each web component into boilerplate (noise) or main content. State-of-the-art approaches such as BoilerNet use neural sequence labeling to achieve an impressive score in CleanEval EN dataset. However, the model uses only the top 50 HTML tags as input features, which does not fully utilize the power of tag information. In addition, the most frequent 1,000 words used for text content representation cannot effectively support a real-world environment in which web pages appear in multiple languages. In this paper, we propose a multi-task learning framework based on two auxiliary tasks: depth prediction and position prediction. We explore HTML tag embedding for tag path representation learning. Further, we employ multilingual Bidirectional Encoder Representations from Transformers (BERT) for text content representation to deal with any web pages without language limitations. The experiments show that HTML tag embedding and multi-task learning frameworks achieve much higher scores than using BoilerNet on CleanEval EN datasets. Secondly, the pre-trained text block representation based on multilingual BERT will degrade the performance on EN test sets; however, zero-shot experiments on three languages (Chinese, Japanese, and Thai) have a performance consistent with the five-fold cross-validation of the respective language, which indicates the possibility of providing cross-lingual support in one model.","PeriodicalId":89230,"journal":{"name":"Proceedings. IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology","volume":"25 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2021-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3486622.3493938","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Although web pages are rich in resources, they are usually intertwined with advertisements, banners, navigation bars, footer copyrights and other templates, which are often not of interest to users. In this paper, we study the problem of extracting the main content and removing irrelevant information from web pages. The common solution is to classify each web component into boilerplate (noise) or main content. State-of-the-art approaches such as BoilerNet use neural sequence labeling to achieve an impressive score in CleanEval EN dataset. However, the model uses only the top 50 HTML tags as input features, which does not fully utilize the power of tag information. In addition, the most frequent 1,000 words used for text content representation cannot effectively support a real-world environment in which web pages appear in multiple languages. In this paper, we propose a multi-task learning framework based on two auxiliary tasks: depth prediction and position prediction. We explore HTML tag embedding for tag path representation learning. Further, we employ multilingual Bidirectional Encoder Representations from Transformers (BERT) for text content representation to deal with any web pages without language limitations. The experiments show that HTML tag embedding and multi-task learning frameworks achieve much higher scores than using BoilerNet on CleanEval EN datasets. Secondly, the pre-trained text block representation based on multilingual BERT will degrade the performance on EN test sets; however, zero-shot experiments on three languages (Chinese, Japanese, and Thai) have a performance consistent with the five-fold cross-validation of the respective language, which indicates the possibility of providing cross-lingual support in one model.
基于多任务神经序列标记的零采样跨语言样板去除
虽然网页资源丰富,但它们通常与广告、横幅、导航栏、页脚版权等模板交织在一起,用户往往对此不感兴趣。本文主要研究了网页中主要内容的提取和无关信息的去除问题。常见的解决方案是将每个web组件分类为样板(噪声)或主要内容。BoilerNet等最先进的方法使用神经序列标记在CleanEval EN数据集中获得了令人印象深刻的分数。但是,该模型只使用前50个HTML标记作为输入特征,没有充分利用标记信息的力量。此外,用于文本内容表示的最常见的1000个单词不能有效地支持网页以多种语言出现的现实环境。本文提出了一种基于深度预测和位置预测两个辅助任务的多任务学习框架。我们探索HTML标签嵌入标签路径表示学习。此外,我们采用多语言双向编码器表示(BERT)来表示文本内容,以处理任何没有语言限制的网页。实验表明,HTML标签嵌入和多任务学习框架在CleanEval EN数据集上取得了比使用BoilerNet更高的分数。其次,基于多语言BERT的预训练文本块表示会降低在EN测试集上的性能;然而,在三种语言(中文、日语和泰语)上的零射击实验的性能与各自语言的五倍交叉验证一致,这表明在一个模型中提供跨语言支持的可能性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信