基于图卷积神经网络和多模态数据融合的网页信息提取服务

2021 IEEE International Conference on Web Services (ICWS) Pub Date : 2021-09-01 DOI:10.1109/ICWS53863.2021.00094

Mingzhu Zhang, Zhongguo Yang, Sikandar Ali, Weilong Ding

{"title":"基于图卷积神经网络和多模态数据融合的网页信息提取服务","authors":"Mingzhu Zhang, Zhongguo Yang, Sikandar Ali, Weilong Ding","doi":"10.1109/ICWS53863.2021.00094","DOIUrl":null,"url":null,"abstract":"Information extraction and its service is a hot topic. Many works focus on extracting information from a certain web page and ignore the localization of the webpage which contains useful information. Nevertheless, developing a holistic system to extract information consists of locating a webpage and extracting information from that webpage, and these two steps are indispensable. For instance, extracting lecture news from universities' websites is a typical hard task that need to locate web pages and extract news information from them. Due to different layouts and visual appearances, statistic-based methods and visual based methods failed to find them. In this study, we propose an all-holistic method to locate lecture news on the university website. Graph Convolutional Network (GCN) is applied to fuse the multimodal data, which could learn useful features from different views, the linked relationship, the visual similarity, and the semantic of web pages. Firstly, we apply the link model to explore the parent-child relationship between web pages, then calculate the similarity of parent-child pages using a visual model and obtain the semantic features based on the BERT model. Specifically, the visual similarity features are learned based on triplet loss function which imposes the Convolutional Neural Network (CNN) model to learn similar parts in the same group. Lastly, these features are fused into the GCN model to find a certain webpage and it can be adaptive to various university websites. The experiments conducted on 50 websites show our method outperforms state-of-the-art.","PeriodicalId":213320,"journal":{"name":"2021 IEEE International Conference on Web Services (ICWS)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Web Page Information Extraction Service Based on Graph Convolutional Neural Network and Multimodal Data Fusion\",\"authors\":\"Mingzhu Zhang, Zhongguo Yang, Sikandar Ali, Weilong Ding\",\"doi\":\"10.1109/ICWS53863.2021.00094\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Information extraction and its service is a hot topic. Many works focus on extracting information from a certain web page and ignore the localization of the webpage which contains useful information. Nevertheless, developing a holistic system to extract information consists of locating a webpage and extracting information from that webpage, and these two steps are indispensable. For instance, extracting lecture news from universities' websites is a typical hard task that need to locate web pages and extract news information from them. Due to different layouts and visual appearances, statistic-based methods and visual based methods failed to find them. In this study, we propose an all-holistic method to locate lecture news on the university website. Graph Convolutional Network (GCN) is applied to fuse the multimodal data, which could learn useful features from different views, the linked relationship, the visual similarity, and the semantic of web pages. Firstly, we apply the link model to explore the parent-child relationship between web pages, then calculate the similarity of parent-child pages using a visual model and obtain the semantic features based on the BERT model. Specifically, the visual similarity features are learned based on triplet loss function which imposes the Convolutional Neural Network (CNN) model to learn similar parts in the same group. Lastly, these features are fused into the GCN model to find a certain webpage and it can be adaptive to various university websites. The experiments conducted on 50 websites show our method outperforms state-of-the-art.\",\"PeriodicalId\":213320,\"journal\":{\"name\":\"2021 IEEE International Conference on Web Services (ICWS)\",\"volume\":\"37 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE International Conference on Web Services (ICWS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICWS53863.2021.00094\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Conference on Web Services (ICWS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICWS53863.2021.00094","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

信息提取及其服务是一个热门话题。许多研究都侧重于从某个网页中提取信息，而忽略了包含有用信息的网页的本地化。然而，开发一个完整的信息提取系统包括定位网页和从网页中提取信息，这两个步骤是必不可少的。例如，从大学网站中提取讲座新闻是一项典型的艰巨任务，需要定位网页并从中提取新闻信息。由于布局和视觉外观的不同，基于统计的方法和基于视觉的方法都无法找到它们。在本研究中，我们提出了一种全方位的方法来定位大学网站上的讲座新闻。采用图形卷积网络(GCN)对多模态数据进行融合，可以从不同的角度、链接关系、视觉相似性和网页语义等方面学习到有用的特征。首先利用链接模型探索网页之间的亲子关系，然后利用可视化模型计算亲子页面的相似度，并基于BERT模型获得语义特征。具体来说，视觉相似性特征是基于三重损失函数来学习的，该函数利用卷积神经网络(CNN)模型来学习同一组中的相似部分。最后，将这些特征融合到GCN模型中，找到特定的网页，可以适应各种大学网站。在50个网站上进行的实验表明，我们的方法优于最先进的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Web Page Information Extraction Service Based on Graph Convolutional Neural Network and Multimodal Data Fusion

Information extraction and its service is a hot topic. Many works focus on extracting information from a certain web page and ignore the localization of the webpage which contains useful information. Nevertheless, developing a holistic system to extract information consists of locating a webpage and extracting information from that webpage, and these two steps are indispensable. For instance, extracting lecture news from universities' websites is a typical hard task that need to locate web pages and extract news information from them. Due to different layouts and visual appearances, statistic-based methods and visual based methods failed to find them. In this study, we propose an all-holistic method to locate lecture news on the university website. Graph Convolutional Network (GCN) is applied to fuse the multimodal data, which could learn useful features from different views, the linked relationship, the visual similarity, and the semantic of web pages. Firstly, we apply the link model to explore the parent-child relationship between web pages, then calculate the similarity of parent-child pages using a visual model and obtain the semantic features based on the BERT model. Specifically, the visual similarity features are learned based on triplet loss function which imposes the Convolutional Neural Network (CNN) model to learn similar parts in the same group. Lastly, these features are fused into the GCN model to find a certain webpage and it can be adaptive to various university websites. The experiments conducted on 50 websites show our method outperforms state-of-the-art.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 IEEE International Conference on Web Services (ICWS)

自引率

0.00%

发文量