{"title":"基于图卷积神经网络和多模态数据融合的网页信息提取服务","authors":"Mingzhu Zhang, Zhongguo Yang, Sikandar Ali, Weilong Ding","doi":"10.1109/ICWS53863.2021.00094","DOIUrl":null,"url":null,"abstract":"Information extraction and its service is a hot topic. Many works focus on extracting information from a certain web page and ignore the localization of the webpage which contains useful information. Nevertheless, developing a holistic system to extract information consists of locating a webpage and extracting information from that webpage, and these two steps are indispensable. For instance, extracting lecture news from universities' websites is a typical hard task that need to locate web pages and extract news information from them. Due to different layouts and visual appearances, statistic-based methods and visual based methods failed to find them. In this study, we propose an all-holistic method to locate lecture news on the university website. Graph Convolutional Network (GCN) is applied to fuse the multimodal data, which could learn useful features from different views, the linked relationship, the visual similarity, and the semantic of web pages. Firstly, we apply the link model to explore the parent-child relationship between web pages, then calculate the similarity of parent-child pages using a visual model and obtain the semantic features based on the BERT model. Specifically, the visual similarity features are learned based on triplet loss function which imposes the Convolutional Neural Network (CNN) model to learn similar parts in the same group. Lastly, these features are fused into the GCN model to find a certain webpage and it can be adaptive to various university websites. The experiments conducted on 50 websites show our method outperforms state-of-the-art.","PeriodicalId":213320,"journal":{"name":"2021 IEEE International Conference on Web Services (ICWS)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Web Page Information Extraction Service Based on Graph Convolutional Neural Network and Multimodal Data Fusion\",\"authors\":\"Mingzhu Zhang, Zhongguo Yang, Sikandar Ali, Weilong Ding\",\"doi\":\"10.1109/ICWS53863.2021.00094\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Information extraction and its service is a hot topic. Many works focus on extracting information from a certain web page and ignore the localization of the webpage which contains useful information. Nevertheless, developing a holistic system to extract information consists of locating a webpage and extracting information from that webpage, and these two steps are indispensable. For instance, extracting lecture news from universities' websites is a typical hard task that need to locate web pages and extract news information from them. Due to different layouts and visual appearances, statistic-based methods and visual based methods failed to find them. In this study, we propose an all-holistic method to locate lecture news on the university website. Graph Convolutional Network (GCN) is applied to fuse the multimodal data, which could learn useful features from different views, the linked relationship, the visual similarity, and the semantic of web pages. Firstly, we apply the link model to explore the parent-child relationship between web pages, then calculate the similarity of parent-child pages using a visual model and obtain the semantic features based on the BERT model. Specifically, the visual similarity features are learned based on triplet loss function which imposes the Convolutional Neural Network (CNN) model to learn similar parts in the same group. Lastly, these features are fused into the GCN model to find a certain webpage and it can be adaptive to various university websites. The experiments conducted on 50 websites show our method outperforms state-of-the-art.\",\"PeriodicalId\":213320,\"journal\":{\"name\":\"2021 IEEE International Conference on Web Services (ICWS)\",\"volume\":\"37 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE International Conference on Web Services (ICWS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICWS53863.2021.00094\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Conference on Web Services (ICWS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICWS53863.2021.00094","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Web Page Information Extraction Service Based on Graph Convolutional Neural Network and Multimodal Data Fusion
Information extraction and its service is a hot topic. Many works focus on extracting information from a certain web page and ignore the localization of the webpage which contains useful information. Nevertheless, developing a holistic system to extract information consists of locating a webpage and extracting information from that webpage, and these two steps are indispensable. For instance, extracting lecture news from universities' websites is a typical hard task that need to locate web pages and extract news information from them. Due to different layouts and visual appearances, statistic-based methods and visual based methods failed to find them. In this study, we propose an all-holistic method to locate lecture news on the university website. Graph Convolutional Network (GCN) is applied to fuse the multimodal data, which could learn useful features from different views, the linked relationship, the visual similarity, and the semantic of web pages. Firstly, we apply the link model to explore the parent-child relationship between web pages, then calculate the similarity of parent-child pages using a visual model and obtain the semantic features based on the BERT model. Specifically, the visual similarity features are learned based on triplet loss function which imposes the Convolutional Neural Network (CNN) model to learn similar parts in the same group. Lastly, these features are fused into the GCN model to find a certain webpage and it can be adaptive to various university websites. The experiments conducted on 50 websites show our method outperforms state-of-the-art.